# Star Wars Data Science
## Network Analysis, Topic Modeling, and a Wordcloud!
https://linkedin.com/in/dennisbakhuis

## 4. Star Wars Topic Modeling

Until now we have covered the basics with some data exploration and we have created the mandatory wordcloud. Therefore, we are now ready to go into the more advanced analysis and the first I want to give a try in topic modeling. The method described here is based on the work by Maarten Grootendorst, the creator of BerTopic.

Topic modeling is an unsupervised learning technique that can answer the following question: I have this bunch texts, what are the most common topics these text talk about. 

For topic modeling we will make use of a Python package called Sentence-Transformers which is as the name suggests based on the transformers architecture and is able to convert a full sentence into a vector. To find similar topics, we need to find vectors that are grouped together. Lets first create the sentence embeddings:

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import cm

from tqdm import tqdm

from sentence_transformers import SentenceTransformer
import umap
import hdbscan
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity


In [None]:
sent = pd.read_parquet('../Dataset/StarWars_Raw_Sentences.parquet')
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
embeddings = model.encode(sent.sentence, show_progress_bar=True)

The sentence vectors have a length of 768 which is very large to to the cluster analysis, therefore, we will apply a dimension reduction. There are many choices, such as LDA or NMF, but here we will use a method called Umap which has the benefit of keeping the local structure in tact.

In [None]:
n_components = 6
n_neighbors = 24

umap_embeddings = umap.UMAP(
    n_neighbors=n_neighbors, 
    n_components=n_components, 
    metric='cosine',
).fit_transform(embeddings)

After the dimension reduction we can try to identify the clusters using Hdbscan. This will find the clusters that are grouped together in the reduced parameter space.

In [None]:
cluster = hdbscan.HDBSCAN(
    min_cluster_size=n_neighbors,
    metric='euclidean',                      
    cluster_selection_method='eom',
).fit(umap_embeddings)

We have reduced the 768 features to only 6 features, however this reduction is not enough if we want to plot this to the screen. For this we need to reduce it at least to 3 dimensions and we will go even down to 2 for better visibility. For this we are again using Umap and reduce the original dataset from 768 all the way to 2 and combine the earlier defined label with the coordinates.

In [None]:
# Create DataFrame
plot_df = umap.UMAP(
    n_neighbors=n_neighbors,
    n_components=2,
    min_dist=0.0,
    metric='cosine'
).fit_transform(embeddings)

result = pd.DataFrame(plot_df, columns=['x', 'y'])
result['label'] = cluster.labels_

And now we can plot the result:

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

outliers = result.loc[result.label == -1, :]  # cluster -1 are "outliers"
clustered = result.loc[result.label != -1, :]

# plot outliers is gray
plt.scatter(
    outliers.x,
    outliers.y,
    color='#BDBDBD',
    s=0.05,
)

# plot clusters in color
plt.scatter(
    clustered.x,
    clustered.y,
    c=clustered.label,
    s=0.05, 
    cmap='hsv_r',
)
_, _ = ax.set_xlim([3, 15]), ax.set_ylim([4, 15])
_, _ = ax.set_xlabel('umap_1'), ax.set_ylabel('umap_2')

In [None]:
fig.savefig('../Assets/sw_topic1.png', bbox_inches="tight")

The result is very pretty but to get some idea if these clusters make any sense we need to combine the clusters with their original sentences.

In [None]:
docs_df = pd.DataFrame({
    'Doc': sent.sentence.tolist(),
    'Topic': cluster.labels_,
    'Doc_ID': sent.index.tolist(),
})

# Combine all documents for each topic
docs_per_topic = (docs_df
    .groupby(['Topic'], as_index = False)
    .agg({'Doc': ' '.join})
)

The result is very pretty but to get some idea if these clusters make any sense we need to combine the clusters with their original sentences. Also, we want to find the most common keywords for each cluster and therefore, we need to analyze each cluster separately. To solve this, Maarten Grootendorst came up with a class based TF_IDF which works pretty neat!

In [None]:
def c_tf_idf(documents, m, ngram_range=(1, 1)):
    count = CountVectorizer(
        ngram_range=ngram_range,
        stop_words="english",
    ).fit(documents)
    
    t = count.transform(documents).toarray()
    w = t.sum(axis=1)
    tf = np.divide(t.T, w)
    sum_t = t.sum(axis=0)
    idf = np.log(np.divide(m, sum_t)).reshape(-1, 1)
    tf_idf = np.multiply(tf, idf)

    return tf_idf, count
  
tf_idf, count = c_tf_idf(docs_per_topic.Doc.values, m=len(sent))

We need to apply c_tf_idf for each cluster and treat the collection of documents as a single document. The result is a list of most frequent terms and hopefully they make some sense on our Wookieepedia dataset.

In [None]:
def extract_top_n_words_per_topic(
    tf_idf, 
    count, 
    docs_per_topic, 
    n=20
):
    words = count.get_feature_names()
    labels = list(docs_per_topic.Topic)
    tf_idf_transposed = tf_idf.T
    indices = tf_idf_transposed.argsort()[:, -n:]
    top_n_words = {
        label: [
            (words[j], tf_idf_transposed[i][j]) 
            for j in indices[i]
        ][::-1] 
        for i, label in enumerate(labels)
    }
    return top_n_words

def extract_topic_sizes(df):
    return (df
        .groupby(['Topic'])
        .Doc
        .count()
        .reset_index()
        .rename({"Topic": "Topic", "Doc": "Size"}, axis='columns')
        .sort_values("Size", ascending=False)
    )

top_n_words = extract_top_n_words_per_topic(
    tf_idf,
    count,
    docs_per_topic,
    n=20,
)
topic_sizes = extract_topic_sizes(docs_df)
topic_sizes.head(10)

In [None]:
top_n_words[93]

When looking at the results there are actually some that make sense but are not that cool. There is a full cluster that has the many colors, which is still pretty cool, knowing that it got it from an unsupervised way. There are however a couple of clusters that are indeed have story-based topics. For example, a cluster containing Emperor Palpatine contained lots of political terms such as senate, supreme chancellor, and constitution. Palatine also had links to clones which is the master plan he created to destroy the Jedi.

However my most favorite cluster combines almost all star characters together with the terms missions and battle of Yavin. Of course, the battle of Yavin is one of the most important events in Star Wars and it makes sense that it is often referenced. Still, finding this using topic modeling is pretty awesome.

In [None]:
for ix, group in top_n_words.items():
    if any([True if x[0]=='skywalker' else False for x in group]):
        print(ix)

In [None]:
top_n_words[162]

In [None]:
for ix, group in top_n_words.items():
    if any([True if x[0]=='palpatine' else False for x in group]):
        print(ix)

In [None]:
top_n_words[7]