# Dataset NMF Topic Visualization

This notebook is used to produce visualizations and exploration for the NMF topic models + a few plots with t-SNE and PCA. Again, I will start by importing the packages and reading in previously trained models and the dataset.

In [5]:
# Import the packages
import joblib
import pandas as pd
import plotly.express as px

# Get the models and the dataset
nmf = joblib.load('../models/nmf.joblib')
tsne = joblib.load('../models/tsne.joblib')
pca = joblib.load('../models/pca.joblib')
vectorizer = joblib.load('../models/tfidf_vectorizer.joblib')
features = joblib.load('../models/tfidf_features.joblib')
df = pd.read_csv('../data/travel_processed.csv')

# Get the corpus that contains all terms in the dataset
corpus = vectorizer.get_feature_names_out()

tevdkv veiKvfsCR4OyJGzvAHp0



"is" with a literal. Did you mean "=="?


"is" with a literal. Did you mean "=="?



PlotlyError: Looks like you are setting your plot privacy to both public and private.
 If you set world_readable as False, sharing can only be set to 'private' or 'secret'

### Top terms in each topic

The model components is the Topic/Term or the H matrix. Each row corresponds to the array of model scores for each term in the corpus. By taking the argsort(), we can pull out indices of the most representative words and then use them to get the the actual words from the corpus we got earlier.

In [2]:
# Get the terms most associated with each topic
for i, doc in enumerate(nmf.components_):
    print("Topic {}: {}".format(i, ', '.join([str(x) for x in corpus[doc.argsort()[-6:]]])))

Topic 0: russian, discuss, attend, secretary, minister, foreign
Topic 1: inauguration, mubarak, address, assad, abbas, president
Topic 2: nacc, cento, council, attend, ministerial, nato
Topic 3: assad, discuss, process, east, middle, peace
Topic 4: turkish, discuss, chinese, israeli, senior, officials
Topic 5: informal, reagan, accompany, official, state, visit
Topic 6: forum, apec, conference, economic, attend, summit
Topic 7: reagan, nixon, clinton, president, bush, accompany
Topic 8: olmert, israeli, benjamin, netanyahu, minister, prime
Topic 9: moscow, overnight, return, stop, en, route


### Documents by topic

Now we can use the TF-IDF features to transform the NMF model we've fit earlier to get the Document/Topic or the W matrix. Then use argsort() to get the topic indices. We don't need anything other than the indices since the topics are organized in the ascending order (0-9). Then just slice the first topic and it will be the most representative of a particular document. Then, we add it to the dataset for visualizations.

In [7]:
# Transform and add the topics to the dataset
W = nmf.transform(features)
df['topic'] = [x[0] for x in W.argsort()]

### Visualize using t-SNE and PCA

Here I will plot each document and color it by topic to see if NMF captured similar relationships to PCA/t-SNE. I will also upload the interactive plots for the reposrt on the website.

In [12]:

df['v1'] = tsne[:,0]
df['v2'] = tsne[:,1]
df['topic'] = df['topic'].astype(str)

fig = px.scatter(df, x='v1', y='v2', color='topic', hover_data=['remarks', 'visit_type'])

fig.update_traces()
fig.show()

df['v1'] = pca[:,0]
df['v2'] = pca[:,1]
df['v3'] = pca[:,2]

fig2 = px.scatter_3d(df, x='v1', y='v2', z='v3', color='topic', hover_data=['remarks', 'visit_type'])

fig2.update_traces(marker_size=2)
fig2.show()

fig2['layout'].update(width=700, height=700, autosize=True)
fig2.write_html("pca_topics.html")