# Discourse Atom Topic Modeling (DATM) Tutorial 

## Part 1 of 2: Extract Atoms from Word Embedding Trained on your Text Data

* This code is written in Python 3.7.2, and uses Gensim version 3.8.3. 
* This code is provides an the outline of how we identified topics in a word embedding trained on our cleaned data, and then explored the resultings topics. Note that we cannot redistribute the data used in our paper "Integrating Topic Modeling and Word Embedding" in any form, and researchers must apply directly to the Centers for Disease Control and Prevention for access. Details on data access are provided in the paper. We add comments with tips for adapting this code to your data. 

In [4]:
from __future__ import division
import pandas as pd
import math
from gensim.models import coherencemodel
import pickle
from scipy.linalg import norm
from sklearn.preprocessing import normalize
from scipy.stats import entropy
from sklearn.metrics.pairwise import cosine_similarity
import os
from itertools import combinations
import numpy as np
from gensim import corpora, models, similarities #calc all similarities at once, from http://radimrehurek.com/gensim/tut3.html
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec, KeyedVectors
from random import seed, sample
import seaborn as sns
from ksvd import ApproximateKSVD 


from quality import reconst_qual, topic_diversity, coherence_centroid, coherence_pairwise #written for this jupyter notebook

## Input: Word2Vec model Trained on your Text Data

* Below, we use a public, and free word2vec model pretrained on Google News to illustrate how to identify and explore atom vectors in a trained embedding space. [To download the model, click here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing)

In [5]:
currentmodel = KeyedVectors.load_word2vec_format('C:/Users/arsen/Dropbox/GSRM/LexisNexis Data/GoogleW2V/GoogleNews-vectors-negative300.bin.gz', limit=40000, binary=True) 
#change to the working directory where you downloaded your model #in this tutorial, we're limiting the w2v model to the top 40k words for efficienc

## Extract Atoms with K-SVD

In [11]:
def do_aksvd(w2vmodel, n_comp, n_nonzeros,  savelocation, save=False ):
    aksvd_t = ApproximateKSVD(n_components=n_comp, transform_n_nonzero_coefs=n_nonzeros) #n_components is number of discourse atoms, if vocab size is smallish, keep this fewer. transform_n is the number of atoms (components) that a word can be a linear combo of
    dictionary_t = aksvd_t.fit(w2vmodel.wv.vectors).components_ # Dictionary is the matrix of discourse atoms.
    gamma_t = aksvd_t.transform(w2vmodel.wv.vectors) #get the gammas, which are the "weights" of each word on a discourse atoms
    #len(dictionary[0]) #check that a discourse-atom vector is still same dimensions as word-vectors, note that norm of the dictionary vecs (atoms) are each 1! 
    if save==True:
        outfile = open(str(savelocation)  + str(n_comp) + 'comp' + str(n_nonzeros) + 'nonzeros_aksvd','wb')
        pickle.dump(aksvd_t,outfile)
        outfile.close()
        
        outfile = open(str(savelocation)  +str(n_comp) + 'comp' + str(n_nonzeros) + 'nonzeros_dictionary' ,'wb')
        pickle.dump(dictionary_t,outfile)
        outfile.close()
        
        outfile = open(str(savelocation)  + str(n_comp) + 'comp' + str(n_nonzeros) + 'nonzeros_gamma','wb')
        pickle.dump(gamma_t,outfile)
        outfile.close()
    return(dictionary_t, gamma_t)


Sample usage:

In [None]:
mydictionary, mygamma = do_aksvd(currentmodel, 200, 5,  os.getcwd(),  save=False) #200 topics, each word can be a linear combo of 5 topics

## Evaluate Internal Model Quality

* Using coherence, topic diversity, sse, rmse, or $r^2$
* These functions are imported from quality .py file, above, see code in this .py file or [paper](https://arxiv.org/abs/2106.14365) for details on these functions

In [32]:
print('Coherence (pairwise):', coherence_pairwise(currentmodel, mydictionary, top_n=25))

print('Topic Diversity:', topic_diversity(currentmodel, mydictionary, top_n=25))

print('SSE, RMSE, R2:', reconst_qual(currentmodel, mydictionary, mygamma))

Coherence (pairwise): 0.4617589
Topic Diversity: 0.9562
SSE, RMSE, R2: (282498.41000679834, 0.15343250687919166, 0.4359828777277901)


Select the number of hyperparameters (e.g., number of atoms and/or number of nonzeros) by training models on a range of these hyperparameters and using quality metrics to select the best model

In [None]:
ntopics= []
nonzeros = []
cohere_pairwise= []
div=[]
sse= []
rmse =[]
r2=[]


for i in [25, 50, 75, 100, 200]: 
    for j in [2,5]:
        dictionary, gamma = do_aksvd(currentmodel, i, j, os.getcwd(),  save=True) #varying hyperparameters
        cohere_pairwise.append(coherence_pairwise(currentmodel, dictionary, top_n=25))
        div.append(topic_diversity(currentmodel, dictionary, top_n=25))
        rec= reconst_qual(currentmodel, dictionary, gamma)
        sse.append(rec[0])
        rmse.append(rec[1])
        r2.append(rec[2])
        ntopics.append(i)
        nonzeros.append(j)

In [42]:
quality_results = pd.DataFrame(data={'Components_Topics': ntopics,'Nonzeros': nonzeros,
                'CohereCossim_top25_mean': cohere_pairwise, 'Diversity_top25': div, 
                  'SSE': sse,'RMSE': rmse, 'R2': r2})

In [None]:
sns.set(style="ticks")
plt=sns.lineplot(x="Components_Topics", y="CohereCossim_top25_mean", hue="Nonzeros", data=quality_results, legend='full', sort=True)
plt.legend(bbox_to_anchor=(1.3, .5),loc='center right')#, borderaxespad=0.) 
#plt.set(ylim=(.67, .85))
#plt.set(xlim=(0,550))

## Exploring the Model and Resulting Topics

Exploring the 25 most similar words to each atom atom and their respective cosine similarities (note that here is where you get a "topic": the distribution of words that characterize an atom vector

In [None]:
for i in range(0, len(mydictionary)): 
    print("Discourse_Atom " + str(i))
    print([j for j in currentmodel.wv.similar_by_vector(mydictionary[i],topn=25)]) #what are the 25 most similar words to the Nth dicourse atom?
    print('\n')

#### Extract a Gender Dimension and Compute the Gender Loading of the Topics

In [None]:
manvec = np.mean([currentmodel.wv['male'],  currentmodel.wv['man'], currentmodel.wv['he'], currentmodel.wv['his'], currentmodel.wv['him'], currentmodel.wv['himself']], axis=0)
womanvec= np.mean([currentmodel.wv['female'],  currentmodel.wv['woman'], currentmodel.wv['she'], currentmodel.wv['hers'], currentmodel.wv['her'], currentmodel.wv['herself']], axis=0)

gendervec= normalize(womanvec.reshape(1, -1))-normalize(manvec.reshape(1, -1))

cossim_gender=[]
for i in range(0, len(mydictionary)):
    print("Discourse_Atom " + str(i))
    print([i for i in currentmodel.wv.similar_by_vector(mydictionary[i],topn=25)]) #what are the most similar words to the ith discourse atom?
    print(cosine_similarity(gendervec.reshape(1,-1), mydictionary[i].reshape(1,-1))[0])
    cossim_gender.append(cosine_similarity(gendervec.reshape(1,-1), mydictionary[i].reshape(1,-1))[0])
    #print('\n')

In [None]:
#look at the topics with the largest loading on this dimension (the scalar indicates strength of the loading, the sign indicates direction - whether on the feminine or masculine side)
zippes= zip( cossim_gender, [i for i in range(0, len(mydictionary))]) #get most fem/masc
sorted(zippes)

In [None]:
#write results to CSV

genderedlevels= pd.DataFrame(np.concatenate( cossim_gender, axis=0 ), columns= ['gendered_connotation'])
genderedlevels.to_csv('gendered_connotations_of_topics.csv')

#### Extract an Indoor/Outdoor Dimension and Compute the Loading of the Topics on this Dimension

In [None]:
indoorvec = np.mean([currentmodel.wv['indoor'],  currentmodel.wv['indoors'] , currentmodel.wv['inside']], axis=0)
outdoorvec= np.mean([currentmodel.wv['outdoor'], currentmodel.wv['outdoors'], currentmodel.wv['outside']], axis=0)

indooroutdoorvec= normalize(indoorvec.reshape(1, -1))-normalize(outdoorvec.reshape(1, -1))

cossim_indout=[]
for i in range(0, len(mydictionary)):
    print("Discourse_Atom " + str(i))
    print([i[0] for i in currentmodel.wv.similar_by_vector(mydictionary[i],topn=15)]) #what are the most similar words to the Nth dicourse atom?
    print(cosine_similarity(indooroutdoorvec.reshape(1,-1), mydictionary[i].reshape(1,-1))[0])
    cossim_indout.append(cosine_similarity(indooroutdoorvec.reshape(1,-1), mydictionary[i].reshape(1,-1))[0])
    #print('\n')

In [None]:
#look at the topics with the largest loading on this dimension (the scalar indicates strength of the loading, the sign indicates direction - whether indoor or outdoor)
zippes= zip( cossim_indout, [i for i in range(0, len(mydictionary))]) 
sorted(zippes)

In [None]:
#write results to CSV

indoutlevels= pd.DataFrame(np.concatenate( cossim_indout, axis=0 ), columns= ['indooroutdoor_connotation'])
indoutlevels.to_csv('indout_connotations_of_topics.csv')