# Discourse Atom Topic Modeling (DATM) Tutorial 

## Part 2 of 2: Mapping Atoms to Text Data

* This code is written in Python 3.7.2, and uses Gensim version 3.8.3. 

* This code is provides an the outline of how we assigned atoms to our cleaned data, which we show how to identify in Part 1 of 2. Note that we cannot redistribute the data used in our paper "Integrating Topic Modeling and Word Embedding" in any form, and researchers must apply directly to the Centers for Disease Control and Prevention for access. Details on data access are provided in the paper. We add comments with tips for adapting this code to your data.  
* In our case, the goal of this code is to take a given narrative, get rolling windows of contexts from this narrative, find the SIF sentence embedding from each rolling window, and match the SIF embedding onto the closest (by cosine similarity) atom in the Dictionary loaded in earler. The SIF embedding is the maximum a posteriori (MAP) estimate of what the atom is for that sentence. So we'll get out a rolling window (i.e., a sequence) of atoms underlying the narrative.
* In our case, we get atoms separately for law enforcement (narle) and medical examiner (narcme) narratives, and then combine the two distributions, as described in our paper. 

In [1]:
from __future__ import division
import pandas as pd
import cython
import pickle
from gensim.models import Word2Vec 
from sklearn.preprocessing import normalize
from random import sample
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import math
from scipy.linalg import norm
from collections import Counter
from ksvd import ApproximateKSVD 
from itertools import combinations
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import TruncatedSVD
import re
import string, re
import numpy as np 
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from random import seed, sample
%matplotlib inline

Load in a word embedding model trained on the corpus (in our case, violent death narratives)

In [27]:
w2vmodel=Word2Vec.load('')
#w2vmodel.init_sims(replace=False) #normalize word-vector lengths. May speed up if set replace=True, but then can't go back to the original (non-normalized) word vectors.

Load in dictionary of atoms identified in your word embedding (see "DATM Tutorial Part 1 of 2" for code) 

In [28]:
#load back in the pickled dictionary of atom vectors
infile = open('','rb')
dictionary=pickle.load(infile)
infile.close()

### Prepare two pieces of informatin from the text data, which we will need to compute SIF Sentence Embeddings (MAP) of any given sentence 

* SIF Sentence Embedding is from: "A Simple but tough-to-beat baseline for sentence embedding" https://github.com/PrincetonML/SIF
* To do SIF embeddings, we need to prep functions and two pieces of information from the raw text: (1) frequency weights for each word, and (2) the "common discourse vector" ($C_0$)

In [30]:
file = open("", 'rb') 
#this is the text data which has already been turned into trigrams and bigrams, cleaned, and tokenized. It is a list of lists, where each record is a word tokenized "document"
corpus= pickle.load(file) 
file.close()

In [1]:
#len(corpus) #list of 307249 narratives

**1. The first input to SIF embeddings is an estimate of the frequency weights (based on probabilites) for each word in the corpus. Compute this here.**
* This will naturally downweight stopwords when we compute a sentence embedding. It requires the raw text data of the corpus. 
* Either train a dictonary of weights (1), or upload a saved dictionary (2). 

In [32]:
def get_freq_dict(w2vmodel, weight_a=.001): #reasonable range for weight a is .001 to .0001 based on Arora et al SIF embeddings. The extent to which we re-weight words is controlled by the parameter $a$, where a lower value for $a$ means that frequent words are more aggressively down-weighted compared to less frequent words. 
    freq_dictionary = {word: w2vmodel.wv.vocab[word].count for word in w2vmodel.wv.vocab} 
    total= sum(freq_dictionary.values())
    freq_dictionary = {word: weight_a/(weight_a + (w2vmodel.wv.vocab[word].count / total)) for word in w2vmodel.wv.vocab} #best values according to arora et al are between .001 and .0001
    return(freq_dictionary)

#function to yield a weighted sentence, using the above weight dictionary
def get_weighted_sent(tokedsent,freq_dict, w2vmodel=w2vmodel): 
    weightedsent= [freq_dict[word]*w2vmodel.wv[word] for word in tokedsent if word in freq_dict.keys()]
    return(sum(weightedsent)/len(weightedsent)) #weightedsent_avg  #divide the weighted average by the number of words in the sentence


In [34]:
freq_dict= get_freq_dict(w2vmodel, weight_a=.001) 

**2. The second input to SIF embeddings is $C_0$, the common discourse vector. Compute this here.**
* Get this with a random sample of discourse vectors since the data is so large, or compute using all narratives. 

In [35]:
def samp_cts(docs, n_sample, windowsize, freq_dictionary):
    sampnarrs=  sample(docs, n_sample) #sample of narratives. Will take 1 random window and discourse vector of this window, from each narrative. 
    sampvecs= []


    t1_start = perf_counter()  

    for i in sampnarrs: #adjusting here to corpus sample, but consider using full corpus for final SIF embeddings. 
        if len(i)>windowsize: #want window length to be at least windowsize words
            n= sample(range(0,len(i)-windowsize), 1)[0] #get some random positon in the narrative (at least windowsize steps behind the last one though)
            sent= i[n:n+windowsize] #random context window 
            sampvecs.append(get_weighted_sent(i, freq_dictionary)) #sample a discourse vector, and append to a list of sample discourse vectors.
            n= sample(range(0,len(i)-windowsize), 1)[0] #get some random positon in the narrative (at least windowsize steps behind the last one though)
            sent= i[n:n+windowsize] #random context window 
            sampvecs.append(get_weighted_sent(i, freq_dictionary)) #sample a discourse vector, and append to a list of sample discourse vectors.
    sampvecs= np.asarray(sampvecs)
    t1_stop = perf_counter() #for 100k context windows takes  
    print(t1_stop-t1_start)
    return(sampvecs)

def get_c0(sampvecs):
    svd = TruncatedSVD(n_components=1, n_iter=10, random_state=0) #only keeping top component, using same method as in SIF embedding code
    svd.fit(sampvecs) #1st singular vector  is now c_o
    return(svd.components_[0])

def remove_c0(comdiscvec, modcontextvecs):
    curcontextvec= [X - X.dot(comdiscvec.transpose()) * comdiscvec for X in modcontextvecs] #remove c_0 from all the cts
    curcontextvec=np.asarray(curcontextvec)
    return(curcontextvec)

In [None]:
sampvecs2_narcme= samp_cts(list(corpus['narcme_toked']), 50000, 10, freq_dict) #we used a random sample of 50,000 context vectors
sampvecs2_narcme= normalize(sampvecs2_narcme, axis=1) #l2 normalize the resulting context vectors

sampvecs2_narle= samp_cts(list(corpus['narle_toked']), 50000, 10, freq_dict) #we used random sample of 50,000 context vectors
sampvecs2_narle= normalize(sampvecs2_narle, axis=1) #l2 normalize the resulting context vectors

In [None]:
pc0_narcme= get_c0(sampvecs2_narcme)
pc0_narle= get_c0(sampvecs2_narle)

In [None]:
sampvecs2_narcme = remove_c0(pc0_narcme, sampvecs2_narcme) 
sampvecs2_narle = remove_c0(pc0_narle, sampvecs2_narle) 

###  Resulting function to get SIF MAPs along rolling windows, for a given narrative. 

* This the function we use to find rolling windows and assign MAPs to them, for a given narrative.
* Note that this is set for our embedding size, which was 200-dimensions. 

In [39]:
def sif_atom_seqs(toked_narrative, window_size, topics_dictionary, c0_vector, freq_dict, w2vmodel): 
    
    toked_narr2 = [i for i in toked_narrative if i in w2vmodel.wv.vocab] #remove words not in vocab
    if len(toked_narr2)> 19 :  #this is set so that only narratives with at least 19 tokens in the w2v model vocab are considered. 
        it = iter(toked_narr2) 
        win = [next(it) for cnt in range(0,window_size)] #first context window
        MAPs= normalize(remove_c0( c0_vector, get_weighted_sent(win, freq_dict, w2vmodel).reshape(1,200))) #doing the SIF map here. Hardcoding in the dimensionality of the space to speed this up.
        for e in it: # Subsequent windows
            win[:-1] = win[1:]
            win[-1] = e
            MAPs = np.vstack((MAPs, normalize(remove_c0(c0_vector, get_weighted_sent(win, freq_dict, w2vmodel).reshape(1,200)))))  #this will be matrix of MAPs

        costri= linear_kernel(MAPs, topics_dictionary) 
        atomsseq= np.argmax(costri, axis=1) #this is for the index of the closest atom to each of the MAPs
        #maxinRow = np.amax(costri, axis=1) #this is for the closest atom's cossim value to each of the maps
        return(atomsseq.tolist()) #returns sequence of the closest atoms to the MAPs
    else:
        return(None)


Sample usage of sif_atom_seqs, on a single narrative:

In [None]:
sif_atom_seqs(corpus[0], 10, dictionary , pc0, w2vmodel) #get SIFS then, get atoms. Returns N closest atoms, to N rolling windows

## Transform Documents in a Corpus into a Sequences of Atoms 

* This is the **final result** we want from all code above
* First, get c0 from narcme narratives, and then get the atom sequence for the narcme narratives
* Then, get c0 from narle narratives, and then get the atom sequence for the narle narratives

Get SIF Atom Seqs on NARCME narratives

In [41]:
corpus['narcme_atom_seq']= corpus['narcme_toked'].apply(lambda x: sif_atom_seqs(x, 10, dictionary , pc0_narcme, freq_dict, w2vmodel) )

#convert the atom seq to a string of atoms, since this format is needed for the vectorizer (and easier to work with later) in a CSV, too
corpus['narcme_atom_seq'] = corpus['narcme_atom_seq'].apply(lambda x: ' '.join([str(elem) for elem in x])  if(np.all(pd.notnull(x)))  else x ) #https://thispointer.com/python-how-to-convert-a-list-to-string/

300.2712618999999


Get SIF Atom Seq on NARLE narratives

In [42]:
corpus['narle_atom_seq']= corpus['narle_toked'].apply(lambda x: sif_atom_seqs(x, 10, dictionary , pc0_narle, freq_dict, w2vmodel) )

#convert the atom seq to a string of atoms, since this format is needed for the vectorizer (and easier to work with later) in a CSV, too
corpus['narle_atom_seq'] = corpus['narle_atom_seq'].apply(lambda x: ' '.join([str(elem) for elem in x])  if(np.all(pd.notnull(x)))  else x ) #https://thispointer.com/python-how-to-convert-a-list-to-string/

231.07621459999973


## Reformatting the Resulting Atom Sequences into Variables, by Vectorizing the Atoms

Fill in empty narratives in the data with NAs to avoid errors, and combine the two sequences (i.e., narcme and narle sequences)

In [22]:
corpus['narcme_atom_seq'] = corpus['narcme_atom_seq'].fillna('')
corpus['narle_atom_seq'] = corpus['narle_atom_seq'].fillna('')
corpus['narcme_narle_atom_seq_combined'] = corpus["narcme_atom_seq"].map(str) + ' ' + corpus["narle_atom_seq"].map(str) #combine these, adding in ' ' in  middle, this works find even if there is no entry in one of the narle or narcme

Transform each sequence into a distribution over topics

In [24]:
bow_transformer = TfidfVectorizer(analyzer = 'word', norm='l1', use_idf=False, token_pattern='\S+') #need token pattern, otherwise splits using using 0-9 single digits too! #note that atoms that are part of all or no documents will not be transformed here, can reset this default, but I left as is for now since makes prediction easier (fewer features). #includes l1 normalization so that longer documents don't get more weight, l1 normalizes with abs value but all our values are pos anyways
bow_transformer.fit(corpus['narcme_narle_atom_seq_combined'].dropna(inplace=False)) #corpus needs to be in format ['word word word'], NOT tokenized already

vecked = bow_transformer.transform(corpus['narcme_narle_atom_seq_combined'].tolist()).toarray() #consider instead:  vecked = bow_transformer.transform(corpus['narcme_narle_atom_seq_combined'].dropna(inplace=True).tolist()).toarray() #this is the "feature" data, now in an array for sklearn models
corpus = pd.concat([corpus,pd.DataFrame(vecked, columns = bow_transformer.get_feature_names())], axis=1, sort=False)

Save CSV with final atom assignments

In [25]:
corpus.to_csv("") 