# Text Pre-Processing for Visualization with pyLDAvis

This notebook will demonstrate the processing steps to take raw text and format it in a way that it can be visualized with [pyLDAvis](https://pypi.org/project/pyLDAvis/).

## Imports

First import the required python packages, and open the data as a pandas dataframe.

In [2]:
import pyLDAvis
import json
import numpy as np
import pandas as pd
from nltk.stem import snowball
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk import FreqDist
from itertools import compress

ModuleNotFoundError: No module named 'past'

In [3]:
file_name = "Isla Vista - All Excerpts - 1_2_2019.xlsx"
data = pd.read_excel(file_name, sheet_name='Dedoose Excerpts Export')
print(data.shape)
data = data.dropna(axis=0)
print(data.shape)
print(data.columns)

NameError: name 'pd' is not defined

## Compute Components required for pyLDAvis

The input to pyLDAvis requires computing several things from the documents:

* vocab: all the unique terms in the full dataset
* document lengths: the length of each excerpt
* topic-term distributions: for each topic, compute the frequency of terms in the vocab
* document-topic distribution: for each document, compute the distribution of topics it contains
* term frequency: the frequency of each term in the full dataset

Each of these will be computed based on the given labels for the topics byt the human annotators of this dataset.

### Build Vocab and Term Frequencies

The following code will show the pre-processing of the text into tokens, and use these tokens to compute the vocab, the term frequencies, and the document lengths. 

In [79]:
excerpts = list(data['Excerpt'])

In [80]:
def stem_tokenizer(doc):
    tokens = word_tokenize(doc) 
    stemmer = snowball.SnowballStemmer("english", ignore_stopwords=True)
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    return([tok.lower() for tok in stemmed_tokens if tok.isalpha()])
stem_tokenizer(excerpts[3])[0:10]

['a',
 'student',
 'last',
 'friday',
 'kill',
 'six',
 'peopl',
 'and',
 'wound',
 'more']

In [81]:
# create a subset of the data frame that is the account label types
main_types_df = pd.DataFrame()
for col_name in data.columns:
    if '_' not in str(col_name) and str(col_name).isupper():   
        if not any(data[col_name].values > 1):
            main_types_df[col_name] = data[col_name] 
            
main_types_df.index = range(1, main_types_df.shape[0]+1)

# drop rows and excerpts with no label
# build vocab and doc_lengths
all_words = []
doc_lengths = []
main_types_excerpts = []
for idx, doc in enumerate(excerpts):
    if sum(main_types_df.loc[idx+1]) < 1:
        # if this document had no main type label
        main_types_df = main_types_df.drop([idx+1], axis = 0)
    else:
        main_types_excerpts.append(doc)
        doc_toks = stem_tokenizer(doc)
        all_words.extend(doc_toks)
        doc_lengths.append(len(doc_toks))
fdist = FreqDist(all_words)
fdistmc = fdist.most_common()
vocab = [word for word, count in fdistmc]
term_frequency = [count for word, count in fdistmc]
print("number of documents: "+str(len(doc_lengths)))

number of documents: 7453


### Build Topic-Term Distributions

Compute the distribution of terms from all the documents within a topic. 

In [82]:
stop_words = set(stopwords.words('english'))
freq_dist_dict = {}
topic_size = []
topic_num_words = []
i=0
for coln in main_types_df.columns:
    categ_excerpts = list(compress(main_types_excerpts, main_types_df[coln].values))
    exq = [stem_tokenizer(doc) for doc in categ_excerpts]
    excerpt_words = [tok for tok_list in exq for tok in tok_list]
    i=i+1
    topic_size.append(len(exq))
    topic_num_words.append(len(excerpt_words))
    print("Topic "+str(i)+": "+coln+" number of excerpts: "+str(len(exq)))
    words = [word for word in excerpt_words if word.lower() not in stop_words and word.isalpha()]
    freq_dist_dict[coln] = FreqDist(words)

Topic 1: ACCOUNT number of excerpts: 1799
Topic 2: EVENT number of excerpts: 1280
Topic 3: GRIEF number of excerpts: 646
Topic 4: HERO number of excerpts: 13
Topic 5: INVESTIGATION number of excerpts: 301
Topic 6: JOURNEY number of excerpts: 224
Topic 7: LEGAL number of excerpts: 15
Topic 8: MEDIA number of excerpts: 157
Topic 9: MISCELLANEOUS number of excerpts: 52
Topic 10: MOURNING number of excerpts: 671
Topic 11: PERPETRATOR number of excerpts: 739
Topic 12: PHOTO number of excerpts: 317
Topic 13: POLICY number of excerpts: 816
Topic 14: RACECULTURE number of excerpts: 23
Topic 15: RESOURCES number of excerpts: 190
Topic 16: SAFETY number of excerpts: 168
Topic 17: SOCIALSUPPORT number of excerpts: 310
Topic 18: THREAT number of excerpts: 103
Topic 19: TRAUMA number of excerpts: 696
Topic 20: VICTIMS number of excerpts: 526


In [83]:
topic_term_dists = []

for coln in main_types_df.columns:
    ffdist = freq_dist_dict[coln]
    fdist = [ffdist.freq(word) if word in ffdist.keys() else np.nextafter(float(0), (1)) for word in vocab]
    print("categ: "+str(coln)+" len of freq dist "+str(len(fdist))+" sum of vetor: "+str(sum(fdist)))
    topic_term_dists.append([float(i) for i in fdist])

categ: ACCOUNT len of freq dist 7464 sum of vetor: 1.000000000000016
categ: EVENT len of freq dist 7464 sum of vetor: 0.9999999999999862
categ: GRIEF len of freq dist 7464 sum of vetor: 1.0000000000000384
categ: HERO len of freq dist 7464 sum of vetor: 0.9999999999999978
categ: INVESTIGATION len of freq dist 7464 sum of vetor: 1.0000000000000042
categ: JOURNEY len of freq dist 7464 sum of vetor: 1.0000000000000069
categ: LEGAL len of freq dist 7464 sum of vetor: 1.0000000000000029
categ: MEDIA len of freq dist 7464 sum of vetor: 0.9999999999999828
categ: MISCELLANEOUS len of freq dist 7464 sum of vetor: 1.000000000000019
categ: MOURNING len of freq dist 7464 sum of vetor: 1.0000000000000138
categ: PERPETRATOR len of freq dist 7464 sum of vetor: 1.0000000000000095
categ: PHOTO len of freq dist 7464 sum of vetor: 1.0000000000000178
categ: POLICY len of freq dist 7464 sum of vetor: 0.9999999999999629
categ: RACECULTURE len of freq dist 7464 sum of vetor: 0.9999999999999936
categ: RESOURCE

### Document-Topic Distributions

Compute the distributions of topics for each document. 

In [84]:
doc_topic_dists = []
for index, rowi in main_types_df.iterrows():
    row = list(rowi)
    if(sum(row)>1.01 or sum(row)<0.99):
        #print(str(index)+" row: "+str(row))
        # normalize row
        row = [r/sum(row) for r in row]
    if(sum(row)==0):
        print(row)
    doc_topic_dists.append([float(i) for i in row])

## Format for pyLDAvis

To format all these components for input into lda, they must be formatted as a dictionary. This dictionary can also be saved as a json for loading at a later time.

The output visualization can be saved as an html, or visualized directly in a jupyter notebook.

In [87]:
data_dict = {'topic_term_dists': topic_term_dists, 
            'doc_topic_dists': doc_topic_dists,
            'doc_lengths': doc_lengths,
            'vocab': vocab,
            'term_frequency': term_frequency}
print('Topic-Term shape: %s' % str(np.array(data_dict['topic_term_dists']).shape))
print('Doc-Topic shape: %s' % str(np.array(data_dict['doc_topic_dists']).shape))

Topic-Term shape: (20, 7464)
Doc-Topic shape: (7453, 20)


In [89]:
# save data as json
with open('viz.json', 'w') as json_file:  
    json.dump(data_dict, json_file)

In [91]:
vis_data = pyLDAvis.prepare(**data_dict)
pyLDAvis.save_html(vis_data, 'viz.html')
pyLDAvis.display(vis_data)

To interpret the above visualization, the topic order must be related to the original topic names. This can be done using the "topic_order" from the ldavis data object. 

In [92]:
# order the columns for pyldavis
col_order = vis_data.topic_order
categs = list(main_types_df.columns)
string_list = [""]*len(col_order)
for idx, i in enumerate(col_order):
    msg = "Topic "+str(idx+1)+": "+categs[i-1]+", number of words: "+str(topic_num_words[i-1])
    print(msg)
    string_list[idx] = msg

Topic 1: ACCOUNT, number of words: 176649
Topic 2: POLICY, number of words: 91699
Topic 3: EVENT, number of words: 98494
Topic 4: VICTIMS, number of words: 56075
Topic 5: PERPETRATOR, number of words: 77519
Topic 6: MOURNING, number of words: 48223
Topic 7: TRAUMA, number of words: 57295
Topic 8: GRIEF, number of words: 40609
Topic 9: PHOTO, number of words: 17948
Topic 10: INVESTIGATION, number of words: 20006
Topic 11: SOCIALSUPPORT, number of words: 21641
Topic 12: JOURNEY, number of words: 16370
Topic 13: MEDIA, number of words: 12788
Topic 14: RESOURCES, number of words: 12299
Topic 15: SAFETY, number of words: 11733
Topic 16: THREAT, number of words: 9304
Topic 17: MISCELLANEOUS, number of words: 4667
Topic 18: HERO, number of words: 1511
Topic 19: RACECULTURE, number of words: 1424
Topic 20: LEGAL, number of words: 1737


In [1]:
col_order = vis_data.topic_order
categs = list(main_types_df.columns)
string_list = [""]*len(col_order)
for idx, i in enumerate(col_order):
    #msg = "Topic "+str(idx+1)+": "+categs[i-1]+", number of words: "+str(topic_num_words[i-1])
    print(categs[i-1])
    #string_list[idx] = msg

NameError: name 'vis_data' is not defined