## Topic Modeling Guide

This notebook is a guide to creating topic models for the words contained in the articles. This notebook using the [LDA](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2) topic modeling technique to create interactive HTML topic model visualizations. To ensure accuracy, please check the quality of the data before proceeding — ideally, the text should be pre-processed. A code cell containing the pre-processing algorithm is included in the NGramsGuide.ipynb file if needed.

## Setup

Import the necessary packages. It may be necessary to ![install](https://packaging.python.org/en/latest/tutorials/installing-packages/) the packages if they are not already in your Python kernel.

In [1]:
import pandas as pd
import gensim
import gensim.corpora as corpora
from pprint import pprint
import pyLDAvis.gensim
import pickle 
import pyLDAvis
import os

Load the data set(s). Change the argument to include the path on your local computer that leads to the file. For example, if the file is in your Downloads folder, the path may look like /Users/firstnamelastname/Downloads/Bridging Racial Violence Compiled Data.xlsx.

In [3]:
atl41 = pd.read_csv('/Users/clairefenton/Desktop/Emory/BRV Research/Data/ATL_1941_new_preprocessing.csv')

## Word List Creation

Create a word list for the topic models using the following parameters:
* df: the data set
* text_col: a string value representing the name of the column in df containing the pre-processed text
* words_to_remove: a list of strings containing words to be removed from the word list

Th `words_to_remove` parameter should be used to remove any common words from the word list that do not provide value to the topic models (i.e. "say," "was," "they")

In [6]:
def create_word_list(df, text_col, words_to_remove):
    word_list = [x.split() for x in df[text_col]]

    cleaned_word_list = []
    for word in words_to_remove:
        for list in word_list:
            cleaned_word_list.append([x for x in list if x != word])
    return cleaned_word_list

## Topic Model Creation

Create a topic model using the following parameters:
* word_list: the word list created from the `create_word_list` function
* n: an integer value representing the number of topics to be included in the model
* year: a string value representing the year(s) from which the data comes
* file_path: a string value representing the file path to which the HTML topic models should be saved 

The cell block contains an optional argument `plt.ylim(min, max)` that can adjust the range of the y-axis. Use as necessary to make the graph more visually appealing.

In [13]:
def create_topic_model(word_list, n, year, file_path):
    id2word = corpora.Dictionary(word_list)
    texts = word_list
    corpus = [id2word.doc2bow(text) for text in texts]
    num_topics = n
    lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                        id2word=id2word,
                                        num_topics=num_topics)
    doc_lda = lda_model[corpus]
    pyLDAvis.enable_notebook()
    LDAvis_data_filepath = os.path.join(file_path+str(num_topics))
    if 1 == 1:
        LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
        with open(LDAvis_data_filepath, 'wb') as f:
            pickle.dump(LDAvis_prepared, f)
    with open(LDAvis_data_filepath, 'rb') as f:
        LDAvis_prepared = pickle.load(f)
    pyLDAvis.save_html(LDAvis_prepared, file_path + 'BRV_' + year + '_' + str(num_topics) +'.html')
    LDAvis_prepared

If desired, filter the data set for articles only containing instances of racial violence before creating the word list.

In [None]:
atl41_rv = atl41[atl41['entry'] == 1]
word_list_41 = create_word_list(atl41_rv, 'text_x', ['say', 'there'])

Below is an example creation of a topic model containing 27 topics from the Atlanta Daily World 1941 articles containing instances of racial violence. The file is saved to `/Users/clairefenton/Downloads/` with the name `BRV_1941_27.html`

In [None]:
create_topic_model(word_list_41, 27, '1941', '/Users/clairefenton/Downloads/')

To create multiple topic models from the same data, sometimes a loop can be helpful. The below loop creates topic models in five-topic increments, starting at 25 topics and ending at 45 topics. 

In [40]:
for i in range(25, 50, 5):
    create_topic_model(word_list_41, i, '1941', '/Users/clairefenton/Downloads/')