# Topic Model Creation and Inspection

This notebook will train a set of topic models. You must have a dictionary file (extension: .lda_dict) and a corpus in matrix market format (extension: .mm). If you do not have these files, the script will fail. Use the Preparation Notebook to create these files. If you do not have docbins, create them in the Corpus Linguistics notebook before using the Preparation Notebook for Topic Modelling.

Topic modelling has been introduced in class and you can read more about it in our readings (particularly Nelson, Lukito and Pruden, and Ganesh and Faggiani). 

You need to decide how many topics you want the model to give you. So for example, if you ask for 5 topics, that's how many you will get. If you ask for 50, that's how many you will get. We call this number of topics "k", which is our variable. We will refer to "k" very often--remember, that just means the number of topics!

You should select at least 3 k values, eg. 5, 10, 15, which is what we will do here just for the sake of demonstration. You should then evaluate each model using the visualization you will find in each folder (discussed in class) to evaluate the topic model. If you are satisfied with it, then you can use that model, and in the following notebook, we will transform your original dataset with topic probabilities. First, of course, we have to train the models and inspect them. The files you need to do that are included here. 

Note that each run (so for each k value you choose), this notebook may need to run for a long time. If this is taking an extremely long time on your computer, please contact me and I can run the computation for you (as this is a hardware limitation).

In the cell below, set up your variables so the script knows where to look. This is similar to the Preparation notebook.

In [2]:
from drs_topic_model import *

# Here, you will add in the appropriate information so that the code below will work

# spacy_model is the language model you are using; the default value should be correct for English. If you are using a different model, make sure to use the name of the model exactly as it is given, which you can find in the model section of spacy.io
spacy_model = "en_core_web_sm"

# project save name is the name of your project and is used to create an easy to read name for the different folders and files the script will create automatically
project_save_name = "anxiety"

# remove_stopwords is whether or not you want the script to remove "non-lexical" words, such as "the" or "and". Change the value to False if you wish to include these words.
# if you set it to True when you did the dictionary and corpus in the previous notebook, this MUST BE THE SAME! Otherwise you may get errors or completely incorrect results!
remove_stopwords = True 

# IMPORTANT: This is where you set your k values. We are doing 5, 10, 15, but you may want totally different numbers.
# this must be a list of numbers separated by commas, the default is just an example. You can also just put in one value as long as it is in square brackets, eg. [10] for k=10. This will only train one model, which is good just to test the code. 
ks = list(pd.Series(range(2, 11))) # you should always try multiple k values! this way you can compare models rather than just trust that one worked properly

### Create Topic models

Just run the code cell below to create the models. This might take some time!

If you have errors loading the corpus or dictionary, you might need to check you have the right project_save_name.

In [3]:
# Now we train the topic model. If you want to use the other notebook, you can stop here. there is no need to run this cell.
# Before running this cell, double check the values you used for k. Are they what you want? Don't try more than 3 k values the first time you run this, so you know how long it will take
# this will try to use as much of your CPU as possible, so be prepared to leave your computer alone for a while, and try to make sure it is connected to a power source

print("Loading dictionary and corpus.")

# load up our dictionary and corpus files from disk

#dict_file = f"./{project_save_name}_topic_models_intermediate_data/{project_save_name}.lda_dict"
dict_file = f"../../intermediate_data/{project_save_name}_topic_models_params/{project_save_name}.lda_dict"

#corpus_file = f"./{project_save_name}_topic_models_intermediate_data/{project_save_name}.mm"
corpus_file = f"../../intermediate_data/{project_save_name}_topic_models_params/{project_save_name}.mm"

dictionary = Dictionary.load(dict_file)
corpus = MmCorpus(corpus_file)

# now we train the topic model
# in order to handle file structure, we have to wrap the training in this if/else statement,
# it checks if the folder exists, and if it does, it creates the models; else if the folder does 
# not exist, it creates a folder and then it creates the models.

print("Processing topic models. This will take at least a few minutes for each 'k' or number of topics you selected.")

topic_model_directory = "../../output_data/" + project_save_name + "_topic_models/"

if os.path.exists(topic_model_directory):
    create_topic_model(
        corpus,
        dictionary,
        ks,
        project_directory = topic_model_directory
    )
else:
    os.mkdir(topic_model_directory)
    create_topic_model(
        corpus,
        dictionary,
        ks,
        project_directory = topic_model_directory
    )

print("Finished generating topic models. Examine each one, and either run this notebook again with a different set of k values (you can skip the ones you have already trained to save computing time) or move on to the dataframe transformation notebook.")

Loading dictionary and corpus.
Processing topic models. This will take at least a few minutes for each 'k' or number of topics you selected.
Computing model, k = 2
     Computing evaluation data...
     Saving Model...
     Generating topic/term display...
     Generating interactive topic visualisation...
Computing model, k = 3
     Computing evaluation data...
     Saving Model...
     Generating topic/term display...
     Generating interactive topic visualisation...
Computing model, k = 4
     Computing evaluation data...
     Saving Model...
     Generating topic/term display...
     Generating interactive topic visualisation...
Computing model, k = 5
     Computing evaluation data...
     Saving Model...
     Generating topic/term display...
     Generating interactive topic visualisation...
Computing model, k = 6
     Computing evaluation data...
     Saving Model...
     Generating topic/term display...
     Generating interactive topic visualisation...
Computing model, k = 7
 