# Preparation Notebook for Topic Modelling

In this notebook, we will complete all the steps required to create a topic model, which is an automated text labelling technique. You can read more about topic models and their use in Nelson, Lukito and Pruden, and Ganesh and Faggiani. 

Topic modelling is computationally expensive. Therefore, we will need to effeciently create two files. First, a "dictionary", which is a database that has a mapping from a word to a numeric ID. For example, the word "cat|NOUN" would be represented in this dictionary as a number, for example, 1253. (This is just an example to give you an idea).

Then we need a corpus. Our computers don't really have enough memory (RAM) to look at all the documents in a dataset at a time. So we will represent all the words in each document as a series of numbers. We will generate a MatrixMarket corpus which handles all this for us. 

After we have created both of these files, we can move on (in this notebook) to <i>training</i> topic models or we can load these files in the main topic modelling notebook and train the topic models there.

Note: if you get errors related to modules not being found, you might just have to install a few modules.

You need to have a folder of docbins to use this notebook. Use the Corpus Linguistics Notebook to create them if you did not already.

In [1]:
import pandas as pd
from drs_topic_model import *

### Step 1

Closely read the code cell below. In it, you will have to put in some variable.

Make sure to get the right spaCy model; you can find details where you downloaded the model. 

As well, pay close attention to the project_save_name and docbins_path variables. You will need to change these as you wish for your project.

Finally, as we discussed in class, topic models need a number of topics to be pre-set. For example, you need to tell the model if you want 10 topics or 20. Below you will see a value for ks, which is a list of numbers separated by commas in square brackets, like so:

        ks = [10,20,30,40,50]
    
Just make sure that you enter numbers and no quotation marks. Each k will then create a different topic model, and you can then compare the outputs.

In [2]:
# Step 1: Read all the documents and create a few files
# Here, you will add in the appropriate information so that the code below will work

# spacy_model is the language model you are using; the default value should be correct for most use cases, but make sure to use the name of the model exactly as it is given, which you can find in the model section of spacy.io
spacy_model = "en_core_web_sm"

# project save name is the name of your project and is used to create an easy to read name for the different folders and files the script will create automatically
project_save_name = "anxiety"

#create a folder for the probabilities
if os.path.exists(f"../../intermediate_data/{project_save_name}_topic_models_params"):
    pass
else:
    os.mkdir(f"../../intermediate_data/{project_save_name}_topic_models_params")
    
# docbins path is a folder where your docbins are stored
docbins_path = f"../../intermediate_data/{project_save_name}_docbins"
#docbins_path = 'anxiety_docbins'

# remove_stopwords is whether or not you want the script to remove "non-lexical" words, such as "the" or "and". Change the value to False if you wish to include these words.
remove_stopwords = True 

# if you want to train a topic model in this notebook directly (rather than the following one), put in your k values here
# this must be a list of numbers separated by commas, the default is just an example. You can also just put in one value as long as it is in square brackets, eg. [10] for k=10. 
ks = list(pd.Series(range(2, 11))) # you should always try multiple k values! this way you can compare models rather than just trust that one worked properly

### Step 2

Run the following code cells to generate the files needed for the topic model training.

Once you have run these cells, you can stop the notebook and quit JupyterLab and Anaconda, or continue to train your topic model.

In [4]:
# Here, we do all the preprocessing, which creates some files that store our data in an efficient way for our topic model to run.
# You only need to run this once! If you already have an ".lda_dict" and ".mm" file, you can move on to the Topic Modelling notebook.
# If not, you can run this here, and train your first model in the following step.

docbins = generate_docbin_paths(docbin_folder = docbins_path,
                                project_save_name = project_save_name)

print("Processing dictionary. This may take some time...")

dictionary = create_gensim_dictionary(
    docbins,
    spacy_model = spacy_model,
    remove_stopwords = True,
    filename = f"../../intermediate_data/{project_save_name}_topic_models_params/{project_save_name}"
)

print("Processing corpus. This may take some time...")

corpus = generate_mmcorpus(
    docbins,
    dictionary,
    spacy_model = "en_core_web_sm",
    remove_stopwords = True,
    filename = f"../../intermediate_data/{project_save_name}_topic_models_params/{project_save_name}"
)

Processing dictionary. This may take some time...
     Processing docbin:  ../../intermediate_data/anxiety_docbins/anxiety_docbin_0.db
     Processing docbin:  ../../intermediate_data/anxiety_docbins/anxiety_docbin_1.db
     Processing docbin:  ../../intermediate_data/anxiety_docbins/anxiety_docbin_2.db
     Processing docbin:  ../../intermediate_data/anxiety_docbins/anxiety_docbin_3.db
     Processing docbin:  ../../intermediate_data/anxiety_docbins/anxiety_docbin_4.db
     Processing docbin:  ../../intermediate_data/anxiety_docbins/anxiety_docbin_5.db
     Processing docbin:  ../../intermediate_data/anxiety_docbins/anxiety_docbin_6.db
     Processing docbin:  ../../intermediate_data/anxiety_docbins/anxiety_docbin_7.db
     Processing docbin:  ../../intermediate_data/anxiety_docbins/anxiety_docbin_8.db
     Processing docbin:  ../../intermediate_data/anxiety_docbins/anxiety_docbin_9.db
     Processing docbin:  ../../intermediate_data/anxiety_docbins/anxiety_docbin_10.db
     Processin

### Step 3

Optional! You can train the topic model here, or you can go to the Main Topic Modelling Notebook and follow those steps. As well, you don't need to go any further if you are trying to train a Word2Vec model as we just need the corpus and dictionary.

This can take a long time, depending on how much data you have.

In [None]:
# Now we train the topic model. If you want to use the other notebook, you can stop here. there is no need to run this cell.
# Before running this cell, double check the values you used for k. Are they what you want? Don't try more than 3 k values the first time you run this, so you know how long it will take
# this will try to use as much of your CPU as possible, so be prepared to leave your computer alone for a while, and try to make sure it is connected to a power source

print("Processing topic models. This will take at least a few minutes for each 'k' or number of topics you selected.")

topic_model_directory = "./" + project_save_name + "_topic_models/"
if os.path.exists(topic_model_directory):
    create_topic_model(
        corpus,
        dictionary,
        ks,
        project_directory = topic_model_directory
    )
else:
    os.mkdir(topic_model_directory)
    create_topic_model(
        corpus,
        dictionary,
        ks,
        project_directory = topic_model_directory
    )