# Turbo Topics Training
Per feedback from Professor Collins-Thompson, we will be trialing the turbo topics  model for our unsupervised portion. Turbo topics is a permutative topic modeler that identifies words and phrases significant to a topic.

Turbo topics is documented by its creators here: https://arxiv.org/pdf/0907.1013.pdf.

## Dependencies
This section contains all imports and initialized global variables.

### Clone Github Repository
I am including this step to ensure usability in multiple environments (i.e. Google Colab and Great Lakes Cluster for this project).

In [1]:
!git clone https://github.com/d-atallah/implicit_gender_bias.git
!git clone https://github.com/blei-lab/turbotopics.git

Cloning into 'implicit_gender_bias'...
remote: Enumerating objects: 172, done.[K
remote: Counting objects: 100% (99/99), done.[K
remote: Compressing objects: 100% (91/91), done.[K
remote: Total 172 (delta 54), reused 19 (delta 8), pack-reused 73[K
Receiving objects: 100% (172/172), 566.77 KiB | 3.52 MiB/s, done.
Resolving deltas: 100% (81/81), done.
Cloning into 'turbotopics'...
remote: Enumerating objects: 11, done.[K
remote: Total 11 (delta 0), reused 0 (delta 0), pack-reused 11[K
Receiving objects: 100% (11/11), 9.81 KiB | 9.81 MiB/s, done.
Resolving deltas: 100% (2/2), done.


### Import Libraries
From here we are importing all necessary libraries as well as a configuration file from our repository containing shared functions that we may use across our notebooks.

In [19]:
from implicit_gender_bias import config as cf
import pandas as pd
import numpy as np
import joblib
import os
import nltk
from nltk.tokenize.casual import TweetTokenizer
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.model_selection import GridSearchCV

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# install older version of python for compatibility with turbotopics code
!sudo apt-get install python2.7

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libpython2.7-minimal libpython2.7-stdlib python2.7-minimal
Suggested packages:
  python2.7-doc binfmt-support
The following NEW packages will be installed:
  libpython2.7-minimal libpython2.7-stdlib python2.7 python2.7-minimal
0 upgraded, 4 newly installed, 0 to remove and 33 not upgraded.
Need to get 3,967 kB of archives.
After this operation, 16.0 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 libpython2.7-minimal amd64 2.7.18-13ubuntu1.1 [347 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python2.7-minimal amd64 2.7.18-13ubuntu1.1 [1,394 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 libpython2.7-stdlib amd64 2.7.18-13ubuntu1.1 [1,977 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python2.7 amd64 2

### Set Reoccuring Variables
Here I will set variables that will be used throughout this notebook. Filepath is specified based on environment (Colab vs. GLC). I am also specifying the exact files I want to use for training.

In [4]:
filepath = cf.filepath()
X_train = pd.read_csv(filepath + 'trns/annotations_X_train.csv').iloc[:,1:]
# this is an unsupervised model so I will likely not use y_train
y_train = pd.read_csv(filepath + 'trns/annotations_y_train.csv').iloc[:,1:]

# create storage path if not exists
tt_path = filepath + 'trns/turbotopics/'
if os.path.exists(tt_path) == False:
  print(os.path.exists(tt_path))

# initializing stop_words
stop_words = cf.stop
wnl = WordNetLemmatizer()
lemma_stop_words = [wnl.lemmatize(wrd) for wrd in stop_words]

Mounted at /content/drive


In [5]:
X_train.head()

Unnamed: 0,index,source,post_text,response_text,sentiment,relevance,sourceID
0,1749,facebook_wiki,Second week of physical therapy and making gre...,Tiger on a leash. Stay clear!,Mixed,Content,facebook_wiki1749
1,3463,facebook_congress,Thank you 37th district dems PCOs for nominati...,Congratulations !!! You deserved this honor . ...,Positive,Poster,facebook_congress3463
2,917,facebook_wiki,Holiday survival tips: Never talk politics wit...,i'll have a bloody Mary and a Steak Sandwich p...,Neutral,Irrelevant,facebook_wiki917
3,5294,facebook_congress,"Over the past five years, the Obama Administra...",Things will never be any different with someon...,Negative,Content,facebook_congress5294
4,14819,ted,"Martin Seligman gave a talk about brain, educa...",I like seligman and his studys but I dont unde...,Mixed,ContentPoster,ted14819


## Compute Significant N-Grams
Before doing model training, we will use part of the turbo topics functionality to compute all significant unigrams and bigrams. This will demonstrate the recursive abilities that make turbo topics stand out from other topic modelers.

In [6]:
# save corpus to text file
corpus = '\n'.join(X_train.response_text)
corpus_path = tt_path + 'corpus.txt'
corpus_file = open(corpus_path, 'w')
corpus_file.writelines(corpus)
corpus_file.close()
# create ngrams file for saved results
out_path = tt_path + "ngrams.txt"
out_file = open(out_path, 'w')
out_file.close()

In [7]:
### COMMENTING to avoid rerun. File saved in "out_path" variable.
# %%bash -s "$corpus_path" "$out_path"
# cd turbotopics
# python2.7 compute_ngrams.py --corpus $1 --pval 0.001 --perm True --out $2

## Model Training
Model training will be conducted in two steps. First we will train an LDA model, then we will use the results of our trained LDA to implement turbo topics.

### Data Preprocessing
Lowercase all response text, apply tweet tokenization + lemmatize, and remove all stopwords that are not pronouns

In [8]:
# create class that lemmatizes tweet tokens
# this will be used when creating the term matrix
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
        self.tt = TweetTokenizer(preserve_case=False, reduce_len=True,
                                 strip_handles=True, match_phone_numbers=False)
    def __call__(self, docs):
        return [self.wnl.lemmatize(t) for t in self.tt.tokenize(docs)]

def train_vectorizer(text_data, vectorizer=CountVectorizer, tokenizer=LemmaTokenizer()):
    """
    Trains a vectorizer on the provided text data and returns the vectorizer instance,
    the document-term matrix, and the feature names.

    Parameters:
    - text_data: List of text documents to be vectorized.
    - vectorizer: Vectorizer class to be used for text vectorization. Defaults to CountVectorizer.
    - tokenizer: Tokenizer class to be used for tokenizing the text documents. Defaults to TweetTokenizer.

    Returns:
    - instance: The trained vectorizer instance.
    - matrix: The document-term matrix resulting from fitting the vectorizer on `text_data`.
    - features: An array of feature names generated by the vectorizer.
    """
    # Initialize the vectorizer with specified configurations
    instance = vectorizer(
        strip_accents=None,  # Do not strip accents
        lowercase=True,  # Do not convert characters to lowercase
        tokenizer=tokenizer,  # Use the tokenize method of the tokenizer instance
        token_pattern=None,  # Since a tokenizer is provided, token_pattern is not used
        stop_words=list(lemma_stop_words),  # Remove stop_words but keep pronouns
        ngram_range=(1, 1),  # Consider only single words (1-grams)
        # min_df=0.05,  # Minimum document frequency for filtering terms
        # max_df=0.95,  # Maximum document frequency for filtering terms
        max_features=None  # No limit on the number of features
    )

    # Fit the vectorizer on the provided text data and transform the data into a matrix
    matrix = instance.fit_transform(text_data)

    # Retrieve the feature names generated by the vectorizer
    features = instance.get_feature_names_out()

    return instance, matrix, features

In [9]:
instance, matrix, features = train_vectorizer(X_train.response_text)

### Train LDA
A trained LDA is a dependency for the turbo topics model. Turbo topics takes the unigram based LDA model and recursively identifies significant words before and after each unigram to identify significant topical phrases.

In [12]:
# build lda model
lda_model = LDA(n_components = 2,
                max_iter = 10,
                learning_method = 'online', # better for large datasets
                random_state = 42,
                n_jobs = -1)                # use all available processors
# build gridsearch
params = {'learning_decay'  : np.arange(5, 9)/10,
          'max_iter'        : np.arange(1, 5)*10,
          'learning_offset' : np.arange(1, 5)*10}
gs_cv = GridSearchCV(estimator = lda_model, param_grid = params, verbose = 5)
# fit model
gs_cv.fit(matrix)

Fitting 5 folds for each of 64 candidates, totalling 320 fits
[CV 1/5] END learning_decay=0.5, learning_offset=10, max_iter=10;, score=-188034.729 total time=  21.7s
[CV 2/5] END learning_decay=0.5, learning_offset=10, max_iter=10;, score=-194061.578 total time=  17.9s
[CV 3/5] END learning_decay=0.5, learning_offset=10, max_iter=10;, score=-192616.194 total time=  17.6s
[CV 4/5] END learning_decay=0.5, learning_offset=10, max_iter=10;, score=-199471.488 total time=  16.5s
[CV 5/5] END learning_decay=0.5, learning_offset=10, max_iter=10;, score=-198879.209 total time=  16.3s
[CV 1/5] END learning_decay=0.5, learning_offset=10, max_iter=20;, score=-187805.113 total time=  32.1s
[CV 2/5] END learning_decay=0.5, learning_offset=10, max_iter=20;, score=-193836.282 total time=  33.0s
[CV 3/5] END learning_decay=0.5, learning_offset=10, max_iter=20;, score=-192453.899 total time=  32.6s
[CV 4/5] END learning_decay=0.5, learning_offset=10, max_iter=20;, score=-199332.969 total time=  31.8s
[C

### Run Turbo Topics on Best LDA

In [23]:
joblib.dump(gs_cv, filepath + 'trns/turbotopics/lda_gs.pkl')
final_lda_model = gs_cv.best_estimator_
joblib.dump(final_lda_model, filepath + 'trns/turbotopics/best_lda.pkl')
print("Model's Params: ", gs_cv.best_params_)
print("Log Likelihood Score: ", gs_cv.best_score_)
print("Model Perplexity: ", final_lda_model.perplexity(matrix))

Model's Params:  {'learning_decay': 0.8, 'learning_offset': 40, 'max_iter': 10}
Log Likelihood Score:  -190464.93192245113
Model Perplexity:  1775.0204419460545


In [15]:
lda_transform = final_lda_model.transform(matrix)

In [18]:
final_lda_model.components_

array([[5.26549011e+03, 7.74228580e+02, 1.22267881e+00, ...,
        1.51661252e+00, 5.03856732e-01, 4.49021243e+01],
       [8.51717686e+00, 1.80885498e+01, 1.21075861e+01, ...,
        5.26445855e-01, 1.08583638e+01, 7.57377166e-01]])

In [31]:
# save vocab separated by newlines
with open(filepath + 'trns/turbotopics/vocab.txt', 'w') as f:
  f.write('\n'.join(features))