<a href="https://colab.research.google.com/github/d-atallah/implicit_gender_bias/blob/cluster_da/turbotopics_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Turbo Topics Training
Per feedback from Professor Collins-Thompson, we will be trialing the turbo topics  model for our unsupervised portion. Turbo topics is a permutative topic modeler that identifies words and phrases significant to a topic.

Turbo topics is documented by its creators here: https://arxiv.org/pdf/0907.1013.pdf.

## Dependencies
This section contains all imports and initialized global variables.

### Clone Github Repository
I am including this step to ensure usability in multiple environments (i.e. Google Colab and Great Lakes Cluster for this project).

In [None]:
!git clone https://github.com/d-atallah/implicit_gender_bias.git
!git clone https://github.com/blei-lab/turbotopics.git

Cloning into 'implicit_gender_bias'...
remote: Enumerating objects: 175, done.[K
remote: Counting objects: 100% (102/102), done.[K
remote: Compressing objects: 100% (94/94), done.[K
remote: Total 175 (delta 55), reused 19 (delta 8), pack-reused 73[K
Receiving objects: 100% (175/175), 579.70 KiB | 2.76 MiB/s, done.
Resolving deltas: 100% (82/82), done.
Cloning into 'turbotopics'...
remote: Enumerating objects: 11, done.[K
remote: Total 11 (delta 0), reused 0 (delta 0), pack-reused 11[K
Receiving objects: 100% (11/11), 9.81 KiB | 9.81 MiB/s, done.
Resolving deltas: 100% (2/2), done.


### Import Libraries
From here we are importing all necessary libraries as well as a configuration file from our repository containing shared functions that we may use across our notebooks.

In [None]:
from implicit_gender_bias import config as cf
import pandas as pd
import numpy as np
import joblib
import os
import nltk
from nltk.tokenize.casual import TweetTokenizer
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.model_selection import GridSearchCV

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
# install older version of python for compatibility with turbotopics code
!sudo apt-get install python2.7

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libpython2.7-minimal libpython2.7-stdlib python2.7-minimal
Suggested packages:
  python2.7-doc binfmt-support
The following NEW packages will be installed:
  libpython2.7-minimal libpython2.7-stdlib python2.7 python2.7-minimal
0 upgraded, 4 newly installed, 0 to remove and 33 not upgraded.
Need to get 3,967 kB of archives.
After this operation, 16.0 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 libpython2.7-minimal amd64 2.7.18-13ubuntu1.1 [347 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python2.7-minimal amd64 2.7.18-13ubuntu1.1 [1,394 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 libpython2.7-stdlib amd64 2.7.18-13ubuntu1.1 [1,977 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python2.7 amd64 2

### Set Reoccuring Variables
Here I will set variables that will be used throughout this notebook. Filepath is specified based on environment (Colab vs. GLC). I am also specifying the exact files I want to use for training.

In [None]:
filepath = cf.filepath()
X_train = pd.read_csv(filepath + 'trns/annotations_X_train.csv').iloc[:,1:]
# this is an unsupervised model so I will likely not use y_train
y_train = pd.read_csv(filepath + 'trns/annotations_y_train.csv').iloc[:,1:]

# create storage path if not exists
tt_path = filepath + 'trns/turbotopics/'
if os.path.exists(tt_path) == False:
  print(os.path.exists(tt_path))

# initializing stop_words
stop_words = cf.stop
wnl = WordNetLemmatizer()
lemma_stop_words = [wnl.lemmatize(wrd) for wrd in stop_words]

Mounted at /content/drive


In [None]:
X_train.head()

Unnamed: 0,index,source,post_text,response_text,sentiment,relevance,sourceID
0,1749,facebook_wiki,Second week of physical therapy and making gre...,Tiger on a leash. Stay clear!,Mixed,Content,facebook_wiki1749
1,3463,facebook_congress,Thank you 37th district dems PCOs for nominati...,Congratulations !!! You deserved this honor . ...,Positive,Poster,facebook_congress3463
2,917,facebook_wiki,Holiday survival tips: Never talk politics wit...,i'll have a bloody Mary and a Steak Sandwich p...,Neutral,Irrelevant,facebook_wiki917
3,5294,facebook_congress,"Over the past five years, the Obama Administra...",Things will never be any different with someon...,Negative,Content,facebook_congress5294
4,14819,ted,"Martin Seligman gave a talk about brain, educa...",I like seligman and his studys but I dont unde...,Mixed,ContentPoster,ted14819


## Compute Significant N-Grams
Before doing model training, we will use part of the turbo topics functionality to compute all significant unigrams and bigrams. This will demonstrate the recursive abilities that make turbo topics stand out from other topic modelers.

In [None]:
# save corpus to text file
corpus = '\n'.join(X_train.response_text)
corpus_path = tt_path + 'corpus.txt'
corpus_file = open(corpus_path, 'w')
corpus_file.writelines(corpus)
corpus_file.close()
# create ngrams file for saved results
out_path = tt_path + "ngrams.txt"
out_file = open(out_path, 'w')
out_file.close()

In [None]:
### COMMENTING to avoid rerun. File saved in "out_path" variable.
# %%bash -s "$corpus_path" "$out_path"
# cd turbotopics
# python2.7 compute_ngrams.py --corpus $1 --pval 0.001 --perm True --out $2

## Model Training
Model training will be conducted in two steps. First we will train an LDA model, then we will use the results of our trained LDA to implement turbo topics.

### Data Preprocessing
Lowercase all response text, apply tweet tokenization + lemmatize, and remove all stopwords that are not pronouns

In [None]:
# create class that lemmatizes tweet tokens
# this will be used when creating the term matrix
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
        self.tt = TweetTokenizer(preserve_case=False, reduce_len=True,
                                 strip_handles=True, match_phone_numbers=False)
    def __call__(self, docs):
        return [self.wnl.lemmatize(t) for t in self.tt.tokenize(docs)]

def train_vectorizer(text_data, vectorizer=CountVectorizer, tokenizer=LemmaTokenizer()):
    """
    Trains a vectorizer on the provided text data and returns the vectorizer instance,
    the document-term matrix, and the feature names.

    Parameters:
    - text_data: List of text documents to be vectorized.
    - vectorizer: Vectorizer class to be used for text vectorization. Defaults to CountVectorizer.
    - tokenizer: Tokenizer class to be used for tokenizing the text documents. Defaults to TweetTokenizer.

    Returns:
    - instance: The trained vectorizer instance.
    - matrix: The document-term matrix resulting from fitting the vectorizer on `text_data`.
    - features: An array of feature names generated by the vectorizer.
    """
    # Initialize the vectorizer with specified configurations
    instance = vectorizer(
        strip_accents=None,  # Do not strip accents
        lowercase=True,  # Do not convert characters to lowercase
        tokenizer=tokenizer,  # Use the tokenize method of the tokenizer instance
        token_pattern=None,  # Since a tokenizer is provided, token_pattern is not used
        stop_words=list(lemma_stop_words),  # Remove stop_words but keep pronouns
        ngram_range=(1, 1),  # Consider only single words (1-grams)
        # min_df=0.05,  # Minimum document frequency for filtering terms
        # max_df=0.95,  # Maximum document frequency for filtering terms
        max_features=None  # No limit on the number of features
    )

    # Fit the vectorizer on the provided text data and transform the data into a matrix
    matrix = instance.fit_transform(text_data)

    # Retrieve the feature names generated by the vectorizer
    features = instance.get_feature_names_out()

    return instance, matrix, features

### Train LDA w/CountVectorizer
A trained LDA is a dependency for the turbo topics model. Turbo topics takes the unigram based LDA model and recursively identifies significant words before and after each unigram to identify significant topical phrases.

In [None]:
instance, matrix, features = train_vectorizer(X_train.response_text)

In [None]:
# # build lda model
# lda_model = LDA(n_components = 2,
#                 max_iter = 10,
#                 learning_method = 'online', # better for large datasets
#                 random_state = 42,
#                 n_jobs = -1)                # use all available processors
# # build gridsearch
# params = {'learning_decay'  : np.arange(5, 8)/10,
#           # 'max_iter'        : np.arange(1, 5)*10,
#           'learning_offset' : np.arange(1, 4)*10}
# gs_cv = GridSearchCV(estimator = lda_model, param_grid = params, verbose = 5)
# # fit model
# gs_cv.fit(matrix)

In [None]:
# # save
# joblib.dump(gs_cv, filepath + 'trns/turbotopics/lda_gs.pkl')
# final_lda_model = gs_cv.best_estimator_
# joblib.dump(final_lda_model, filepath + 'trns/turbotopics/best_lda.pkl')

### Run Turbo Topics on Best LDA (CountVectorizer)

In [None]:
gs_cv = joblib.load(filepath + 'trns/turbotopics/lda_gs.pkl')
final_lda_model = joblib.load(filepath + 'trns/turbotopics/best_lda.pkl')
print("Model's Params: ", gs_cv.best_params_)
print("Log Likelihood Score: ", gs_cv.best_score_)
print("Model Perplexity: ", final_lda_model.perplexity(matrix))

Model's Params:  {'learning_decay': 0.8, 'learning_offset': 40, 'max_iter': 10}
Log Likelihood Score:  -190464.93192245113
Model Perplexity:  1775.0204419460545


In [None]:
params = gs_cv.cv_results_['params'][gs_cv.cv_results_['mean_test_score'].argmin()]
another_lda = LDA(learning_decay = params['learning_decay'],
                  learning_offset = params['learning_offset'],
                  max_iter = params['max_iter'],
                  n_components = 2,
                  learning_method = 'online', # better for large datasets
                  random_state = 42,
                  n_jobs = -1)                # use all available processors
another_lda_transform = another_lda.fit_transform(matrix)

In [None]:
lda_transform = final_lda_model.transform(matrix)

In [None]:
a = lda_transform[:,0]
b = lda_transform[:,1]
topics = np.where(a > b, 1, 2)
assign = '\n '.join([str(i) + ':' + str(top) for (i, top) in enumerate(topics)])

In [None]:
# save vocab separated by newlines
vocab_path = tt_path + 'vocab.dat'
with open(vocab_path, 'w') as f: f.write('\n'.join(features))
# save index:topic document
assign_path = tt_path + 'assign.dat'
with open(assign_path, 'w') as f: f.write(assign)
# assign output location
tt_out_path = tt_path + 'tt_result'

In [None]:
# run turbotopics
%%bash -s "$corpus_path" "$vocab_path" "$assign_path" "$tt_out_path"
cd turbotopics
python2.7 lda_topics.py --corpus "/content/drive/MyDrive/RtGender/trns/turbotopics/corpus.txt" --vocab "/content/drive/MyDrive/RtGender/trns/turbotopics/vocab.dat" --assign "/content/drive/MyDrive/RtGender/trns/turbotopics/assign.dat" --out "/content/drive/MyDrive/RtGender/trns/turbotopics/tt_result" --ntopics 2 --min-count 1 --pval 0.001

reading vocabulary from /content/drive/MyDrive/RtGender/trns/turbotopics/vocab.dat
writing topic 0
computing initial counts
writing topic 1
computing initial counts
analyzing 5 terms
checking out        : marg = [     1,      1]; bigram =     1;val = 2.77e+00; null = 1.08e+01 rejected
introduced house    : marg = [     1,      1]; bigram =     1;val = 2.77e+00; null = 1.08e+01 rejected
back in             : marg = [     1,      1]; bigram =     1;val = 2.77e+00; null = 1.08e+01 rejected
know its            : marg = [     1,      1]; bigram =     1;val = 2.77e+00; null = 1.08e+01 rejected


In [None]:
top_df = pd.DataFrame(lda_transform, columns = ['t1', 't2'])
top_df['labels'] = y_train
top_df['label_num'] = np.where(top_df.labels == 'W', 2, 1)
top_df['topic'] = np.where(top_df.t1 > top_df.t2, 1, 2)
top_df[top_df.label_num == top_df.topic].count()/top_df.shape[0]

t1           0.511075
t2           0.511075
labels       0.511075
label_num    0.511075
topic        0.511075
dtype: float64

In [None]:
# accuracy is slightly better when absolute log likelihood is maximized
#  yet still accuracy is minimal
another_df = pd.DataFrame(another_lda_transform, columns = ['t1', 't2'])
another_df['labels'] = y_train
another_df['label_num'] = np.where(another_df.labels == 'W', 1, 2)
another_df['topic'] = np.where(another_df.t1 > another_df.t2, 1, 2)
another_df[another_df.label_num == another_df.topic].count()/another_df.shape[0]

t1           0.522801
t2           0.522801
labels       0.522801
label_num    0.522801
topic        0.522801
dtype: float64

### Train LDA w/TF-IDF
A trained LDA is a dependency for the turbo topics model. Turbo topics takes the unigram based LDA model and recursively identifies significant words before and after each unigram to identify significant topical phrases.

In [None]:
instance, matrix, features = train_vectorizer(X_train.response_text,
                                              vectorizer = TfidfVectorizer)

In [None]:
# build lda model
tf_idf_lda_model = LDA(n_components = 2,
                       max_iter = 30,
                       learning_method = 'online', # better for large datasets
                       random_state = 42,
                       n_jobs = -1)                # use all available processors
# build gridsearch
params = {'learning_decay'  : np.arange(5, 8)/10,
          'learning_offset' : np.arange(1, 4)*10}
tf_idf_gs_cv = GridSearchCV(estimator = tf_idf_lda_model,
                            param_grid = params, verbose = 5)
# fit model
tf_idf_gs_cv.fit(matrix)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END learning_decay=0.5, learning_offset=10;, score=-55500.753 total time=  46.0s
[CV 2/5] END learning_decay=0.5, learning_offset=10;, score=-56138.961 total time=  46.4s
[CV 3/5] END learning_decay=0.5, learning_offset=10;, score=-56090.799 total time=  46.5s
[CV 4/5] END learning_decay=0.5, learning_offset=10;, score=-56667.564 total time=  45.2s
[CV 5/5] END learning_decay=0.5, learning_offset=10;, score=-56754.164 total time=  46.7s
[CV 1/5] END learning_decay=0.5, learning_offset=20;, score=-55496.838 total time=  46.3s
[CV 2/5] END learning_decay=0.5, learning_offset=20;, score=-56058.931 total time=  45.9s
[CV 3/5] END learning_decay=0.5, learning_offset=20;, score=-56087.097 total time=  46.0s
[CV 4/5] END learning_decay=0.5, learning_offset=20;, score=-56684.547 total time=  46.1s
[CV 5/5] END learning_decay=0.5, learning_offset=20;, score=-56677.742 total time=  46.6s
[CV 1/5] END learning_decay=0.5, learnin

In [None]:
# save
joblib.dump(tf_idf_gs_cv, filepath + 'trns/turbotopics/lda_gs_tfidf.pkl')
tf_idf_lda_model = tf_idf_gs_cv.best_estimator_
joblib.dump(tf_idf_lda_model, filepath + 'trns/turbotopics/best_lda_tfidf.pkl')

['/content/drive/MyDrive/RtGender/trns/turbotopics/best_lda_tfidf.pkl']

### Run Turbo Topics on Best LDA (TF-IDF)

In [None]:
tf_idf_gs_cv = joblib.load(filepath + 'trns/turbotopics/lda_gs_tfidf.pkl')
tf_idf_lda_model = joblib.load(filepath + 'trns/turbotopics/best_lda_tfidf.pkl')
print("Model's Params: ", tf_idf_gs_cv.best_params_)
print("Log Likelihood Score: ", tf_idf_gs_cv.best_score_)
print("Model Perplexity: ", tf_idf_lda_model.perplexity(matrix))

Model's Params:  {'learning_decay': 0.7, 'learning_offset': 30}
Log Likelihood Score:  -56141.56878551216
Model Perplexity:  7571.076499030928


In [None]:
params = tf_idf_gs_cv.cv_results_['params'][tf_idf_gs_cv.cv_results_['mean_test_score'].argmin()]
print(params)
another_tf_idf_lda = LDA(learning_decay = params['learning_decay'],
                 learning_offset = params['learning_offset'],
                 max_iter = 30,
                 n_components = 2,
                 learning_method = 'online', # better for large datasets
                 random_state = 42,
                 n_jobs = -1)                # use all available processors
another_tf_idf_lda_transform = another_tf_idf_lda.fit_transform(matrix)

{'learning_decay': 0.7, 'learning_offset': 10}


In [None]:
tf_idf_lda_transform = tf_idf_lda_model.transform(matrix)

In [None]:
tf_idf_lda_df = pd.DataFrame(tf_idf_lda_transform, columns = ['t1', 't2'])
tf_idf_lda_df['labels'] = y_train
tf_idf_lda_df['label_num'] = np.where(tf_idf_lda_df.labels == 'W', 2, 1)
tf_idf_lda_df['topic'] = np.where(tf_idf_lda_df.t1 > tf_idf_lda_df.t2, 1, 2)
tf_idf_lda_df[tf_idf_lda_df.label_num == tf_idf_lda_df.topic].count()/tf_idf_lda_df.shape[0]

t1           0.514875
t2           0.514875
labels       0.514875
label_num    0.514875
topic        0.514875
dtype: float64

In [None]:
# accuracy is slightly better when absolute log likelihood is maximized
#  yet still accuracy is minimal
another_tf_idf_lda_df = pd.DataFrame(another_tf_idf_lda_transform, columns = ['t1', 't2'])
another_tf_idf_lda_df['labels'] = y_train
another_tf_idf_lda_df['label_num'] = np.where(another_tf_idf_lda_df.labels == 'W', 2, 1)
another_tf_idf_lda_df['topic'] = np.where(another_tf_idf_lda_df.t1 > another_tf_idf_lda_df.t2, 1, 2)
another_tf_idf_lda_df[another_tf_idf_lda_df.label_num == another_tf_idf_lda_df.topic].count()/another_tf_idf_lda_df.shape[0]

t1           0.515635
t2           0.515635
labels       0.515635
label_num    0.515635
topic        0.515635
dtype: float64

### Train LDA w/TF-IDF (more topics)
A trained LDA is a dependency for the turbo topics model. Turbo topics takes the unigram based LDA model and recursively identifies significant words before and after each unigram to identify significant topical phrases.

In [None]:
instance, matrix, features = train_vectorizer(X_train.response_text,
                                              vectorizer = TfidfVectorizer)

In [None]:
# build lda model
lda_model_tops = LDA(max_iter = 10,
                     learning_decay = 0.7,
                     learning_offset = 10,
                     learning_method = 'online', # better for large datasets
                     random_state = 42,
                     n_jobs = -1)                # use all available processors
# build gridsearch
params = {'n_components'  : np.arange(2, 7)}
gs_cv_top = GridSearchCV(estimator = lda_model_tops, param_grid = params, verbose = 5)
# fit model
gs_cv_top.fit(matrix)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV 1/5] END ...............n_components=2;, score=-55488.976 total time=  18.8s
[CV 2/5] END ...............n_components=2;, score=-56492.097 total time=  17.4s
[CV 3/5] END ...............n_components=2;, score=-56403.473 total time=  16.0s
[CV 4/5] END ...............n_components=2;, score=-56946.467 total time=  16.5s
[CV 5/5] END ...............n_components=2;, score=-57019.545 total time=  15.9s
[CV 1/5] END ...............n_components=3;, score=-58719.514 total time=  19.8s
[CV 2/5] END ...............n_components=3;, score=-60496.551 total time=  17.6s
[CV 3/5] END ...............n_components=3;, score=-60514.029 total time=  17.3s
[CV 4/5] END ...............n_components=3;, score=-61047.805 total time=  18.2s
[CV 5/5] END ...............n_components=3;, score=-61100.387 total time=  17.9s
[CV 1/5] END ...............n_components=4;, score=-63230.033 total time=  17.9s
[CV 2/5] END ...............n_components=4;, scor

In [None]:
gs_cv_top.best_estimator_

In [None]:
params = gs_cv_top.cv_results_['params'][gs_cv_top.cv_results_['mean_test_score'].argmin()]
print(params)
another_lda_more_top = LDA(learning_decay = 0.7,
                           learning_offset = 10,
                           max_iter = 10,
                           n_components = params['n_components'],
                           learning_method = 'online', # better for large datasets
                           random_state = 42,
                           n_jobs = -1)                # use all available processors
lda_more_top_trans = another_lda_more_top.fit_transform(matrix)

{'n_components': 6}


In [None]:
# Much easier to see 5 distinct topics than 2
more_top_trns = pd.DataFrame(lda_more_top_trans, columns = ['t1', 't2', 't3', 't4', 't5', 't6'])
more_top_trns['labels'] = y_train
more_top_trns.head()

Unnamed: 0,t1,t2,t3,t4,t5,t6,labels
0,0.107433,0.05234,0.2147,0.05234,0.053287,0.519901,W
1,0.492034,0.048048,0.315471,0.048048,0.048351,0.048048,W
2,0.148524,0.047997,0.659332,0.048027,0.048106,0.048014,M
3,0.339117,0.065608,0.059471,0.029629,0.439689,0.066486,M
4,0.04595,0.045766,0.045841,0.045766,0.532863,0.283814,M
