# Word2Vec Training
We will be training multiple iterations of the word2vec model on our annotations file. These iterations include a trained word2vec for:
- The whole training set,
- One for each source in the training set,
- One for all sources dubbed as formal (i.e. TED),
- And one for all sources dubbed as informal (i.e. reddit).

We will be comparing the performance of each of these embeddings when paired with our supervised and unsupervised models.

## Dependencies
This section contains all imports and initialized global variables.

### Clone Github Repository
I am including this step to ensure usability in multiple environments (i.e. Google Colab and Great Lakes Cluster for this project).

In [1]:
!git clone https://github.com/d-atallah/implicit_gender_bias.git

fatal: destination path 'implicit_gender_bias' already exists and is not an empty directory.


### Import Libraries
From here we are importing all necessary libraries as well as a configuration file from our repository containing shared functions that we may use across our notebooks.

In [1]:
from implicit_gender_bias import config as cf
import pandas as pd
import numpy as np
import joblib
import os
from gensim.models import Word2Vec
from tqdm import tqdm

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Set Reoccuring Variables
Here I will set variables that will be used throughout this notebook. Filepath is specified based on environment (Colab vs. GLC). I am also specifying the exact files I want to split into train and test files from the raw RtGender sources. The extract_dfs function is dynamic, I can specify as many files that I want to split and load.

In [2]:
filepath = cf.filepath()
files = ['annotations']
df_dict = cf.extract_dfs(filepath, files)
raw_annotations = df_dict['annotations']
if os.path.exists(filepath + 'trns/') == False:
  os.mkdir(filepath + 'trns/')
load_dict = cf.load_df(filepath, raw_annotations, 'annotations')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Begin Training
We will be training over each iteration of source and combination of sources. The data used to train and the models trained will be saved in the shared drive

In [3]:
df = pd.read_csv(load_dict['X_train']).iloc[:,1:]
df.head()

Unnamed: 0,index,source,post_text,response_text,sentiment,relevance,sourceID
0,1749,facebook_wiki,Second week of physical therapy and making gre...,Tiger on a leash. Stay clear!,Mixed,Content,facebook_wiki1749
1,3463,facebook_congress,Thank you 37th district dems PCOs for nominati...,Congratulations !!! You deserved this honor . ...,Positive,Poster,facebook_congress3463
2,917,facebook_wiki,Holiday survival tips: Never talk politics wit...,i'll have a bloody Mary and a Steak Sandwich p...,Neutral,Irrelevant,facebook_wiki917
3,5294,facebook_congress,"Over the past five years, the Obama Administra...",Things will never be any different with someon...,Negative,Content,facebook_congress5294
4,14819,ted,"Martin Seligman gave a talk about brain, educa...",I like seligman and his studys but I dont unde...,Mixed,ContentPoster,ted14819


In [4]:
src_lst = list(df.source.unique())
src_lst.extend(['formal', 'informal', 'all'])
rm_stop_param_lst = [True, False]
lemma_param_lst = [True, False]
subj_param_lst = [True, False]
print(src_lst)

['facebook_wiki', 'facebook_congress', 'ted', 'reddit', 'fitocracy', 'formal', 'informal', 'all']


In [5]:
train_dict = {}
# create paths if they do not already exist
if os.path.exists(filepath + 'trns/word2vec/') == False:
  os.mkdir(filepath + 'trns/word2vec/')
if os.path.exists(filepath + 'trns/word2vec/data/') == False:
  os.mkdir(filepath + 'trns/word2vec/data/')
if os.path.exists(filepath + 'trns/word2vec/models/') == False:
  os.mkdir(filepath + 'trns/word2vec/models/')
# iterate over each source and combination of preprocessing settings
for src in tqdm(src_lst):
  for stop_param in rm_stop_param_lst:
    for lemma_param in lemma_param_lst:
      for subj_param in subj_param_lst:
        name = 'word2vec/data/' + src + '_stop_param_' + str(stop_param) + \
          '_lemma_param_' + str(lemma_param) + '_subj_param_' + str(subj_param)
        if src == 'formal': srcs = ['ted', 'facebook_congress']
        elif src == 'informal': srcs = ['fitocracy', 'facebook_wiki', 'reddit']
        elif src == 'all': srcs = list(df.source.unique())
        else: srcs = [src]
        # below function inits df AND saves to shared drive
        temp_df = cf.preprocess(filepath = load_dict['X_train'], name = name,
                                rm_stopwords = stop_param, subj_rm = subj_param,
                                lemmatize = lemma_param)
        temp_ser = temp_df[temp_df.source.isin(srcs)].processed_response
        # below are the parameters I found online to be recommended for short text
        temp_model = Word2Vec(temp_ser, vector_size = 100, window = 5,
                              min_count = 5, sg = 1, hs = 0, negative = 5,
                              sample = 1e-3, workers = 4, epochs = 10)
        name2 = 'word2vec/models/' + src + '_stop_param_' + str(stop_param) + \
          '_lemma_param_' + str(lemma_param) + '_subj_param_' + str(subj_param)
        # saving model to shared location
        temp_model.save(filepath + 'trns/' + name2 + '.pkl')

100%|██████████| 8/8 [09:20<00:00, 70.04s/it]
