<a href="https://colab.research.google.com/github/d-atallah/implicit_gender_bias/blob/cluster_da/cluster_exploratory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering Using Tagged Subjects
Without much luck using topic modeling (LDA) to identify two genders within documents, we will be turning to clustering. An approach we would like to take is to cluster documents based on the subjects addressed within them. Our hope is that our clusters will represent the two genders. We can then use word2vec to find other words (of other parts-of-speech) that frequently co-occur with our clustered subjects.

## Dependencies
This section contains all imports and initialized global variables.

### Clone Github Repository
I am including this step to ensure usability in multiple environments (i.e. Google Colab and Great Lakes Cluster for this project).

In [1]:
!git clone https://github.com/d-atallah/implicit_gender_bias.git

fatal: destination path 'implicit_gender_bias' already exists and is not an empty directory.


### Import Libraries
From here we are importing all necessary libraries as well as a configuration file from our repository containing shared functions that we may use across our notebooks.

In [2]:
from implicit_gender_bias import config as cf
import pandas as pd
import numpy as np
import joblib
import re
import os
from collections import Counter

### text preprocessing dependencies
import nltk
from nltk.tokenize.casual import TweetTokenizer
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
# nltk has tweet tokenizer and POS functionality
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

### spaCy dependencies for POS tagging
import spacy
nlp = spacy.load("en_core_web_sm")

### sklearn dependencies
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.model_selection import GridSearchCV

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Set Reoccuring Variables
Here I will set variables that will be used throughout this notebook. Filepath is specified based on environment (Colab vs. GLC). I am also specifying the exact files I want to use for training.

In [3]:
filepath = cf.filepath()
X_train = pd.read_csv(filepath + 'trns/annotations_X_train.csv').iloc[:,1:]
# this is an unsupervised model so I will likely not use y_train
y_train = pd.read_csv(filepath + 'trns/annotations_y_train.csv').iloc[:,1:]

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
X_train.head()

Unnamed: 0,index,source,post_text,response_text,sentiment,relevance,sourceID
0,1749,facebook_wiki,Second week of physical therapy and making gre...,Tiger on a leash. Stay clear!,Mixed,Content,facebook_wiki1749
1,3463,facebook_congress,Thank you 37th district dems PCOs for nominati...,Congratulations !!! You deserved this honor . ...,Positive,Poster,facebook_congress3463
2,917,facebook_wiki,Holiday survival tips: Never talk politics wit...,i'll have a bloody Mary and a Steak Sandwich p...,Neutral,Irrelevant,facebook_wiki917
3,5294,facebook_congress,"Over the past five years, the Obama Administra...",Things will never be any different with someon...,Negative,Content,facebook_congress5294
4,14819,ted,"Martin Seligman gave a talk about brain, educa...",I like seligman and his studys but I dont unde...,Mixed,ContentPoster,ted14819


### Prepare Functions
This section will contain the functions needed to preprocess the data for our various clustering models.

In [5]:
# create storage path if not exists
path = filepath + 'trns/cluster/'
if os.path.exists(path) == False:
  print(os.path.exists(path))

False


In [6]:
### This may not be used for this exercise as we are only interested
###  in clustering documents by their subjects.
###  Any stop words identified as subjects will be important for this analysis.
# initializing stop_words
# must lemmatize to use with train_vectorizer function
stop_words = cf.stop
wnl = WordNetLemmatizer()
lemma_stop_words = [wnl.lemmatize(wrd) for wrd in stop_words]

In [7]:
# as to not confuse the POS tagger, we will need to remove emojis and tags
# below function and regex was posted by toshi456 on stackoverflow
def deEmojify(text):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642"
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'', text)

In [8]:
# similarly we want to remove user tags
#  as these will only apply to specific responses
def rm_userid(text):
  return re.sub(r'@[^\s]+', '', text)

In [9]:
# similarly we want to remove links
#  as these will only apply to specific responses
def rm_link(text):
  return re.sub(r'https?://\S+', '', text)

In [10]:
### This may not be used for this exercise as we are only interested
###  in clustering documents by their subjects.
###  Any stop words identified as subjects will be important for this analysis.
# create class that lemmatizes tweet tokens
# this will be used when creating the term matrix
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
        self.tt = TweetTokenizer(preserve_case=False, reduce_len=True,
                                 strip_handles=True, match_phone_numbers=False)
    def __call__(self, docs):
        return [self.wnl.lemmatize(t) for t in self.tt.tokenize(docs)]

In [11]:
# creates a term matrix
def train_vectorizer(text_data, vectorizer=CountVectorizer, tokenizer=LemmaTokenizer()):
    """
    Trains a vectorizer on the provided text data and returns the vectorizer instance,
    the document-term matrix, and the feature names.

    Parameters:
    - text_data: List of text documents to be vectorized.
    - vectorizer: Vectorizer class to be used for text vectorization. Defaults to CountVectorizer.
    - tokenizer: Tokenizer class to be used for tokenizing the text documents. Defaults to TweetTokenizer.

    Returns:
    - instance: The trained vectorizer instance.
    - matrix: The document-term matrix resulting from fitting the vectorizer on `text_data`.
    - features: An array of feature names generated by the vectorizer.
    """
    # Initialize the vectorizer with specified configurations
    instance = vectorizer(
        strip_accents=None,  # Do not strip accents
        lowercase=True,  # Do not convert characters to lowercase
        tokenizer=tokenizer,  # Use the tokenize method of the tokenizer instance
        token_pattern=None,  # Since a tokenizer is provided, token_pattern is not used
        stop_words=list(lemma_stop_words),  # Remove stop_words but keep pronouns
        ngram_range=(1, 1),  # Consider only single words (1-grams)
        # min_df=0.05,  # Minimum document frequency for filtering terms
        # max_df=0.95,  # Maximum document frequency for filtering terms
        max_features=None  # No limit on the number of features
    )

    # Fit the vectorizer on the provided text data and transform the data into a matrix
    matrix = instance.fit_transform(text_data)

    # Retrieve the feature names generated by the vectorizer
    features = instance.get_feature_names_out()

    return instance, matrix, features

## Prepare Data
We will apply above functions to preprocess text.

In [12]:
# remove emojis and links
responses = X_train.response_text.apply(lambda x: rm_link(deEmojify(x)))
# tokenize using tweet tokenizer to use preserve case, strip handles,
#  and match phone numbers params. Will rejoin to use spaCy POS tagging.
tt = TweetTokenizer(preserve_case = False, reduce_len = True,
                    strip_handles = True, match_phone_numbers = False)
response_lst = [' '.join(tt.tokenize(doc)) for doc in responses]
# POS tag using spaCy
tag_pipe = list(nlp.pipe(response_lst))
tag_doc_lst = []
for doc in tag_pipe:
  processed_doc = [(tok.text, tok.dep_, tok.pos_) for tok in doc]
  tag_doc_lst.append(processed_doc)

### Exploratory: Count of POS by Document
Are there any POS that one gender uses more than the other?

In [13]:
# count each pos
pos_cnt_lst = []
for doc in tag_doc_lst:
   temp_cnt = Counter(pos for word, dep, pos in doc)
   pos_cnt_lst.append(temp_cnt)
# create df and attach labels
df_pos_cnts = pd.DataFrame(pos_cnt_lst).fillna(0)
df_pos_cnts['gender'] = y_train

In [14]:
df_pos_cnts.head()

Unnamed: 0,NOUN,ADP,DET,PUNCT,VERB,ADJ,PRON,PART,AUX,CCONJ,INTJ,ADV,SCONJ,PROPN,NUM,X,SYM,gender
0,2.0,1.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,W
1,3.0,0.0,1.0,5.0,2.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,W
2,3.0,0.0,2.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,M
3,4.0,7.0,1.0,6.0,8.0,1.0,14.0,1.0,7.0,1.0,0.0,7.0,2.0,1.0,0.0,0.0,0.0,M
4,3.0,0.0,0.0,1.0,2.0,0.0,4.0,1.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,M


In [15]:
diff = df_pos_cnts.groupby(['gender']).mean().T
diff['diff'] = abs(diff.M - diff.W)
cols = diff[diff['diff'] > 0.00999].index
diff[diff['diff'] > 0.00999]

gender,M,W,diff
NOUN,3.39746,3.329543,0.067917
ADP,1.616296,1.567224,0.049072
DET,1.362963,1.29699,0.065973
PUNCT,2.886772,2.909253,0.022481
VERB,2.567196,2.530658,0.036538
ADJ,1.404444,1.374136,0.030308
PRON,2.512804,2.531327,0.018522
PART,0.608889,0.582386,0.026503
AUX,1.355344,1.313043,0.0423
PROPN,0.702222,0.717503,0.015281


### Exploratory: Count of DEPs by Document
Are there any dependencies that one gender uses more than the other?

In [16]:
# count each dep
dep_cnt_lst = []
for doc in tag_doc_lst:
   temp_cnt = Counter(dep for word, dep, pos in doc)
   dep_cnt_lst.append(temp_cnt)
# create df and attach labels
df_dep_cnts = pd.DataFrame(dep_cnt_lst).fillna(0)
df_dep_cnts['gender'] = y_train

In [17]:
df_dep_cnts.head()

Unnamed: 0,ROOT,prep,det,pobj,punct,acomp,nsubj,dobj,aux,relcl,...,predet,npadvmod,expl,agent,meta,csubj,preconj,parataxis,csubjpass,gender
0,2.0,1.0,1.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,W
1,3.0,0.0,1.0,0.0,5.0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,W
2,1.0,0.0,2.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,M
3,4.0,7.0,1.0,7.0,6.0,1.0,7.0,3.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,M
4,1.0,0.0,0.0,0.0,1.0,0.0,2.0,2.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,M


In [18]:
df_dep_cnts.groupby(['gender']).mean().diff()

Unnamed: 0_level_0,ROOT,prep,det,pobj,punct,acomp,nsubj,dobj,aux,relcl,...,oprd,predet,npadvmod,expl,agent,meta,csubj,preconj,parataxis,csubjpass
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M,,,,,,,,,,,...,,,,,,,,,,
W,0.046844,-0.049025,-0.06285,-0.052034,0.013909,0.011041,-0.049994,0.007122,-0.036408,-0.010358,...,-0.00559,0.000561,-0.005795,-0.012522,-0.002518,-0.000454,0.003049,0.000306,0.000547,3.4e-05


###Train DBSCAN on POS
Train DBSCAN on the POS counts for each document. It appears some parts-of-speech are used more by each gender.

In [19]:
ac = AgglomerativeClustering(n_clusters = 2)
ac.fit(df_pos_cnts.loc[:, df_pos_cnts.columns.isin(cols)])

In [20]:
# not really performant just on POS tagging
df_pos_cnts['ac'] = np.where(ac.labels_ == 1, 'W', 'M')
acc = df_pos_cnts[df_pos_cnts.gender == df_pos_cnts.ac].shape[0]/df_pos_cnts.shape[0]
print(f'Accuracy: {acc}')

Accuracy: 0.5006514657980456


### Prepare TF-IDF Matrix
Create a TF-IDF matrix and train a clustering model on that.

In [30]:
instance, matrix, features = train_vectorizer(X_train.response_text,
                                              vectorizer = TfidfVectorizer,
                                              tokenizer = LemmaTokenizer())

In [31]:
# ac = AgglomerativeClustering(n_clusters = 2,
#                              metric = 'cosine',
#                              linkage = 'complete')
# ac.fit(matrix.toarray())

In [32]:
# acc_test_df = pd.DataFrame()
# acc_test_df['gender'] = y_train
# acc_test_df['clustr'] = np.where(ac.labels_ == 1, 'M', 'W')
# acc = acc_test_df[acc_test_df.gender == acc_test_df.clustr].shape[0]/acc_test_df.shape[0]
# acc

In [33]:
# ac.labels_

In [34]:
from sklearn.mixture import BayesianGaussianMixture as bgm
from sklearn.decomposition import TruncatedSVD

In [35]:
tsvd = TruncatedSVD(n_components = 9, n_iter = 25, random_state = 42)#, algorithm = 'arpack', tol = 0.001)
tsvd_trns = tsvd.fit_transform(matrix)

In [36]:
baygauss = bgm(n_components = 2, random_state = 42)
baygauss.fit(tsvd_trns)

In [37]:
bgm_test_df = pd.DataFrame()
bgm_test_df['gender'] = y_train
pred = baygauss.predict(tsvd_trns)
bgm_test_df['clustr'] = np.where(pred == 1, 'W', 'M')
bgm = bgm_test_df[bgm_test_df.gender == bgm_test_df.clustr].shape[0]/bgm_test_df.shape[0]
print(abs(1 - 2*bgm))

0.05928338762214991


In [1]:
X_test = pd.read_csv(filepath + 'trns/annotations_X_test.csv').iloc[:,1:]
# this is an unsupervised model so I will likely not use y_train
y_test = pd.read_csv(filepath + 'trns/annotations_y_test.csv').iloc[:,1:]

instance_t, matrix_t, features_t = train_vectorizer(X_test.response_text,
                                              vectorizer = TfidfVectorizer,
                                              tokenizer = LemmaTokenizer())
tsvd_trns_t = tsvd.transform(matrix_t)

NameError: name 'pd' is not defined

In [None]:
bgm_test_df = pd.DataFrame()
bgm_test_df['gender'] = y_test
pred = baygauss.predict(tsvd_trns)
bgm_test_df['clustr'] = np.where(pred == 1, 'W', 'M')
bgm = bgm_test_df[bgm_test_df.gender == bgm_test_df.clustr].shape[0]/bgm_test_df.shape[0]
print(abs(1 - 2*bgm))