# Gender classification

Sarah Nam initially wrote this notebook. Jae Yeon Kim reviwed the notebook, edited the markdown, and reproduced, commented on and made substantial changes in the code.

## Import libraries 

In [1]:

import pandas as pd
import numpy as np
import string

import collections
from collections import Counter
import matplotlib.pyplot as plt
import re

# NLTK
import nltk
import nltk as nlp
# nltk.download('punkt') You may need to download the dataset
from nltk.stem.lancaster import LancasterStemmer
from nltk.corpus import stopwords

# ML

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB # Naive-Bayes
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression # Linear models
from xgboost import XGBClassifier # Xgboost

################### Validation ######################
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import LeavePOut
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import StratifiedKFold

################### Vectorizer ######################
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.decomposition import PCA

################### Model evals #####################
from sklearn.metrics import accuracy_score # Accuracy score
from sklearn.metrics import balanced_accuracy_score # Balanced accuracy score
from sklearn.metrics import cohen_kappa_score # Cohen's Kappa score
from sklearn.metrics import precision_score # Precision

################### Imbalanced data #####################
from sklearn.utils import resample # for resampling

# Custom functions
from clean_text import clean_tweet

## Load data

In [2]:

tweets = pd.read_csv('/home/jae/intersectional-bias-in-ml/raw_data/tweet.csv')

tweets.drop('Unnamed: 0', axis=1, inplace=True)

tweets.columns

Index(['Tweet', 'Type', 'Number of Votes'], dtype='object')

In [3]:
# Name columns 
tweets.columns = ['text', 'label', 'votes']

# See dimentions 
tweets.shape

(99995, 3)

In [4]:
# dropping duplicates to remove the effects of boosting
#tweets = tweets.drop_duplicates()
#tweets.shape

## Clean text 

Borrowed the text clean code (`clean_text.py`) from racial classification to make the preprocessing step consistent across different classifiers.

In [5]:

# Clean text
tweets_clean = tweets.copy()

tweets_clean['text'] = clean_tweet(tweets_clean['text'])

tweets_clean.head()

Unnamed: 0,text,label,votes
0,man it would fucking rule if we had a party ...,abusive,4
1,it is time to draw close to him 128591127995 f...,normal,4
2,if you notice me start to act different or dis...,normal,5
3,forget unfollowers i believe in growing 7 new ...,normal,3
4,hate being sexually frustrated like i wanna ...,abusive,4


## Import and wrangle training data

1. Sarah Nam found an open source Twitter data which was gender labeled (as in, it was written by a person of a certain gender):https://www.kaggle.com/crowdflower/twitter-user-gender-classification
2. The dataset includes image, username and other data. She only used the gender label and the sample tweet for training. 
3. The confidence level of the gender label was also used to sort out which data points would be useful for training our model. I decided that tweets with confidence level greater than .8 were to be used for training. This was because setting confidence threshold to 1 proved to return too few data points for training.

* Unfortunately, no clear explanation on what confidence means is provided in the Kaggle webpage. 

In [6]:
labeled_dat = pd.read_csv("/home/jae/intersectional-bias-in-ml/raw_data/gender_classified.csv", sep=",", engine='python')

training_data = labeled_dat[labeled_dat['gender:confidence'] > .8][['gender', 'text', 'gender:confidence']]

training_data['gender'].value_counts()

female     5371
male       4658
brand      3788
unknown     122
Name: gender, dtype: int64

In [7]:
# Getting rid of the unknown labeled data
training_data = training_data[training_data['gender'] != 'unknown']

# Clean text
training_data['text'] = clean_tweet(training_data['text']) 

training_data['text'].isnull().values.any()

False

- This dataset included not just male and female tweets, but also tweets by brand twitter accounts and some which were unknown. 
- Sarah Nam removed tweets for which the gender was unknown, and decided to use two dummy variables to encode the gender. One dummy variable was 1 for male and 0 for non-male (i.e. female or brand). A similar dummy variable was used for females. 
- Note that the classes are imbalanced in terms of their size.

In [8]:
training_data = training_data.copy()[['text', 'gender']]

# Inspect unique values 
training_data['gender'].unique()

array(['male', 'female', 'brand'], dtype=object)

In [9]:
# Create male and female columns 
training_data['male'] = [1 if i == 'male' else 0 for i in training_data['gender'].values]
training_data['female'] = [1 if i == 'female' else 0 for i in training_data['gender'].values]

In [10]:
# Check the class balance 

## Male
training_data['male'].value_counts()

0    9159
1    4658
Name: male, dtype: int64

In [11]:

## Female
training_data['female'].value_counts()

0    8446
1    5371
Name: female, dtype: int64

## Upsampling 

To fix the imbalance problem, Jae Kim randomly oversampled the minority class. 

In [12]:
# Custom function for upsample. Jae Kim adapted some code from here: https://elitedatascience.com/imbalanced-classes 

def upsample(data, condition): 

    df_majority = data[data[condition] == 0]
    df_minority = data[data[condition] == 1]
    
    # Upsample (oversample) minority class 
    
    df_minority_upsampled = resample(df_minority, 
                                 replace = True,     # sample with replacement
                                 n_samples = 8000,    # to match majority class
                                 random_state = 1234) # reproducible results
    
    # Combine majority class with upsampled minority class
    data = pd.concat([df_majority, df_minority_upsampled])
    
    return(data)


## Feature extraction (bag-of-words model)

In [13]:

# Vectorizer

vectorizer = CountVectorizer(strip_accents='ascii', 
                             max_features = 5000, # 5,000 is large enough
                             min_df = 1, # minimum frequency 1
                             ngram_range = (1,2), # ngram 
                             binary = True)


In [14]:
# Turn text into document-term matrix

def dtm_train(data, condition):

    ############################### Upsampling ################################

    data = upsample(data, condition)
    
    ############################### DOCUMENT-TERM MATRIX ################################
    
    # BOW model 
    
    features = vectorizer.fit_transform(data['text']).todense() # Turn into a sparse matrix    

    # Response variable
    
    response = data[condition].values # values 

    ############################### STRATIFIED RANDOM SAMPLING ################################
    
    # Split into training and testing sets 

    X_train, X_test, y_train, y_test = train_test_split(features, response, 
                                                        test_size = 0.2, # training = 80%, test = 20%
                                                        random_state = 1234) 
    
    return(X_train, y_train, X_test, y_test)

In [15]:
# Male DTM

male_dtm = dtm_train(training_data, 'male')

# Female DTM

female_dtm = dtm_train(training_data, 'female')

## Train classifiers


### Functions for various ML models

In [16]:
# Lasso

def fit_logistic_regression(X_train, y_train):
    model = LogisticRegression(fit_intercept = True,
                               penalty = 'l1', # Lasso 
                               solver = 'liblinear') # for small datasets
    # sage solver is faster but doesn't coverge in this case
    model.fit(X_train, y_train)
    return model

# Naive-Bayes 

def fit_bayes(X_train, y_train):
    model = GaussianNB()
    model.fit(X_train, y_train)
    return model

# Xgboost

def fit_xgboost(X_train, y_train):
    model = XGBClassifier(random_state = 42,
                         seed = 2, 
                         colsample_bytree = 0.6,
                         subsample = 0.7)
    model.fit(X_train, y_train)
    return model

### Function for evaluating ML models (accuracy and balanced accuracy)

In [17]:
def test_model(model, X_train, y_train, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
#   print("Accuracy:", accuracy, "\n"
#          "Balanced accuracy:", balanced_accuracy)
    return(accuracy, balanced_accuracy)

### Model fitting 

In [18]:
def fit_models(data):
    # Lasso
    lasso = fit_logistic_regression(data[0], data[1])
    # Naive-Bayes
    bayes = fit_bayes(data[0], data[1])
    # Xgboost
    xgboost = fit_xgboost(data[0], data[1])
    
    return(lasso, bayes, xgboost)

In [19]:

male_fit = fit_models(male_dtm)

female_fit = fit_models(female_dtm)


## Model evaluations 

### Function for testing multiple models

In [20]:
def test_models(models, data):
    lasso = test_model(models[0], data[0], data[1], data[2], data[3]) 
    bayes = test_model(models[1], data[0], data[1], data[2], data[3])
    xgboost = test_model(models[2], data[0], data[1], data[2], data[3])
    return(lasso, bayes, xgboost)

Evaluate multiple models for each data.

In [21]:
male_models = test_models(male_fit, male_dtm)

female_models = test_models(female_fit, female_dtm)

### Function for putting the model evaluations into a table.

In [22]:
def eval_table(data):
    table = pd.DataFrame(list(data), columns= ['Accuracy','Balanced Accuracy'])
    table.insert(loc = 0, column = 'Models', value = ['Lasso', 'Bayes', 'XGBoost'])
    return(table)

In [23]:

eval_table(male_models)

Unnamed: 0,Models,Accuracy,Balanced Accuracy
0,Lasso,0.701923,0.700363
1,Bayes,0.618881,0.632854
2,XGBoost,0.667249,0.660138


In [24]:

eval_table(female_models)

Unnamed: 0,Models,Accuracy,Balanced Accuracy
0,Lasso,0.753191,0.753121
1,Bayes,0.681459,0.68399
2,XGBoost,0.71307,0.712317


## Prediction

### Function for predicting the unlabeled data (tweets)

In [25]:
def predict_text(text, model):   
      
    # BOW model 
    
    features = vectorizer.fit_transform(text).todense()
    
    # Prediction
    
    preds = model.predict(features)
    
    return preds

### Apply the function to the tweets

In [26]:
male_predicted = predict_text(tweets_clean['text'], male_fit[0])
female_predicted = predict_text(tweets_clean['text'], female_fit[0])

### Data quality check 

In [27]:
# The only data cleaning that was done was to lower case everything.
# Verifying there are no null entries in the tweet text.
tweets_clean['text'].isnull().values.any()

False

In [28]:
tweets_clean['male'] = male_predicted
tweets_clean['female'] = female_predicted

In [29]:
tweets_clean['male'].value_counts()

0    53068
1    46927
Name: male, dtype: int64

In [30]:
tweets_clean['female'].value_counts()

1    60492
0    39503
Name: female, dtype: int64

In [31]:
tweets_clean.head()

Unnamed: 0,text,label,votes,male,female
0,man it would fucking rule if we had a party ...,abusive,4,1,0
1,it is time to draw close to him 128591127995 f...,normal,4,0,0
2,if you notice me start to act different or dis...,normal,5,1,1
3,forget unfollowers i believe in growing 7 new ...,normal,3,1,0
4,hate being sexually frustrated like i wanna ...,abusive,4,1,1


## Export the predicted values 

In [None]:
tweets_clean.columns

In [32]:
tweets_clean.to_csv("/home/jae/intersectional-bias-in-ml/processed_data/gender_predictions.csv", sep=',', encoding='utf-8', 
                    header=["text", "label", "votes", "male", "female"], index=True)