Model results when we do not remove the words ['woman', 'women','man','ladies','guys','men']

## Import libraries

In [332]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV,  cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, RandomForestClassifier, ExtraTreesClassifier

from bs4 import BeautifulSoup 

import nltk
import re
from nltk import word_tokenize, pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

%matplotlib inline

## Import data

In [333]:
# import dataset from AskMen and AskWomen subreddit
df = pd.read_csv('./Datasets/askmen_top.csv')
df2 = pd.read_csv('./Datasets/askwomen_top.csv')

In [334]:
# set dataframe display to show full text 
pd.set_option('display.max_colwidth', 10000)
pd.set_option('display.max_columns',10000)

In [335]:
# return only columns required and assign new variables
askwomen = df2[['title','subreddit','selftext']]
askmen = df[['title','subreddit','selftext']]

In [336]:
# get shape of askmen dataframe
askmen.shape

(1000, 3)

In [337]:
# get shape of askwomen dataframe
askwomen.shape

(1000, 3)

## Combine data

In [338]:
# combine into one dataframe
df = pd.concat([askwomen, askmen])

In [339]:
# view dataframe
df.head(3)

Unnamed: 0,title,subreddit,selftext
0,"Reminder: Trans women are women. If you see transphobic commentary on this subreddit, please report",AskWomen,"Recently, we've seen an uptick in transphobic commentary. We wanted to take this time to reiterate our commitment to trans women feeling welcome here. It's askwomen policy that trans women are women, full stop, no qualifiers. So if you see transphobic commentary, please report it. And we will continue to not allow bigotry in this subreddit."
1,"A new dating app is launched. Instead of a photo of a person, it shows you a photo of their bedroom, car, kitchen, shoes, how they have their tea/coffee, things like that... what photo would tell you the most about someone, and would you be most interested to see to choose a potential date?",AskWomen,
2,"When Kamala Harris said ‘I am speaking’ while she was being interrupted over and over, how did that resonate with you?",AskWomen,"Sorry guys - this post has gotten traction because it resonated with a lot of people but the mods have locked it indefinitely. \n\nI posted this question to understand what a moment felt for many women after I saw my own sister wince. It’s small question but the response has been powerful. I feel a lot of people can be heard and a lot of people like myself can learn. \n\nHopefully if they open this sooner rather than later, we can hear more experiences and comments geared towards the question in hand. I am not sure exactly why this question in particular has been locked for this long.\n\nEdit 2: it’s been a month and it’s looking like this post was locked because of its content as opposed to clearing out any comments as the mods have suggested. Wonder if they had an issue with the question or the Kamala Harris?"


In [340]:
# replace nans with blank 
df.fillna("",inplace=True)
# combine all text into one column
df['combined'] = df['title'] +" "+ df['selftext']
# remove columns that are no longer in use
df.drop(columns=['title', 'selftext'], inplace=True)

In [341]:
# check for null values
df.isnull().sum()

subreddit    0
combined     0
dtype: int64

#### Baseline Accuracy

In [342]:
# check data for ratio between the 2 subreddits
df['subreddit'].value_counts(normalize=True)

AskWomen    0.5
AskMen      0.5
Name: subreddit, dtype: float64

When solving a classification problem, we always have to take note of the **baseline accuracy**. This baseline accuracy is derived from the class blanace of out target variable. In this case, out dataset consists of **50:50 AskMen and AskWomen** posts. We have **stratified our train-test-split** to ensure that our target variable keeps this classification ratio. With that, the baseline accuracy we will use to measure the accuracy of our model will be exactly **50%.0**. 

## Dataset Preparation

#### Removing unnecessary texts

limiting factors. in the context of this project we have removed keywords such as woman, women, man, ladies, guys and men to make the prediction task more challenging. In a real life scenario, where there is no such limitations, our model will be able to predict much better. 

In [343]:
# Function to convert a raw review to a string of words
# The input is a single string (a raw post), and 
# the output is a single string (a preprocessed post)
  
def review_to_words(raw_review, words_to_remove):
    
    # 1. Remove HTML.
    review_text = BeautifulSoup(raw_review).get_text()
    
    # 2. Remove non-letters and https
    letters_only = re.sub("[^a-zA-Z]|https", " ", review_text)
    
    # 4. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 5. In Python, searching a set is much faster than searching a list, so convert the stopwords to a set.
    stops = set(stopwords.words('english')) 
    others= {'reddit', 'subreddit','askwomen', 'askmen'}
    
    # 6. Remove stopwords.
    meaningful_words = [w for w in words if w not in stops]
    meaningful_words2 = [w for w in meaningful_words if w not in others]
    meaningful_words3 = [w for w in meaningful_words2 if w not in words_to_remove]
    
    # 7. Join the words back into one string separated by space, and return the result.
    return(" ".join(meaningful_words3))

In [344]:
#sanity check: check stopwords to see what has been removed
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [345]:
# Initialize an empty list to hold the clean reviews.
clean_df = []

# For every review in our training set...
for reviews in df['combined']:
    
    # Convert review to words, then append to clean_train_reviews.
    clean_df.append(review_to_words(reviews,""))

In [346]:
clean_df[0]

'reminder trans women women see transphobic commentary please report recently seen uptick transphobic commentary wanted take time reiterate commitment trans women feeling welcome policy trans women women full stop qualifiers see transphobic commentary please report continue allow bigotry'

#### Train Test Split 

In [351]:
X = clean_df
y = df['subreddit']

In [352]:
# Create train_test_split.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.25,
                                                    stratify = y,
                                                    shuffle=True,
                                                    random_state = 42)

In [353]:
len(X_train)

1500

## Quick Test Models

In an attempt to finding the best model for our dataset, we will run through our dataset on 9 different Machine Learning algorithms with default values and run through 3 scores: the CrossVal Score with 5 fold, train score and test score on 8 Maching Learning algrorithms. 

The scores in this test is no way representative of what the best scores the model can give should we tune the parameters but it might help us to narrow down our models. 

**Why was this method used?**  
Prior to deciding on quick testing each models, I have run through 4 models (Logistic Regression, KNN, Naive Bayes, Decision Tree and Bagging with base model Decision Tree) with aggressive fine tuning on the parameters. You may refer to Project 3: Testing Model Parameters. 

The results have shown that the best models are Logistics Regression, Naive Bayes followed by Decision Tree, with KNN scoring the worst. I have noticed that whilst the quick and dirty method doesn't give me the best score a model can produce, its results is in line with my findings when I heavily fine tune the parameter of the model. With KNN being the worst performing model and logistic regressions as the best performing one. 

However, with consideration to the time it takes to fine tune each model, I have attempted to instead run through a quick and dirty test to see which model runs best on my data set. I will then narrow this down to 3 different models and fine tune its parameters to further improve the scores. 


In [354]:
# create function to compare cvec and tvec scores for a model
def compare_cvectvec(model):

    # instantiate pipe with cvec and model 
    pipe = Pipeline([('cvec', CountVectorizer()),
                     ('model', model)
                    ])
    
    # fit pipe with train dataset
    pipe.fit(X_train, y_train)
    
    # get scores
    print('CVEC 5 Folds Cross-Val Score:' + str(cross_val_score(pipe, X_train, y_train, cv =5).mean()))
    print('CVEC Train Score :' + str(pipe.score(X_train,y_train)))
    print('CVEC Test Score :' + str(pipe.score(X_test,y_test)))
    
    # instantiate pipe with tvec and model 
    pipe2 = Pipeline([('tvec', TfidfVectorizer()),
                      ('model', model)
                     ])
    
    # fit pipe with train dataset
    pipe2.fit(X_train, y_train)
    
    # get scores
    print('TVEC 5 Folds Cross-Val Score :' + str(cross_val_score(pipe2, X_train, y_train, cv =5).mean()))
    print('TVEC Train Score :' + str(pipe.score(X_train,y_train)))
    print('TVEC Test Score :' + str(pipe.score(X_test,y_test)))

#### Linear Regression with Count Vectorizer/ TFIFD Vectorizer

In [355]:
compare_cvectvec(LogisticRegression(max_iter=1000, random_state=42))

CVEC 5 Folds Cross-Val Score:0.826
CVEC Train Score :0.9766666666666667
CVEC Test Score :0.808
TVEC 5 Folds Cross-Val Score :0.8146666666666669
TVEC Train Score :0.8606666666666667
TVEC Test Score :0.768


#### K-Nearest Neighbors with Count Vectorizer/ TFIFD Vectorizer


In [356]:
compare_cvectvec(KNeighborsClassifier())

CVEC 5 Folds Cross-Val Score:0.6046666666666666
CVEC Train Score :0.6933333333333334
CVEC Test Score :0.628
TVEC 5 Folds Cross-Val Score :0.7013333333333333
TVEC Train Score :0.8166666666666667
TVEC Test Score :0.652


#### Naive Baynes with Count Vectorizer/ TFIFD Vectorizer

In [357]:
compare_cvectvec(MultinomialNB())

CVEC 5 Folds Cross-Val Score:0.7360000000000001
CVEC Train Score :0.9073333333333333
CVEC Test Score :0.73
TVEC 5 Folds Cross-Val Score :0.7313333333333334
TVEC Train Score :0.906
TVEC Test Score :0.728


#### Decision Tree with Count Vectorizer/ TFIFD Vectorizer

In [358]:
compare_cvectvec(DecisionTreeClassifier(random_state= 42))

CVEC 5 Folds Cross-Val Score:0.7673333333333334
CVEC Train Score :0.9993333333333333
CVEC Test Score :0.76
TVEC 5 Folds Cross-Val Score :0.764
TVEC Train Score :0.7906666666666666
TVEC Test Score :0.678


#### Decision Tree Bagging Classifier with Count Vectorizer/ TFIFD Vectorizer

In [359]:
compare_cvectvec(BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    random_state=42)
                )

CVEC 5 Folds Cross-Val Score:0.8073333333333335
CVEC Train Score :0.984
CVEC Test Score :0.812
TVEC 5 Folds Cross-Val Score :0.8033333333333333
TVEC Train Score :0.882
TVEC Test Score :0.756


#### Random Forest Classifier with Count Vectorizer/ TFIFD Vectorizer

In [360]:
compare_cvectvec(RandomForestClassifier(random_state= 42))

CVEC 5 Folds Cross-Val Score:0.8253333333333334
CVEC Train Score :0.9986666666666667
CVEC Test Score :0.808
TVEC 5 Folds Cross-Val Score :0.8213333333333335
TVEC Train Score :0.984
TVEC Test Score :0.774


#### Adaboost with Count Vectorizer/ TFIFD Vectorizer

In [361]:
compare_cvectvec(AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(),
    random_state=42)
                )

CVEC 5 Folds Cross-Val Score:0.772
CVEC Train Score :0.9993333333333333
CVEC Test Score :0.758
TVEC 5 Folds Cross-Val Score :0.772
TVEC Train Score :0.8446666666666667
TVEC Test Score :0.714


#### Support Vector Machine with Count Vectorizer/ TFIFD Vectorizer

In [362]:
compare_cvectvec(SVC(random_state= 42))

CVEC 5 Folds Cross-Val Score:0.8033333333333333
CVEC Train Score :0.8886666666666667
CVEC Test Score :0.81
TVEC 5 Folds Cross-Val Score :0.8166666666666667
TVEC Train Score :0.5166666666666667
TVEC Test Score :0.5


## Testing Models

### Testing Model 1 - Logistic Regression

The cross validated score for the logistic regression with default parameters is 0.74.
We will fine tune the parameters and try to get a better score than this. 

In [363]:
# instantiating the pipeline
lrtvecpipe = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression(random_state=42, max_iter =1000))
])

In [364]:
# fine tuning parameters
lrtvecpipe_params={'tvec__max_df': [0.3 ,0.5],
             'tvec__max_features': [2000, 3000, 4000],
             'tvec__min_df': [1, 2, 3],
             'tvec__ngram_range': [(1, 1), (1, 2)],
             'lr__C': [1.0, 2.0, 3.0],
#              'lr__penalty': ['l1', 'l2'],
#              'lr__solver': ['liblinear']
              }

In [365]:
# instantiating grid search
lrtvecgs = GridSearchCV(
    lrtvecpipe,                       # what model is used?
    param_grid = lrtvecpipe_params,   # what parameters values are we searching)
    cv=5)                             # 5-fold cross-validation.

In [366]:
# fit grid search with train set
lrtvecgs.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('lr',
                                        LogisticRegression(max_iter=1000,
                                                           random_state=42))]),
             param_grid={'lr__C': [1.0, 2.0, 3.0], 'tvec__max_df': [0.3, 0.5],
                         'tvec__max_features': [2000, 3000, 4000],
                         'tvec__min_df': [1, 2, 3],
                         'tvec__ngram_range': [(1, 1), (1, 2)]})

In [367]:
# set model parameters to best parameters
lrtvecgs_bestmodel = lrtvecgs.best_estimator_

In [368]:
# create function to get cross_val score, train score, test score and best params
# this will be useful as we will be testing and comparing multiple models
def get_best_scores(gsmodel, bestmodel, X_train, X_test, y_train, y_test):
    # print cross val score
    print('Best Cross Val Score: ' + str(gsmodel.best_score_))
    # print train score on best model
    print('Best Train Score: ' + str(bestmodel.score(X_train, y_train)))
    # print test score on best model
    print('Best Test Score: ' + str(bestmodel.score(X_test, y_test)))
    # print best parameters for model
    print('Best Parameters:' + json.dumps(gsmodel.best_params_, indent=2))

In [369]:
# get best scores and parameters for logistic regression with TF-IDF vectorizer
get_best_scores(lrtvecgs, lrtvecgs_bestmodel, X_train, X_test, y_train, y_test)

Best Cross Val Score: 0.8220000000000001
Best Train Score: 0.942
Best Test Score: 0.816
Best Parameters:{
  "lr__C": 1.0,
  "tvec__max_df": 0.3,
  "tvec__max_features": 4000,
  "tvec__min_df": 1,
  "tvec__ngram_range": [
    1,
    2
  ]
}


In [370]:
#get coefficients of X variables from logistic regression 
lrtvecgs_bestmodel['lr'].coef_

array([[-0.32039498,  0.11670507, -0.10112859, ..., -0.10290752,
        -0.084625  ,  0.03142645]])

In [371]:
lrtveccoef = pd.DataFrame(lrtvecgs_bestmodel['lr'].coef_, columns = lrtvecgs_bestmodel['tvec'].get_feature_names()).T

In [372]:
lrtveccoef.sort_values(ascending=False, by=0).head(15)

Unnamed: 0,0
women,3.732349
ladies,2.637589
partner,1.07585
etc,1.047371
career,0.840673
mean,0.796092
learn,0.735907
products,0.688392
period,0.679349
ever,0.668767


In [373]:
# create functions to find posts with keyword
def review_with_word(word, lst):
    all_reviews = []
    for post in lst:
        if len(re.findall(f"\s{word}\W", post))> 0:
            all_reviews.append(post)
    return pd.DataFrame(all_reviews)

In [265]:
review_with_word('ladies', X_train)

Unnamed: 0,0
0,ladi best comeback someon slut shame
1,nsfw dead give away woman watch much porn bed notic askreddit thread target ladi made wonder women deal porn obsess guy revers thing my ex obsess porn time sex lack emot men thought
2,ladi deal pimpl ingrown hair around pubic area
3,ladi still feel like kid manag full function adult deal life like deal car issu financ paperwork etc without doubt
4,ladi increas worth believ guy amaz thank share advic stori
...,...
59,ladi dislik avoid take pictur like avoid someon want take pictur
60,ladi difficult part relationship complet normal
61,ladi makeup tip beginn tip product newbi i want start mascara i know look alright prep need done put makeup normal face wash clean i never anyon teach thing i m sorri i sound stupid haha edit thank comment i bought skincar product gon na work skin routin first sinc i never done i also bought mascara thank guy
62,women emo phase emo song edit amount respons absolut wild i love read everyon answer honest take back my middl school high school day thank my fellow former former emo ladi answer edit mod pleas let know allow love user u humbledfool creat playlist respons delight everyon respons realli remind one my favorit subredddit open spotifi com playlist m xdqv fbwss zwfnsr si bj qu yoqzq jlbdtxrysa


In [225]:
lrtveccoef.sort_values(ascending=False, by=0).tail(15)

Unnamed: 0,0
time,-1.682367
nsfw,-1.702461
fuck,-1.749726
talk,-1.788449
get,-1.840142
want,-1.881128
girlfriend,-1.887077
even,-1.915714
attract,-1.929326
ask,-2.096007


### Understanding Tokens causing misclassification

In [226]:
# instantiate an empty dataframe
lr_actualvspred = pd.DataFrame(lrtvecgs_bestmodel.predict_proba(X_test))

# add new column for reviews
lr_actualvspred['descr'] = X_test

# add actual subreddit classification to column 'actual'
lr_actualvspred['actual'] = y_test.to_list()

# add predicted subreddit classification to column 'predict'
lr_actualvspred['predicted'] = lrtvecgs_bestmodel.predict(X_test)

# filter out only reviews that have been wrongly misclassified
wrongly_classified_data = lr_actualvspred[lr_actualvspred['actual']!=lr_actualvspred['predicted']]

In [227]:
#check misclassified data
wrongly_classified_data.head(5)

Unnamed: 0,0,1,descr,actual,predicted
3,0.40317,0.59683,favorit nonsexu activ,AskMen,AskWomen
5,0.146301,0.853699,boundari women non negoti,AskMen,AskWomen
29,0.455436,0.544564,extent girl skin qualiti acn scar etc matter often notic girl amaz bad skin,AskMen,AskWomen
31,0.439347,0.560653,i support my partner griev,AskMen,AskWomen
36,0.439341,0.560659,male prostitut legal demand femal much would charg hour,AskMen,AskWomen


In [228]:
# create new dataframe for reviews that has been misclassified as AskWomen
false_pred_askwomen = wrongly_classified_data[wrongly_classified_data['predicted']=='AskWomen']

In [229]:
# understand reviews that has been misclassified as AskWomen
false_pred_askwomen.sort_values(by=0, ascending=False).head(60)

Unnamed: 0,0,1,descr,actual,predicted
457,0.498347,0.501653,ever woman way uncomfort work,AskMen,AskWomen
460,0.497476,0.502524,often crazi thought go head whilst maintain cool exterior,AskMen,AskWomen
367,0.489853,0.510147,fellow singl guy ya late edit apolog i respond everyon expect level respons send good vibe across multivers,AskMen,AskWomen
355,0.488318,0.511682,men look like summer sup shitlord time karmawhor get doxx post pictur internet might get attent random peopl would never speak otherwis open floodgat rule imgur link pleas instagram link remov know upload imgur googl neckbeard make awkward sexual advanc comment anyon cop perma want learn give compliment peopl without sound like total perv go wikihow someth thread dude ladi post thread remov ladi thread also post www com r comment hyzli women look like summer oh yeah go make comment like i m insecur post pictur lmao instead actual particip comment remov well happi post rememb insecur,AskMen,AskWomen
347,0.483296,0.516704,late bloomer eventu got live order stori i m year old man live my life fear i slowli chang better i like hear stori especi one start take life serious late live manag made,AskMen,AskWomen
306,0.471717,0.528283,fella use bathroom work prefer stand sit cri,AskMen,AskWomen
230,0.465289,0.534711,sweetest thing girl ever done,AskMen,AskWomen
287,0.464415,0.535585,younger brother hit stop hit,AskMen,AskWomen
422,0.460923,0.539077,manag work everi day hi year sinc i start work colleg sudden month ago lost mean joy,AskMen,AskWomen
480,0.460499,0.539501,biggest victori year,AskMen,AskWomen


In [230]:
# create new dataframe for reviews that has been misclassified as AskMen
false_pred_askmen = wrongly_classified_data[wrongly_classified_data['predicted']=='AskMen']

In [231]:
# understand reviews that has been misclassified as AskMen
false_pred_askmen.sort_values(by=0, ascending=False).head(60)

Unnamed: 0,0,1,descr,actual,predicted
337,0.844229,0.155771,favorit type soup edit my inbox mega full guy serious love soup huh edit messag today i expect soup fever get far lol,AskWomen,AskMen
136,0.7752,0.2248,kind experi regard misdiagnosi ignor medic problem i m struggl done long time taken serious enough get test someth i everi symptom i understand major problem lot women particular i wonder experi edit thought stori young week old i hospit pneumonia bronchiol my left lung collaps i breath my right i low chanc surviv howev diagnosi my mum told initi hospit stay i stabl go home mum point blank refus say know child instead sat hour wait room i becam blue unconsci eventu put oxygen nicu doctor apologis my mum repeat ignor i later news pneumonia babi symptom,AskWomen,AskMen
352,0.753002,0.246998,women guy friend guy besti make clear friend flirt,AskWomen,AskMen
359,0.742887,0.257113,know peopl featur queer eye much show actual chang way live live watch new season seem like realli turn peopl live around seem crazi week cours i know also lot goe make show wonder seem like realli stick self care lifestyl chang guy help x b edit fuck guy say keep get remov,AskWomen,AskMen
493,0.740581,0.259419,valentin day mega thread check thing gift food plan valentin day order avoid sea valentin galentin post one mega thread thread rule advic gift relax ask away also obvious ask relationship stuff monday look advic make sure descript succinct better inform give better answer receiv suggest sort new see well new stuff,AskWomen,AskMen
276,0.713468,0.286532,stop look girl start look woman,AskWomen,AskMen
336,0.692354,0.307646,work home kind make peopl understand convers distract difficult love one want share someth interest interact listen work time,AskWomen,AskMen
41,0.691357,0.308643,found say friend friend none true past happen,AskWomen,AskMen
134,0.684737,0.315263,look forward summer edit thank silver i m go tri repli everyth go take bit,AskWomen,AskMen
116,0.677013,0.322987,choic attract men women would choos e common theme straight women wish attract women due bad past experi men misogini one exampl expect chore exampl wish give im realli curious,AskWomen,AskMen


#### Tokenizing misclassified data

In [232]:
# Instantiate a new vectorizer with ngram_range matching that of the logistic regression best parameters
tvec = TfidfVectorizer(ngram_range =(1,2), use_idf=False)

In [233]:
# fit vectorizer with data and tokenize text
tvec_fit = tvec.fit_transform(false_pred_askwomen['descr'])

In [234]:
# convert results into dataframe with feature name
false_women_misclassifying_words = pd.DataFrame(tvec_fit.toarray().sum(axis=0).tolist(), tvec.get_feature_names(), columns=['tfidf_women'])

In [235]:
# describe the distribution of words 
false_women_misclassifying_words.describe()

Unnamed: 0,tfidf_women
count,889.0
mean,0.214332
std,0.18613
min,0.060523
25%,0.099504
50%,0.174078
75%,0.258199
max,2.096181


In [236]:
# filter words that has above 0.9 tfidf score
false_women_misclassifying_words[false_women_misclassifying_words['tfidf_women']>0.9].sort_values(by='tfidf_women', ascending=False)

Unnamed: 0,tfidf_women
ever,2.096181
love,1.723292
thing,1.648973
like,1.4269
feel,1.403346
post,1.257485
women,1.248011
live,1.239269
year,1.21796
work,1.132138


In [237]:
false_women_misclassified_words = false_women_misclassifying_words[false_women_misclassifying_words['tfidf_women']>0.9].index.tolist()

In [238]:
tvec2 = TfidfVectorizer(
  ngram_range= (1,2))

In [239]:
tvec2_fit = tvec2.fit_transform(false_pred_askmen['descr'])

In [240]:
false_men_misclassifying_words = pd.DataFrame(tvec2_fit.toarray().sum(axis=0).tolist(), tvec2.get_feature_names(), columns=['tfidf_men'])

In [241]:
false_men_misclassifying_words.describe()

Unnamed: 0,tfidf_men
count,1733.0
mean,0.148529
std,0.133352
min,0.050743
25%,0.073175
50%,0.097292
75%,0.177926
max,1.268371


In [242]:
false_men_misclassifying_words[false_men_misclassifying_words['tfidf_men']>1].sort_values(by='tfidf_men', ascending=False)

Unnamed: 0,tfidf_men
get,1.268371
thing,1.152441
women,1.120778
friend,1.076979
one,1.050552
look,1.001137


In [243]:
false_men_misclassified_words = false_men_misclassifying_words[false_men_misclassifying_words['tfidf_men']>1].index.tolist()

In [244]:
high_misclassification_token = set.union(set(false_men_misclassified_words), set(false_women_misclassified_words)) 

In [245]:
# sanity check: check the selected high misclassification tokens that we are going to remove
high_misclassification_token

{'ever',
 'feel',
 'friend',
 'get',
 'like',
 'live',
 'look',
 'love',
 'one',
 'post',
 'start',
 'thing',
 'women',
 'work',
 'year'}

In [246]:
# Initialize new train and test sets
X1_train = []
X1_test = []
y1_train = y_train
y1_test = y_test

# remove words in reviews that are in the high_misclassification_token list
for reviews in X_train:
    X1_train.append(review_to_words(reviews, high_misclassification_token))
    
for reviews in X_test:
    X1_test.append(review_to_words(reviews, high_misclassification_token))

In [247]:
# instantiating a new pipeline for Logistic Regression with TF-IDF vectorizer
lrtvecpipe1 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression(random_state=42, max_iter=4000))
])

In [248]:
# instantiating grid search with similar parameters to that above
lrtvecgs1 = GridSearchCV(
    lrtvecpipe1,                       # what model is used?
    param_grid = lrtvecpipe_params,    # what parameters values are we searching)
    cv=5)                              # 5-fold cross-validation.

In [249]:
# fit model with new train set (high_misclassification_token removed)
lrtvecgs1.fit(X1_train, y1_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('lr',
                                        LogisticRegression(max_iter=4000,
                                                           random_state=42))]),
             param_grid={'lr__C': [1.0, 2.0, 3.0], 'tvec__max_df': [0.3, 0.5],
                         'tvec__max_features': [2000, 3000, 4000],
                         'tvec__min_df': [1, 2, 3],
                         'tvec__ngram_range': [(1, 1), (1, 2)]})

In [250]:
# set paramaters in model to best parameters
lrtvecgs1_bestmodel = lrtvecgs1.best_estimator_

In [251]:
# get scores for model 
get_best_scores(lrtvecgs1, lrtvecgs1_bestmodel, X1_train, X1_test, y1_train, y1_test)

Best Cross Val Score: 0.8320000000000001
Best Train Score: 0.9426666666666667
Best Test Score: 0.824
Best Parameters:{
  "lr__C": 2.0,
  "tvec__max_df": 0.3,
  "tvec__max_features": 3000,
  "tvec__min_df": 2,
  "tvec__ngram_range": [
    1,
    2
  ]
}


In [252]:
#get coefficients of X variables from logistic regression 
lrtvecgs1_bestmodel['lr'].coef_

array([[-0.49419796, -0.85028383,  0.36245181, ..., -0.39638225,
         0.10491519, -0.21207972]])

In [253]:
lrtveccoef1 = pd.DataFrame(lrtvecgs1_bestmodel['lr'].coef_, columns = lrtvecgs1_bestmodel['tvec'].get_feature_names()).T

In [254]:
lrtveccoef1.sort_values(ascending=False, by=0).head(15)

Unnamed: 0,0
ladi,3.470867
partner,1.692335
marri,1.639589
career,1.448382
best,1.244288
hair,1.220776
cri,1.202605
posit,1.1926
etc,1.190158
absolut,1.118182


In [255]:
lrtveccoef1.sort_values(ascending=False, by=0).tail(20)

Unnamed: 0,0
see,-1.552017
wife,-1.559436
man,-1.582686
mine,-1.716034
say,-1.718194
male,-1.73561
attract,-1.770127
nsfw,-1.784005
talk,-1.789933
fuck,-1.834243


### Testing Model 2 - Random Forest


In [256]:
# instantiating a pipeling for random forest classifier with TF-IDF vectorizer
rftfidpipe = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('rf', RandomForestClassifier(random_state= 42, n_estimators= 500))
     ])

In [257]:
# set parameters 
rftfidpipe_params={
             'tvec__max_df': [0.3],
             'tvec__max_features': [4000],     #[3000, 4000, 5000]
             'tvec__min_df': [2, 4, 6],
             'tvec__ngram_range': [(1, 1), (1, 2)], #[(1, 1), (1, 2), (1,3)],
             'rf__max_features' : [400 ,500, 800],             #['auto', 'sqrt', 'log2']
             'rf__max_depth':[100, 200],
}

In [258]:
# run grid search on pipeline against various parameters
rftfidgs = GridSearchCV(
    rftfidpipe,     # what object are we optimizing?
    param_grid = rftfidpipe_params,
    cv=5,verbose=2) # what parameters values are we searching) # 5-fold cross-validation.

In [259]:
rftfidgs.fit(X1_train, y1_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV] END rf__max_depth=100, rf__max_features=400, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=2, tvec__ngram_range=(1, 1); total time=   4.7s
[CV] END rf__max_depth=100, rf__max_features=400, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=2, tvec__ngram_range=(1, 1); total time=   4.7s
[CV] END rf__max_depth=100, rf__max_features=400, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=2, tvec__ngram_range=(1, 1); total time=   4.6s
[CV] END rf__max_depth=100, rf__max_features=400, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=2, tvec__ngram_range=(1, 1); total time=   5.3s
[CV] END rf__max_depth=100, rf__max_features=400, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=2, tvec__ngram_range=(1, 1); total time=   5.6s
[CV] END rf__max_depth=100, rf__max_features=400, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=2, tvec__ngram_range=(1, 2); total time=   5.1s
[CV] END rf__m

[CV] END rf__max_depth=100, rf__max_features=500, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=6, tvec__ngram_range=(1, 1); total time=  11.8s
[CV] END rf__max_depth=100, rf__max_features=500, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=6, tvec__ngram_range=(1, 1); total time=   9.9s
[CV] END rf__max_depth=100, rf__max_features=500, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=6, tvec__ngram_range=(1, 2); total time=  12.3s
[CV] END rf__max_depth=100, rf__max_features=500, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=6, tvec__ngram_range=(1, 2); total time=  12.0s
[CV] END rf__max_depth=100, rf__max_features=500, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=6, tvec__ngram_range=(1, 2); total time=  11.6s
[CV] END rf__max_depth=100, rf__max_features=500, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=6, tvec__ngram_range=(1, 2); total time=  11.2s
[CV] END rf__max_depth=100, rf__max_features=500, tvec__max_df=0.3, tvec__ma

[CV] END rf__max_depth=200, rf__max_features=400, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=4, tvec__ngram_range=(1, 2); total time=  11.5s
[CV] END rf__max_depth=200, rf__max_features=400, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=4, tvec__ngram_range=(1, 2); total time=  10.9s
[CV] END rf__max_depth=200, rf__max_features=400, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=4, tvec__ngram_range=(1, 2); total time=   9.3s
[CV] END rf__max_depth=200, rf__max_features=400, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=6, tvec__ngram_range=(1, 1); total time=  13.4s
[CV] END rf__max_depth=200, rf__max_features=400, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=6, tvec__ngram_range=(1, 1); total time=  13.9s
[CV] END rf__max_depth=200, rf__max_features=400, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=6, tvec__ngram_range=(1, 1); total time=  13.2s
[CV] END rf__max_depth=200, rf__max_features=400, tvec__max_df=0.3, tvec__ma

[CV] END rf__max_depth=200, rf__max_features=800, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=4, tvec__ngram_range=(1, 1); total time=  21.6s
[CV] END rf__max_depth=200, rf__max_features=800, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=4, tvec__ngram_range=(1, 1); total time=  19.5s
[CV] END rf__max_depth=200, rf__max_features=800, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=4, tvec__ngram_range=(1, 1); total time=  20.8s
[CV] END rf__max_depth=200, rf__max_features=800, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=4, tvec__ngram_range=(1, 1); total time=  19.3s
[CV] END rf__max_depth=200, rf__max_features=800, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=4, tvec__ngram_range=(1, 2); total time=  20.0s
[CV] END rf__max_depth=200, rf__max_features=800, tvec__max_df=0.3, tvec__max_features=4000, tvec__min_df=4, tvec__ngram_range=(1, 2); total time=  18.3s
[CV] END rf__max_depth=200, rf__max_features=800, tvec__max_df=0.3, tvec__ma

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('rf',
                                        RandomForestClassifier(n_estimators=500,
                                                               random_state=42))]),
             param_grid={'rf__max_depth': [100, 200],
                         'rf__max_features': [400, 500, 800],
                         'tvec__max_df': [0.3], 'tvec__max_features': [4000],
                         'tvec__min_df': [2, 4, 6],
                         'tvec__ngram_range': [(1, 1), (1, 2)]},
             verbose=2)

In [260]:
# set model parameters to best paramaters
rftfidgs_bestmodel = rftfidgs.best_estimator_

In [261]:
# get accuracy scores and best parameters for model
get_best_scores(rftfidgs, rftfidgs_bestmodel, X1_train, X1_test, y1_train, y1_test)

Best Cross Val Score: 0.8153333333333332
Best Train Score: 0.9986666666666667
Best Test Score: 0.83
Best Parameters:{
  "rf__max_depth": 100,
  "rf__max_features": 400,
  "tvec__max_df": 0.3,
  "tvec__max_features": 4000,
  "tvec__min_df": 2,
  "tvec__ngram_range": [
    1,
    2
  ]
}


### Testing Model 3 - Support Vector Machines


In [179]:
# instantiating pipeline for support vector machine with TF-IDF vectorizer
svctvecpipe = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('svc', SVC(random_state= 42))
])

In [181]:
# set parameters for support vector machine
svctvecpipe_params={'tvec__max_df': [0.3, 0.5],
                    'tvec__max_features': [2000, 3000, 4000],
                    'tvec__min_df': [1, 2, 3],
                    'tvec__ngram_range': [(1, 1), (1, 2)],
                    'svc__kernel': ['linear','rbf', 'sigmoid'],
                    'svc__C': [1.0],
                    }

In [182]:
# grid search model across various parameters
svctvecdgs = GridSearchCV(
    svctvecpipe, # what object are we optimizing?
    param_grid = svctvecpipe_params,
    cv=5) # what parameters values are we searching) # 5-fold cross-validation.

In [183]:
# fit model with train set
svctvecdgs.fit(X1_train, y1_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('svc', SVC(random_state=42))]),
             param_grid={'svc__C': [100, 10, 1.0, 0.1, 0.001],
                         'svc__kernel': ['linear', 'rbf', 'sigmoid'],
                         'tvec__max_df': [0.3, 0.5],
                         'tvec__max_features': [2000, 3000, 4000],
                         'tvec__min_df': [1, 2, 3],
                         'tvec__ngram_range': [(1, 1), (1, 2)]})

In [184]:
# assign best parameters to model
svctvecgs_bestmodel = svctvecdgs.best_estimator_

In [185]:
# get the accuracy scores and best parameters for model
get_best_scores(svctvecdgs, svctvecgs_bestmodel, X1_train, X1_test, y1_train, y1_test)

Best Cross Val Score: 0.8299999999999998
Best Train Score: 0.996
Best Test Score: 0.814
Best Parameters:{
  "svc__C": 1.0,
  "svc__kernel": "rbf",
  "tvec__max_df": 0.3,
  "tvec__max_features": 4000,
  "tvec__min_df": 1,
  "tvec__ngram_range": [
    1,
    2
  ]
}
