# Project 3: Web APIs & Classification

## Problem Statement



## Executive Summary



### Contents:
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Data Cleaning](#Data-Cleaning)
- [Pre-Processing](#Pre-Processing)
- [Modelling](#Modelling)

## Exploratory Data Analysis

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import re

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from bs4 import BeautifulSoup

from nltk.tokenize import RegexpTokenizer

# Import CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# display max columns
pd.set_option('display.max_columns',None)

In [2]:
# Select neccesary columns

askmen = pd.read_csv('datasets/askmen.csv')
askmen = askmen[['subreddit','selftext','title']]
askmen.head()

Unnamed: 0,subreddit,selftext,title
0,AskMen,"Sup shitlords,\n\nThis is a very hectic time i...",Askmen Coronavirus Response
1,AskMen,"In the four years that I’ve been a mod here, t...","ThE sUb Is CaLlEd ""AsKmEn"" NoT ""aSkWoMeN """
2,AskMen,"Probably tmi, and I should probably make an al...","As an adult, have you ever pooped your pants? ..."
3,AskMen,I for one am struggling as a 25 year old with ...,Men whose fathers didn’t really teach you abou...
4,AskMen,I hope this doesn't start an internet tough gu...,What is the manliest thing you've ever done?


In [3]:
askmen.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149 entries, 0 to 1148
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  1149 non-null   object
 1   selftext   611 non-null    object
 2   title      1149 non-null   object
dtypes: object(3)
memory usage: 27.1+ KB


In [4]:
askmen.isnull().sum()

subreddit      0
selftext     538
title          0
dtype: int64

In [5]:
askmen.shape

(1149, 3)

In [6]:
# Select neccesary columns

askwomen = pd.read_csv('datasets/askwomen.csv')
askwomen = askwomen[['subreddit','selftext','title']]
askwomen.head()

Unnamed: 0,subreddit,selftext,title
0,AskWomen,This is the third version of our COVID-19 post...,Don't Bring May Flowers - Coronavrius Mega Thr...
1,AskWomen,,What are the best parts about living on your own?
2,AskWomen,,"Does anyone experience ""emotional hangovers,"" ..."
3,AskWomen,"For example, having strong opinions on persona...",How do you deal with critical people that spea...
4,AskWomen,,What’s one mistake your mother made that you v...


In [7]:
askwomen.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1137 entries, 0 to 1136
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  1137 non-null   object
 1   selftext   306 non-null    object
 2   title      1137 non-null   object
dtypes: object(3)
memory usage: 26.8+ KB


In [8]:
askwomen.isnull().sum()

subreddit      0
selftext     831
title          0
dtype: int64

In [9]:
askwomen.shape

(1137, 3)

## Data Cleaning

In [10]:
# Concat title with selftext 

askmen['selftext'] = askmen['selftext'].str.cat(askmen['title'],na_rep='')

In [11]:
# Askmen 

askmen.head()
askmen = askmen[['subreddit','selftext']]
askmen.head()

Unnamed: 0,subreddit,selftext
0,AskMen,"Sup shitlords,\n\nThis is a very hectic time i..."
1,AskMen,"In the four years that I’ve been a mod here, t..."
2,AskMen,"Probably tmi, and I should probably make an al..."
3,AskMen,I for one am struggling as a 25 year old with ...
4,AskMen,I hope this doesn't start an internet tough gu...


In [12]:
# Check null values

askmen.isnull().sum()

subreddit    0
selftext     0
dtype: int64

In [13]:
# Concat title with selftext 

askwomen['selftext'] = askwomen['selftext'].str.cat(askwomen['title'],na_rep = '')

In [14]:
# Askwomen 

askwomen.head()
askwomen = askwomen[['subreddit','selftext']]
askwomen.head()

Unnamed: 0,subreddit,selftext
0,AskWomen,This is the third version of our COVID-19 post...
1,AskWomen,What are the best parts about living on your own?
2,AskWomen,"Does anyone experience ""emotional hangovers,"" ..."
3,AskWomen,"For example, having strong opinions on persona..."
4,AskWomen,What’s one mistake your mother made that you v...


In [15]:
# Check null values

askwomen.isnull().sum()

subreddit    0
selftext     0
dtype: int64

In [16]:
# Concat askmen and askwomen

df = pd.concat([askmen,askwomen],axis=0)

In [17]:
# reset index

df.reset_index(drop=True,inplace=True)

In [18]:
df.shape

(2286, 2)

## Pre-Processing

In [19]:
'''
Convert AskMen & AskWomen into binary labels:

0 for AskWomen
1 for AskMen
'''

df['subreddit'] = df['subreddit'].map({'AskMen': 1,'AskWomen': 0})

In [20]:
df.dtypes

subreddit     int64
selftext     object
dtype: object

In [21]:
df.head()

Unnamed: 0,subreddit,selftext
0,1,"Sup shitlords,\n\nThis is a very hectic time i..."
1,1,"In the four years that I’ve been a mod here, t..."
2,1,"Probably tmi, and I should probably make an al..."
3,1,I for one am struggling as a 25 year old with ...
4,1,I hope this doesn't start an internet tough gu...


In [22]:
# Baseline score

df['subreddit'].value_counts(normalize=True)

1    0.502625
0    0.497375
Name: subreddit, dtype: float64

In [23]:
X = df[['selftext']]
y = df['subreddit']

In [24]:
# Create train_test_split.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.3,
                                                    random_state = 42)

In [25]:
X_train.shape

(1600, 1)

In [26]:
X_test.shape

(686, 1)

In [27]:
# Check for HTML code artifacts

X_train['selftext'][300]

"I've been dating this guy for a few weeks who I'm suuuupppppeeeerrrr into and things are going perfectly, but I know he's talked about me to his friends/shown them my picture but I don't know what kinds of things he would say to them? the only specific thing I know is that he sent my picture to the group chat he has with his close friends saying he hoped I looked like my picture in person and that I seem cool before our first date\n\nBut anyway, I'm curious about most men and their take on this in general lolWhat kinds of things do guys talk about when it comes to girls they're dating?"

In [28]:
# Create a function to clean selftext

def cleaning(text):
    
    # Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]"," ", text)

    # Convert to lower case
    words = letters_only.lower()
    
    return words

In [29]:
# Initialize an empty list to hold the clean reviews.
X_train_clean = []
X_test_clean = []

# Instantiate the counter
counter = 0

# Run through each row in X_train selftext
for row in X_train['selftext']:
    
    # Clean the text and append to X_train_clean
    X_train_clean.append(cleaning(row))
    
# Run through each row in X_test selftest
for row in X_test['selftext']:
    
    # Clean the text and append to X_test_clean
    X_test_clean.append(cleaning(row))

## Modelling

### Model 1 : CountVectorizer & LogisticRegression

In [30]:
# add reddit name to english stopwords to better train model

from sklearn.feature_extraction import text

my_stop_words = text.ENGLISH_STOP_WORDS.union(["askmen","askwomen"])

In [31]:
# Set up pipeline

pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words=my_stop_words)),
    ('lr', LogisticRegression(solver = 'lbfgs'))
])

In [32]:
# Set the pipe parameter

pipe_params = {
    'cvec__max_features': [500,1_000,1_500,2_000],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [.9, .95],
    'cvec__ngram_range': [(1,1), (1,2)]
}

In [33]:
# Instantiate GridSearchCV.

gs = GridSearchCV(pipe, 
                  param_grid=pipe_params, 
                  cv=5)

In [34]:
# Fit GridSearch to training data.
gs.fit(X_train_clean, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [35]:
# Gridsearch best score

gs.best_score_

0.985625

In [36]:
# Get the best hyperparameters

gs.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': 1500,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1)}

In [37]:
# Save the best model 

gs.model = gs.best_estimator_

In [38]:
# Score model on training set

gs.model.score(X_train_clean,y_train)

0.9875

In [39]:
# Score model on testing set

gs.model.score(X_test_clean,y_test)

0.9810495626822158

### Inference

The accuracy of the model is 98% and the model is slightly overfit since the model has higher accuracy on the training set than the test set.

### Model 2: TfidVectorizer & LogisticRegression

In [40]:
# Set up pipeline

tvec_pipe = Pipeline([
    ('tvec', TfidfVectorizer(stop_words=my_stop_words)),
    ('lr', LogisticRegression(solver = 'lbfgs'))
])

In [41]:
# Set the pipe params

pipe_params = {
    'tvec__max_features': [500,1_000,1_500,2_000],
    'tvec__min_df': [2, 3],
    'tvec__max_df': [.9, .95],
    'tvec__ngram_range': [(1,1), (1,2)]
}

In [42]:
# Instantiate GridSearchCV.

gs_tvec = GridSearchCV(tvec_pipe, 
                  param_grid=pipe_params, 
                  cv=5)

In [43]:
# Fit GridSearch to training data.
gs_tvec.fit(X_train_clean, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tvec',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        no

In [44]:
# Gridsearch best score

gs_tvec.best_score_

0.9774999999999998

In [45]:
# Get the best hyperparameters

gs_tvec.best_params_

{'tvec__max_df': 0.9,
 'tvec__max_features': 2000,
 'tvec__min_df': 2,
 'tvec__ngram_range': (1, 1)}

In [46]:
# Save the best model 

gs_tvec.model = gs_tvec.best_estimator_

In [47]:
# Score model on training set

gs_tvec.model.score(X_train_clean,y_train)

0.9875

In [48]:
# Score model on training set

gs_tvec.model.score(X_test_clean,y_test)

0.9810495626822158

### Inference

The TfidfVectorizer scores the same as CountVectorizer of 98%