# Project 3 : 3-NLP

## Problem statement

> #### Create the model to predict where the given post came from between cat and dog post

> - **Create and compare two models**. Any two classifiers at least of your choosing: random forest, logistic regression, KNN, etc.

### Game plan

1. cvec + rf
2.   tfidf + rf

3. cvec + lr
4.   tfdif + lr

5. cvec + knn : 
6.   tfidf + knn

7. cvec + nb + gs : try 2grams 3grams

8. cvec + boost

9. stack / lv1 = best1 + best2 + best3 / lv2 = lr

** extra: maybe use lemmatize stemming

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, f1_score

In [2]:
## These functions may or may not use, but put in here just in case
## Lemmatization
from nltk import word_tokenize   
from nltk.stem import WordNetLemmatizer
import re
class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        regex_num_ponctuation = '(\d+)|([^\w\s])'
        regex_little_words = r'(\b\w{1,2}\b)'
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc) 
                if not re.search(regex_num_ponctuation, t) and not re.search(regex_little_words, t)]
# got this from https://stackoverflow.com/questions/47423854/sklearn-adding-lemmatizer-to-countvectorizer

## Stemming
import nltk.stem    
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([english_stemmer.stem(w) for w in analyzer(doc)])
# got this from https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn

# Custom preprocessor - from breakfast hours
def my_preprocessor(text):
    text = text.lower() # overriden by using custom preprocessor
    text = re.sub('\\n', '', text)
    text = re.findall("[\w']+|\$[\d\.]+", text)
    text = " ".join(text)
    
    return text

## Load Data

In [3]:
df = pd.read_csv('../data/nlp_cat_dog.csv')
df.drop(columns='Unnamed: 0', inplace=True)
df.shape

(8000, 2)

## Create `X` and `y`

In [4]:
X = df['title']
X.head(), X.shape

(0       The “how is it only Wednesday?!” mood - 😄 (OC)
 1                      The Most Wonderful Time Of Year
 2            My grumpy boy doesn’t like being cuddled.
 3                                   If I fits it sits!
 4    Friendly baby got into our garage last night t...
 Name: title, dtype: object,
 (8000,))

In [5]:
y = df['subreddit']
y.sample(5) #1=cats, 0=dogs

7755    0
985     1
2233    1
6113    0
3667    1
Name: subreddit, dtype: int64

### Baseline Accuracy

In [6]:
df['subreddit'].value_counts(normalize=True) #each category split equally 50:50

1    0.5
0    0.5
Name: subreddit, dtype: float64

> Test score from any model should higher than 0.5 and close to 1.0 as much as possible

### Train/Test split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.15)
X_train.shape, X_test.shape, y_train.shape, y_test.shape, y_train.value_counts(normalize=True), y_test.value_counts(normalize=True)

((6800,),
 (1200,),
 (6800,),
 (1200,),
 0    0.5
 1    0.5
 Name: subreddit, dtype: float64,
 0    0.5
 1    0.5
 Name: subreddit, dtype: float64)

## Modeling

> method ordering got idea from Simon's Hackathon group

In [8]:
# set default cv
cv=5

### 1. cvec + rf (do grid search, due to high score with default params)

In [9]:
pipe_1 = Pipeline([
    ('cvec', CountVectorizer(preprocessor=my_preprocessor)),
    ('rf', RandomForestClassifier())
])

params_1 = {
    # as many tries, default result the best for stack
}

In [10]:
model_1 = GridSearchCV(pipe_1, param_grid=params_1, cv=cv, n_jobs=-1)

In [11]:
model_1.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec',
                                        CountVectorizer(preprocessor=<function my_preprocessor at 0x7fc2f39bc310>)),
                                       ('rf', RandomForestClassifier())]),
             n_jobs=-1, param_grid={})

In [12]:
model_1.best_score_, model_1.best_params_

(0.7444117647058823, {})

In [13]:
model_1.score(X_train, y_train), model_1.score(X_test, y_test)

(0.9882352941176471, 0.7508333333333334)

> As trial severals time, default params got the best score

## 2. tf + rf

In [14]:
pipe_2 = Pipeline([
    ('tf', TfidfVectorizer()),
    ('rf', RandomForestClassifier(random_state=42))
])

In [15]:
params_2 = {
    'tf__stop_words': ['english'],
    'tf__max_features': [4000, 6000],
    'tf__min_df': [2],
    'tf__max_df': [.9],
    'tf__ngram_range': [(1,1)],
    'rf__n_estimators':[600],
    'rf__max_depth':[None],
    'rf__min_samples_split':[6,8,10],
    'rf__min_samples_leaf': [2],
    'rf__ccp_alpha':[0, .1, .01],
}


In [16]:
model_2 = GridSearchCV(pipe_2, param_grid={}, cv=cv)

In [17]:
model_2.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tf', TfidfVectorizer()),
                                       ('rf',
                                        RandomForestClassifier(random_state=42))]),
             param_grid={})

In [18]:
model_2.best_score_, model_2.best_params_

(0.7385294117647059, {})

In [19]:
model_2.score(X_train, y_train), model_2.score(X_test, y_test)

(0.9880882352941176, 0.7533333333333333)

### 3. cvec + lr (do more grid search)

In [20]:
pipe_3 = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression()),
])

In [21]:
params_3 = {
    'cvec__stop_words': ['english'],
    'cvec__max_features': [5000, 6000],
    'cvec__min_df': [1, 2],
    'cvec__max_df': [.8, .9],
    'cvec__ngram_range': [(1,1),(1, 2)],
    'lr__C':[0.01, 0.1, 1],
}

In [22]:
model_3 = GridSearchCV(pipe_3, param_grid=params_3, cv=cv)

In [23]:
model_3.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('lr', LogisticRegression())]),
             param_grid={'cvec__max_df': [0.8, 0.9],
                         'cvec__max_features': [5000, 6000],
                         'cvec__min_df': [1, 2],
                         'cvec__ngram_range': [(1, 1), (1, 2)],
                         'cvec__stop_words': ['english'],
                         'lr__C': [0.01, 0.1, 1]})

In [24]:
model_3.best_score_, model_3.best_params_

(0.7588235294117647,
 {'cvec__max_df': 0.8,
  'cvec__max_features': 6000,
  'cvec__min_df': 1,
  'cvec__ngram_range': (1, 2),
  'cvec__stop_words': 'english',
  'lr__C': 1})

In [25]:
model_3.score(X_train, y_train), model_3.score(X_test, y_test)

(0.9041176470588236, 0.7666666666666667)

### 4. tf + lr (do grid search, due to high score with default params)

In [26]:
pipe_4 = Pipeline([
    #('tf', TfidfVectorizer()),
    ('cvec', StemmedCountVectorizer(stop_words='english')),
    ('lr', LogisticRegression()),
])

In [27]:
params_4 = {
    # default is the best
}

In [28]:
model_4 = GridSearchCV(pipe_4, param_grid=params_4, cv=cv)

In [29]:
model_4.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec',
                                        StemmedCountVectorizer(stop_words='english')),
                                       ('lr', LogisticRegression())]),
             param_grid={})

In [30]:
model_4.best_score_, model_4.best_params_

(0.7614705882352941, {})

In [31]:
model_4.score(X_train, y_train), model_4.score(X_test, y_test)

(0.9105882352941177, 0.7833333333333333)

### 5. cvec + knn

In [32]:
pipe_5 = Pipeline([
    ('cvec', CountVectorizer()),
    ('knn', KNeighborsClassifier()),
])

In [33]:
params_5 = {
    'cvec__stop_words': [None, 'english'],
    'cvec__max_features': [2000, 3000],
    'cvec__min_df': [1, 2],
    'cvec__max_df': [0.7, 0.8],
    'cvec__ngram_range': [(1, 1), (1, 2)]
}

In [34]:
model_5 = GridSearchCV(pipe_5, param_grid=params_5, cv=cv)

In [35]:
model_5.fit(X_train, y_train)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mo

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('knn', KNeighborsClassifier())]),
             param_grid={'cvec__max_df': [0.7, 0.8],
                         'cvec__max_features': [2000, 3000],
                         'cvec__min_df': [1, 2],
                         'cvec__ngram_range': [(1, 1), (1, 2)],
                         'cvec__stop_words': [None, 'english']})

In [36]:
model_5.best_score_, model_5.best_params_

(0.7023529411764706,
 {'cvec__max_df': 0.7,
  'cvec__max_features': 2000,
  'cvec__min_df': 2,
  'cvec__ngram_range': (1, 1),
  'cvec__stop_words': 'english'})

In [37]:
model_5.score(X_train, y_train), model_5.score(X_test, y_test)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


(0.8039705882352941, 0.7041666666666667)

### 6. tfidf + knn

In [38]:
pipe_6 = Pipeline([
    ('tf', TfidfVectorizer()),
    ('knn', KNeighborsClassifier())
])

In [39]:
params_6 = {
    'tf__stop_words': ['english'],
    'tf__max_features': [500],
    'tf__min_df': [1,2],
    'tf__max_df': [.7],
    'tf__ngram_range': [(1, 2),(1,3)]
}

In [40]:
model_6 = GridSearchCV(pipe_6, param_grid=params_6, cv=cv)

In [41]:
model_6.fit(X_train, y_train)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tf', TfidfVectorizer()),
                                       ('knn', KNeighborsClassifier())]),
             param_grid={'tf__max_df': [0.7], 'tf__max_features': [500],
                         'tf__min_df': [1, 2],
                         'tf__ngram_range': [(1, 2), (1, 3)],
                         'tf__stop_words': ['english']})

In [42]:
model_6.best_score_, model_6.best_params_

(0.6460294117647059,
 {'tf__max_df': 0.7,
  'tf__max_features': 500,
  'tf__min_df': 2,
  'tf__ngram_range': (1, 2),
  'tf__stop_words': 'english'})

In [43]:
model_6.score(X_train, y_train), model_6.score(X_test, y_test)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


(0.7489705882352942, 0.6541666666666667)

### 7. cvec + nb + gs : try 2grams 3grams (do grid search, due to high score with default params)

In [44]:
pipe_7 = Pipeline([
    ('cvec', StemmedCountVectorizer(stop_words='english')),
    ('nb', MultinomialNB()),
])

In [45]:
params_7 = {
    'cvec__max_features': [4000, 6000, 8000],
    'cvec__min_df': [2, 3, 4],
    'cvec__max_df': [.8, .9],
    'cvec__ngram_range': [(1,1),(1, 2)]
}

In [46]:
model_7 = GridSearchCV(pipe_7, param_grid=params_7, cv=cv)

In [47]:
model_7.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec',
                                        StemmedCountVectorizer(stop_words='english')),
                                       ('nb', MultinomialNB())]),
             param_grid={'cvec__max_df': [0.8, 0.9],
                         'cvec__max_features': [4000, 6000, 8000],
                         'cvec__min_df': [2, 3, 4],
                         'cvec__ngram_range': [(1, 1), (1, 2)]})

In [48]:
model_7.best_score_, model_7.best_params_

(0.7498529411764705,
 {'cvec__max_df': 0.8,
  'cvec__max_features': 4000,
  'cvec__min_df': 2,
  'cvec__ngram_range': (1, 1)})

In [49]:
model_7.score(X_train, y_train), model_7.score(X_test, y_test)

(0.8289705882352941, 0.7708333333333334)

### 8. cvec + boost

In [50]:
pipe_8 = Pipeline([
    ('cvec', CountVectorizer(preprocessor=my_preprocessor)),
    ('boost', AdaBoostClassifier())
])

In [51]:
params_8 = {
    'cvec__stop_words': ['english'],
    'cvec__max_features': [4000, 6000, 8000],
    'cvec__min_df': [2, 3, 4],
    'cvec__max_df': [.8, .9],
    'cvec__ngram_range': [(1,1),(1, 2)]
}

In [52]:
model_8 = GridSearchCV(pipe_8, param_grid=params_8, cv=cv)

In [53]:
model_8.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec',
                                        CountVectorizer(preprocessor=<function my_preprocessor at 0x7fc2f39bc310>)),
                                       ('boost', AdaBoostClassifier())]),
             param_grid={'cvec__max_df': [0.8, 0.9],
                         'cvec__max_features': [4000, 6000, 8000],
                         'cvec__min_df': [2, 3, 4],
                         'cvec__ngram_range': [(1, 1), (1, 2)],
                         'cvec__stop_words': ['english']})

In [54]:
model_8.best_score_, model_8.best_params_

(0.7232352941176471,
 {'cvec__max_df': 0.8,
  'cvec__max_features': 4000,
  'cvec__min_df': 4,
  'cvec__ngram_range': (1, 1),
  'cvec__stop_words': 'english'})

In [55]:
model_8.score(X_train, y_train), model_8.score(X_test, y_test)

(0.7322058823529412, 0.7283333333333334)

### 9. stack
> Check best score first

In [56]:
total_test_score =[]
total_test_score.append(model_1.score(X_test, y_test))
total_test_score.append(model_2.score(X_test, y_test))
total_test_score.append(model_3.score(X_test, y_test))
total_test_score.append(model_4.score(X_test, y_test))
total_test_score.append(model_5.score(X_test, y_test))
total_test_score.append(model_6.score(X_test, y_test))
total_test_score.append(model_7.score(X_test, y_test))
total_test_score.append(model_8.score(X_test, y_test))

for i in range(len(total_test_score)):
    print('model', i+1, '=', round(total_test_score[i], 4), '*'*int((total_test_score[i])/0.05), i+1)

model 1 = 0.7508 *************** 1
model 2 = 0.7533 *************** 2
model 3 = 0.7667 *************** 3
model 4 = 0.7833 *************** 4
model 5 = 0.7042 ************** 5
model 6 = 0.6542 ************* 6
model 7 = 0.7708 *************** 7
model 8 = 0.7283 ************** 8


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [57]:
# Use 1, 3, 7
level_1 = [
    ('model_1', model_1.best_estimator_),
    ('model_4', model_4.best_estimator_),
    ('model_7', model_7.best_estimator_),
]
level_2 = LogisticRegression()

model_9 = StackingClassifier(estimators=level_1,
                             final_estimator=level_2)

In [58]:
model_9.fit(X_train, y_train)

StackingClassifier(estimators=[('model_1',
                                Pipeline(steps=[('cvec',
                                                 CountVectorizer(preprocessor=<function my_preprocessor at 0x7fc2f39bc310>)),
                                                ('rf',
                                                 RandomForestClassifier())])),
                               ('model_4',
                                Pipeline(steps=[('cvec',
                                                 StemmedCountVectorizer(stop_words='english')),
                                                ('lr', LogisticRegression())])),
                               ('model_7',
                                Pipeline(steps=[('cvec',
                                                 StemmedCountVectorizer(max_df=0.8,
                                                                        max_features=4000,
                                                                        min_df=2,
      

In [59]:
model_9.score(X_train, y_train), model_9.score(X_test, y_test)

(0.9501470588235295, 0.7941666666666667)

___

## Evaluate

In [60]:
def get_recall(fitted_model):
    y_preds = fitted_model.predict(X_test)
    cm = confusion_matrix(y_test, y_preds)
    tn, fp, fn, tp = cm.ravel()
    recall = np.round((tp)/(tp+fn), 2)
    f1 = np.round(f1_score(y_test, y_preds), 2)
    return [recall, f1]

In [61]:
model = [model_1, model_2, model_3, model_4, model_5, model_6, model_7, model_8, model_9]
score_list = []
for i in range(len(model)):
    score_list.append(get_recall(model[i]))

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [62]:
for i in range(len(model)):
    print('model', i+1, ', recall =', score_list[i][0], ', f1 =', score_list[i][1])

model 1 , recall = 0.81 , f1 = 0.77
model 2 , recall = 0.81 , f1 = 0.77
model 3 , recall = 0.83 , f1 = 0.78
model 4 , recall = 0.84 , f1 = 0.79
model 5 , recall = 0.85 , f1 = 0.74
model 6 , recall = 0.77 , f1 = 0.69
model 7 , recall = 0.77 , f1 = 0.77
model 8 , recall = 0.96 , f1 = 0.78
model 9 , recall = 0.84 , f1 = 0.8


### Best predictor: Stack with 3 models
1. Pipeline CountVectorizer with default option + RandomForestClassifier with default option
2. Pipeline StemmedCountVectorizer(stop_words='english') + LogisticRegression with default option
3. Pipeline StemmedCountVectorizer(stop_words='english') + MultinomialNB with default option

### Summary
- Will send this stack model to the image processing team, with caution that the score is quite low.
- CVEC+ADA make a good recall at 0.96, will pass this model to their team too for predicting cats.
- May improve the score by finding new wording technique