# Project 3 : 3-NLP

## Problem statement

> #### Create the model to predict where the given post came from between cat and dog post

> - **Create and compare two models**. Any two classifiers at least of your choosing: random forest, logistic regression, KNN, etc.

### Game plan

1. stem_cvec + rf
2. tfidf + rf
3. cvec + lr
4. stem_cvec + lr (default)
5. cvec + knn
6. tfidf + knn
7. stem_cvec + nb
8. cvec + adaboost
9. stack with best 3 from model 1-8

** extra: maybe use lemmatize stemming for vectorize

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, f1_score

In [2]:
## These functions may or may not use, but put in here just in case
## Lemmatization
from nltk import word_tokenize   
from nltk.stem import WordNetLemmatizer
import re
class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        regex_num_ponctuation = '(\d+)|([^\w\s])'
        regex_little_words = r'(\b\w{1,2}\b)'
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc) 
                if not re.search(regex_num_ponctuation, t) and not re.search(regex_little_words, t)]
# got this from https://stackoverflow.com/questions/47423854/sklearn-adding-lemmatizer-to-countvectorizer

## Stemming
import nltk.stem    
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([english_stemmer.stem(w) for w in analyzer(doc)])
# got this from https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn

# Custom preprocessor - from breakfast hours
def my_preprocessor(text):
    text = text.lower() # overriden by using custom preprocessor
    text = re.sub('\\n', '', text)
    text = re.findall("[\w']+|\$[\d\.]+", text)
    text = " ".join(text)
    
    return text

## Load Data

In [3]:
df = pd.read_csv('../data/nlp_cat_dog.csv')
#df.drop(columns='Unnamed: 0', inplace=True)
df.shape

(10000, 2)

## Create `X` and `y`

In [4]:
X = df['title']
X.head(), X.shape

(0    Sanji kept staring at me so i seized the oppor...
 1        I make minimalistic clay cats. The Dinky Cats
 2          Why has my cat been throwing up her treats?
 3                                    My grumpy floof 💕
 4              When you hear the other cats being fed.
 Name: title, dtype: object,
 (10000,))

In [5]:
y = df['subreddit']
y.sample(5) #1=cats, 0=dogs

1646    1
1551    1
4896    1
2168    1
9439    0
Name: subreddit, dtype: int64

### Baseline Accuracy
#### Any predition model should better than baseline accuracy 0.5

In [6]:
df['subreddit'].value_counts(normalize=True) #each category split equally 50:50

1    0.5
0    0.5
Name: subreddit, dtype: float64

> Test score from any model should higher than 0.5 and close to 1.0 as much as possible

### Train/Test split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.15)
X_train.shape, X_test.shape, y_train.shape, y_test.shape, y_train.value_counts(normalize=True), y_test.value_counts(normalize=True)

((8500,),
 (1500,),
 (8500,),
 (1500,),
 1    0.5
 0    0.5
 Name: subreddit, dtype: float64,
 1    0.5
 0    0.5
 Name: subreddit, dtype: float64)

## Modeling

> method ordering got idea from Simon's Hackathon group

In [8]:
# set default cv
cv=5

### 1. cvec + rf

In [9]:
pipe_1 = Pipeline([
    ('cvec', StemmedCountVectorizer(preprocessor=my_preprocessor)),
    ('rf', RandomForestClassifier())
])

params_1 = {
    'cvec__max_df': [0.5, 0.8, 1],
    'rf__ccp_alpha':[0, .1, .01],
}

In [10]:
model_1 = GridSearchCV(pipe_1, param_grid=params_1, cv=cv, n_jobs=-1)

In [11]:
model_1.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec',
                                        StemmedCountVectorizer(preprocessor=<function my_preprocessor at 0x7f8bc3bc6310>)),
                                       ('rf', RandomForestClassifier())]),
             n_jobs=-1,
             param_grid={'cvec__max_df': [0.5, 0.8, 1],
                         'rf__ccp_alpha': [0, 0.1, 0.01]})

In [12]:
model_1.best_score_, model_1.best_params_

(0.9055294117647058, {'cvec__max_df': 0.8, 'rf__ccp_alpha': 0})

In [13]:
model_1.score(X_train, y_train), model_1.score(X_test, y_test)

(0.9981176470588236, 0.9113333333333333)

> As trial severals time, default params got the best score

## 2. tfidf + rf

In [14]:
pipe_2 = Pipeline([
    ('tf', TfidfVectorizer()),
    ('rf', RandomForestClassifier(random_state=42))
])

In [15]:
params_2 = {
    'tf__stop_words': ['english'],
    'tf__max_features': [4000, 6000],
    'tf__min_df': [2],
    'tf__max_df': [.9],
    'tf__ngram_range': [(1,1)],
    'rf__n_estimators':[600],
    'rf__max_depth':[None],
    'rf__min_samples_split':[6,8,10],
    'rf__min_samples_leaf': [2],
    'rf__ccp_alpha':[0, .1, .01],
}


In [16]:
model_2 = GridSearchCV(pipe_2, param_grid=params_2, cv=cv) #using default

In [17]:
model_2.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tf', TfidfVectorizer()),
                                       ('rf',
                                        RandomForestClassifier(random_state=42))]),
             param_grid={'rf__ccp_alpha': [0, 0.1, 0.01],
                         'rf__max_depth': [None], 'rf__min_samples_leaf': [2],
                         'rf__min_samples_split': [6, 8, 10],
                         'rf__n_estimators': [600], 'tf__max_df': [0.9],
                         'tf__max_features': [4000, 6000], 'tf__min_df': [2],
                         'tf__ngram_range': [(1, 1)],
                         'tf__stop_words': ['english']})

In [18]:
model_2.best_score_, model_2.best_params_

(0.8978823529411765,
 {'rf__ccp_alpha': 0,
  'rf__max_depth': None,
  'rf__min_samples_leaf': 2,
  'rf__min_samples_split': 8,
  'rf__n_estimators': 600,
  'tf__max_df': 0.9,
  'tf__max_features': 4000,
  'tf__min_df': 2,
  'tf__ngram_range': (1, 1),
  'tf__stop_words': 'english'})

In [19]:
model_2.score(X_train, y_train), model_2.score(X_test, y_test)

(0.9447058823529412, 0.894)

### 3. cvec + lr

In [20]:
pipe_3 = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression()),
])

In [21]:
params_3 = {
    'cvec__stop_words': ['english'],
    'cvec__max_features': [5000, 6000],
    'cvec__min_df': [1, 2],
    'cvec__max_df': [.8, .9],
    'cvec__ngram_range': [(1,1),(1, 2)],
    'lr__C':[0.01, 0.1, 1],
}

In [22]:
model_3 = GridSearchCV(pipe_3, param_grid=params_3, cv=cv)

In [23]:
model_3.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('lr', LogisticRegression())]),
             param_grid={'cvec__max_df': [0.8, 0.9],
                         'cvec__max_features': [5000, 6000],
                         'cvec__min_df': [1, 2],
                         'cvec__ngram_range': [(1, 1), (1, 2)],
                         'cvec__stop_words': ['english'],
                         'lr__C': [0.01, 0.1, 1]})

In [24]:
model_3.best_score_, model_3.best_params_

(0.9065882352941177,
 {'cvec__max_df': 0.8,
  'cvec__max_features': 5000,
  'cvec__min_df': 1,
  'cvec__ngram_range': (1, 2),
  'cvec__stop_words': 'english',
  'lr__C': 1})

In [25]:
model_3.score(X_train, y_train), model_3.score(X_test, y_test)

(0.96, 0.896)

### 4. stem_cvec + lr (default)

In [26]:
pipe_4 = Pipeline([
    ('cvec', StemmedCountVectorizer(stop_words='english')),
    ('lr', LogisticRegression()),
])

In [27]:
params_4 = {
    # default
}

In [28]:
model_4 = GridSearchCV(pipe_4, param_grid=params_4, cv=cv)

In [29]:
model_4.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec',
                                        StemmedCountVectorizer(stop_words='english')),
                                       ('lr', LogisticRegression())]),
             param_grid={})

In [30]:
model_4.best_score_, model_4.best_params_

(0.906235294117647, {})

In [31]:
model_4.score(X_train, y_train), model_4.score(X_test, y_test)

(0.9642352941176471, 0.908)

### 5. cvec + knn

In [32]:
pipe_5 = Pipeline([
    ('cvec', CountVectorizer()),
    ('knn', KNeighborsClassifier()),
])

In [33]:
params_5 = {
    'cvec__stop_words': [None, 'english'],
    'cvec__max_features': [2000, 3000],
    'cvec__min_df': [1, 2],
    'cvec__max_df': [0.7, 0.8],
    'cvec__ngram_range': [(1, 1), (1, 2)]
}

In [34]:
model_5 = GridSearchCV(pipe_5, param_grid=params_5, cv=cv)

In [35]:
model_5.fit(X_train, y_train)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mo

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('knn', KNeighborsClassifier())]),
             param_grid={'cvec__max_df': [0.7, 0.8],
                         'cvec__max_features': [2000, 3000],
                         'cvec__min_df': [1, 2],
                         'cvec__ngram_range': [(1, 1), (1, 2)],
                         'cvec__stop_words': [None, 'english']})

In [36]:
model_5.best_score_, model_5.best_params_

(0.8514117647058825,
 {'cvec__max_df': 0.7,
  'cvec__max_features': 2000,
  'cvec__min_df': 2,
  'cvec__ngram_range': (1, 2),
  'cvec__stop_words': 'english'})

In [37]:
model_5.score(X_train, y_train), model_5.score(X_test, y_test)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


(0.9032941176470588, 0.8633333333333333)

### 6. tfidf + knn

In [38]:
pipe_6 = Pipeline([
    ('tf', TfidfVectorizer()),
    ('knn', KNeighborsClassifier())
])

In [39]:
params_6 = {
    'tf__stop_words': ['english'],
    'tf__max_features': [500],
    'tf__min_df': [1,2],
    'tf__max_df': [.7],
    'tf__ngram_range': [(1, 2),(1,3)]
}

In [40]:
model_6 = GridSearchCV(pipe_6, param_grid=params_6, cv=cv)

In [41]:
model_6.fit(X_train, y_train)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tf', TfidfVectorizer()),
                                       ('knn', KNeighborsClassifier())]),
             param_grid={'tf__max_df': [0.7], 'tf__max_features': [500],
                         'tf__min_df': [1, 2],
                         'tf__ngram_range': [(1, 2), (1, 3)],
                         'tf__stop_words': ['english']})

In [42]:
model_6.best_score_, model_6.best_params_

(0.7930588235294118,
 {'tf__max_df': 0.7,
  'tf__max_features': 500,
  'tf__min_df': 1,
  'tf__ngram_range': (1, 3),
  'tf__stop_words': 'english'})

In [43]:
model_6.score(X_train, y_train), model_6.score(X_test, y_test)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


(0.8522352941176471, 0.784)

### 7. stem_cvec + nb

In [44]:
pipe_7 = Pipeline([
    ('cvec', StemmedCountVectorizer(stop_words='english')),
    ('nb', MultinomialNB()),
])

In [45]:
params_7 = {
    'cvec__max_features': [None, 1000, 2000],
    'cvec__min_df': [1, 2, 3],
    'cvec__max_df': [.7, .8],
    'cvec__ngram_range': [(1,1),(1, 2),(2,2)]
}

In [46]:
model_7 = GridSearchCV(pipe_7, param_grid=params_7, cv=cv)

In [47]:
model_7.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec',
                                        StemmedCountVectorizer(stop_words='english')),
                                       ('nb', MultinomialNB())]),
             param_grid={'cvec__max_df': [0.7, 0.8],
                         'cvec__max_features': [None, 1000, 2000],
                         'cvec__min_df': [1, 2, 3],
                         'cvec__ngram_range': [(1, 1), (1, 2), (2, 2)]})

In [48]:
model_7.best_score_, model_7.best_params_

(0.8965882352941176,
 {'cvec__max_df': 0.7,
  'cvec__max_features': None,
  'cvec__min_df': 1,
  'cvec__ngram_range': (1, 2)})

In [49]:
model_7.score(X_train, y_train), model_7.score(X_test, y_test)

(0.9731764705882353, 0.904)

### 8. cvec + boost

In [50]:
pipe_8 = Pipeline([
    ('cvec', CountVectorizer(preprocessor=my_preprocessor)),
    ('boost', AdaBoostClassifier())
])

In [51]:
params_8 = {
    'cvec__stop_words': ['english'],
    'cvec__max_features': [4000, 6000, 8000],
    'cvec__min_df': [2, 3, 4],
    'cvec__max_df': [.8, .9],
    'cvec__ngram_range': [(1,1),(1, 2)]
}

In [52]:
model_8 = GridSearchCV(pipe_8, param_grid=params_8, cv=cv)

In [53]:
model_8.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec',
                                        CountVectorizer(preprocessor=<function my_preprocessor at 0x7f8bc3bc6310>)),
                                       ('boost', AdaBoostClassifier())]),
             param_grid={'cvec__max_df': [0.8, 0.9],
                         'cvec__max_features': [4000, 6000, 8000],
                         'cvec__min_df': [2, 3, 4],
                         'cvec__ngram_range': [(1, 1), (1, 2)],
                         'cvec__stop_words': ['english']})

In [54]:
model_8.best_score_, model_8.best_params_

(0.8641176470588234,
 {'cvec__max_df': 0.8,
  'cvec__max_features': 4000,
  'cvec__min_df': 3,
  'cvec__ngram_range': (1, 1),
  'cvec__stop_words': 'english'})

In [55]:
model_8.score(X_train, y_train), model_8.score(X_test, y_test)

(0.8682352941176471, 0.8546666666666667)

### 9. stack
> Check best score first

In [56]:
total_test_score =[]
total_test_score.append(model_1.score(X_test, y_test))
total_test_score.append(model_2.score(X_test, y_test))
total_test_score.append(model_3.score(X_test, y_test))
total_test_score.append(model_4.score(X_test, y_test))
total_test_score.append(model_5.score(X_test, y_test))
total_test_score.append(model_6.score(X_test, y_test))
total_test_score.append(model_7.score(X_test, y_test))
total_test_score.append(model_8.score(X_test, y_test))

for i in range(len(total_test_score)):
    print('model', i+1, '=', round(total_test_score[i], 4), '*'*int((total_test_score[i]-0.9)/0.001), i+1)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


model 1 = 0.9113 *********** 1
model 2 = 0.894  2
model 3 = 0.896  3
model 4 = 0.908 ******** 4
model 5 = 0.8633  5
model 6 = 0.784  6
model 7 = 0.904 **** 7
model 8 = 0.8547  8


### Select the best 3 to stack

In [57]:
# Use 1, 3, 7
level_1 = [
    ('model_1', model_1.best_estimator_),
    ('model_4', model_4.best_estimator_),
    ('model_7', model_7.best_estimator_),
]
level_2 = LogisticRegression()

model_9 = StackingClassifier(estimators=level_1,
                             final_estimator=level_2)

In [58]:
model_9.fit(X_train, y_train)

StackingClassifier(estimators=[('model_1',
                                Pipeline(steps=[('cvec',
                                                 StemmedCountVectorizer(max_df=0.8,
                                                                        preprocessor=<function my_preprocessor at 0x7f8bc3bc6310>)),
                                                ('rf',
                                                 RandomForestClassifier(ccp_alpha=0))])),
                               ('model_4',
                                Pipeline(steps=[('cvec',
                                                 StemmedCountVectorizer(stop_words='english')),
                                                ('lr', LogisticRegression())])),
                               ('model_7',
                                Pipeline(steps=[('cvec',
                                                 StemmedCountVectorizer(max_df=0.7,
                                                                        ngram

In [59]:
model_9.score(X_train, y_train), model_9.score(X_test, y_test)

(0.9891764705882353, 0.9306666666666666)

___

#### prepare function and data for evaluation

In [60]:
def get_score(fitted_model):
    y_preds = fitted_model.predict(X_test)
    cm = confusion_matrix(y_test, y_preds)
    tn, fp, fn, tp = cm.ravel()
    train_score = "%.3f" % fitted_model.score(X_train, y_train)
    test_score = "%.3f" % fitted_model.score(X_test, y_test)
    recall = "%.3f" % ((tp)/(tp+fn))
    f1 = "%.3f" % f1_score(y_test, y_preds)
    return [train_score, test_score, recall, f1, cm]
# got idea from https://stackoverflow.com/questions/19986662/rounding-a-number-in-python-but-keeping-ending-zeros

In [61]:
model = [model_1, model_2, model_3, model_4, model_5, model_6, model_7, model_8, model_9]
score_list = []
for i in range(len(model)):
    score_list.append(get_score(model[i]))

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


## Evaluation
### Summary Train/Test score

In [62]:
for i in range(len(model)):
    print('model', i+1, '| train_score =', score_list[i][0] , '| test_score =', score_list[i][1])

model 1 | train_score = 0.998 | test_score = 0.911
model 2 | train_score = 0.945 | test_score = 0.894
model 3 | train_score = 0.960 | test_score = 0.896
model 4 | train_score = 0.964 | test_score = 0.908
model 5 | train_score = 0.903 | test_score = 0.863
model 6 | train_score = 0.852 | test_score = 0.784
model 7 | train_score = 0.973 | test_score = 0.904
model 8 | train_score = 0.868 | test_score = 0.855
model 9 | train_score = 0.989 | test_score = 0.931


### Summary Recall/Sensitivity(sensi)/F1

In [63]:
for i in range(len(model)):
    cm = score_list[i][4]
    tn, fp, fn, tp = cm.ravel()
    print('model', i+1,f'| {tn=} |', f'{fp=}\t| ', f'{fn=}\t|', f'{tp=} |', f'recall={"%.2f" % ((tp)/(tp+fn))} |', f'sensitivity={"%.2f" % ((tn)/(tn+fp))} |', f'f1={score_list[i][3]} |', f'({i+1})')

model 1 | tn=649 | fp=101	|  fn=32	| tp=718 | recall=0.96 | sensitivity=0.87 | f1=0.915 | (1)
model 2 | tn=630 | fp=120	|  fn=39	| tp=711 | recall=0.95 | sensitivity=0.84 | f1=0.899 | (2)
model 3 | tn=623 | fp=127	|  fn=29	| tp=721 | recall=0.96 | sensitivity=0.83 | f1=0.902 | (3)
model 4 | tn=648 | fp=102	|  fn=36	| tp=714 | recall=0.95 | sensitivity=0.86 | f1=0.912 | (4)
model 5 | tn=581 | fp=169	|  fn=36	| tp=714 | recall=0.95 | sensitivity=0.77 | f1=0.874 | (5)
model 6 | tn=549 | fp=201	|  fn=123	| tp=627 | recall=0.84 | sensitivity=0.73 | f1=0.795 | (6)
model 7 | tn=696 | fp=54	|  fn=90	| tp=660 | recall=0.88 | sensitivity=0.93 | f1=0.902 | (7)
model 8 | tn=548 | fp=202	|  fn=16	| tp=734 | recall=0.98 | sensitivity=0.73 | f1=0.871 | (8)
model 9 | tn=680 | fp=70	|  fn=34	| tp=716 | recall=0.95 | sensitivity=0.91 | f1=0.932 | (9)


### Best predictor: model 9 - Stack with following 3 models
- Model 1 : CountVectorizer + RandomForestClassifier
- Model 4 : StemmedCountVectorizer + LogisticRegression
- Model 7 : StemmedCountVectorizer + MultinomialNB

### Summary
- Will send this stack model to the image processing team
- Along with high recall model 8 and high sensitivity model 7
- May improve the score by gathering more data, finding new wording technique, and more tuning on hyperparameter