# Project 3: Web APIs & Classification

General Assembly DSI 19 Project 3 Adrian Teng 

# Executive Summary

In this project, it is splitted into three parts respectively:
- API & EDA (01_api_eda)
- Processing & Modeling (02_processing_modeling)
- Conclusion (03_conclusion)

In the third notebook, 03_conclusion, unseen data was used to test against the selected models and further evaluted with the distribution of True positive, False Positive, False Negative and True Negative Preidtions. Additional features like, Accuaracy, Sensitivity, Specificity and Precision also put into actions.

Feature Analysis was done to support problem statement in the conclusion

# Content

- Model 1: CV MB
- Model 2: TFIDF LR
- Feature Anaylsis
- Conclusion


In [169]:

# library imports
import requests
import time
import pandas as pd
import numpy as np
import ast
import regex as re
from tqdm import tqdm
import collections


# preprocessing imports
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [170]:
r = 42

In [171]:
df = pd.read_csv('../datasets/tifucofns.csv')
df_pre = pd.read_csv('../datasets/df_pre.csv')
X_train = pd.read_csv('../datasets/X_train.csv', index_col = 0)
X_test = pd.read_csv('../datasets/X_test.csv', index_col = 0)
y_train = pd.read_csv('../datasets/y_train.csv', index_col = 0)
y_test = pd.read_csv('../datasets/y_test.csv', index_col = 0)

In [172]:
df_pre.isnull().sum()

post_stem    1
post_lem     1
tifu         0
dtype: int64

In [173]:
df_pre.dropna(inplace = True) #remove nan values

### Model 1: CountVectorizor MultinomialNB

In [174]:
m1_steps = [('m1_cv',CountVectorizer(stop_words='english', ngram_range=(1,1))),
           ('m1_mnb',MultinomialNB())]

In [175]:
pipe_1 = Pipeline(m1_steps)

In [176]:
pipe_1.fit(X_train.post_lem, y_train.tifu)

Pipeline(steps=[('m1_cv', CountVectorizer(stop_words='english')),
                ('m1_mnb', MultinomialNB())])

In [180]:
#train score
pipe_1.score(X_train.post_lem, y_train.tifu)

0.9711711711711711

In [178]:
#test score
pipe_1.score(X_test.post_lem, y_test.tifu)

0.8652291105121294

In [179]:
#unseen test score
pipe_1.score(df_pre.post_lem, df_pre.tifu)

0.9446320054017556

In [104]:
tn, fp, fn, tp = confusion_matrix(df_pre.tifu, pipe_1.predict(df_pre.post_lem)).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

print("\nAccuracy: ", (tn + tp) / (tn + fp + fn + tp))
print("Sensitivity: ", tp / (tp + fn))
print("Specificity: ", tn / (tn + fp))
print("Precision: ", tp / (tp + fp))


True Negatives: 839
False Positives: 47
False Negatives: 35
True Positives: 560

Accuracy:  0.9446320054017556
Sensitivity:  0.9411764705882353
Specificity:  0.9469525959367946
Precision:  0.9225700164744646


In [165]:
m1 = pipe_1.named_steps['m1_mnb']
cv1 = pipe_1.named_steps['m1_cv']
cv1.fit_transform(X_train.post_lem)

<1110x13829 sparse matrix of type '<class 'numpy.int64'>'
	with 94359 stored elements in Compressed Sparse Row format>

### Model 2: TFIDFVectorizor Logistic Regression

In [181]:
m2_steps = [('m2_tf',TfidfVectorizer(stop_words='english', ngram_range=(1,1), max_features=30000)),
            ('m2_ss',StandardScaler(with_mean=False)),
            ('m2_lr',LogisticRegression())]

In [182]:
pipe_2 = Pipeline(m2_steps)

In [183]:
pipe_2.fit(X_train.post_lem, y_train.tifu)

Pipeline(steps=[('m2_tf',
                 TfidfVectorizer(max_features=30000, stop_words='english')),
                ('m2_ss', StandardScaler(with_mean=False)),
                ('m2_lr', LogisticRegression())])

In [184]:
#train score
pipe_2.score(X_train.post_lem, y_train.tifu)

1.0

In [185]:
#test score
pipe_2.score(X_test.post_lem, y_test.tifu)

0.8867924528301887

In [186]:
#unseen data score
pipe_2.score(df_pre.post_lem, df_pre.tifu)

0.9716407832545577

In [187]:
tn, fp, fn, tp = confusion_matrix(df_pre.tifu, pipe_2.predict(df_pre.post_lem)).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

print("\nAccuracy: ", (tn + tp) / (tn + fp + fn + tp))
print("Sensitivity: ", tp / (tp + fn))
print("Specificity: ", tn / (tn + fp))
print("Precision: ", tp / (tp + fp))

True Negatives: 871
False Positives: 15
False Negatives: 27
True Positives: 568

Accuracy:  0.9716407832545577
Sensitivity:  0.9546218487394958
Specificity:  0.9830699774266366
Precision:  0.9742710120068611


In [188]:
m2 = pipe_2.named_steps['m2_lr']

In [189]:
tf2 = pipe_2.named_steps['m2_tf']

In [190]:
tf2.fit_transform(X_train.post_lem)

<1110x13829 sparse matrix of type '<class 'numpy.float64'>'
	with 94359 stored elements in Compressed Sparse Row format>

In [191]:
m1_df = pd.DataFrame(m1.coef_.T, index=cv1.get_feature_names(), columns=['coef'])

In [197]:
m1_df.coef.sort_values(ascending=False).head(20)

just       -4.704559
like       -4.996557
time       -5.224905
didn       -5.273912
got        -5.297008
dr         -5.486834
tl         -5.501059
went       -5.547989
really     -5.569234
know       -5.581580
said       -5.739360
going      -5.761420
ve         -5.795450
thought    -5.810954
day        -5.822742
don        -5.842703
happened   -5.863070
car        -5.913714
today      -5.931181
did        -5.990155
Name: coef, dtype: float64

In [193]:
m2_df = pd.DataFrame(m2.coef_.T, index=tf2.get_feature_names(), columns=['coef'])

In [195]:
m2_df.coef.sort_values(ascending=False).head(20)

tl              0.355900
dr              0.342180
tldr            0.173506
hour            0.105832
happened        0.096826
today           0.095596
minutes         0.092480
obligatory      0.089691
forgot          0.084122
decided         0.081912
morning         0.074589
went            0.073344
sitting         0.069247
prank           0.069129
forgetting      0.068725
got             0.068466
immediately     0.067605
sleep           0.067374
accidentally    0.066832
starts          0.066280
Name: coef, dtype: float64

### Feature Analysis

In [200]:
#function to count the number of unique words
def important_features(vectorizer,classifier,n=20):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    print("Important words for r/confessions\n")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------\n")
    print("Important words for r/tifu\n")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat)

In [201]:
#print the top 20 of each subredits
important_features(pipe_1.named_steps['m1_cv'], pipe_1.named_steps['m1_mnb'], 20)

Important words for r/confessions

0 864.0 just
0 775.0 like
0 589.0 don
0 508.0 know
0 476.0 feel
0 412.0 want
0 410.0 really
0 402.0 ve
0 398.0 time
0 361.0 people
0 296.0 life
0 295.0 think
0 279.0 didn
0 265.0 got
0 237.0 friends
0 233.0 years
0 215.0 day
0 207.0 did
0 206.0 things
0 201.0 said
-----------------------------------------

Important words for r/tifu

1 773.0 just
1 577.0 like
1 459.0 time
1 437.0 didn
1 427.0 got
1 353.0 dr
1 348.0 tl
1 332.0 went
1 325.0 really
1 321.0 know
1 274.0 said
1 268.0 going
1 259.0 ve
1 255.0 thought
1 252.0 day
1 247.0 don
1 242.0 happened
1 230.0 car
1 226.0 today
1 213.0 did


## Conclusion

- 1. Model 2 (TFIDF Logistic Regression) have the highest test score for both test and unseen test, as compared to the other models as a sucess metric in comparing text of two similar subreddits.

- 2. The similarity of the two subreddits is very high, from the feature analysis we can see that the top 20 words that overlapped is more than 70%.

- 3. Other than the test score, the overall Sensistivity, Specificity, Precicion are higher than Model 1(CV MultinomialNB).

- 4. Hence, if the post is so similar, Reddit user can consider posting in 'TIFU' instead of 'Confessions' to attract more viewers.