# Project 3: Web APIs & Classification

#### By Bhupesh Kumar

### Step 2-3 : Data Cleaning and Modeling 

In [1]:
import pandas as pd
import numpy as np 

import regex as re
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [4]:
dataset = pd.read_csv('raw_title_dataset')
dataset.head()

Unnamed: 0.1,Unnamed: 0,post_title,rank
0,0,Weekly Tech Support Thread - [July 09],0
1,1,Important Announcement + Mod Applications,0
2,2,Well Why Not 😀,0
3,3,Paid apps actually worth it in 2019?,0
4,4,My iPhone collection as of yesterday! Just mis...,0


In [5]:
del dataset['Unnamed: 0']

In [7]:
dataset.head()

Unnamed: 0,post_title,rank
0,Weekly Tech Support Thread - [July 09],0
1,Important Announcement + Mod Applications,0
2,Well Why Not 😀,0
3,Paid apps actually worth it in 2019?,0
4,My iPhone collection as of yesterday! Just mis...,0


In [8]:
dataset.tail()

Unnamed: 0,post_title,rank
1802,3 XL Asurion replacement: should I expect the ...,1
1803,The Pixel 3a disappointing design decisions,1
1804,Pixel 3a screen protector,1
1805,Does anyone know how I can sync my pictures fr...,1
1806,do I have to give up Pixel XL original photos ...,1


In [9]:
dataset.shape

(1807, 2)

#### Cleaning the text data

In [10]:
def clean_posts(posts):
    # Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ", posts)
    
    #  Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    return(" ".join(words))

In [11]:
dataset['post_title'] = dataset.apply(lambda x: clean_posts(x['post_title']), axis=1)

In [13]:
dataset.head()

Unnamed: 0,post_title,rank
0,weekly tech support thread july,0
1,important announcement mod applications,0
2,well why not,0
3,paid apps actually worth it in,0
4,my iphone collection as of yesterday just miss...,0


In [15]:
dataset.tail()

Unnamed: 0,post_title,rank
1802,xl asurion replacement should i expect the ver...,1
1803,the pixel a disappointing design decisions,1
1804,pixel a screen protector,1
1805,does anyone know how i can sync my pictures fr...,1
1806,do i have to give up pixel xl original photos ...,1


In [16]:
dataset.isnull().sum()

post_title    0
rank          0
dtype: int64

### Baseline Modeling 

In [17]:
X = dataset['post_title']
y = dataset['rank']

In [18]:
y.value_counts(normalize =True)

1    0.501384
0    0.498616
Name: rank, dtype: float64

As I can see the baseling score is almsot 50/50. which is good because our data is balaced. 

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                  random_state =42,
                                                 stratify = y)

### Modeling 

I am going to be using two models to train and test the data, Logistic Regression and Multinomial Naive Bayes. 

I will also use CountVectorizer with No stop words, English stop words and custom stop words


##### No stop words 

In [27]:
# Instantiate a CountVectorizer
count_vec = CountVectorizer()

# Fit the vectorizer and tranforming it into matrix
X_train_cv = count_vec.fit_transform(X_train)
X_test_cv = count_vec.transform(X_test)

In [28]:
X_train.shape

(1355,)

In [29]:
X_test.shape

(452,)

In [30]:
X_train_cv.shape

(1355, 2238)

In [31]:
X_test_cv.shape

(452, 2238)

In [32]:
vocab = count_vec.get_feature_names()
print(vocab)



In [33]:
type(vocab)

list

We can see the shape of testing and training data changes because countvectorozer converts the text data into matrix form. So it can be trained on those features. 

#### Logistic Regression model 1

In [42]:
# Intializing the model , fitting the mode, and predicting 
lr = LogisticRegression()
lr.fit(X_train_cv,y_train) 
y_pred_lr = lr.predict(X_test_cv)
metrics.accuracy_score(y_test,y_pred_lr)

0.8650442477876106

In [47]:
lr.score(X_train_cv,y_train) * 100# LR training score

98.67158671586715

In [48]:
lr.score(X_test_cv,y_test) * 100 # LR testing score 

86.50442477876106

#### Confusion Matrix 
Confusion matrix creates 2 X 2 matrix which shows the actual vs predictions 

In [45]:
metrics.confusion_matrix(y_test, y_pred_lr)

array([[200,  25],
       [ 36, 191]])

In [46]:
cf_m1 = metrics.confusion_matrix(y_test, y_pred_lr)
cf_lr = pd.DataFrame(data=cf_m1, columns=['Predicted Iphone (0)', 
                                       'Predicted Google Pixel(1)'], 
                     index=['Actual Iphone(0)', 'Actual Google Pixel (1)'])
cf_lr

Unnamed: 0,Predicted Iphone (0),Predicted Google Pixel(1)
Actual Iphone(0),200,25
Actual Google Pixel (1),36,191


The logistic model has accuracy of 98.67 percent on training data, and 86.56 percent on testing. 



#### Bayes Classifier ( Multinomial ) model 1

In [49]:
# Intializing the model , fitting the mode, and predicting 

nb = MultinomialNB()
nb.fit(X_train_cv,y_train)
y_pred_nb = nb.predict(X_test_cv)
metrics.accuracy_score(y_test,y_pred_nb)

0.8849557522123894

In [52]:
nb.score(X_train_cv,y_train) * 100 # NB training Score

95.64575645756457

In [53]:
nb.score(X_test_cv,y_test) * 100 # NB testing

88.49557522123894

In [54]:
metrics.confusion_matrix(y_test, y_pred_nb)

array([[202,  23],
       [ 29, 198]])

In [56]:
cf_m2 = metrics.confusion_matrix(y_test, y_pred_nb)
cf_nb = pd.DataFrame(data=cf_m2, columns=['Predicted Iphone (0)', 
                                       'Predicted Google Pixel(1)'], 
                     index=['Actual Iphone(0)', 'Actual Google Pixel (1)'])
cf_nb

Unnamed: 0,Predicted Iphone (0),Predicted Google Pixel(1)
Actual Iphone(0),202,23
Actual Google Pixel (1),29,198


The Multinomial model has accuracy of 95.64 percent on training data, and 88.49 percent on testing.

### Stop Words (english)

I am now going to play with stop words. I will use english stop words in countvectorizer and see if my results are any different. 

In [65]:
english_stopwords = stopwords.words('english')
count_vec_english = CountVectorizer(stop_words= english_stopwords)

X_train_cv_e = count_vec_english.fit_transform(X_train)
X_test_cv_e = count_vec_english.transform(X_test)

In [66]:
X_train_cv_e.shape

(1355, 2116)

In [67]:
X_test_cv_e.shape

(452, 2116)

Using english stop words, shape of the data changed. Features decreased from 2238 to 2116. 

#### Logistic Regression model 2

In [74]:
# Intializing the model , fitting the mode, and predicting 

lr = LogisticRegression()
lr.fit(X_train_cv_e,y_train)

y_pred_lr_e = lr.predict(X_test_cv_e)
metrics.accuracy_score(y_test,y_pred_lr_e)

0.8805309734513275

In [75]:
lr.score(X_train_cv_e,y_train) * 100# LR training score

98.45018450184502

In [76]:
lr.score(X_test_cv_e,y_test) * 100 # LR testing score 

88.05309734513274

In [77]:
metrics.confusion_matrix(y_test, y_pred_lr_e)

array([[207,  18],
       [ 36, 191]])

In [78]:
cf_m3 = metrics.confusion_matrix(y_test, y_pred_lr_e)
cf_lr_2 = pd.DataFrame(data=cf_m3, columns=['Predicted Iphone (0)', 
                                       'Predicted Google Pixel(1)'], 
                     index=['Actual Iphone(0)', 'Actual Google Pixel (1)'])
cf_lr_2

Unnamed: 0,Predicted Iphone (0),Predicted Google Pixel(1)
Actual Iphone(0),207,18
Actual Google Pixel (1),36,191


The logistic regression with english stop words has accuracy of 98.45 % on training data, and 88.05 on testing data. 

#### Bayes Classifier ( Multinomial ) model 2

In [86]:
# Intializing the model , fitting the mode, and predicting 

nb = MultinomialNB()
nb.fit(X_train_cv_e,y_train)

y_pred_nb_e = nb.predict(X_test_cv_e)
metrics.accuracy_score(y_test,y_pred_nb_e)

0.8783185840707964

In [87]:
nb.score(X_train_cv_e,y_train) * 100 # NB training Score

95.64575645756457

In [88]:
nb.score(X_test_cv_e,y_test) * 100 # Nb testing

87.83185840707965

In [89]:
metrics.confusion_matrix(y_test, y_pred_nb_e)

array([[196,  29],
       [ 26, 201]])

In [90]:
cf_m4 = metrics.confusion_matrix(y_test, y_pred_nb)
cf_nb_2 = pd.DataFrame(data=cf_m4, columns=['Predicted Iphone (0)', 
                                       'Predicted Google Pixel(1)'], 
                     index=['Actual Iphone(0)', 'Actual Google Pixel (1)'])
cf_nb_2

Unnamed: 0,Predicted Iphone (0),Predicted Google Pixel(1)
Actual Iphone(0),202,23
Actual Google Pixel (1),29,198


The Multinomial model with english stop words has accuracy of 95.64 percent on training data, and 87.83 percent on testing.

### Custom Stop words 

In custom stop words, I am going to keep english stopwords and add 3 more. Three stop words i be using is manilty the subreddits names: iphone, google, pixel. 

In [92]:
custom_stopwords = stopwords.words('english')
custom_stopwords.extend(['iphone','google','pixel'])
count_vec_custom = CountVectorizer(stop_words= custom_stopwords)

X_train_cv_c = count_vec_custom.fit_transform(X_train)
X_test_cv_c = count_vec_custom.transform(X_test)

In [95]:
X_train_cv_c.shape

(1355, 2113)

In [96]:
X_test_cv_c.shape

(452, 2113)

#### Logistic Regression model 3

In [98]:
# Intializing the model , fitting the mode, and predicting 

lr = LogisticRegression()
lr.fit(X_train_cv_c,y_train)

y_pred_lr_c = lr.predict(X_test_cv_c)
metrics.accuracy_score(y_test,y_pred_lr_c)

0.7522123893805309

In [99]:
lr.score(X_train_cv_c,y_train) * 100# LR training score

96.38376383763838

In [100]:
lr.score(X_test_cv_c,y_test) * 100 # LR testing score 

75.22123893805309

In [101]:
metrics.confusion_matrix(y_test, y_pred_lr_c)

array([[166,  59],
       [ 53, 174]])

In [102]:
cf_m5 = metrics.confusion_matrix(y_test, y_pred_lr_c)
cf_nb_3 = pd.DataFrame(data=cf_m5, columns=['Predicted Iphone (0)', 
                                       'Predicted Google Pixel(1)'], 
                     index=['Actual Iphone(0)', 'Actual Google Pixel (1)'])
cf_nb_3

Unnamed: 0,Predicted Iphone (0),Predicted Google Pixel(1)
Actual Iphone(0),166,59
Actual Google Pixel (1),53,174


The Logistic regression model with custom stopwords accuracy for training is 96.38% and testing is 75.22%

#### Bayes Classifier ( Multinomial ) model 2

In [104]:
# Intializing the model , fitting the mode, and predicting 

nb = MultinomialNB()
nb.fit(X_train_cv_c,y_train)

y_pred_nb_c = nb.predict(X_test_cv_c)
metrics.accuracy_score(y_test,y_pred_nb_c)

0.7676991150442478

In [105]:
nb.score(X_train_cv_c,y_train) * 100 # NB training Score

92.69372693726937

In [106]:
nb.score(X_test_cv_c,y_test) * 100 # NB testing score 

76.76991150442478

In [107]:
metrics.confusion_matrix(y_test, y_pred_nb_c)

array([[172,  53],
       [ 52, 175]])

In [109]:
cf_m6 = metrics.confusion_matrix(y_test, y_pred_nb_c)
cf_nb_3 = pd.DataFrame(data=cf_m6, columns=['Predicted Iphone (0)', 
                                       'Predicted Google Pixel(1)'], 
                     index=['Actual Iphone(0)', 'Actual Google Pixel (1)'])
cf_nb_3

Unnamed: 0,Predicted Iphone (0),Predicted Google Pixel(1)
Actual Iphone(0),172,53
Actual Google Pixel (1),52,175


### Conclusion 

After comparing all both models with different stop words, we can see there is huge difference in the accuracy of each model especailly with custom stop words. Basic approach to attacking a NLP problem is CCM: Collect, Clean, and Model. However, there is alot more goes into it, as we saw with stopwords and different models. CountVectorizer has many parameter, and I hope this basic approach is helpful. Next step could be to use different models such as Random forest, K-Nearest neighbor. Also, countvectorizer has many parameter. Best thing to do is play with different parameters. I could have gotten better result if i had more time to play with parameters. As of now, all of the models are overfitting because accuracy rate for training and testing is not the same. Best model thus far is Naive Bayes Multinomial with no stop words. The difference in accuracy is about 7 percent. 