## Main Goals: 
<ol>
<li>Create a cleaned development dataset that can be used to complete the modeling step of this project
    <ul>
<li> Perform NLP Precrocessing steps to the text</li> 
<li>Split into testing and training datasets</li> 
<li>Vectorizing our dataset</li>
    </ul>
<li>Modeling: Build a <b>Negative Tweet Detector</b> </li>
    <ul>
<li>Building and evaluating models</li> 
<li>Comparing models</li> 
    </ul>
</ol>

### 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import nltk
import re
import string
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
import time
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split, cross_validate
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

### 2. Load Data

In [2]:
df = pd.read_csv('data/cleaned_tweets.csv')
print(df.shape)
df.head()

(4982, 6)


Unnamed: 0,date,cleaned_tweet,polarity,sentiment,text_len,text_word_count
0,2021-05-12,right now we welcome competition just no...,0.543,positive,83,12
1,2021-05-12,hahaha unfollowed tile a company who was...,0.1,positive,104,17
2,2021-05-12,i was thinking it might be in corenfc but i ...,0.0,neutral,94,17
3,2021-05-12,this is super clever creating a new battery ...,0.187,positive,98,18
4,2021-05-12,any one be interested if i did an airtag give...,0.25,positive,52,10


### 3. Text Preprocessing

#### 3.1 NLP Preprocessing

Previously in the last step, we cleaned the text of tweets after loading our dataset, and we've removed all the punctuations and lowercased the words. Now we need to perform some other preprocessing steps before fitting the data into our model.

In [3]:
# Tokenization
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in df['cleaned_tweet']]

# Remove stop words
stopword = nltk.corpus.stopwords.words('english')
no_stops=[]
for i in all_tokens:
    new_no_stops = [t for t in i if t not in stopword]
    no_stops.append(new_no_stops)

# Lemmatization
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized = []
for i in no_stops:
    for j in i:
        new_lemmatized = [wordnet_lemmatizer.lemmatize(j) for j in i] #Lemmatize all tokens into a new list: lemmatized
    lemmatized.append(new_lemmatized)

In [4]:
df['preprocessed'] = lemmatized
df.head()

Unnamed: 0,date,cleaned_tweet,polarity,sentiment,text_len,text_word_count,preprocessed
0,2021-05-12,right now we welcome competition just no...,0.543,positive,83,12,"[right, welcome, competition, apple, tile, air..."
1,2021-05-12,hahaha unfollowed tile a company who was...,0.1,positive,104,17,"[hahaha, unfollowed, tile, company, born, thri..."
2,2021-05-12,i was thinking it might be in corenfc but i ...,0.0,neutral,94,17,"[thinking, might, corenfc, seen, anything, spe..."
3,2021-05-12,this is super clever creating a new battery ...,0.187,positive,98,18,"[super, clever, creating, new, battery, backpl..."
4,2021-05-12,any one be interested if i did an airtag give...,0.25,positive,52,10,"[one, interested, airtag, giveaway]"


In [5]:
# keep only the feature and target that we will use for our model
df = df[['preprocessed','sentiment']]

#### 3.2 CountVectorizer: Vectorizing our dataset

In [6]:
# label positive and neutral sentiment as 0, and lable negative sentiment as 1
df['negative_sentiment'] = df.sentiment.map({'positive':0,'neutral':0,'negative':1})
y = df['negative_sentiment']

In [7]:
df.preprocessed = df.preprocessed.apply(lambda x: " ".join(x))

In [8]:
X_train, X_test, y_train, y_test = train_test_split(df['preprocessed'], y,test_size = 0.33,random_state = 33)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(3337,)
(1645,)
(3337,)
(1645,)


In [9]:
count_vectorizer = CountVectorizer()

# learn training data vocabulary and use it to create a document-term matrix: count_train 
count_train = count_vectorizer.fit_transform(X_train)
# transform testing data (using fitted vocabulary) into a document-term matrix: count_test 
count_test = count_vectorizer.transform(X_test)

In [10]:
# Print the first 200 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:200])

['00', '000', '000apple', '03', '05', '07', '07115164', '07airtag', '08132321259', '09', '0mm', '0ver', '10', '100', '10brain', '11', '119', '12', '120', '120m', '127', '128', '13', '132', '1379', '14', '149', '15', '1500', '15k', '16', '19', '1978', '1986', '19999', '1k', '1mi', '1st', '20', '2001', '2018', '2019', '2020', '2021', '2022', '21', '216', '237', '24', '25', '2516', '279', '280', '2837472', '29', '2fa', '2nd', '30', '300', '301', '30am', '30k', '319', '328', '32mb', '33', '33k', '349', '35', '35k', '360', '3d', '3dprinting', '3mm', '3rd', '3v', '40', '400', '400ft', '449', '479', '482', '486', '490', '499', '4agze', '4am', '4ever', '4k', '50', '500', '50m', '52', '52832', '53pm', '54', '5g', '5k', '5th', '5x', '60', '60m', '62', '658', '699', '6ft', '6user', '72', '75key', '7999', '7th', '835', '877', '8970', '8m', '8mm', '8th', '8v', '90', '900', '90deg', '95', '987', '99', '99link', '9to5m', '9to5mac', 'a2f6', 'a52', 'aapl', 'ab', 'abccentralvic', 'abcmsh', 'abcwimmera',

We will inspect the vectors to see how they look like.

In [11]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.toarray(), columns=count_vectorizer.get_feature_names())
count_df.head()

Unnamed: 0,00,000,000apple,03,05,07,07115164,07airtag,08132321259,09,...,yup,zac,zarak,zdnet,zdnets,zee,zero,zip,zone,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
# See the most used words in the training set by sorting values descendingly
count = pd.DataFrame(count_df.sum())
countdf = count.sort_values(0,ascending=False).head(20)
countdf[0:11]

Unnamed: 0,0
airtag,3038
apple,1559
new,282
find,276
hacked,255
researcher,244
security,241
tracker,227
airtags,213
lost,182


Besides the hashtag keyword airtag, the most common words used were "apple"(AirTag's company), "new", "find", "hacked", "researcher", "security", "tracker", "airtags", "lost", and "already". 

### 4. Modeling

#### 4.1 Training and testing the "Negative Tweet Detector"

##### 4.1.1 Multinomial Naive Bayes classifier

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). We'll first train and test a Naive Bayes model using the CountVectorizer data.

We have only 14% of class 1 and 86% of class 0. Accuracy is not the metric to use when working with an imbalanced dataset since we could have the accuracy paradox.

In [13]:
# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()
# Define scoring metrics
scoring = ['accuracy','precision', 'recall', 'f1']
# Use Cross-validation to evaluate the model
cv_results = cross_validate(nb_classifier, count_train, y_train, cv=5,scoring=scoring)

In [14]:
# Print the mean cross validation score for each metrics
for key in cv_results:
    print(key,': ', np.mean(cv_results[key]))

fit_time :  0.004010915756225586
score_time :  0.0030123710632324217
test_accuracy :  0.872341523848854
test_precision :  0.5432328786936461
test_recall :  0.5504232441088995
test_f1 :  0.545932682299735


Now I will **tune** the Naive Bayes classifier by trying out several alpha values to see which one yeilds the best performance.  (The default alpha value for Multinomial Naive Bayes classifier was 1.0)

In [15]:
# Create the list of alphas
alphas = np.arange(0,1,0.1)
# Define scoring metrics
scoring = ['accuracy','precision', 'recall', 'f1']

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha)
    # Use Cross-validation to evaluate the model
    cv_results = cross_validate(nb_classifier, count_train, y_train, cv=5,scoring=scoring)
    for key in cv_results:
        print(key,': ', np.mean(cv_results[key]))    
    print()

Alpha:  0.0
fit_time :  0.0034763336181640623
score_time :  0.003389263153076172
test_accuracy :  0.8873223567856791
test_precision :  0.5976697820733202
test_recall :  0.5952871196522536
test_f1 :  0.5962756384752638

Alpha:  0.1
fit_time :  0.0015959739685058594
score_time :  0.002991914749145508
test_accuracy :  0.8603506629918574
test_precision :  0.5005561849111642
test_recall :  0.6959505833905284
test_f1 :  0.5817408806724007

Alpha:  0.2
fit_time :  0.001596212387084961
score_time :  0.0027923107147216795
test_accuracy :  0.8546566537090735
test_precision :  0.4864664449206434
test_recall :  0.695973461450469
test_f1 :  0.5722637412984226

Alpha:  0.30000000000000004
fit_time :  0.0015958309173583984
score_time :  0.0031913280487060546
test_accuracy :  0.8531574033342609
test_precision :  0.4827224771305293
test_recall :  0.6767101349805535
test_f1 :  0.5627636869386878

Alpha:  0.4
fit_time :  0.0013968944549560547
score_time :  0.003191089630126953
test_accuracy :  0.85765470



fit_time :  0.0017955780029296875
score_time :  0.0027923583984375
test_accuracy :  0.8645490129186904
test_precision :  0.5117885789662385
test_recall :  0.6231983527796843
test_f1 :  0.5616363697037045

Alpha:  0.7000000000000001
fit_time :  0.0023941993713378906
score_time :  0.003190279006958008
test_accuracy :  0.8657466177091095
test_precision :  0.5170530541583174
test_recall :  0.5932052161976664
test_f1 :  0.552132001536439

Alpha:  0.8
fit_time :  0.001994419097900391
score_time :  0.0033908367156982424
test_accuracy :  0.8666461679339971
test_precision :  0.5211368986361002
test_recall :  0.5696636925188743
test_f1 :  0.54391351030959

Alpha:  0.9
fit_time :  0.0017955303192138672
score_time :  0.003789377212524414
test_accuracy :  0.8687437718266615
test_precision :  0.5293105274491254
test_recall :  0.5589567604667124
test_f1 :  0.5432705427504866



Since the main purpose of our classifier is to detect as many negative tweets as possible, we will see the recall score as the most important metric as it measures the coverage of actual positive samples. According to the results above, I will pick 0.1 as the best alpha for my Multinomial Naive Bayes classifier.

##### 4.1.2 Logistic regression

In [16]:
# Create a logistic regression model
logreg = LogisticRegression()
# Define scoring metrics
scoring = ['accuracy','precision', 'recall', 'f1']

# Use Cross-validation to evaluate the model
cv_results = cross_validate(logreg, count_train, y_train, cv=5,scoring=scoring)
# Print the mean cross validation score for each metrics
for key in cv_results:
    print(key,': ', np.mean(cv_results[key]))

fit_time :  0.061238861083984374
score_time :  0.0035968780517578124
test_accuracy :  0.9196891075420375
test_precision :  0.8886842306426239
test_recall :  0.48833218943033624
test_f1 :  0.626209757683717


Now I will tune the Logistic Regression classifier by trying out several C values to see which one yeilds the best performance. (The default C value for Logistic Regression classifier is 1.0)

In [17]:
# Create the list of Cs
cs = np.arange(0.5, 11, 1)
# Define scoring metrics
scoring = ['accuracy','precision', 'recall', 'f1']

# Iterate over the Cs and print the corresponding score
for c in cs:
    print('C: ', c)
    # Instantiate the classifier: 
    logreg = LogisticRegression(C=c)
    # Use Cross-validation to evaluate the model
    cv_results = cross_validate(logreg, count_train, y_train, cv=5,scoring=scoring)
    for key in cv_results:
        print(key,': ', np.mean(cv_results[key]))    
    print()

C:  0.5
fit_time :  0.03651537895202637
score_time :  0.002987241744995117
test_accuracy :  0.9145939006544632
test_precision :  0.9236246949290428
test_recall :  0.4261725005719515
test_f1 :  0.5807533387945887

C:  1.5
fit_time :  0.04965529441833496
score_time :  0.003197002410888672
test_accuracy :  0.920588208889567
test_precision :  0.888020313020313
test_recall :  0.4969114619080302
test_f1 :  0.6335627624663321

C:  2.5
fit_time :  0.05585112571716309
score_time :  0.0027965545654296876
test_accuracy :  0.9217853648026286
test_precision :  0.8541069990461121
test_recall :  0.5332875772134523
test_f1 :  0.6539758715716903

C:  3.5
fit_time :  0.05845065116882324
score_time :  0.0025945663452148437
test_accuracy :  0.9211861135300612
test_precision :  0.8435276638584988
test_recall :  0.5375886524822695
test_f1 :  0.6536863569858513

C:  4.5
fit_time :  0.06204352378845215
score_time :  0.002791881561279297
test_accuracy :  0.9220852148775911
test_precision :  0.8436882209481478


According to the results above, I will pick 10.5 as the best C for my Logistic Regression classifier since it gives the highest accuracy and recall scores.

#### 4.2 Final Model Selection

In this section, I'll compare the performances of MultinomialNB(alpha=0.1) and LogisticRegression(C=10.5). I'll use recall as my metric for final selection.

##### 4.2.1 Performance of  the Multinomial Naive Bayes classifier

In [18]:
# Evaluate recall by cross-validation using training set
nb_recall = cross_validate(MultinomialNB(alpha=0.1), count_train, y_train, 
                            scoring='recall', cv=5, n_jobs=-1)
print(np.mean(nb_recall['test_score']), np.std(nb_recall['test_score']))

0.6959505833905284 0.07205608326270745


In [19]:
# Evaluate the performance of the Naive Bayes Classifier using the test set
nb_classifier = MultinomialNB(alpha=0.1)
nb_classifier.fit(count_train,y_train)
pred = nb_classifier.predict(count_test)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test,pred)
print(cm)
# extract true positives, false positive, false negative, and false positive
tn, fp, fn, tp = metrics.confusion_matrix(y_test,pred).ravel()
print("The recall score is:", tp/(tp+fn))

[[1336   71]
 [  90  148]]
The recall score is: 0.6218487394957983


##### 4.2.2 Performance of  the Logistic Regression Classifier

In [20]:
# Evaluate recall by cross-validation using training set
logreg_recall = cross_validate(LogisticRegression(C=10.5), count_train, y_train, 
                            scoring='recall', cv=5, n_jobs=-1)
print(np.mean(logreg_recall['test_score']), np.std(logreg_recall['test_score']))

0.5654312514298787 0.06306498130540593


In [21]:
# Evaluate the performance of the Logistic Regression Classifier using the test set
logreg = LogisticRegression(C=10.5)
logreg.fit(count_train,y_train)
y_pred = logreg.predict(count_test)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test,y_pred)
print('confusion matrix:\n',cm)
# extract true positives, false positive, false negative, and false positive
tn, fp, fn, tp = metrics.confusion_matrix(y_test,y_pred).ravel()
print("The recall score is:", tp/(tp+fn))

confusion matrix:
 [[1387   20]
 [  90  148]]
The recall score is: 0.6218487394957983


##### 4.2.3 Conslusion

The Multinomial Naive Bayes classifier has a higher cross-validation recall score by 0.13. Verifying performance on the test set shows that the recall score for both models are the same. I'll pick the Multinomial Naive Bayes classifier as the final model.

In [22]:
# Save best model object
best_model = MultinomialNB(alpha=0.1)
best_model

MultinomialNB(alpha=0.1)

### 5. Summary

To prepare your data for fitting models, I've performed NLP precrocessing steps to the tweet text. The steps included tokenization to split sentences into tokens, removing stop words, and lemmatization to convert words to their meaningful base forms. I then use CountVectorizer to vectorizing the dataset, which represented text as numerical data for modeling.   
Only the preprocessed tweet text column was used to predict sentiment classes. To build a negative tweet dector, all tweets with 'postive' and 'neutral' sentiment labels were relabeled as 0, and tweets with 'negative' sentiment label were relabeled as 1.  
We then have a binary classification problem. Here we have tried two classificaiton models: Multinomial Naive Bayes classifier and Logistic Regression Classifier.  
Evaluating the performance of a model by training and testing on the same dataset can lead to overfitting. To prevent that, Cross-Validation technique is used where under the k-fold CV approach, the training set is split into k smaller sets, where a model is trained using k-1 of the folds as training data and the model is validated on the remaining part.  
I've first tried using models with their default parameters to classify sentiments, and I've evaluated their performances in terms of accuracy, precision, recall, and f1 using cross-validation.  
Next, I did hyperparameter tuning for both models seperately. With the result of the optimized hyperparameters, I have evaluated each model using recall score for both the training and test data using cross validation. I've picked the Multinomial Naive Bayes classifier with alpha equal to 0.1 as the final model since it gave a higher cross-validation recall score than the Logistic Regression Classifier by 0.13, and the performances on the test set were the same for both models in terms of recall score.
Therefor, if we use our chosen model for negative tweets dectection, it's expected to dectect around 69% of negative tweets.