## Main Goals: 
<ol>
<li>Create a cleaned development dataset that can be used to complete the modeling step of this project
    <ul>
<li> Perform NLP Precrocessing steps to the text</li> 
<li>Split into testing and training datasets</li> 
<li>Vectorizing our dataset</li>
    </ul>
<li>Modeling: Build a <b>Negative Tweet Detector</b> </li>
    <ul>
<li>Building and evaluating models</li> 
<li>Comparing models</li> 
    </ul>
</ol>

### 1. Import Libraries

In [15]:
import pandas as pd
import os
import nltk
import re
import string
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
import time
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB

### 2. Load Data

In [2]:
df = pd.read_csv('data/cleaned_tweets.csv')
print(df.shape)
df.head()

(4982, 6)


Unnamed: 0,date,cleaned_tweet,polarity,sentiment,text_len,text_word_count
0,2021-05-12,right now we welcome competition just no...,0.543,positive,83,12
1,2021-05-12,hahaha unfollowed tile a company who was...,0.1,positive,104,17
2,2021-05-12,i was thinking it might be in corenfc but i ...,0.0,neutral,94,17
3,2021-05-12,this is super clever creating a new battery ...,0.187,positive,98,18
4,2021-05-12,any one be interested if i did an airtag give...,0.25,positive,52,10


### 3. Text Preprocessing

#### 3.1 NLP Preprocessing

Previously in the last step, we cleaned the text of tweets after loading our dataset, and we've removed all the punctuations and lowercased the words. Now we need to perform some other preprocessing steps before fitting the data into our model.

In [3]:
# Tokenization
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in df['cleaned_tweet']]

# Remove stop words
stopword = nltk.corpus.stopwords.words('english')
no_stops=[]
for i in all_tokens:
    new_no_stops = [t for t in i if t not in stopword]
    no_stops.append(new_no_stops)

# Lemmatization
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized = []
for i in no_stops:
    for j in i:
        new_lemmatized = [wordnet_lemmatizer.lemmatize(j) for j in i] #Lemmatize all tokens into a new list: lemmatized
    lemmatized.append(new_lemmatized)

In [4]:
# Define a function to perform all the NLP preprocessing steps above
def preprocess_text(text):
    # Tokenization
    tknzr = TweetTokenizer()
    all_tokens = [tknzr.tokenize(t) for t in text]

    # Remove stop words
    stopword = nltk.corpus.stopwords.words('english')
    no_stops=[]
    for i in all_tokens:
        no_stops_new = [t for t in i if t not in stopword]
        no_stops.append(no_stops_new)

    # Lemmatization
    wordnet_lemmatizer = WordNetLemmatizer()
    lemmatized = []
    for i in no_stops:
        for j in i:
            lemmatized_new = [wordnet_lemmatizer.lemmatize(j) for j in i] 
        lemmatized.append(lemmatized_new)
    return lemmatized

In [5]:
preprocess_text(df['cleaned_tweet'])

[['right', 'welcome', 'competition', 'apple', 'tile', 'airtag', 'apple'],
 ['hahaha',
  'unfollowed',
  'tile',
  'company',
  'born',
  'thrived',
  'thanks',
  'apple',
  'bitterly',
  'aga'],
 ['thinking', 'might', 'corenfc', 'seen', 'anything', 'specificall'],
 ['super',
  'clever',
  'creating',
  'new',
  'battery',
  'backplate',
  'roku',
  'remote',
  'fit',
  'airtag'],
 ['one', 'interested', 'airtag', 'giveaway'],
 ['wanna',
  'get',
  'airtag',
  'since',
  'covid',
  '19',
  'go',
  'a9lan',
  'lolololol',
  'would',
  'useless',
  'atm'],
 ['mailed',
  'airtag',
  'tracked',
  'progress',
  'happened',
  'mac',
  'security',
  'blog'],
 ['esperando', 'sair', 'airtag', 'na', 'shopee'],
 ['def', 'went', 'function', 'form', 'also', 'bought', 'plain', 'airtag'],
 ['mailed', 'airtag', 'tracked', 'progress', 'technews', 'news'],
 ['chipolo',
  'undercut',
  'airtag',
  'buck',
  'find',
  'tracker',
  'apple',
  'appletrainer'],
 ['new', 'best', 'story', 'mailed', 'airtag', 'tr

In [6]:
df['preprocessed'] = lemmatized
df.head()

Unnamed: 0,date,cleaned_tweet,polarity,sentiment,text_len,text_word_count,preprocessed
0,2021-05-12,right now we welcome competition just no...,0.543,positive,83,12,"[right, welcome, competition, apple, tile, air..."
1,2021-05-12,hahaha unfollowed tile a company who was...,0.1,positive,104,17,"[hahaha, unfollowed, tile, company, born, thri..."
2,2021-05-12,i was thinking it might be in corenfc but i ...,0.0,neutral,94,17,"[thinking, might, corenfc, seen, anything, spe..."
3,2021-05-12,this is super clever creating a new battery ...,0.187,positive,98,18,"[super, clever, creating, new, battery, backpl..."
4,2021-05-12,any one be interested if i did an airtag give...,0.25,positive,52,10,"[one, interested, airtag, giveaway]"


In [7]:
# keep only the feature and target that we will use for our model
df = df[['preprocessed','sentiment']]

#### 3.2 CountVectorizer: Vectorizing our dataset

In [8]:
# label positive and neutral sentiment as 0, and lable negative sentiment as 1
df['negative_sentiment'] = df.sentiment.map({'positive':0,'neutral':0,'negative':1})
y = df['negative_sentiment']

In [9]:
df.preprocessed = df.preprocessed.apply(lambda x: " ".join(x))

In [10]:
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
X_train, X_test, y_train, y_test = train_test_split(df['preprocessed'], y,test_size = 0.33,random_state = 33)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(3337,)
(1645,)
(3337,)
(1645,)


In [11]:
count_vectorizer = CountVectorizer()

# learn training data vocabulary and use it to create a document-term matrix: count_train 
count_train = count_vectorizer.fit_transform(X_train)
# transform testing data (using fitted vocabulary) into a document-term matrix: count_test 
count_test = count_vectorizer.transform(X_test)

In [12]:
# Print the first 200 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:200])

['00', '000', '000apple', '03', '05', '07', '07115164', '07airtag', '08132321259', '09', '0mm', '0ver', '10', '100', '10brain', '11', '119', '12', '120', '120m', '127', '128', '13', '132', '1379', '14', '149', '15', '1500', '15k', '16', '19', '1978', '1986', '19999', '1k', '1mi', '1st', '20', '2001', '2018', '2019', '2020', '2021', '2022', '21', '216', '237', '24', '25', '2516', '279', '280', '2837472', '29', '2fa', '2nd', '30', '300', '301', '30am', '30k', '319', '328', '32mb', '33', '33k', '349', '35', '35k', '360', '3d', '3dprinting', '3mm', '3rd', '3v', '40', '400', '400ft', '449', '479', '482', '486', '490', '499', '4agze', '4am', '4ever', '4k', '50', '500', '50m', '52', '52832', '53pm', '54', '5g', '5k', '5th', '5x', '60', '60m', '62', '658', '699', '6ft', '6user', '72', '75key', '7999', '7th', '835', '877', '8970', '8m', '8mm', '8th', '8v', '90', '900', '90deg', '95', '987', '99', '99link', '9to5m', '9to5mac', 'a2f6', 'a52', 'aapl', 'ab', 'abccentralvic', 'abcmsh', 'abcwimmera',

We will inspect the vectors to see how they look like.

In [13]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.toarray(), columns=count_vectorizer.get_feature_names())
count_df.head()

Unnamed: 0,00,000,000apple,03,05,07,07115164,07airtag,08132321259,09,...,yup,zac,zarak,zdnet,zdnets,zee,zero,zip,zone,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
# See the most used words in the training set by sorting values descendingly
count = pd.DataFrame(count_df.sum())
countdf = count.sort_values(0,ascending=False).head(20)
countdf[0:11]

Unnamed: 0,0
airtag,3038
apple,1559
new,282
find,276
hacked,255
researcher,244
security,241
tracker,227
airtags,213
lost,182


Besides the hashtag keyword airtag, the most common words used were "apple"(AirTag's company), "new", "find", "hacked", "researcher", "security", "tracker", "airtags", "lost", and "already". 

### 4. Modeling

#### 4.1 Training and testing the "Negative Tweet Detector"

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). We'll first train and test a Naive Bayes model using the CountVectorizer data.

In [21]:
# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()
# Fit the classifier to the training data
%time nb_classifier.fit(count_train,y_train)
# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)

Wall time: 5.94 ms


In [26]:
# Calculate the accuracy score: score
print("The accuracy score is:", metrics.accuracy_score(y_test,pred))
# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test,pred)
print(cm)

The accuracy score is: 0.9045592705167174
[[1368   39]
 [ 118  120]]


In [33]:
# extract true positives, false positive, false negative, and false positive
tn, fp, fn, tp = metrics.confusion_matrix(y_test,pred).ravel()

In [38]:
# print message text for the false positives (not negative tweets incorrectly classified as negative tweets)
print('There are {} non-negative tweeets incorrectly classified as negative tweets'.format(fp))
X_test[y_test < pred]

There are 39 non-negative tweeets incorrectly classified as negative tweets


1361    io 14 6 let list email contact method lost airtag
2418    downplay achievement airtag us fairly shelf no...
444       io 14 6 user use email address airtag lost mode
968     apple airtag work almost well scary side read ...
421                                put airtag butt asleep
3332    engrave character language english airtag buti...
1517    iphone io 14 watchos 7 apple today released th...
3868    beep get airtag request phone find recorded em...
3172    gone ahead purchased good condition ipod click...
584     apple airtag jailbroken could used redirect fi...
4524                                airtag find interface
2226    randomdumber apple airtag jailbroken could use...
545     apple airtag jailbroken could used redirect fi...
3874    appleinsider airtag owner taken apart device s...
418                         airtag attach anything really
4715    change color precision finding screen far away...
4782    update apple responds airtag pulled major aust...
3991    iphone

In [45]:
# print message text for the false negatives (negative tweets incorrectly classified as not negative tweets)
print('There are {} negative tweeets incorrectly classified as non-negative tweets'.format(fn))
X_test[y_test > pred].head()

There are 118 negative tweeets incorrectly classified as non-negative tweets


3328       video clever hack turn airtag thin card wallet
3240                                         tear airtags
2304    little brother love playing scavenger hunt air...
1060    introduced airtag buy single pack n32 500 appl...
2254                        apple airtag hacked bad sound
Name: preprocessed, dtype: object

In [49]:
# calculate predicted probabilities for X_test (poorly calibrated)
y_pred_prob = nb_classifier.predict_proba(count_test)[:, 1]
y_pred_prob

array([7.34558212e-08, 1.14985135e-02, 1.04806256e-04, ...,
       3.91153359e-02, 9.72373044e-05, 2.04032125e-05])

In [50]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.8726296488744752