# Sentiment Analysis for Tweets

Given a dataset of tweets with Sentiment Analysis annotation, build a classification model using Naive Bayes.

## Load the data

Let's load our tweets dataset

In [1]:
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
data = '/content/drive/MyDrive/Lab 6: Text Classification/P6_Sentiment_Analysis_Tweets.csv'
data = pd.read_csv(data)

In [4]:
data.head()

Unnamed: 0.1,Unnamed: 0,ItemID,Sentiment,SentimentText
0,0,1,0,is so sad for my APL frie...
1,1,2,0,I missed the New Moon trail...
2,2,3,1,omg its already 7:30 :O
3,3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,4,5,0,i think mi bf is cheating on me!!! ...


In [5]:
data["Sentiment"].value_counts()

0    5812
1    4188
Name: Sentiment, dtype: int64

In [6]:
data.drop(['Unnamed: 0', 'ItemID'],axis=1,inplace=True)

In [7]:
data.columns=['label','text']
data.head()

Unnamed: 0,label,text
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


## Clean the data

I will be using the cleaning_functions.py script that we created in the third week (P3_Text_Normalization)

In [8]:
import sys

sys.path.insert(1, '/content/drive/MyDrive/Lab 3: 25th of January/') # Insert folder path to be able to call cleaning_functions.py

In [9]:
from cleaning_functions  import *

In [10]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [12]:
data["clean_text"] = data["text"].apply(lambda x: clean_text(x, 'stemming'))
data.head()

Unnamed: 0,label,text,clean_text
0,0,is so sad for my APL frie...,sad apl friend
1,0,I missed the New Moon trail...,miss new moon trailer
2,1,omg its already 7:30 :O,omg alreadi
3,0,.. Omgaga. Im sooo im gunna CRy. I'...,omgaga im sooo im gunna cri dentist sinc supos...
4,0,i think mi bf is cheating on me!!! ...,think mi bf cheat


## Vectorize the data

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
bow_model=CountVectorizer()
bow_vec=bow_model.fit_transform(data["clean_text"])

In [15]:
bow_data=pd.DataFrame(bow_vec.toarray(), columns = bow_model.get_feature_names())
bow_data.head()



Unnamed: 0,aaaaaaah,aaaarrgghhhhhhhh,aaah,aaahh,aaahhhhh,aaahhhhhh,aaarrrgggghhh,aah,aahh,aahhh,aaliyah,aamaustin,aaron,aaroncarl,aaronxnow,aaronyonda,aarrgh,aarrrrggghhhhh,aawww,aay,abagail,abbey,abbi,abc,abduct,abduzeedo,aber,aberdeen,aberonlin,abfahrt,abhl,abi,abil,abit,abl,abnorm,abomin,aboot,abort,abortiontuesday,...,zaho,zahra,zammi,zazzl,zcultfm,zelda,zeldman,zenbitch,zend,zendoc,zephyrlili,zepinkladi,zeppelin,zero,zeu,zevin,zheyamada,ziggi,zn,zochula,zoe,zoebo,zoey,zoffitcha,zoidberg,zombi,zombieninja,zomg,zone,zoo,zopiclon,zoro,zumba,zune,zzzzz,zzzzzzzzzzzz,zzzzzzzzzzzzz,ãªnfase,ðµ,øªù
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Create the model

In [17]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(bow_data,data['label'],test_size=.2)

In [21]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

nb_model = MultinomialNB(fit_prior = True).fit(X_train, y_train)

## Evaluate the model

In [22]:
nb_prediction=nb_model.predict(X_test)
nb_precision,nb_recall,nb_fscore,nb_support = score(y_test,nb_prediction,pos_label=1,average='binary')
print('Precision: {} / Recall: {} / Accuracy: {}'.format(round(nb_precision, 3),
                                                        round(nb_recall, 3),
                                                        round((nb_prediction==y_test).sum() / len(nb_prediction),3)))

Precision: 0.754 / Recall: 0.635 / Accuracy: 0.76
