<font color="#4b76b7">To start practicing, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# AfterWork Data Science: Getting Started with NLP Project

### Prerequisites

In [66]:
# Importing the required libraries
# ---
# 
import pandas as pd # library for data manipulation
import numpy as np  # librariy for scientific computations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # scientific conputing 

### 1. Importing our Data

In [67]:
# Question: Given new tweets, create a sentiment analysis model that will 
# predict whether a tweet will contain positive or negative sentiment.
# ---
# Dataset url = https://bit.ly/31kqByD 
# ---
#
df = pd.read_csv('https://bit.ly/31kqByD', encoding='latin-1')
df.head()

Unnamed: 0.1,Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


### 2. Data Exploration

In [68]:
# We can determine the size of our dataset
# ---
#
df.shape

(10000, 7)

Seems this dataset will need some data cleaning i.e. columns. We also don't need some columns to perform create our model. We will drop those columns.

### 3. Data Preparation

#### Basic Data Cleaning Techniques

In [69]:
# We rename the columns for ease of referencing our columns later on
# ---
#
df.columns = ['id', 'target', 't_id', 'created_at', 'query', 'user', 'text']
df.head()

Unnamed: 0,id,target,t_id,created_at,query,user,text
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


In [70]:
# We retain the relevant columns by dropping the columns we don't need 
# for creating a sentiment analysis model. 
# ---
#
df = df.drop(['id', 't_id', 'created_at', 'query', 'user'], axis = 1)
df.head()

Unnamed: 0,target,text
0,0,Obama forges his Muslim alliance against the c...
1,4,Had the most spectacular prom ever but now my...
2,0,I am overwhelmed today taking a moment to eat...
3,0,@lindork Tres sad. I was totally a Max fan. #...
4,0,"Crap, I was counting down the hours until my d..."


In [71]:
# Understanding the distribution of target
# ---
#
df.target.value_counts() 

0    5067
4    4933
Name: target, dtype: int64

In [72]:
# Let's determine whether our columns have the right data types
# ---
#
df.dtypes

target     int64
text      object
dtype: object

In [73]:
# What values are in our target variable?
# ---
#
df.target.unique()

array([0, 4])

These are the two classes to which each document (text) belongs. The target value 0 means a text with a negative sentiment, while that of 4 means a text with a positive sentiment. 

In [74]:
# Let's check for missing values 
# ---
# 
df.isnull().sum()

target    0
text      0
dtype: int64

We don't have any missing values, so we are good to go.

#### Text Processing

In [75]:
# Text Cleaning: Removing all urls/links
# ---
# 
df['text'] =  df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))
df[['text']].head()

Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,@lindork Tres sad. I was totally a Max fan. #...
4,"Crap, I was counting down the hours until my d..."


In [76]:
# Text Cleaning: Removing @ and # characters or replace them with space
# ---

df['text'] =  df.text.str.replace('[@#]','')
df[['text']].head()


  after removing the cwd from sys.path.


Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,lindork Tres sad. I was totally a Max fan. SY...
4,"Crap, I was counting down the hours until my d..."


In [77]:
# Text Cleaning: Conversion to lowercase
# ---
df['text'] = df.text.str.lower() 
df[['text']].head()


Unnamed: 0,text
0,obama forges his muslim alliance against the c...
1,had the most spectacular prom ever but now my...
2,i am overwhelmed today taking a moment to eat...
3,lindork tres sad. i was totally a max fan. sy...
4,"crap, i was counting down the hours until my d..."


In [78]:
# Text Cleaning: Splitting concatenated words
# ---
# Performing this step will take few minutes...

# Installing wordnija and textblob

!pip3 install wordninja
!pip3 install textblob

# Importing those libraries

import wordninja 
from textblob import TextBlob


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [79]:
# Performing the split

df['text'] = df.text.apply(lambda x: wordninja.split(str(TextBlob(x))))  
df['text'] = df.text.str.join(' ')
df[['text']].head(5)


Unnamed: 0,text
0,obama forges his muslim alliance against the c...
1,had the most spectacular prom ever but now my ...
2,i am overwhelmed today taking a moment to eat ...
3,lin dork tres sad i was totally a max fan sytycd
4,crap i was counting down the hours until my da...


In [80]:
# Text Cleaning: Removing punctuation characters
# ---

df['text'] = df.text.str.replace('[^\w\s]','') 
df[['text']].sample(10)

  after removing the cwd from sys.path.


Unnamed: 0,text
8506,jo it ou 2 that s awesome im not gion g 2 1 th...
7529,good morning lt 3 i love how its only just sat...
9861,robbie cur lee good night s is and god bless you
7303,shan ta noo i did that too 3 days ago but some...
6585,looking forward to my night out with women wh...
9570,ready fuels when god was handing out names he ...
145,why do you come across my mind still youre so ...
8560,the e real fd hc hurry up and get home i miss you
1852,at dinner with nick rp ive got an excellent vi...
7625,i dont understand how you use twitter


In [81]:
# Text Cleaning: Removing stop words

# import library
import nltk
nltk.download('stopwords')

# import stop words
from nltk.corpus import stopwords
stop = stopwords.words('english')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [82]:
# find out the number of stop words in the tweets.

df['no_of_stopwords'] = df.text.apply(lambda x: len([x for x in x.split() if x in stop]))
df[['text','no_of_stopwords']].sample(10)



Unnamed: 0,text,no_of_stopwords
717,demonic angel 81 yay but u have no idea what y...,5
9209,jane safari an omg wow thats amazing did you d...,6
6662,deconstruct o open to new members but otherwis...,3
3417,ieee 802 dot 16 e hah aha yes s s absolutely g...,2
4210,mard hi ah read it on ohno they didnt livejour...,12
33,i just updated my multiply blog about my gabe ...,10
2758,meredith gould my absolute pleasure ma am,3
7170,pod nosh re cringe ing those buggers playing a...,5
3502,gators lost in softball tonight guess it will ...,6
5975,i have a lisp,3


In [83]:
#remove the stop words

df['text'] = df.text.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df[['text']].sample(5)


Unnamed: 0,text
5444,first tweet
9077,monday morning sun shining new emails single d...
7865,work taking hour lunch
3844,mateo camargo buy un tu ant ly hmv guys see fr...
9634,computer keeps overheating try catch fave shows


In [84]:
# Text Cleaning: Lemmatization

# For lemmatization, we will need to download wordnet

nltk.download('wordnet')

from textblob import Word

# Lemmatizing our text


df['lemmatization'] = df.text.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()])) 
df[['text', 'lemmatization']].head(10)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,text,lemmatization
0,obama forges muslim alliance civilized world d...,obama forge muslim alliance civilized world di...
1,spectacular prom ever bed serenading must answ...,spectacular prom ever bed serenading must answ...
2,overwhelmed today taking moment eat pray,overwhelmed today taking moment eat pray
3,lin dork tres sad totally max fan sytycd,lin dork tres sad totally max fan sytycd
4,crap counting hours dad could come home amp he...,crap counting hour dad could come home amp hel...
5,dc b tv dc b tv go check things buy others loo...,dc b tv dc b tv go check thing buy others look...
6,mr ke never gmail anymore,mr ke never gmail anymore
7,alex jeffrey id loved come couple unfortunate ...,alex jeffrey id loved come couple unfortunate ...
8,br rrr heading work chilly today,br rrr heading work chilly today
9,ga bri iii ella nee ed talk u good new sss,ga bri iii ella nee ed talk u good new ss


We won't remove numerics because we could loose meaning of our text if we lost the numerics. We could also further prepare our text by performing spelling correction but this is a resource intensive process that we will skip for now.

#### Feature Engineering Techniques 

In [85]:
# Feature Construction: Length of tweet- includes spaces

df['length_of_tweet'] = df.text.str.len()
df[['text','length_of_tweet']].sample(5)



Unnamed: 0,text,length_of_tweet
2638,waiting tomorrow school almost soon high school,47
9307,corey grand aaa nd ccas mom decided let go im ...,51
968,really need real bed,20
8849,please hope time nothing,24
6781,fast cars 800 coming see morro,30


In [86]:
# Feature Construction: Word count 

df['word_count'] = df["text"].apply(lambda x: len(str(x).split(" ")))
df.sample(5)


Unnamed: 0,target,text,no_of_stopwords,lemmatization,length_of_tweet,word_count
9216,4,digital maverick enjoy sunshine,5,digital maverick enjoy sunshine,31,4
4004,4,day salcedo girls,3,day salcedo girl,17,3
8810,0,taking ri yoko get shots,2,taking ri yoko get shot,24,5
8043,0,still stiffness sunday maybe counter another w...,5,still stiffness sunday maybe counter another w...,60,8
9117,0,tired feel almost comatose didnt sleep well la...,7,tired feel almost comatose didnt sleep well la...,63,10


In [87]:
df.shape

(10000, 6)

In [88]:
# Feature Construction: Word density (Average no. of words / tweet)

# number of tweets in df:
index = df. index
# find length of index.
number_of_rows = len(index)
print(number_of_rows)

df['Word density'] = df['word_count'] / number_of_rows
df.sample(5)


10000


Unnamed: 0,target,text,no_of_stopwords,lemmatization,length_of_tweet,word_count,Word density
5814,0,go work,5,go work,7,2,0.0002
5545,0,trip bangkok,3,trip bangkok,12,2,0.0002
4375,4,esa bas thank awesome stranger 40 th follower,4,esa ba thank awesome stranger 40 th follower,45,8,0.0008
6203,4,mango x 3 like song,4,mango x 3 like song,19,5,0.0005
597,4,good,3,good,4,1,0.0001


In [89]:
# Feature Construction: Noun count


# First, we will download the punkt and the averaged_perceptron_tagger into our notebook environment. 
# which will allow us to find the part of speech tags.

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')





[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [90]:
# We create the function to check and get the part of speech tag count of a words in a given sentence


from textblob import TextBlob
pos_dic = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt


In [91]:
# Noun Count

df['noun_count'] = df.text.apply(lambda x: pos_check(x, 'noun'))
df[['text','noun_count']].sample(10)

Unnamed: 0,text,noun_count
3876,peter 76 sounds like busy day fun stuff good l...,6
1289,feeling far away friends right good support da...,3
4130,kr azy mary lol daughter turned 2 drives crazy...,5
6177,close yet far failure amp disappointed,2
3523,brig ens judo learning boring,2
7836,prophet lady new amp getting verified takes ti...,6
3832,babies,1
183,im someones hero makes feel special,2
7128,sarah spined limit story 140 ch tell annoying,4
916,watching roni play g tar hero wont left play c...,9


In [92]:
# Feature Construction: Verb count
# ---
df['verb_count'] = df.text.apply(lambda x: pos_check(x, 'verb'))
df[['text','verb_count']].sample(10)

Unnamed: 0,text,verb_count
6825,car think pretty cool nar mac friendly great m...,1
8179,middle morning amp nose kinda runnin yup p sick,0
2901,woke cutest text message world lt 3,1
5574,taylor swift 13s dateline special airs sunday ...,3
1970,jessica ashs house playing lips mint bbq yeste...,1
1830,amanda palmer would like captain pimp ship cal...,3
9138,air condo tion er broken work us 90 degrees bu...,2
1664,hi people,0
932,newb nb yep shou ask crazy tw cr uci fire brai...,1
8679,ian cap stick 4000 thats guess,0


In [93]:
# Feature Construction: Adjective count / Tweet
# ---
df['adj_count'] = df.text.apply(lambda x: pos_check(x, 'adj'))
df[['text','adj_count']].sample(10)


Unnamed: 0,text,adj_count
2912,todays trip fu son fail berry gela plain straw...,0
7201,good morning malta sending tweets mu ovos job ...,2
4511,ip christina five sounds pretty good ill check,2
9283,thinks everyone needs hug kids little bit tighter,2
2561,monika maple boo still,0
5321,washing machine died yesterday buy new one,1
8274,summer time woke get shots today,0
3762,playing pet society,1
1822,inspector latest webkit badly broken,1
8648,stars mile e omg thats sad love babies happen sad,2


In [94]:
# Feature Construction: Adverb count / Tweet
# ---

df['adv_count'] = df.text.apply(lambda x: pos_check(x, 'adv'))
df[['text','adv_count']].sample(10)


Unnamed: 0,text,adv_count
9438,kr uci al ive got ur om let te quite ready u w...,2
2197,still accounting im tired,1
1631,kirstie alley really like participate 24 twitt...,1
2673,made mistake yesterday,0
6569,twitter love website,0
97,hammock amazing still one peice im proud,1
1798,going dentist,0
3284,roque j 75 yea see order something dont eat ge...,0
9101,right b wlaz think u agree u hang w rp tom mor...,1
5330,gab bouc cori luv n blast atl stay cool wanna ...,0


In [95]:
# Feature Construction: Pronoun 
# ---
df['pron_count'] = df.text.apply(lambda x: pos_check(x, 'pron'))
df[['text','pron_count']].sample(10)



Unnamed: 0,text,pron_count
2957,doctor follow ill jeez thank u country punk ga...,1
8513,holly mae 20 sorry youre feeling yuck london p...,0
4090,pet results yet hell knows wish cum soon nervous,0
6844,rac hie sweets hear ya well dont let ppl get y...,0
441,za bbs thank,0
7707,finished treasury lots colors back work,0
1974,spastic tiger msn talk p wee ease one else im ...,0
4773,leaving leanns house going boardwalk,0
9980,going relax weekend start packing room,0
8390,cant believe im stuck work weather lush,0


In [96]:
# Feature Construction: Subjectivity
# ---

def get_subjectivity(text):
    try:
        textblob = TextBlob(text)
        subj = textblob.sentiment.subjectivity
    except:
        subj = 0.0
    return subj

df['subjectivity'] = df.text.apply(get_subjectivity)
df[['text', 'subjectivity']].sample(10)



Unnamed: 0,text,subjectivity
4723,loving whole quo dirty carol vo order man ravi...,0.7875
232,yi anno poul os like,0.0
3248,damn broke another fly swatter lol de j vu las...,0.591667
1892,rock berry movies amp downloads sweeter words ...,0.0
2887,jonas brothers fa v turn right far think aweso...,0.783929
3464,w ock eez hol hol spain,0.0
181,elias 8909 omg need go see together girl lol,0.7
2455,got sent home school nearly fainted r e felt i...,0.7
7035,cant wait hour long season finale hills goodnight,0.4
6092,tweet food get follows taco bell truck canters...,0.0


In [97]:
# Feature Construction: Polarity
# ---

def get_polarity(text):
    try:
        textblob = TextBlob(text)
        pol = textblob.sentiment.polarity
    except:
        pol = 0.0
    return pol

df['polarity'] = df.text.apply(get_polarity)
df[['text', 'polarity']].sample(10)


Unnamed: 0,text,polarity
8574,ok far iphone os 3 0 isnt bad steps right dire...,-0.102857
540,going airport drop mom,0.0
7722,panda cat baby im retarded thats looked pretty...,-0.179167
5878,vamps r us welcome,0.8
2367,shawn 3 k hoo chuck although gloomy today much...,0.5
4716,mrs double u pug gy another friend meeting w a...,0.0
7562,mi smile youre cute,0.4
7696,doll life yeah unfortunately,-0.5
2076,anti mate yes really need hang summer pca stat...,0.2
4352,elliott danger actually h avn,0.0


In [98]:
# Feature Construction: Word Level N-Gram TF-IDF Feature 
# ---
from nltk import word_tokenize, ngrams

# Word ngrams

list(ngrams(word_tokenize(df['text'][0]), 2)) 

[('obama', 'forges'),
 ('forges', 'muslim'),
 ('muslim', 'alliance'),
 ('alliance', 'civilized'),
 ('civilized', 'world'),
 ('world', 'didnt'),
 ('didnt', 'even'),
 ('even', 'drop'),
 ('drop', 'cup'),
 ('cup', 'tea')]

In [99]:
# Feature Construction: Character Level N-Gram TF-IDF Feature
# ---

list(ngrams(df['text'][0], 2))


[('o', 'b'),
 ('b', 'a'),
 ('a', 'm'),
 ('m', 'a'),
 ('a', ' '),
 (' ', 'f'),
 ('f', 'o'),
 ('o', 'r'),
 ('r', 'g'),
 ('g', 'e'),
 ('e', 's'),
 ('s', ' '),
 (' ', 'm'),
 ('m', 'u'),
 ('u', 's'),
 ('s', 'l'),
 ('l', 'i'),
 ('i', 'm'),
 ('m', ' '),
 (' ', 'a'),
 ('a', 'l'),
 ('l', 'l'),
 ('l', 'i'),
 ('i', 'a'),
 ('a', 'n'),
 ('n', 'c'),
 ('c', 'e'),
 ('e', ' '),
 (' ', 'c'),
 ('c', 'i'),
 ('i', 'v'),
 ('v', 'i'),
 ('i', 'l'),
 ('l', 'i'),
 ('i', 'z'),
 ('z', 'e'),
 ('e', 'd'),
 ('d', ' '),
 (' ', 'w'),
 ('w', 'o'),
 ('o', 'r'),
 ('r', 'l'),
 ('l', 'd'),
 ('d', ' '),
 (' ', 'd'),
 ('d', 'i'),
 ('i', 'd'),
 ('d', 'n'),
 ('n', 't'),
 ('t', ' '),
 (' ', 'e'),
 ('e', 'v'),
 ('v', 'e'),
 ('e', 'n'),
 ('n', ' '),
 (' ', 'd'),
 ('d', 'r'),
 ('r', 'o'),
 ('o', 'p'),
 ('p', ' '),
 (' ', 'c'),
 ('c', 'u'),
 ('u', 'p'),
 ('p', ' '),
 (' ', 't'),
 ('t', 'e'),
 ('e', 'a')]

In [100]:
df.shape



(10000, 14)

In [101]:
df.head(3)


Unnamed: 0,target,text,no_of_stopwords,lemmatization,length_of_tweet,word_count,Word density,noun_count,verb_count,adj_count,adv_count,pron_count,subjectivity,polarity
0,0,obama forges muslim alliance civilized world d...,9,obama forge muslim alliance civilized world di...,68,11,0.0011,6,3,1,1,0,0.9,0.4
1,4,spectacular prom ever bed serenading must answ...,13,spectacular prom ever bed serenading must answ...,83,12,0.0012,5,3,2,1,0,0.85,0.65
2,0,overwhelmed today taking moment eat pray,5,overwhelmed today taking moment eat pray,40,6,0.0006,4,2,0,0,0,0.0,0.0


In [112]:
# Let's prepare the constructed features for modeling
# ---
#
X_metadata = np.array(df.iloc[:, 4:12])
X_metadata

array([[6.8e+01, 1.1e+01, 1.1e-03, ..., 1.0e+00, 1.0e+00, 0.0e+00],
       [8.3e+01, 1.2e+01, 1.2e-03, ..., 2.0e+00, 1.0e+00, 0.0e+00],
       [4.0e+01, 6.0e+00, 6.0e-04, ..., 0.0e+00, 0.0e+00, 0.0e+00],
       ...,
       [4.5e+01, 8.0e+00, 8.0e-04, ..., 2.0e+00, 2.0e+00, 0.0e+00],
       [3.4e+01, 6.0e+00, 6.0e-04, ..., 2.0e+00, 1.0e+00, 0.0e+00],
       [4.4e+01, 1.0e+01, 1.0e-03, ..., 0.0e+00, 0.0e+00, 0.0e+00]])

In [113]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1,3),  stop_words= 'english')
df_word_vect = tfidf.fit_transform(df['text'])

df_word_vect.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [114]:
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1,3),  stop_words= 'english')
df_char_vect = tfidf.fit_transform(df['text'])

df_char_vect.toarray()

array([[0.24763637, 0.        , 0.        , ..., 0.        , 0.08440112,
        0.        ],
       [0.22588248, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.17512614, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.22314039, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.18370694, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.28063543, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [115]:
# We combine our two tfidf (sparse) matrices and X_metadata
# ---
#

X = scipy.sparse.hstack([df_word_vect, df_char_vect, X_metadata])
X

<10000x2008 sparse matrix of type '<class 'numpy.float64'>'
	with 943618 stored elements in COOrdinate format>

In [116]:
# Getting our response variable
# ---
#
y = np.array(df.iloc[:, 0])
y

array([0, 4, 0, ..., 0, 4, 0])

### 4. Data Modelling

During this step, we will use machine learning algorithms to train and test our sentiment analysis models.

In [117]:
# Splitting our data
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [118]:
# Fitting our model
# ---
#

# Importing the algorithms
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression

nb_classifier = MultinomialNB() 
lr_classifier = LogisticRegression(max_iter=1000) 

# Training our model
nb_classifier.fit(X_train, y_train) 
lr_classifier.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [119]:
# Making predictions
# ---
#
y_predict_nb = nb_classifier.predict(X_test) 
y_predict_lr = lr_classifier.predict(X_test)

In [120]:
# Evaluating the Models
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
# ---
#
print("Naive Bayes Classifier:\n", accuracy_score(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", accuracy_score(y_test, y_predict_lr))

Naive Bayes Classifier:
 0.7275
Logistic Regression Classifier: 
 0.7305


In [121]:
# Confusion matrices
# ---
# 
print("Naive Bayes Classifier: \n", confusion_matrix(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", confusion_matrix(y_test, y_predict_lr))

Naive Bayes Classifier: 
 [[757 293]
 [252 698]]
Logistic Regression Classifier: 
 [[761 289]
 [250 700]]


In [122]:
# Classification Reports
# ---
#
print("Naive Bayes Classifier: \n", classification_report(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", classification_report(y_test, y_predict_lr))

Naive Bayes Classifier: 
               precision    recall  f1-score   support

           0       0.75      0.72      0.74      1050
           4       0.70      0.73      0.72       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000

Logistic Regression Classifier: 
               precision    recall  f1-score   support

           0       0.75      0.72      0.74      1050
           4       0.71      0.74      0.72       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000



**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

To improve our model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. We can also use different vectorizing techniques, implement other machine learning models and perform hyperparameter tuning.

### 5. Recommendations


Our best model had an accuracy of 73.25% and use it for classifying newer tweets. We can improve this performance by performing hyperparameter tuning and feature engineering methods. 