<font color="#4b76b7">To start practicing, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# AfterWork Data Science: Getting Started with NLP Project

### Prerequisites

In [11]:
# Importing the required libraries
# ---
# 
import pandas as pd # library for data manipulation
import numpy as np  # librariy for scientific computations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # scientific conputing 

### 1. Importing our Data

In [23]:
# Question: Given a new tweets, create a sentiment analysis model that will 
# predict whether a tweet will contain positive or negative sentiment.
# ---
# Dataset url = https://bit.ly/31kqByD 
# ---
#

# see entire column content in the dataframe
pd.set_option('display.max_columns', None)  

df = pd.read_csv('https://bit.ly/31kqByD', encoding='latin-1')
df.head()

Unnamed: 0.1,Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


### 2. Data Exploration

In [24]:
# We can determine the size of our dataset
# ---
#
df.shape

(10000, 7)

Seems this dataset will need some data cleaning i.e. columns. We also don't need some columns to perform create our model. We will drop those columns.

### 3. Data Preparation

#### Basic Data Cleaning Techniques

In [25]:
# We rename the columns for ease of referencing our columns later on
# ---
#
df.columns = ['id', 'target', 't_id', 'created_at', 'query', 'user', 'text']
df.sample(10)

Unnamed: 0,id,target,t_id,created_at,query,user,text
3959,1276060,4,2001165871,Tue Jun 02 00:02:11 PDT 2009,NO_QUERY,sabrinasg,@m4s Looking forward to it
8777,480659,0,2179337746,Mon Jun 15 08:58:14 PDT 2009,NO_QUERY,skpargania,I am very tired
7327,623899,0,2229720785,Thu Jun 18 16:06:46 PDT 2009,NO_QUERY,supergeeker,"@ackstay yeah, um, she left like, insides &amp..."
8920,556123,0,2204214768,Wed Jun 17 01:37:29 PDT 2009,NO_QUERY,swellvintage,@jamiesmart Still no internet soz. I WILL be ...
6278,586708,0,2216055355,Wed Jun 17 18:58:53 PDT 2009,NO_QUERY,JLSapphire914,Somebody send me a picture text message cuz I ...
6759,838099,4,1558975990,Sun Apr 19 09:55:23 PDT 2009,NO_QUERY,H4L3Yx,jus woke upp eatin breakfast at noon! i love w...
8466,313442,0,2001752926,Tue Jun 02 01:57:26 PDT 2009,NO_QUERY,weezii_d,yeah assignments
869,1130980,4,1975786067,Sat May 30 15:54:02 PDT 2009,NO_QUERY,mmystifier,Now back at home....poker stars on Tv
4012,410949,0,2059992814,Sat Jun 06 18:02:56 PDT 2009,NO_QUERY,animarieee,Where is everyone?
2911,404343,0,2058399791,Sat Jun 06 14:51:55 PDT 2009,NO_QUERY,mayank25may,@Krishna_Teja dude steve is not giving the key...


In [26]:
# We retain the relevant columns by dropping the columns we don't need 
# for creating a sentiment analysis model. 
# ---
#
df = df.drop(['id', 't_id', 'created_at', 'query', 'user'], axis = 1)
df.sample(10)

Unnamed: 0,target,text
8796,4,told my mum my maths result. she said she'll s...
2597,4,it's my day off.. finally!! a nice rainy day 4...
9056,0,@Kristie999 Thanks girl i did over just a litt...
573,0,"I have no money to actually buy playboy, other..."
5183,4,at home thank god.
445,0,sry guys some probs wid yahoo messenger... can...
5430,0,scared of tornado warnings
2577,4,"@ilex_ Hey you Seeing clearly yet? LOL Nah,..."
2566,0,Updated my blog with Good Bye Gary aka the auc...
8434,0,"Uhhh I have a fever now, I can't stop sweating..."


In [27]:
# Understanding the distribution of target
# ---
#
df.target.value_counts() 

0    5067
4    4933
Name: target, dtype: int64

In [28]:
# Let's determine whether our columns have the right data types
# ---
#
df.dtypes

target     int64
text      object
dtype: object

In [29]:
# What values are in our target variable?
# ---
#
df.target.unique()

array([0, 4])

These are the two classes to which each document (text) belongs. The target value 0 means a text with a negative sentiment, while that of 4 means a text with a positive sentiment. 

In [31]:
# Let's check for missing values 
# ---
# 
df.isnull().sum()

target    0
text      0
dtype: int64

We don't have any missing values, so we are good to go.

#### Text Processing

In [32]:
# Text Cleaning: Removing all urls/links
# ---
# 
df['text'] =  df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))
df[['text']].head()

Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,@lindork Tres sad. I was totally a Max fan. #...
4,"Crap, I was counting down the hours until my d..."


In [42]:
# Text Cleaning: Removing @ and # characters or replace them with space
# ---
# Remove @

# df['no_of_ampersats'] = df.text.apply(lambda x: len([x for x in x.split() if x.startswith('@')]))
# df[['text','no_of_ampersats']].head(10)

df['text'] = df.text.str.replace('@',' ')
df.head(5)



Unnamed: 0,target,text
0,0,Obama forges his Muslim alliance against the c...
1,4,Had the most spectacular prom ever but now my...
2,0,I am overwhelmed today taking a moment to eat...
3,0,lindork Tres sad. I was totally a Max fan. #...
4,0,"Crap, I was counting down the hours until my d..."


In [43]:
# Remove #

df['text'] = df.text.str.replace('#',' ')
df.head(5)

Unnamed: 0,target,text
0,0,Obama forges his Muslim alliance against the c...
1,4,Had the most spectacular prom ever but now my...
2,0,I am overwhelmed today taking a moment to eat...
3,0,lindork Tres sad. I was totally a Max fan. ...
4,0,"Crap, I was counting down the hours until my d..."


In [45]:
# Text Cleaning: Conversion to lowercase
# ---
# Count number of text in uppercase
df['no_of_uppercase'] = df.text.apply(lambda x: len([x for x in x.split() if x.isupper()]))
df[['text','no_of_uppercase']].head(10)


Unnamed: 0,text,no_of_uppercase
0,Obama forges his Muslim alliance against the c...,0
1,Had the most spectacular prom ever but now my...,0
2,I am overwhelmed today taking a moment to eat...,1
3,lindork Tres sad. I was totally a Max fan. ...,2
4,"Crap, I was counting down the hours until my d...",1
5,"DCBTV DCBTV I had to go check some things, b...",3
6,smrorke why are you never on gmail anymore,0
7,"Alex_Jeffreys I'd have loved to have come, ju...",0
8,Brrrr ! Heading to work.... Chilly today,0
9,gabriiiiella I neeed to talk to youu.. good ...,1


In [47]:

# Lowercasing Text
# ---
df['text'] = df.text.apply(lambda x: " ".join(x.lower() for x in x.split()))
df[['text']].head(10)

Unnamed: 0,text
0,obama forges his muslim alliance against the c...
1,had the most spectacular prom ever but now my ...
2,i am overwhelmed today taking a moment to eat ...
3,lindork tres sad. i was totally a max fan. sytycd
4,"crap, i was counting down the hours until my d..."
5,"dcbtv dcbtv i had to go check some things, buy..."
6,smrorke why are you never on gmail anymore
7,"alex_jeffreys i'd have loved to have come, jus..."
8,brrrr ! heading to work.... chilly today
9,gabriiiiella i neeed to talk to youu.. good ne...


In [49]:
# Text Cleaning: Splitting concatenated words
# ---
# Performing this step will take few minutes...
# ---


# Installing wordnija and textblob
# ---
#

!pip3 install wordninja
!pip3 install textblob


# Importing those libraries
# ---
#
import wordninja 
from textblob import TextBlob

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m[33m
[0m[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[33mDEPRECATION: Configuring installation sche

In [51]:
# Performing the split
# ---
#
df['text'] = df.text.apply(lambda x: wordninja.split(str(TextBlob(x))))  
df['text'] = df.text.str.join(' ')
df[['text']].sample(10) 

Unnamed: 0,text
5783,tom mcfly i wish i'd be there with you guys
8719,paul k ne bel boo my jam
3003,dom stark i'd forgotten about that i'm not wor...
3057,lar on james get meh wet like haw t sex i moan...
2712,n azzi e 86 sorry i haven't exactly had a lot ...
8734,tired dd work til 8 then a lovely 4 hour drive...
1034,is going to miss the team when they leave tomo...
3060,i wanna go para sailing and jet skiing in bora...
1408,had her ipod playing so loud that her ears hur...
2790,mort ici a 626 we're about to board the plane ...


In [53]:
# Text Cleaning: Removing punctuation characters
# ---
# Count number of punctuation chars
df['punctuation_count'] = df.text.apply(lambda x: len("".join(_ for _ in x if _ in x.split()))) 
df[['text', 'punctuation_count']].head(10)


Unnamed: 0,text,punctuation_count
0,obama forges his muslim alliance against the c...,9
1,had the most spectacular prom ever but now my ...,14
2,i am overwhelmed today taking a moment to eat ...,9
3,lin dork tres sad i was totally a max fan sytycd,8
4,crap i was counting down the hours until my da...,8
5,dc b tv dc b tv i had to go check some things ...,6
6,s mr or ke why are you never on gmail anymore,1
7,alex jeffrey s i'd have loved to have come jus...,12
8,br rrr heading to work chilly today,0
9,ga bri iii ella i nee ed to talk to you u good...,7


In [55]:
# Removing Punctuation Characters
# ---
# 
df['text'] = df.text.str.replace('[^\w\s]','')
df[['text']].head(10)

  df['text'] = df.text.str.replace('[^\w\s]','')


Unnamed: 0,text
0,obama forges his muslim alliance against the c...
1,had the most spectacular prom ever but now my ...
2,i am overwhelmed today taking a moment to eat ...
3,lin dork tres sad i was totally a max fan sytycd
4,crap i was counting down the hours until my da...
5,dc b tv dc b tv i had to go check some things ...
6,s mr or ke why are you never on gmail anymore
7,alex jeffrey s id have loved to have come just...
8,br rrr heading to work chilly today
9,ga bri iii ella i nee ed to talk to you u good...


In [59]:
# Text Cleaning: Removing stop words
# ---
# Import natural language tooklit (nltk) library
# 

import nltk
nltk.download('stopwords')
# import a list of stopwords 
from nltk.corpus import stopwords
stop = stopwords.words('english')
# View the stopwords
stop


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Barayne/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [60]:
# Finding Stop Words
df['no_of_stopwords'] = df.text.apply(lambda x: len([x for x in x.split() if x in stop]))
df[['text','no_of_stopwords']].head(10)

Unnamed: 0,text,no_of_stopwords
0,obama forges his muslim alliance against the c...,9
1,had the most spectacular prom ever but now my ...,13
2,i am overwhelmed today taking a moment to eat ...,5
3,lin dork tres sad i was totally a max fan sytycd,3
4,crap i was counting down the hours until my da...,14
5,dc b tv dc b tv i had to go check some things ...,7
6,s mr or ke why are you never on gmail anymore,6
7,alex jeffrey s id have loved to have come just...,10
8,br rrr heading to work chilly today,1
9,ga bri iii ella i nee ed to talk to you u good...,4


In [61]:
# Remove the stopwords
df['text'] = df.text.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df[['text']].head(10)

Unnamed: 0,text
0,obama forges muslim alliance civilized world d...
1,spectacular prom ever bed serenading must answ...
2,overwhelmed today taking moment eat pray
3,lin dork tres sad totally max fan sytycd
4,crap counting hours dad could come home amp he...
5,dc b tv dc b tv go check things buy others loo...
6,mr ke never gmail anymore
7,alex jeffrey id loved come couple unfortunate ...
8,br rrr heading work chilly today
9,ga bri iii ella nee ed talk u good new sss


In [63]:
# Conirm stopwords have been removed
df['no_of_stopwords'] = df.text.apply(lambda x: len([x for x in x.split() if x in stop]))
df[['text','no_of_stopwords']].head(10)

Unnamed: 0,text,no_of_stopwords
0,obama forges muslim alliance civilized world d...,0
1,spectacular prom ever bed serenading must answ...,0
2,overwhelmed today taking moment eat pray,0
3,lin dork tres sad totally max fan sytycd,0
4,crap counting hours dad could come home amp he...,0
5,dc b tv dc b tv go check things buy others loo...,0
6,mr ke never gmail anymore,0
7,alex jeffrey id loved come couple unfortunate ...,0
8,br rrr heading work chilly today,0
9,ga bri iii ella nee ed talk u good new sss,0


In [72]:
# Text Cleaning: Lemmatization
# ---
# YOUR CODE GOES BELOW
#

# For lemmatization, we will need to download wordnet
#
# import wordnet
nltk.download('wordnet')

# import 0mw-1.4
nltk.download('omw-1.4')


[nltk_data] Downloading package wordnet to /Users/Barayne/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/Barayne/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


True

In [75]:
# Lemmatizing our text
# ---
df['lemmatization'] = df.text.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()])) 
df[['text', 'lemmatization']].head(10)

Unnamed: 0,text,lemmatization
0,obama forges muslim alliance civilized world d...,obama forge muslim alliance civilized world di...
1,spectacular prom ever bed serenading must answ...,spectacular prom ever bed serenading must answ...
2,overwhelmed today taking moment eat pray,overwhelmed today taking moment eat pray
3,lin dork tres sad totally max fan sytycd,lin dork tres sad totally max fan sytycd
4,crap counting hours dad could come home amp he...,crap counting hour dad could come home amp hel...
5,dc b tv dc b tv go check things buy others loo...,dc b tv dc b tv go check thing buy others look...
6,mr ke never gmail anymore,mr ke never gmail anymore
7,alex jeffrey id loved come couple unfortunate ...,alex jeffrey id loved come couple unfortunate ...
8,br rrr heading work chilly today,br rrr heading work chilly today
9,ga bri iii ella nee ed talk u good new sss,ga bri iii ella nee ed talk u good new ss


We won't remove numerics because we could loose meaning of our text if we lost the numerics. We could also further prepare our text by performing spelling correction but this is a resource intensive process that we will skip for now.

#### Feature Engineering Techniques 

In [78]:
# Feature Construction: Length of tweet
# ---
df['length_of_tweet'] = df.text.str.len()
df[['text','length_of_tweet']].sample(10)

Unnamed: 0,text,length_of_tweet
4812,dev yn burton far away hey im glad made even b...,72
2527,mandy 3000,10
9991,gona watch true blood starting,30
5236,ryan seacrest rip adam,22
2714,michelle mistake get mask,25
9010,got home caroline wedding,25
76,famous pen name lost please help find good home,47
4787,loves chocolate milk gf yeah,28
172,ive got wine laptop dvr full quo worlds dumbes...,63
2572,gods gift 2 even im sleep im work n quo beauty...,56


In [None]:
# Feature Construction: Word count 
# ---



In [88]:
# Feature Construction: Word density (Average no. of words / tweet)
# ---

def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

df['avg_word_length'] = df.text.apply(lambda x: avg_word(x))
df[['text','avg_word_length']].sample(5)

ZeroDivisionError: division by zero

In [86]:
# Feature Construction: Noun count
# ---
# YOUR CODE GOES BELOW
#
# First, we will download the punkt and the averaged_perceptron_tagger into our notebook environment. 
# which will allow us to find the part of speech tags.
# ---
#
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/Barayne/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/Barayne/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [84]:
# We create the function to check and get the part of speech tag count of a words in a given sentence
pos_dic = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt

In [87]:
# Noun Count
# ---
df['noun_count'] = df.text.apply(lambda x: pos_check(x, 'noun'))
df[['text','noun_count']].sample(10)

Unnamed: 0,text,noun_count
8881,finishing talking everything fine strange thin...,6
9318,summer kick suite freedom tonight eh still sha...,7
5559,dougie mcfly mcfly day brazil special mcfly ho...,5
8000,kenner jacobs indeed coming see u b 4 u left w...,6
4626,kev b due nt sad,2
3137,blog hopping ill never get tired everyday,2
5077,time warner peeps without power right,5
2999,upset kicked peyton lucas one tree hill mik call,3
8730,moving process officially begins new apt sweet,2
1087,toyota gladstone try rub spent afternoon pools...,12


In [89]:
# Feature Construction: Verb count
# ---
df['verb_count'] = df.text.apply(lambda x: pos_check(x, 'verb'))
df[['text','verb_count']].sample(10)

Unnamed: 0,text,verb_count
322,ri krak shop thanks much,0
6280,kfar 10 wu h wu h wu h lauren lol,1
5151,go lee cy go lee cy yes girl l im dead serious,4
7815,im bus,0
4502,sun scrape bai im paris anymore na ko tet bai,0
4743,zaibatsu arent looking,1
7030,need get together,1
1990,google app engine doesnt support j axb rest ea...,1
9507,lance armstrong agree please show,0
6161,tom mcfly went hilton three times stayed day u...,1


In [90]:
# Feature Construction: Adjective count / Tweet
# ---
df['adj_count'] = df.text.apply(lambda x: pos_check(x, 'adj'))
df[['text','adj_count']].sample(10)


Unnamed: 0,text,adj_count
9911,bobby l lew lister k ry ten live one night,2
4688,arm deacon night,0
2560,morning im soo glad sunny today makes smile tw...,2
4595,rubber pacers suck 6 mouth pain eat anything,0
8009,miss clar cal,2
1156,laura walker xo sister hah beck nelson going g,0
1660,eh,0
8521,david archie hi really dont anything say wanna...,2
6505,sold 1 st car ebay pleasure surprise yay shall,2
5912,pitch engine lol charge head heart didnt even ...,3


In [91]:
# Feature Construction: Adverb count / Tweet
# ---
df['adv_count'] = df.text.apply(lambda x: pos_check(x, 'adv'))
df[['text','adv_count']].sample(10)

Unnamed: 0,text,adv_count
7564,im sjsu studying reg holy cow reg looks tough,0
5646,laundry done ironing packing,0
80,going vegas maybe maybe ill two months,2
701,fat man johnson atl hah needs cop younger sister,0
2329,wanna take sun bath backyard go dentist well m...,3
8281,tomorrow tomorrow love ya tomorrow remember th...,0
8012,bore eeeeee e ed,1
1399,lovin new blackberry,0
5114,beautiful day airsoft,0
5635,foto lex ic sadly thats whats happening good,1


In [92]:
# Feature Construction: Pronoun 
# ---
df['pron_count'] = df.text.apply(lambda x: pos_check(x, 'pron'))
df[['text','pron_count']].sample(10)

Unnamed: 0,text,pron_count
9159,ugh bronchitis horrible going bed,0
1772,stretch able pants butt uncomfortable,0
8724,first online purchase im flickr pro,0
3672,ram las thx follow friday recommendation,0
0,obama forges muslim alliance civilized world d...,0
7707,finished treasury lots colors back work,0
770,go banana tailpipe said hell stick ports,0
79,unn allman yeah looks like quo busy quo fuckin...,1
709,neither sol j el job looks like problem dentist,0
69,sore throat runny nose lethargy fun,0


In [100]:
# Feature Construction: Subjectivity
# ---
# YOUR CODE GOES BELOW
# 
def get_subjectivity(text):
    try:
        textblob = TextBlob(unicode(text, 'utf-8'))
        subj = textblob.sentiment.subjectivity
    except:
        subj = 0.0
    return subj

df['subjectivity'] = df.text.apply(get_subjectivity)
df[['text', 'subjectivity']].sample(10)

Unnamed: 0,text,subjectivity
9588,jem maha tty really hope joke j emma today fir...,0.0
9334,kids australia green tea designs thanks girls ...,0.0
9855,telecom bum im showing youve online consistent...,0.0
2385,x r sorry bad night,0.0
1693,mr jack enjoy rush work last series unfortunat...,0.0
7490,kelly dot compton going oh thats everyone skin...,0.0
8681,thats good sleep shouldnt started morning eati...,0.0
7880,dd lovato words sad buu u,0.0
1028,got home chinese food put shay las info blast ...,0.0
6787,st b jordan know deadlines multi task theyll hear,0.0


In [130]:
# Feature Construction: Polarity
# ---
def get_polarity(text):
    try:
        textblob = TextBlob(unicode(text, 'utf-8'))
        pol = textblob.sentiment.polarity
    except:
        pol = 0.0
    return pol

df['polarity'] = df.text.apply(get_polarity)
df[['text', 'polarity']].sample(10)


Unnamed: 0,text,polarity
536,terrence j 106 drunk hell yes laid gimme,0.0
3300,princess 2802 hah miss boy smells matter got a...,0.0
9793,n takaya could much fun testing different play...,0.0
477,hey tear catcher theyve belgium shows,0.0
2761,listen please anyone,0.0
9628,bl ka demi c eh en fo di ou ke gen yon le glis...,0.0
9296,baz anna mother laughing lol bl c gr lt 3 ssc,0.0
3150,watching nutty madams videos,0.0
2020,twitter titter started nothing still,0.0
1604,sad beautiful get local recommendations change...,0.0


In [131]:
# Library for TD-IDF
from sklearn.feature_extraction.text import TfidfVectorizer 

In [132]:
# Feature Construction: Word Level N-Gram TF-IDF Feature 
# ---

# from nltk import word_tokenize, ngrams

# list(ngrams(word_tokenize(df['text'][0]), 2)) 

tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1,3),  stop_words= 'english')
df_word_vect = tfidf.fit_transform(df.text) 



In [133]:
# Feature Construction: Character Level N-Gram TF-IDF Feature
# ---
# list(ngrams(df['text'][0], 2))
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1,3),  stop_words= 'english')
df_char_vect = tfidf.fit_transform(df.text)


In [134]:
# Let's prepare the constructed features for modeling
# ---
#
X_metadata = np.array(df.iloc[:, 2:12])
X_metadata

array([[0, 9,
        'obama forges his muslim alliance against the civilized world and he didnt even drop in for a cup of tea',
        ..., 3, 1, 1],
       [0, 14,
        'had the most spectacular prom ever but now my bed is serenading me and i must answer sweet dreams my friends what a wonderful day',
        ..., 3, 2, 1],
       [1, 9, 'i am overwhelmed today taking a moment to eat and pray',
        ..., 2, 0, 0],
       ...,
       [0, 5, 'hah a linas hyper already well lucky you im in college',
        ..., 1, 2, 2],
       [0, 0, 'omg really good day happened right here', ..., 1, 2, 1],
       [0, 7,
        'love 2 cook pie i saw you on division and 68 th but you didnt see me',
        ..., 3, 0, 0]], dtype=object)

In [136]:
# We combine our two tfidf (sparse) matrices and X_metadata
# ---
#
X = scipy.sparse.hstack([df_word_vect, df_char_vect,  X_metadata])
X

TypeError: no supported conversion for types: (dtype('float64'), dtype('float64'), dtype('O'))

In [109]:
# Getting our response variable
# ---
#
y = np.array(df.iloc[:, 0])
y

array([0, 4, 0, ..., 0, 4, 0])

### 4. Data Modelling

During this step, we will use machine learning algorithms to train and test our sentiment analysis models.

In [111]:
# Splitting our data
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

NameError: name 'X' is not defined

In [None]:
# Fitting our model
# ---
#

# Importing the algorithms
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression

nb_classifier = MultinomialNB() 
lr_classifier = LogisticRegression(max_iter=1000) 

# Training our model
nb_classifier.fit(X_train, y_train) 
lr_classifier.fit(X_train, y_train)

In [None]:
# Making predictions
# ---
#
y_predict_nb = nb_classifier.predict(X_test) 
y_predict_lr = lr_classifier.predict(X_test)

In [None]:
# Evaluating the Models
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
# ---
#
print("Naive Bayes Classifier:\n", accuracy_score(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", accuracy_score(y_test, y_predict_lr))

In [None]:
# Confusion matrices
# ---
# 
print("Naive Bayes Classifier: \n", confusion_matrix(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", confusion_matrix(y_test, y_predict_lr))

In [None]:
# Classification Reports
# ---
#
print("Naive Bayes Classifier: \n", classification_report(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", classification_report(y_test, y_predict_lr))

**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

To improve our model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. We can also use different vectorizing techniques, implement other machine learning models and perform hyperparameter tuning.

### 5. Recommendations


Our best model had an accuracy of 73.25% and use it for classifying newer tweets. We can improve this performance by performing hyperparameter tuning and feature engineering methods. 