<a href="https://colab.research.google.com/github/fkihu/Model-Quality-and-Improvement-Assignment/blob/main/Week_8D2_Assignment_Getting_Started_with_Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="#4b76b7">To start practicing, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# AfterWork Data Science: Getting Started with NLP Project

### Prerequisites

In [None]:
# Importing the required libraries
# ---
# 
import pandas as pd # library for data manipulation
import numpy as np  # librariy for scientific computations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # scientific conputing 

### 1. Importing our Data

In [None]:
# Question: Given a new tweets, create a sentiment analysis model that will 
# predict whether a tweet will contain positive or negative sentiment.
# ---
# Dataset url = https://bit.ly/31kqByD 
# ---
#
df = pd.read_csv('https://bit.ly/31kqByD', encoding='latin-1')
df.head()

Unnamed: 0.1,Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


### 2. Data Exploration

In [None]:
# We can determine the size of our dataset
# ---
#
df.shape

(10000, 7)

Seems this dataset will need some data cleaning i.e. columns. We also don't need some columns to create our model. We will drop those columns.

### 3. Data Preparation

#### Basic Data Cleaning Techniques

In [None]:
# We rename the columns for ease of referencing our columns later on
# ---
#
df.columns = ['id', 'target', 't_id', 'created_at', 'query', 'user', 'text']
df.head()

Unnamed: 0,id,target,t_id,created_at,query,user,text
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


In [None]:
# We retain the relevant columns by dropping the columns we don't need 
# for creating a sentiment analysis model. 
# ---
#
df = df.drop(['id', 't_id', 'created_at', 'query', 'user'], axis = 1)
df.head()

Unnamed: 0,target,text
0,0,Obama forges his Muslim alliance against the c...
1,4,Had the most spectacular prom ever but now my...
2,0,I am overwhelmed today taking a moment to eat...
3,0,@lindork Tres sad. I was totally a Max fan. #...
4,0,"Crap, I was counting down the hours until my d..."


In [None]:
# Understanding the distribution of target
# ---
#
df.target.value_counts() 

0    5067
4    4933
Name: target, dtype: int64

In [None]:
# Let's determine whether our columns have the right data types
# ---
#
df.dtypes

target     int64
text      object
dtype: object

In [None]:
# What values are in our target variable?
# ---
#
df.target.unique()

array([0, 4])

These are the two classes to which each document (text) belongs. The target value 0 means a text with a negative sentiment, while that of 4 means a text with a positive sentiment. 

In [None]:
# Let's check for missing values 
# ---
# 
df.isnull().sum()

target    0
text      0
dtype: int64

We don't have any missing values, so we are good to go.

#### Text Processing

In [None]:
# Text Cleaning: Removing all urls/links
# ---
# 
df['text'] =  df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))
df[['text']].head()

Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,@lindork Tres sad. I was totally a Max fan. #...
4,"Crap, I was counting down the hours until my d..."


In [None]:
# Text Cleaning: Removing @ and # characters or replace them with space
# ---
# YOUR CODE GOES BELOW

df['tweet_clean'] = df.text.str.replace('#',' ')
df['tweet_clean'] = df.text.str.replace('@',' ')


df[['tweet_clean']].sample(5)


Unnamed: 0,tweet_clean
8719,paulknebel - boo. my jam
3843,Has just had a row with her mum : . Annoyed
7947,o_O -- This morning the pain is manifesting in...
3619,mcinnes oh dear hamish'll get his back soon ...
5642,I wanna see my kidssss.


In [None]:
# Text Cleaning: Conversion to lowercase
# ---
# YOUR CODE GOES BELOW

df['tweet_clean'] = df.tweet_clean.apply(lambda x: " ".join(x.lower() for x in x.split()))
df[['tweet_clean']].sample(5)


Unnamed: 0,tweet_clean
3562,grace notes nyc oh hope worked okay thanks try...
6765,theater today friends
2473,sitting car outside c hippie whilst wife joins...
4526,pinky penny o wu wit u talking mens
2872,weather crap today


In [None]:
# Text Cleaning: Splitting concatenated words
# ---
# Performing this step will take few minutes...
# ---
# YOUR CODE GOES BELOW
# 

# Installing wordnija and textblob
!pip3 install wordninja
!pip3 install textblob


# Importing those libraries
import wordninja 
from textblob import TextBlob




In [None]:
# Performing the split

df['tweet_clean'] = df.tweet_clean.apply(lambda x: wordninja.split(str(TextBlob(x))))  
df['tweet_clean'] = df.tweet_clean.str.join(' ')
df[['tweet_clean']].sample(10) 


Unnamed: 0,tweet_clean
5090,adrienne bailon yea heard feel like im person ...
7521,discovered fairly impressive gash leg upon wak...
1350,x miss katie x but i icon make sense
9837,got done working im sore
274,hay duchovny hayley updates showing people pub...
5813,jay handsome lol sup mr handsome
3988,kr n karina want something fight maybe fightin...
3594,est swiss im fine sun came back sweden today d...
4772,smelling good food across street gonna break s...
6487,millie magsaysay wha aaa at text lt 3


In [None]:
# Text Cleaning: Removing punctuation characters
# ---
# YOUR CODE GOES BELOW

df['tweet_clean'] = df.tweet_clean.str.replace('[^\w\s]','')
df[['tweet_clean']].sample(5)


Unnamed: 0,tweet_clean
4753,elle c tro cutie sick baby he s trouble sleepi...
879,burnt roof mouth
6048,rob st witt oh fuck hah
9098,s tucking tong liao not happy
4369,alicia loves j ls yeahh h time ago


In [None]:
# Text Cleaning: Removing stop words
# ---
# YOUR CODE GOES BELOW

# Importing the natural language tool kit and downloading the stopwords
import nltk
nltk.download('stopwords')

# Importing a list of English stopwords

from nltk.corpus import stopwords
stop = stopwords.words('english')

# Removing the stopwords
df['tweet_clean'] = df.tweet_clean.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df[['tweet_clean']].sample(5)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,tweet_clean
4123,davina sky apprentice two hours going tired tom
1029,eggs hall 5 c mon
6521,nancy yeh thanks didnt enough miles upgrade 1 ...
6560,il 429 fine kinder coughing lot sleep weathers...
7838,unfortunately isnt nice right


In [None]:
# Text Cleaning: Lemmatization
# ---
# YOUR CODE GOES BELOW

# For lemmatization, we will need to download wordnet
nltk.download('wordnet')
from textblob import Word

# Lemmatizing our text
df['lemmatization'] = df.tweet_clean.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()])) 
df[['tweet_clean', 'lemmatization']].sample(10)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,tweet_clean,lemmatization
7686,sang chattanooga tn hungry b c b racy last ate 10,sang chattanooga tn hungry b c b racy last ate 10
6861,avalon 789 lucky make sure treats right hah aha,avalon 789 lucky make sure treat right hah aha
1235,heather official make people drink petrol set ...,heather official make people drink petrol set ...
153,postal gu las things going lately read u reall...,postal gu la thing going lately read u really ...
9545,3 58 sleep site wouldnt bad except need finish...,3 58 sleep site wouldnt bad except need finish...
3886,time time gott bit nasty curse name blah hh,time time gott bit nasty curse name blah hh
283,isaac sousa v,isaac sousa v
4767,getting new twitter need followers please,getting new twitter need follower please
8767,ed burton free version germany yet pay,ed burton free version germany yet pay
6548,alright alright fuck dont hit cars im ya,alright alright fuck dont hit car im ya


We won't remove numerics because we could loose meaning of our text if we lost the numerics. We could also further prepare our text by performing spelling correction but this is a resource intensive process that we will skip for now.

#### Feature Engineering Techniques 

In [None]:
# Feature Construction: Length of tweet
# ---
# YOUR CODE GOES BELOW

def tweet_length(sentence):
  words = sentence.split()
  z = (sum(len(word) for word in words))
  return z

df['tweet_length'] = df.tweet_clean.apply(lambda x: tweet_length(x))
df[['tweet_clean','tweet_length']].sample(5)


Unnamed: 0,tweet_clean,tweet_length
1746,dub lover tried upload one song sisters laptop...,77
4010,gareth cliff gareth durban heavenly today,36
7229,got proper nights sleep last night whole eight...,44
1737,think one jake ryan,16
6163,todays mm tz thank song music via music stall,37


In [None]:
# Feature Construction: Word count 
# ---
# YOUR CODE GOES BELOW

df['word_count'] = df.tweet_clean.apply(lambda x: len(str(x).split(" ")))
df[['tweet_clean', 'tweet_length', 'word_count']].sample(5)

Unnamed: 0,tweet_clean,tweet_length,word_count
1406,ac tully gutted katy perry gig another 11 week...,56,12
2565,cunning hamster moved back feb gave 6 months,37,8
518,meg urban know feeling,19,4
9618,decided might good idea misjudge hair amp prod...,72,14
6495,yup predicted macbook ran space could render d...,49,8


In [None]:
# Feature Construction: Word density (Average no. of words / tweet)
# ---
# YOUR CODE GOES BELOW

def avg_word(sentence):
  words = sentence.split()
  try:
    z = (sum(len(word) for word in words)/len(words))
  except ZeroDivisionError:
    z = 0 
  return z

df['avg_word_length'] = df.tweet_clean.apply(lambda x: avg_word(x))
df[['tweet_clean', 'tweet_length', 'word_count', 'avg_word_length']].sample(5)


Unnamed: 0,tweet_clean,tweet_length,word_count,avg_word_length
6674,sum 1 talk 2 please,15,5,3.0
7553,sad professional someday id really like share ...,74,13,5.692308
11,318 carsten injured cool didnt catch one xs n ...,51,12,4.25
4696,wishing barney happy birthday tomorrow get wis...,46,8,5.75
7143,go watch twilight ill back mo ring tweet goodbye,40,9,4.444444


In [None]:
# Feature Construction: Noun count
# ---
# YOUR CODE GOES BELOW
#
# First, we will download the punkt and the averaged_perceptron_tagger into our notebook environment. 
# which will allow us to find the part of speech tags.
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


# We create the function to check and get the part of speech tag count of a words in a given sentence
pos_dic = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
# Noun Count
# ---
# YOUR CODE GOES BELOW

df['noun_count'] = df.tweet_clean.apply(lambda x: pos_check(x, 'noun'))
df[['tweet_clean', 'tweet_length', 'word_count', 'avg_word_length', 'noun_count']].sample(10)


Unnamed: 0,tweet_clean,tweet_length,word_count,avg_word_length,noun_count
4369,alicia loves j ls yeahh h time ago,27,8,3.375,5
449,rb cort quo rb vi seven tenths oh yeah quo cor...,72,21,3.428571,13
3943,purple bint dont nobody workplace time friday ...,70,13,5.384615,9
3630,friend hope thank god 4 hope lol,26,7,3.714286,4
3252,polo 65 th right,13,4,3.25,1
701,fat man johnson atl hah needs cop younger sister,40,9,4.444444,5
6589,caustic cay la get wear one super fancy hair f...,61,13,4.692308,8
9707,allie power happy birth da yyyy yyyy,30,7,4.285714,5
9899,0 breaker tried midway take nwa 1 stop saves 2...,67,17,3.941176,5
8473,stephen harris pesto curry never thought add t...,73,14,5.214286,7


In [None]:
# Feature Construction: Verb count
# ---
# YOUR CODE GOES BELOW

df['verb_count'] = df.tweet_clean.apply(lambda x: pos_check(x, 'verb'))
df[['tweet_clean', 'tweet_length', 'word_count', 'avg_word_length', 'noun_count','verb_count']].sample(5)

Unnamed: 0,tweet_clean,tweet_length,word_count,avg_word_length,noun_count,verb_count
6912,niels decided instead make pancakes home lazy ...,44,8,5.5,5,2
5703,malcolm ingram,13,2,6.5,2,0
570,hello slow getting going today feel like curli...,52,11,4.727273,5,5
4879,think im one ppl world loves mondays oh love m...,43,10,4.3,5,2
4642,slice coffee cake im snack particularly yummy ...,63,12,5.25,6,1


In [None]:
# Feature Construction: Adjective count / Tweet
# ---
# YOUR CODE GOES BELOW

df['adj_count'] = df.tweet_clean.apply(lambda x: pos_check(x, 'adj'))
df[['tweet_clean', 'tweet_length', 'word_count', 'avg_word_length', 'noun_count', 'verb_count', 'adj_count']].sample(10)


Unnamed: 0,tweet_clean,tweet_length,word_count,avg_word_length,noun_count,verb_count,adj_count
4221,sore sluggish magenta cried put e bay guess fe...,61,13,4.692308,4,3,4
2986,hate storms,10,2,5.0,2,0,0
5079,shaun divine love,15,3,5.0,2,0,1
4707,seem elie indeed,14,3,4.666667,1,1,0
5752,ugh management project back tech tomorrow im g...,56,11,5.090909,5,1,4
9531,awake mt wanna get petty fer today x,29,8,3.625,5,2,1
7668,finished show donnie back city home,30,6,5.0,3,2,0
5372,muc bru la 3 hours flight went missing long ti...,60,13,4.615385,6,3,3
1393,tizzy sizzle berg ok well tube strike next thu...,75,16,4.6875,5,2,4
5239,photo julia coleman bought new nordstrom yeste...,44,7,6.285714,5,1,1


In [None]:
# Feature Construction: Adverb count / Tweet
# ---
# YOUR CODE GOES BELOW

df['adv_count'] = df.tweet_clean.apply(lambda x: pos_check(x, 'adv'))
df[['tweet_clean', 'tweet_length', 'word_count', 'avg_word_length', 'noun_count', 'verb_count', 'adj_count', 'adv_count']].sample(10)


Unnamed: 0,tweet_clean,tweet_length,word_count,avg_word_length,noun_count,verb_count,adj_count,adv_count
3423,star laced ms mia u u u u saw called yesterday...,43,13,3.307692,7,2,4,0
6927,c go w voting starts hour vote,24,7,3.428571,4,2,1,0
5137,serial seb lol realised peter bells quo requir...,65,11,5.909091,8,1,2,0
2512,enjoying glass pinot noir chilling long day,37,7,5.285714,3,2,1,0
4292,marketers n cel ebs ruined twitter,29,6,4.833333,3,2,1,0
4485,angry snout loved comic much,24,5,4.8,1,1,3,0
7887,got hair cut almost persuaded mum let go satur...,69,16,4.3125,6,5,2,3
744,x lainey x hey,11,4,2.75,3,0,1,0
6090,fashion enemy good morning,23,4,5.75,2,0,1,1
1714,happy squared real lii dont get,26,6,4.333333,3,1,2,0


In [None]:
# Feature Construction: Pronoun 
# ---
# YOUR CODE GOES BELOW

df['pron_count'] = df.tweet_clean.apply(lambda x: pos_check(x, 'pron'))
df[['tweet_clean', 'tweet_length', 'word_count', 'avg_word_length', 'noun_count', 'verb_count', 'adj_count', 'adv_count', 'pron_count']].sample(10)


Unnamed: 0,tweet_clean,tweet_length,word_count,avg_word_length,noun_count,verb_count,adj_count,adv_count,pron_count
2283,something wrong twitter app,24,4,6.0,3,0,1,0,0
335,working holiday houses damn infinite loops,37,6,6.166667,3,2,1,0,0
9129,nea gy cody line ly pali would love give info ...,67,18,3.722222,9,2,0,4,0
9796,oh mexican martini good,20,4,5.0,2,0,2,0,0
2858,damn gonna rain day tomorrow,24,5,4.8,4,1,0,0,0
3347,real jordin last year looks,23,5,4.6,3,0,2,0,0
1962,arch eu th 1 away twitter 12 hrs crazy need he...,75,19,3.947368,9,3,2,1,0
8794,tumble moose thanks,17,3,5.666667,1,0,2,0,0
2439,ms ebony luv sorry delay lifting weights break...,70,14,5.0,8,2,2,2,0
9607,got home sleepy today fun though,27,6,4.5,3,2,0,0,0


In [None]:
# Feature Construction: Subjectivity
# ---
# YOUR CODE GOES BELOW

def get_subjectivity(text):
    textblob = TextBlob(text)
    subj = textblob.sentiment.subjectivity
    return subj

df['subjectivity'] = df.tweet_clean.apply(get_subjectivity)
df[['tweet_clean', 'tweet_length', 'word_count', 'avg_word_length', 'noun_count', 'verb_count', 'adj_count', 'adv_count', 'pron_count', 'subjectivity']].sample(10)


Unnamed: 0,tweet_clean,tweet_length,word_count,avg_word_length,noun_count,verb_count,adj_count,adv_count,pron_count,subjectivity
2040,going purse find nail clippers found hole lini...,65,14,4.642857,5,3,4,1,0,0.6
4433,spencer pratt country code phone number ill ca...,53,10,5.3,7,1,2,0,0,1.0
6384,wolverine really good,19,3,6.333333,1,0,1,1,0,0.6
3818,work 5 trying go sleep,18,5,3.6,1,2,1,0,0,0.0
818,drummer g 217 lucky meet though fully believe,38,8,4.75,3,1,1,1,0,0.833333
8106,going movies hannah jordan victoria,31,5,6.2,3,2,0,0,0,0.0
407,productivity shot week trying tomorrow,34,5,6.8,4,1,0,0,0,0.0
817,laura bouvier x done guys fun left,28,7,4.0,4,2,1,0,0,0.1
2564,tired drunk night bus home,22,5,4.4,4,0,1,0,0,0.85
1412,jordan knight choosing pic please please next ...,66,13,5.076923,7,3,3,0,0,0.25641


In [None]:
# Feature Construction: Polarity
# ---
# YOUR CODE GOES BELOW

def get_polarity(text):
    textblob = TextBlob(text)
    pol = textblob.sentiment.polarity
    return pol

df['polarity'] = df.tweet_clean.apply(get_polarity)
df[['tweet_clean', 'tweet_length', 'word_count', 'avg_word_length', 'noun_count', 'verb_count', 'adj_count', 'adv_count', 'pron_count', 'subjectivity', 'polarity']].sample(5)


Unnamed: 0,tweet_clean,tweet_length,word_count,avg_word_length,noun_count,verb_count,adj_count,adv_count,pron_count,subjectivity,polarity
5982,sad face back work tom 4 days break kkk kkk kk...,40,12,3.333333,7,1,2,1,0,0.5,-0.25
6133,still cant believe summer im programed correctly,42,7,6.0,2,2,1,2,0,0.0,0.0
6379,bed got early day moro est london snatch car b...,48,11,4.363636,6,2,2,1,0,0.15,0.05
8971,ginger cm 2 prizes left first party second one...,75,20,3.75,8,5,4,0,0,0.306667,0.293333
5328,ashleigh 92 okay baby nothing really gonna go ...,55,12,4.583333,3,3,1,4,0,0.433333,0.4


In [None]:
# Feature Construction: Word Level N-Gram TF-IDF Feature 
# ---
# YOUR CODE GOES BELOW
# Importing the TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1,3),  stop_words= 'english')
df_tweets_vect = tfidf.fit_transform(df['tweet_clean'])

# Show feature matrix / Previewing the created sparse matrix
#
df_tweets_vect.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
tfidf.get_feature_names()



['00',
 '09',
 '10',
 '100',
 '11',
 '12',
 '13',
 '15',
 '16',
 '19',
 '20',
 '2009',
 '21',
 '22',
 '24',
 '30',
 '40',
 '50',
 'aaa',
 'aaaa',
 'aaaaaa',
 'able',
 'account',
 'actually',
 'ad',
 'adam',
 'add',
 'added',
 'afternoon',
 'ago',
 'agree',
 'ah',
 'aha',
 'ahh',
 'ahh hh',
 'air',
 'airport',
 'al',
 'album',
 'alex',
 'alright',
 'amazing',
 'amp',
 'andy',
 'angel',
 'anna',
 'answer',
 'anymore',
 'apparently',
 'apple',
 'arent',
 'arm',
 'art',
 'ashley',
 'ask',
 'asleep',
 'ate',
 'aw',
 'awake',
 'awards',
 'away',
 'awesome',
 'ay',
 'baby',
 'bad',
 'bag',
 'ball',
 'band',
 'bar',
 'bb',
 'beach',
 'bear',
 'beat',
 'beautiful',
 'bed',
 'bee',
 'believe',
 'ben',
 'best',
 'better',
 'big',
 'bike',
 'bird',
 'birthday',
 'bit',
 'black',
 'blog',
 'bloody',
 'blue',
 'body',
 'boo',
 'book',
 'books',
 'bored',
 'boring',
 'bought',
 'bout',
 'box',
 'boy',
 'boys',
 'break',
 'breakfast',
 'bring',
 'broke',
 'broken',
 'brother',
 'brothers',
 'brown',
 

In [None]:
# Feature Construction: Character Level N-Gram TF-IDF Feature
# ---
# YOUR CODE GOES BELOW

tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1,3),  stop_words= 'english')
df_char_vect = tfidf.fit_transform(df['tweet_clean'])
df_char_vect.toarray()


array([[0.24731812, 0.        , 0.        , ..., 0.        , 0.08428422,
        0.        ],
       [0.2254523 , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.17457094, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.22579768, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.1835366 , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.2803263 , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [None]:
# Let's prepare the constructed features for modeling
# ---
# I IGNORED THIS STEP SINCE THE OUTPUT WAS IN OBJECT FORM
# X_metadata = np.array(df.iloc[:, 2:12])
# X_metadata

array([['obama forges muslim alliance civilized world didnt even drop cup tea',
        'obama forge muslim alliance civilized world didnt even drop cup tea',
        58, ..., 1, 1, 0],
       ['spectacular prom ever bed serenading must answer sweet dreams friends wonderful day',
        'spectacular prom ever bed serenading must answer sweet dream friend wonderful day',
        72, ..., 2, 1, 0],
       ['overwhelmed today taking moment eat pray',
        'overwhelmed today taking moment eat pray', 35, ..., 0, 0, 0],
       ...,
       ['hah lin hyper already well lucky im college',
        'hah lin hyper already well lucky im college', 36, ..., 2, 2, 0],
       ['omg really good day happened right',
        'omg really good day happened right', 29, ..., 2, 1, 0],
       ['love 2 cook pie saw division 68 th didnt see',
        'love 2 cook pie saw division 68 th didnt see', 35, ..., 0, 0, 0]],
      dtype=object)

In [None]:
# We combine our two tfidf (sparse) matrices and X_metadata
# ---
#
X = scipy.sparse.hstack([df_tweets_vect, df_char_vect])
X

<10000x2000 sparse matrix of type '<class 'numpy.float64'>'
	with 877768 stored elements in COOrdinate format>

In [None]:
# Getting our response variable
# ---
#
y = np.array(df.iloc[:, 0])
y

array([0, 4, 0, ..., 0, 4, 0])

### 4. Data Modelling

During this step, we will use machine learning algorithms to train and test our sentiment analysis models.

In [None]:
# Splitting our data
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Fitting our model
# ---
#

# Importing the algorithms
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression

nb_classifier = MultinomialNB() 
lr_classifier = LogisticRegression(max_iter=1000) 

# Training our model
nb_classifier.fit(X_train, y_train) 
lr_classifier.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [None]:
# Making predictions
# ---
#
y_predict_nb = nb_classifier.predict(X_test) 
y_predict_lr = lr_classifier.predict(X_test)

In [None]:
# Evaluating the Models
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
# ---
#
print("Naive Bayes Classifier:\n", accuracy_score(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", accuracy_score(y_test, y_predict_lr))

Naive Bayes Classifier:
 0.734
Logistic Regression Classifier: 
 0.7305


In [None]:
# Confusion matrices
# ---
# 
print("Naive Bayes Classifier: \n", confusion_matrix(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", confusion_matrix(y_test, y_predict_lr))

Naive Bayes Classifier: 
 [[766 284]
 [248 702]]
Logistic Regression Classifier: 
 [[760 290]
 [249 701]]


In [None]:
# Classification Reports
# ---
#
print("Naive Bayes Classifier: \n", classification_report(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", classification_report(y_test, y_predict_lr))

Naive Bayes Classifier: 
               precision    recall  f1-score   support

           0       0.76      0.73      0.74      1050
           4       0.71      0.74      0.73       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000

Logistic Regression Classifier: 
               precision    recall  f1-score   support

           0       0.75      0.72      0.74      1050
           4       0.71      0.74      0.72       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000



**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

To improve our model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. We can also use different vectorizing techniques, implement other machine learning models and perform hyperparameter tuning.

### 5. Recommendations


Our best model had an accuracy of 73.25% and use it for classifying newer tweets. We can improve this performance by performing hyperparameter tuning and feature engineering methods. 