<font color="#4b76b7">To start practicing, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# AfterWork Data Science: Getting Started with NLP Project

### Prerequisites

In [216]:
# Importing the required libraries
import pandas as pd # library for data manipulation
import numpy as np  # librariy for scientific computations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # scientific conputing 

### 1. Importing our Data

In [217]:
# Question: Given a new tweets, create a sentiment analysis model that will 
# predict whether a tweet will contain positive or negative sentiment.
df = pd.read_csv('https://bit.ly/31kqByD', encoding='latin-1')
df.head()

Unnamed: 0.1,Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the civilized world - http://tinyurl.com/pqcops . And he didn't even drop in for a cup of tea
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,"Had the most spectacular prom ever but now my bed is serenading me and i must answer, sweet dreams my friends what a wonderful day"
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat and pray!!!!
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #SYTYCD
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my dad could come home &amp; help me watch my son but now he said is going out to dinner first"


### 2. Data Exploration

In [218]:
# We can determine the size of our dataset
df.shape

(10000, 7)

Seems this dataset will need some data cleaning i.e. columns. We also don't need some columns to perform create our model. We will drop those columns.

### 3. Data Preparation

#### Basic Data Cleaning Techniques

In [None]:
# We rename the columns for ease of referencing our columns later on
df.columns = ['id', 'target', 't_id', 'created_at', 'query', 'user', 'text']
df.head()

In [None]:
# We retain the relevant columns by dropping the columns we don't need 
# for creating a sentiment analysis model. 
df = df.drop(['id', 't_id', 'created_at', 'query', 'user'], axis = 1)
df.head()

In [None]:
# Understanding the distribution of target
df.target.value_counts() 

In [222]:
# Let's determine whether our columns have the right data types
df.dtypes

target     int64
text      object
dtype: object

In [223]:
# What values are in our target variable?
df.target.unique()

array([0, 4])

These are the two classes to which each document (text) belongs. The target value 0 means a text with a negative sentiment, while that of 4 means a text with a positive sentiment. 

In [224]:
# Let's check for missing values 
df.isnull().sum()

target    0
text      0
dtype: int64

We don't have any missing values, so we are good to go.

#### Text Processing

In [225]:
# Text Cleaning: Removing all urls/links
df['text'] =  df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))
df[['text']].head()

Unnamed: 0,text
0,Obama forges his Muslim alliance against the civilized world - . And he didn't even drop in for a cup of tea
1,"Had the most spectacular prom ever but now my bed is serenading me and i must answer, sweet dreams my friends what a wonderful day"
2,I am overwhelmed today taking a moment to eat and pray!!!!
3,@lindork Tres sad. I was totally a Max fan. #SYTYCD
4,"Crap, I was counting down the hours until my dad could come home &amp; help me watch my son but now he said is going out to dinner first"


In [226]:
# Text Cleaning: Removing @ and # characters or replace them with space
df['text'] = df.text.str.replace('#','')
df['text'] = df.text.str.replace('@','')
df[['text']].head()

Unnamed: 0,text
0,Obama forges his Muslim alliance against the civilized world - . And he didn't even drop in for a cup of tea
1,"Had the most spectacular prom ever but now my bed is serenading me and i must answer, sweet dreams my friends what a wonderful day"
2,I am overwhelmed today taking a moment to eat and pray!!!!
3,lindork Tres sad. I was totally a Max fan. SYTYCD
4,"Crap, I was counting down the hours until my dad could come home &amp; help me watch my son but now he said is going out to dinner first"


In [227]:
# Text Cleaning: Conversion to lowercase
df['text'] = df.text.apply(lambda x: " ".join(x.lower() for x in x.split()))
df[['text']].sample(10)

Unnamed: 0,text
6536,elliottkember don't think it's going to be as easy as that
4526,pinkypenny =o wut is wit u and talking to my mens
498,"_annella that's the rumor, at least. if it turns out to be true, i will probably cry or something dramatic. boys."
4591,italian websites do not like connections from the us.
4890,annoyer i've only seen star trek (excellent movie). up should be a riot. birthday best wishes to the 3 stars-of-the-day! wish i could go
5987,therealtiffany aww thats a nice pic i luv u guys so much
5958,ms_geey give me the review later yak..
5961,allycupcake well i'll check it out. see when you'll be in my area. i think it would be neat. see if i cant get a front row spot
5925,drvictor i'll ponder it victor..
5922,its still at work


In [228]:
# Text Cleaning: Splitting concatenated words

# Installing wordnija and textblob
!pip3 install wordninja
!pip3 install textblob

# Importing those libraries
import wordninja 
from textblob import TextBlob



In [None]:
# Performing the split
df['text'] = df.text.apply(lambda x: wordninja.split(str(TextBlob(x))))
df['text'] = df.text.str.join(' ')
df[['text']].sample(10)

In [230]:
# Text Cleaning: Removing punctuation characters
df['text'] = df.text.str.replace('[^\w\s]','')

In [231]:
# Text Cleaning: Removing stop words
#We first import a list of stopwords in English
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
#Removing the stop words
df['text'] = df.text.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df[['text']].sample(5)

In [None]:
# Text Cleaning: Lemmatization

# For lemmatization, we will need to download wordnet

nltk.download('wordnet')

from textblob import Word

# Lemmatizing our text
df['text'] = df.text.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()])) 
df[['text']].sample(10)

We won't remove numerics because we could loose meaning of our text if we lost the numerics. We could also further prepare our text by performing spelling correction but this is a resource intensive process that we will skip for now.

#### Feature Engineering Techniques 

In [None]:
# Feature Construction: Length of tweet
# ---
# YOUR CODE GOES BELOW

df['length_of_tweet'] = df.text.str.len()
pd.set_option('display.max_colwidth', None) #view the full column width
df[['text','length_of_tweet']].sample(5)


In [None]:
# Feature Construction: Word count 
df['word_count'] = df.text.apply(lambda x: len(str(x).split(" ")))
df[['text', 'word_count']].sample(5)

In [236]:
# Feature Construction: Word density (Average no. of words / tweet)
def avg_word(sentence):
  words = sentence.split()
  try:
    z = (sum(len(word) for word in words)/len(words))
  except ZeroDivisionError:
    z = 0 
  return z

df['avg_word_length'] = df.text.apply(lambda x: avg_word(x))
df[['text','avg_word_length']].sample(5)


Unnamed: 0,text,avg_word_length
5918,benny bugatti ben rub,4.5
519,important role finished,7.0
8141,wonder screamo hottie pant kyle show tonight,5.428571
8648,star mile e omg thats sad love baby happen sad,3.7
2199,254 mocha charlie yes working,5.0


In [237]:
# Feature Construction: Noun count
#We download the punkt and the averaged_perceptron_tagger into our notebook environment. 
# which will allow us to find the part of speech tags.
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [238]:
# We create the function to check and get the part of speech tag count of a words in a given sentence
pos_dic = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt

In [239]:
# Noun Count
df['noun_count'] = df.text.apply(lambda x: pos_check(x, 'noun'))
df[['text','noun_count']].sample(5)

Unnamed: 0,text,noun_count
933,jing new mix car woo p woo p,6
2121,wish dine ring,2
666,hope get job dow hall need job acting career isnt going well,7
879,burnt roof mouth,2
5262,travis garland hey n lt f rig awesome got sister hooked guy im soo sad,9


In [240]:
# Feature Construction: Verb count
df['verb_count'] = df.text.apply(lambda x: pos_check(x, 'verb'))
df[['text','verb_count']].sample(5)

Unnamed: 0,text,verb_count
1988,whitaker actually see marketing peep eye glaze bring quo p amp l quo learn peep learn,3
8193,ray beauty 222 word thats fair want job tell monica hook,2
1276,everyones sunday coming along im excited lot show scheduled week,3
1549,really looking forward start working 5 30 morning guess dont really choice,3
9940,official mg n fox good j ee bus beautiful,0


In [241]:
# Feature Construction: Adjective count / Tweet
df['adj_count'] = df.text.apply(lambda x: pos_check(x, 'adj'))
df[['text','adj_count']].sample(5)


Unnamed: 0,text,adj_count
2263,friday almost 60 follower today 50 guy like,0
6636,love girlfriend really everything,0
4519,ivan harris hey go last night expecting see broadway,1
8790,cute guy wasnt montagues tonight 2 sufficient eye candy evening,1
1805,hate closing,0


In [242]:
# Feature Construction: Adverb count / Tweet
df['adv_count'] = df.text.apply(lambda x: pos_check(x, 'adv'))
df[['text','adv_count']].sample(5)


Unnamed: 0,text,adv_count
799,rainn wilson working new film new episode office really kill,1
9985,flu im gonna h shopping time sister ooo im damn excited,0
6572,enough pondering pudding sunday need sunshine lolly tt fn,2
1642,tweeted nearly enough ge meri around tweet shes gone work 20 min boo pata pon fucking awesome,2
2547,awa seer ive recently done post like wait,1


In [243]:
# Feature Construction: Pronoun 
df['pron_count'] = df.text.apply(lambda x: pos_check(x, 'pron'))
df[['text','pron_count']].sample(5)


Unnamed: 0,text,pron_count
7384,x scrunch ix dont want anymore,0
8277,larry larry see blue screen reflection man bridge glass,0
9270,im heartbroken discover dairy queen plan return cotton candy blizzard ever,0
6669,sad lost engagement ring,0
2042,im sorry cant keep way used update come trio nd account,0


In [182]:
# Feature Construction: Subjectivity
def get_subjectivity(text):
    try:
        textblob = TextBlob(unicode(text, 'utf-8'))
        subj = textblob.sentiment.subjectivity
    except:
        subj = 0.0
    return subj

df['subjectivity'] = df.text.apply(get_subjectivity)
df[['text', 'subjectivity']].sample(5)


Unnamed: 0,text,subjectivity
7100,left home itty salad,0.0
5869,cant sleep without seth,0.0
4344,ted roddy site checks errors html web cant make site works 000 web host,0.0
7228,sitting bonfire w tibet,0.0
448,eating papa lentil crackers,0.0


In [244]:
# Feature Construction: Polarity
def get_polarity(text):
    try:
        textblob = TextBlob(unicode(text, 'utf-8'))
        pol = textblob.sentiment.polarity
    except:
        pol = 0.0
    return pol

df['polarity'] = df.text.apply(get_polarity)
df[['text', 'polarity']].sample(5)


Unnamed: 0,text,polarity
4969,ya wanna stalk j k mma go shop pin q ill water tower lol add samar 14 hotmail com,0.0
5750,morning tweet well ive got work bright sunny day shame work,0.0
3949,still process making tough choice,0.0
4536,im dreading going work,0.0
2071,watching trailer various movie drinking morning coffee,0.0


In [246]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Feature Construction: Word Level N-Gram TF-IDF Feature 
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1,3),  stop_words= 'english')
df_word_vect = tfidf.fit_transform(df['text']) 
df_word_vect.toarray()
tfidf.get_feature_names()
pd.DataFrame(df_word_vect.toarray(), columns=tfidf.get_feature_names())




Unnamed: 0,00,09,10,100,11,12,13,15,16,17,19,20,2009,21,22,24,30,40,50,aaa,aaaaaa,able,account,actually,ad,adam,add,added,afternoon,age,ago,agree,ah,aha,ahh,ahh hh,air,airport,al,album,...,work,work today,working,world,worry,worse,worst,worth,wouldnt,wow,write,wrong,wu,ww,www,xd,xo,xx,xxx,ya,yay,ye,yea,yeah,year,yep,yes,yesterday,yo,youll,young,youre,youtube,youve,yr,yummy,yy,yyyy,zz,zzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.413509,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [247]:
# Feature Construction: Character Level N-Gram TF-IDF 
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1,3),  stop_words= 'english')
df_char_vect = tfidf.fit_transform(df['text'])
df_char_vect.toarray()

array([[0.2442019 , 0.        , 0.        , ..., 0.        , 0.08323056,
        0.        ],
       [0.23113135, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.17551132, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.21992303, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.18421584, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.28106559, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [250]:
# Let's prepare the constructed features for modeling
X_metadata = np.array(df.iloc[:, 2:12])
X_metadata

array([[67.        , 11.        ,  5.18181818, ...,  2.        ,
         0.        ,  0.        ],
       [81.        , 12.        ,  5.83333333, ...,  1.        ,
         0.        ,  0.        ],
       [40.        ,  6.        ,  5.83333333, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [45.        ,  8.        ,  4.75      , ...,  2.        ,
         0.        ,  0.        ],
       [34.        ,  6.        ,  4.83333333, ...,  1.        ,
         0.        ,  0.        ],
       [44.        , 10.        ,  3.5       , ...,  0.        ,
         0.        ,  0.        ]])

In [251]:
# We combine our two tfidf (sparse) matrices and X_metadata
X = scipy.sparse.hstack([df_word_vect, df_char_vect,  X_metadata])
X

<10000x2009 sparse matrix of type '<class 'numpy.float64'>'
	with 938102 stored elements in COOrdinate format>

In [253]:
# Getting our response variable
y = np.array(df.iloc[:, 0])
y

array([0, 4, 0, ..., 0, 4, 0])

### 4. Data Modelling

During this step, we will use machine learning algorithms to train and test our sentiment analysis models.

In [254]:
# Splitting our data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [255]:
# Fitting our model
# Importing the algorithms
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression

nb_classifier = MultinomialNB() 
lr_classifier = LogisticRegression(max_iter=1000) 

# Training our model
nb_classifier.fit(X_train, y_train) 
lr_classifier.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [256]:
# Making predictions
y_predict_nb = nb_classifier.predict(X_test) 
y_predict_lr = lr_classifier.predict(X_test)

In [257]:
# Evaluating the Models
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
print("Naive Bayes Classifier:\n", accuracy_score(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", accuracy_score(y_test, y_predict_lr))

Naive Bayes Classifier:
 0.7285
Logistic Regression Classifier: 
 0.734


In [258]:
# Confusion matrices
print("Naive Bayes Classifier: \n", confusion_matrix(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", confusion_matrix(y_test, y_predict_lr))

Naive Bayes Classifier: 
 [[762 288]
 [255 695]]
Logistic Regression Classifier: 
 [[765 285]
 [247 703]]


In [259]:
# Classification Reports
print("Naive Bayes Classifier: \n", classification_report(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", classification_report(y_test, y_predict_lr))

Naive Bayes Classifier: 
               precision    recall  f1-score   support

           0       0.75      0.73      0.74      1050
           4       0.71      0.73      0.72       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000

Logistic Regression Classifier: 
               precision    recall  f1-score   support

           0       0.76      0.73      0.74      1050
           4       0.71      0.74      0.73       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000



**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

To improve our model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. We can also use different vectorizing techniques, implement other machine learning models and perform hyperparameter tuning.

### 5. Recommendations


Our best model had an accuracy of 73.25% and use it for classifying newer tweets. We can improve this performance by performing hyperparameter tuning and feature engineering methods. 