<a href="https://colab.research.google.com/github/chalsai/Getting-Started-with-Text-Analysis/blob/main/Week_8_Monday_Getting_Started_with_Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Instructions

Background Information

The management of a certain Marketing Firm would like to track the sentiments of their
customers. This would help in shortening the amount of time that it takes to act on
feedback


# AfterWork Data Science: Getting Started with Text Analysis

### Prerequisites

In [None]:
# Importing the required libraries
# ---
# 
import pandas as pd # library for data manipulation
import numpy as np  # librariy for scientific computations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # scientific conputing 

### 1. Importing our Data

In [None]:
# Question: Given a new tweets, create a sentiment analysis model that will 
# predict whether a tweet will contain positive or negative sentiment.
# ---
# Dataset url = https://bit.ly/31kqByD 
# ---
#
df = pd.read_csv('https://bit.ly/31kqByD', encoding='latin-1')
df.head()

Unnamed: 0.1,Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


### 2. Data Exploration

In [None]:
# We can determine the size of our dataset
# ---
#
df.shape

(10000, 7)

Seems this dataset will need some data cleaning i.e. columns. We also don't need some columns to perform create our model. We will drop those columns.

### 3. Data Preparation

#### Basic Data Cleaning Techniques

In [None]:
# We rename the columns for ease of referencing our columns later on
# ---
#
df.columns = ['id', 'target', 't_id', 'created_at', 'query', 'user', 'text']
df.head()

Unnamed: 0,id,target,t_id,created_at,query,user,text
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


In [None]:
# We retain the relevant columns by dropping the columns we don't need 
# for creating a sentiment analysis model. 
# ---
#
df = df.drop(['id', 't_id', 'created_at', 'query', 'user'], axis = 1)
df.head()

Unnamed: 0,target,text
0,0,Obama forges his Muslim alliance against the c...
1,4,Had the most spectacular prom ever but now my...
2,0,I am overwhelmed today taking a moment to eat...
3,0,@lindork Tres sad. I was totally a Max fan. #...
4,0,"Crap, I was counting down the hours until my d..."


In [None]:
# Understanding the distribution of target
# ---
#
df.target.value_counts() 

0    5067
4    4933
Name: target, dtype: int64

In [None]:
# Let's determine whether our columns have the right data types
# ---
#
df.dtypes

target     int64
text      object
dtype: object

In [None]:
# What values are in our target variable?
# ---
#
df.target.unique()

array([0, 4])

These are the two classes to which each document (text) belongs. The target value 0 means a text with a negative sentiment, while that of 4 means a text with a positive sentiment. 

In [None]:
# Let's check for missing values 
# ---
# 
df.isnull().sum()

target    0
text      0
dtype: int64

We don't have any missing values, so we are good to go.

#### Text Processing

In [None]:
# Text Cleaning: Removing all urls/links
# ---
# 
df['text'] =  df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))
df[['text']].head()

Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,@lindork Tres sad. I was totally a Max fan. #...
4,"Crap, I was counting down the hours until my d..."


In [None]:
# Text Cleaning: Removing @ and # characters or replace them with space
# ---
# YOUR CODE GOES BELOW
#
## We will replace # with space ' '
df['tweet_rp_hash'] = df.text.str.replace('#',' ')
df[['text', 'tweet_rp_hash']].sample(5)



Unnamed: 0,text,tweet_rp_hash
2703,sippin my wine..layin dwn in dark...take contr...,sippin my wine..layin dwn in dark...take contr...
5177,@g_willow There may be a statistic out there s...,@g_willow There may be a statistic out there s...
5269,"well well well, i am really praying for good t...","well well well, i am really praying for good t..."
4250,I want Ashley to come in Paris !!,I want Ashley to come in Paris !!
1474,which is the best twitter client ?? can somebo...,which is the best twitter client ?? can somebo...


In [None]:
## We will replace @ with space ' '
df['tweet_rp_at'] = df.tweet_rp_hash.str.replace('@',' ')
df[['text', 'tweet_rp_at']].sample(5)

Unnamed: 0,text,tweet_rp_at
6004,Mmm chocolate chip cookies Thank you @rachela...,Mmm chocolate chip cookies Thank you rachela...
3087,not looking forward to tomorrow...how do you p...,not looking forward to tomorrow...how do you p...
8286,please feel free to comment on my pics. commen...,please feel free to comment on my pics. commen...
6547,@lindachka @shmalala 40 days till boys like an...,lindachka shmalala 40 days till boys like an...
7389,im at my uncles house im really bored!!,im at my uncles house im really bored!!


In [None]:
# Text Cleaning: Conversion to lowercase
# ---
# 
#
# Finding no. of Uppercase Words 
# ---
#
df['no_of_uppercase'] = df.tweet_rp_at.apply(lambda x: len([x for x in x.split() if x.isupper()]))
df[['tweet_rp_at','no_of_uppercase']].sample(10)


Unnamed: 0,tweet_rp_at,no_of_uppercase
7188,"Beat_Control i can't, she's gone to bed",0
6289,I'm babysitting but the kids are asleep and I'...,1
1034,is going to miss the team when they leave tomo...,0
1762,Sitting one seat away from Jenn. Thanks to Mar...,0
6572,Enough of this pondering over puddingsunday I...,2
4638,i dread having the dreams i've always wanted t...,0
8672,Off to church.. 40 min late yes but umm better...,0
8814,JasonBradbury Waves to the audience,0
4306,lilyroseallen I hope all goes well on the 22 ...,1
62,"next DnD game that he's a player, he's going t...",0


In [None]:
# Text Cleaning: Conversion to lowercase
# Column names: remove white spaces and convert to lower case
df.text= df.tweet_rp_at.str.strip().str.lower()
df.text

0       obama forges his muslim alliance against the c...
1       had the most spectacular prom ever  but now my...
2       i am overwhelmed today  taking a moment to eat...
3       lindork tres sad. i was totally a max fan.   s...
4       crap, i was counting down the hours until my d...
                              ...                        
9995    somehow survived the day without dying... of b...
9996    kpbslu06 booo!  i got it on hd plus peaking at...
9997    haha lina's hyper already well lucky you i'm i...
9998              omg really good day happened right here
9999    love2cookpie i saw you on division and 68th bu...
Name: text, Length: 10000, dtype: object

In [None]:
#confirm after cleaning
df['no_of_uppercase'] = df.text.apply(lambda x: len([x for x in x.split() if x.isupper()]))
df[['text','no_of_uppercase']].sample(10)

Unnamed: 0,text,no_of_uppercase
9371,helloxxtaylor thats exactly the type of person,0
9137,skdevitt nah it's be octy or otus,0
5386,says bein ill with pneumonia really does suck ...,0
6412,ã¯â¿â½i m?t quã¯â¿â½ ?au ??u ko th? ch?u n?i....,0
2254,"janabelle_xo nothing, just sitting in my rooom...",0
6047,its only midnight and i'm exhausted ! off to m...,0
5197,nkfan1 knlsmom i must've missed it too.,0
9699,bluesangel80 you got it.. and i keep my promises.,0
5158,good morning everyone :o i hate waking up earl...,0
7559,jonthanjay i have a hard time sleeping too. i ...,0


In [None]:
# Text Cleaning: Splitting concatenated words
# ---
# Performing this step will take few minutes...
# ---
# YOUR CODE GOES BELOW
# 

# Installing wordnija and textblob
# ---
#
! pip install wordninja


# ---
#


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wordninja
  Downloading wordninja-2.0.0.tar.gz (541 kB)
[K     |████████████████████████████████| 541 kB 9.8 MB/s 
[?25hBuilding wheels for collected packages: wordninja
  Building wheel for wordninja (setup.py) ... [?25l[?25hdone
  Created wheel for wordninja: filename=wordninja-2.0.0-py3-none-any.whl size=541551 sha256=bbf46667a1c0fb30c4ad281fc642f063470901a910ce6513f2b3122c67daaf52
  Stored in directory: /root/.cache/pip/wheels/dd/3f/eb/a2692e3d2b9deb1487b09ba4967dd6920bd5032bfd9ff7acfc
Successfully built wordninja
Installing collected packages: wordninja
Successfully installed wordninja-2.0.0


In [None]:
# Importing those libraries
import wordninja

In [None]:
# Performing the split
# ---
#

for wordstring in df['tweet_rp_at']:
    split = wordninja.split(wordstring)


In [None]:
df['tweet_rp_at'].head(10)

0    Obama forges his Muslim alliance against the c...
1    Had the most spectacular prom ever  but now my...
2    I am overwhelmed today  taking a moment to eat...
3     lindork Tres sad. I was totally a Max fan.   ...
4    Crap, I was counting down the hours until my d...
5     DCBTV  DCBTV I had to go check some things, b...
6          smrorke why are you never on gmail anymore 
7     Alex_Jeffreys I'd have loved to have come, ju...
8            Brrrr ! Heading to work.... Chilly today 
9     gabriiiiella I neeed to talk to youu..  good ...
Name: tweet_rp_at, dtype: object

In [None]:
# Text Cleaning: Removing punctuation characters
# ---
# YOUR CODE GOES BELOW
#

df['tweet_rp_at'] = df.tweet_rp_at.str.replace('[^\w\s]','')
df[['tweet_rp_at']].sample(5)


  


Unnamed: 0,tweet_rp_at
8615,megasaurus_x the last 2 brisbane i wanted to...
9830,withlovekristin I dont have a car today
2158,Sad that her nieces are gone
396,just call me gonks mc giverygonks lol L oh i ...
4995,Theres going to be a dance for the graduating ...


In [None]:
# Text Cleaning: Removing stop words
# ---
# YOUR CODE GOES BELOW
# 
#Import nltk
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
#checking stop words
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
# Then let's see the no. of stopwords in the text
# ---
# 
df['no_of_stopwords'] = df.tweet_rp_at.apply(lambda x: len([x for x in x.split() if x in stop]))
df[['tweet_rp_at','no_of_stopwords']].sample(10)

Unnamed: 0,tweet_rp_at,no_of_stopwords
8072,karissamitha oh ya heard that b4 ada tmn lulu...,3
4064,garnettdc Ive tried that but I didnt stick 2 ...,4
6942,I cant wait to see bruno lt3,1
6877,kayteeeleanor ill write a letter of complaint...,8
5633,Azlen Get one of those biodegradable floating...,9
7689,Enterprise people are dropping like flies My g...,3
5317,RossBOnline I wouldnt say serious time maybe ...,5
9582,kayrbair I am just randomly quoting songs,2
9687,is wishing everyone a very beautiful and prosp...,6
957,zaheyraw too bad,1


In [None]:
# Removing stop words
df['tweet_rp_at'] = df.tweet_rp_at.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df[['tweet_rp_at']].sample(5)

Unnamed: 0,tweet_rp_at
4824,Someone bring mattress I dont wanna sleep floor
737,Having goodno great day already amp DrMiracles...
4877,DarlingNikiWyre Aww man You playing video game...
7919,still dont understand love soooo much
3124,csquaredsmiles obviously Ive forgotten I Ill u...


In [None]:
# Let's see if the stop words are there... 0 means we don't have any stopwords.
# ---
#
df['no_of_stopwords'] = df.tweet_rp_at.apply(lambda x: len([x for x in x.split() if x in stop]))
df[['tweet_rp_at','no_of_stopwords']].sample(5)

Unnamed: 0,tweet_rp_at,no_of_stopwords
4097,I think im getting sick,0
4507,hey Tweeties auntie jus got n bad car accident...,0
5070,dapiedra u r welcome,0
8103,brittgow reEdublogs TV isnt unfortunately wait...,0
568,LouLouK I gave Wired UK last issue mens mag ai...,0


In [None]:
# Text Cleaning: Lemmatization
# ---

# We will use PorterStemmer from the NLTK library to perform 
# stemming, so lets import it.
# ---
# 
from nltk.stem import PorterStemmer
st = PorterStemmer()



In [None]:
# Stemming
# ---
# Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc.
# It cuts either the beginning or end of the word.
# We use stemming to categorize the same type of data by its root word.
# ---
#
df['stemming'] = df['tweet_rp_at'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
df[['tweet_rp_at', 'stemming']].sample(10)

Unnamed: 0,tweet_rp_at,stemming
7479,acalderwood yep wpg penticton amp brandon,acalderwood yep wpg penticton amp brandon
5649,pizza fam,pizza fam
2417,I think Im one left college,i think im one left colleg
6524,DavidArchie OMG THIS IS THE BEST HANNAH MONTAN...,davidarchi omg thi is the best hannah montana ...
3191,stephtheripper yeah lol president fan club sho...,stephtheripp yeah lol presid fan club shot sta...
2925,I cant believe james isnt giving ucla commence...,i cant believ jame isnt give ucla commenc spee...
5552,624 one year ago wish could go back time,624 one year ago wish could go back time
51,emzyjonas meee june 15th november 17th novembe...,emzyjona meee june 15th novemb 17th novemb 22n...
1980,I woke realizing history paper due like Tomorr...,i woke realiz histori paper due like tomorrow ...
4839,alannaaaa felt like earlier watching 90210 che...,alannaaaa felt like earlier watch 90210 cheer


We won't remove numerics because we could loose meaning of our text if we lost the numerics. We could also further prepare our text by performing spelling correction but this is a resource intensive process that we will skip for now.

#### Feature Engineering Techniques 

In [None]:
# Feature Construction: Length of tweet
# ---
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
bag_of_words = count.fit_transform(df['stemming'])

# Show feature matrix / Priviewing the created sparse matrix
# ---
#
bag_of_words.toarray()
#


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
# Get feature names
feature_names = count.get_feature_names()

# View feature names
feature_names



['00',
 '000webhost',
 '06',
 '0630',
 '07',
 '09',
 '09011',
 '0f',
 '0ff',
 '0fletcher',
 '0mg',
 '0n',
 '0o0o0oh',
 '10',
 '100',
 '1000',
 '10000',
 '100000',
 '10000000',
 '1000000giraff',
 '1000am',
 '1000th',
 '1002',
 '100th',
 '100x',
 '101',
 '1012',
 '1014511',
 '1015',
 '1017',
 '1021pm',
 '103',
 '1030',
 '1030pm',
 '104',
 '1040pm',
 '105',
 '1099',
 '10am',
 '10k',
 '10km',
 '10monthold',
 '10pm',
 '10quot',
 '10th',
 '10x',
 '10year',
 '10yr',
 '11',
 '110',
 '1109pm',
 '110i',
 '111',
 '1112',
 '11394607',
 '116',
 '11am',
 '11pm',
 '11th',
 '11week',
 '12',
 '120',
 '1200',
 '123',
 '1230',
 '1236am',
 '124',
 '125',
 '1250',
 '1250am',
 '12am',
 '12hr',
 '12th',
 '13',
 '130',
 '132',
 '1337sauc',
 '134k',
 '136',
 '13th',
 '13whatthefuck',
 '14',
 '140',
 '1400',
 '140conf',
 '145',
 '14th',
 '15',
 '1500',
 '1501',
 '15328',
 '15c',
 '15min',
 '15quot',
 '15th',
 '15thxxx',
 '16',
 '160',
 '16th',
 '16yr',
 '17',
 '17472th',
 '17incher',
 '17th',
 '18',
 '180',
 '1

In [None]:
# Creating a dataframe to visualise our matrix
# ---
#
pd.DataFrame(bag_of_words.toarray(), columns=feature_names)

Unnamed: 0,00,000webhost,06,0630,07,09,09011,0f,0ff,0fletcher,...,ãâãâãâãâ,ãâãâãâãâ¾ãâ¹,ãâãâãâãâãâ,ãâãâãâãâãâµ,ãâãâãâãâãâãâ,ãâãâããâºãâãâ¹,ããâºãâãâãâãâãâ,ããâãâ,ããâãâãâµ,ããâãâãâãâãâãâ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Feature Construction: Word count 
# ---
# YOUR CODE GOES BELOW
# 


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
# Feature Construction: Word density (Average no. of words / tweet)
# ---
# YOUR CODE GOES BELOW
#


In [None]:
# Feature Construction: Noun count
# ---
# YOUR CODE GOES BELOW
#
# First, we will download the punkt and the averaged_perceptron_tagger into our notebook environment. 
# which will allow us to find the part of speech tags.
# ---
#


# We create the function to check and get the part of speech tag count of a words in a given sentence


In [None]:
# Noun Count
# ---
# YOUR CODE GOES BELOW
#


In [None]:
# Feature Construction: Verb count
# ---
# YOUR CODE GOES BELOW
#

In [None]:
# Feature Construction: Adjective count / Tweet
# ---
# YOUR CODE GOES BELOW
#


In [None]:
# Feature Construction: Adverb count / Tweet
# ---
# YOUR CODE GOES BELOW
#


In [None]:
# Feature Construction: Pronoun 
# ---
# YOUR CODE GOES BELOW
#


In [None]:
# Feature Construction: Subjectivity
# ---
# YOUR CODE GOES BELOW
# 


In [None]:
# Feature Construction: Polarity
# ---
# YOUR CODE GOES BELOW
# 


In [None]:
# Feature Construction: Word Level N-Gram TF-IDF Feature 
# ---
# YOUR CODE GOES BELOW
#


In [None]:
# Feature Construction: Character Level N-Gram TF-IDF Feature
# ---
# YOUR CODE GOES BELOW
# 


In [None]:
# Let's prepare the constructed features for modeling
# ---
#
X_metadata = np.array(df.iloc[:, 2:12])
X_metadata

In [None]:
# We combine our two tfidf (sparse) matrices and X_metadata
# ---
#
X = scipy.sparse.hstack([df_word_vect, df_char_vect,  X_metadata])
X

In [None]:
# Getting our response variable
# ---
#
y = np.array(df.iloc[:, 0])
y

### 4. Data Modelling

During this step, we will use machine learning algorithms to train and test our sentiment analysis models.

In [None]:
# Splitting our data
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Fitting our model
# ---
#

# Importing the algorithms
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression

nb_classifier = MultinomialNB() 
lr_classifier = LogisticRegression(max_iter=1000) 

# Training our model
nb_classifier.fit(X_train, y_train) 
lr_classifier.fit(X_train, y_train)

In [None]:
# Making predictions
# ---
#
y_predict_nb = nb_classifier.predict(X_test) 
y_predict_lr = lr_classifier.predict(X_test)

In [None]:
# Evaluating the Models
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
# ---
#
print("Naive Bayes Classifier:\n", accuracy_score(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", accuracy_score(y_test, y_predict_lr))

In [None]:
# Confusion matrices
# ---
# 
print("Naive Bayes Classifier: \n", confusion_matrix(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", confusion_matrix(y_test, y_predict_lr))

In [None]:
# Classification Reports
# ---
#
print("Naive Bayes Classifier: \n", classification_report(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", classification_report(y_test, y_predict_lr))

**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

To improve our model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. We can also use different vectorizing techniques, implement other machine learning models and perform hyperparameter tuning.

### 5. Recommendations


Our best model had an accuracy of 73.25% and use it for classifying newer tweets. We can improve this performance by performing hyperparameter tuning and feature engineering methods. 