<a href="https://colab.research.google.com/github/fkivuti/Getting-Started-With-Text-Analysis/blob/main/Getting_Started_with_Text_Analysis_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="#4b76b7">To start practicing, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# AfterWork Data Science: Getting Started with NLP Project

### Prerequisites

In [1]:
# Importing the required libraries
# ---
# 
import pandas as pd # library for data manipulation
import numpy as np  # librariy for scientific computations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # scientific conputing 

### 1. Importing our Data

In [43]:
# Question: Given a new tweets, create a sentiment analysis model that will 
# predict whether a tweet will contain positive or negative sentiment.
# ---
# Dataset url = https://bit.ly/31kqByD 
# ---
#
df = pd.read_csv('https://bit.ly/31kqByD', encoding='latin-1')
df.head()

Unnamed: 0.1,Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


### 2. Data Exploration

In [44]:
# We can determine the size of our dataset
# ---
#
df.shape

(10000, 7)

Seems this dataset will need some data cleaning i.e. columns. We also don't need some columns to perform create our model. We will drop those columns.

### 3. Data Preparation

#### Basic Data Cleaning Techniques

In [45]:
# We rename the columns for ease of referencing our columns later on
# ---
#
df.columns = ['id', 'target', 't_id', 'created_at', 'query', 'user', 'text']
df.head()

Unnamed: 0,id,target,t_id,created_at,query,user,text
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


In [46]:
# We retain the relevant columns by dropping the columns we don't need 
# for creating a sentiment analysis model. 
# ---
#
df = df.drop(['id', 't_id', 'created_at', 'query', 'user'], axis = 1)
df.head()

Unnamed: 0,target,text
0,0,Obama forges his Muslim alliance against the c...
1,4,Had the most spectacular prom ever but now my...
2,0,I am overwhelmed today taking a moment to eat...
3,0,@lindork Tres sad. I was totally a Max fan. #...
4,0,"Crap, I was counting down the hours until my d..."


In [47]:
# Understanding the distribution of target
# ---
#
df.target.value_counts() 

0    5067
4    4933
Name: target, dtype: int64

In [7]:
# Let's determine whether our columns have the right data types
# ---
#
df.dtypes

target     int64
text      object
dtype: object

In [8]:
# What values are in our target variable?
# ---
#
df.target.unique()

array([0, 4])

These are the two classes to which each document (text) belongs. The target value 0 means a text with a negative sentiment, while that of 4 means a text with a positive sentiment. 

In [9]:
# Let's check for missing values 
# ---
# 
df.isnull().sum()

target    0
text      0
dtype: int64

We don't have any missing values, so we are good to go.

#### Text Processing

In [48]:
# Text Cleaning: Removing all urls/links
# ---
# 
df['text'] =  df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))
df[['text']].head()

Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,@lindork Tres sad. I was totally a Max fan. #...
4,"Crap, I was counting down the hours until my d..."


In [49]:
# Text Cleaning: Removing @ and # characters or replace them with space
# ---
df['text'] = df['text'].str.replace('[#@]', '')
df.head()


Unnamed: 0,target,text
0,0,Obama forges his Muslim alliance against the c...
1,4,Had the most spectacular prom ever but now my...
2,0,I am overwhelmed today taking a moment to eat...
3,0,lindork Tres sad. I was totally a Max fan. SY...
4,0,"Crap, I was counting down the hours until my d..."


In [50]:
# Text Cleaning: Conversion to lowercase
# ---
df['text'] = df.text.apply(lambda x: " ".join(x.lower() for x in x.split()))
df[['text']].sample(5)


Unnamed: 0,text
291,mariahcarey i think suddenly 30 is a cute eter...
4426,zigzag_girl im gonna im getting a new laptop i...
828,to anyone who ordered my brushes before june 5...
3220,im sleepy too. i tried napping it didn't work.
3130,"svallie looks good, but somehow cthulhu and pi..."


In [51]:
# Text Cleaning: Splitting concatenated words
# ---
# Performing this step will take few minutes...
# ---
# YOUR CODE GOES BELOW
!pip3 install wordninja
!pip3 install textblob


# Importing those libraries
# ---
#
import wordninja 
from textblob import TextBlob



In [52]:
# Performing the split
df['text'] = df.text.apply(lambda x: wordninja.split(str(TextBlob(x))))  
df['text'] = df.text.str.join(' ')
df[['text']].sample(10)


Unnamed: 0,text
2485,she turn out fine at least she didn't annoy me...
9185,and it is coming soon
7595,tom mcfly prog s
9822,feel like ive been at work all my life somebod...
4998,jbl over 1494 yes i know sometimes italy seems...
1351,steno the social network usability principles ...
1702,be ke meyer i think too early to tell there's ...
2365,on vacation easter is here woo
88,he yy t witz x
393,it's kind of cool outside which is fine by me ...


In [53]:
# Text Cleaning: Removing punctuation characters
# ---
df['text'] = df.text.str.replace('[^\w\s]','')
df[['text']].sample(5)


Unnamed: 0,text
5725,waiting for sandy
9122,got charger shes using ear f ones guess ill ha...
5089,ted the bear 999 the world tis crazy yyyy yyyy...
1191,jonas brothers gu u ys why your youtube accoun...
9336,i never even got to visit lol


In [57]:
# Text Cleaning: Removing stop words
# we will use the natural language tooklit (nltk) library
import nltk
nltk.download('stopwords')

# import a list of stopwords in english
from nltk.corpus import stopwords
stop = stopwords.words('english')

# remove the stop words
df['text'] = df.text.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df[['text']].sample(5)



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text
8516,mary j artist lol missed u last night miss mary
1462,iona aaa omg say thing
146,arrived mums place birthday diner pancakes sti...
413,hey souljaboytellem yo ooo come e back nyc im ...
4438,shannon renee yes im getting touch rural hillb...


In [60]:
# Text Cleaning: Lemmatization
# ---
nltk.download('wordnet')
from textblob import Word


# Lemmatizing our text
df['text'] = df.text.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()])) 
df[['text']].sample(10)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,text
6195,tabby bottom hi love
7776,smell nice trying perfume boot could spent muc...
3457,flash da jag war think thats like rancho n iii...
5923,two knotty boy damn wish staying ca little longer
3938,put hack phone install unsigned apps signed ap...
8241,got burned b c stupid forgot sunscreen chest
7459,work depressing hell want someone f uk come ho...
7847,story far never came 2 time saw mx px went alone
2308,cant get ticket see jane addiction
6009,got back vince last night living long day pack...


We won't remove numerics because we could loose meaning of our text if we lost the numerics. We could also further prepare our text by performing spelling correction but this is a resource intensive process that we will skip for now.

#### Feature Engineering Techniques 

In [62]:
# Feature Construction: Length of tweet
# ---
df['length_of_text'] = df.text.str.len()
df[['text','length_of_text']].sample(5)


Unnamed: 0,text,length_of_text
5208,mmm butter toast wanna see electro anna e dani...,87
1821,going g sunset tt shop p ping shit miss b,41
5265,al gore disappointed least 10 x follower ashto...,55
4693,greek dude account everywhere livejournal mysp...,74
4080,rather hungover,15


In [63]:
# Feature Construction: Word count 
# ---
df['word_count'] = df.text.apply(lambda x: len(str(x).split(" ")))
df[['text', 'word_count']].sample(5)


Unnamed: 0,text,word_count
8402,tom mcfly next year year arena tour x,8
2142,got home,2
4763,naw id see ca describe issue find trouble yet,9
271,good morning nice day today taking mango park,8
3597,mj newham stupidly went 24 month contract oh w...,17


In [75]:
# Feature Construction: Word density (Average no. of words / tweet)
# ---
df['word_density'] = df['word_count'].sum()/df['word_count'].count()
df['word_density']

0       9.0412
1       9.0412
2       9.0412
3       9.0412
4       9.0412
         ...  
9995    9.0412
9996    9.0412
9997    9.0412
9998    9.0412
9999    9.0412
Name: word_density, Length: 10000, dtype: float64

In [64]:
# Feature Construction: Noun count
# ---
# First, we will download the punkt and the averaged_perceptron_tagger into our notebook environment. 
# which will allow us to find the part of speech tags.
# ---
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


# We create the function to check and get the part of speech tag count of a words in a given sentence
pos_dic = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [65]:
# Noun Count
# ---
df['noun_count'] = df.text.apply(lambda x: pos_check(x, 'noun'))
df[['text','noun_count']].sample(10)


Unnamed: 0,text,noun_count
9685,one hundred foot oh really tough one think ang...,5
8048,kl,1
2598,chill axing g gg,3
3217,give strength,1
5327,ronald heft hate towards fsb also spy master f...,8
2088,chris daughtry dont know youve done already id...,6
1518,zara x im creep didnt block baha xxx,8
5974,sims 3 tomorrow,2
4025,r city rock city vi glad made back safely missed,6
388,high high life high party invite pen fume yuk ...,9


In [66]:
# Feature Construction: Verb count
# ---
df['verb_count'] = df.text.apply(lambda x: pos_check(x, 'verb'))
df[['text','verb_count']].sample(10)

Unnamed: 0,text,verb_count
5906,thanks mel u hit da spot,1
6258,peter fac ell must busy return love,1
1382,good morning guy,0
1033,woo oo bruin celt part tonight,1
6325,im bit behind stavro flat ley excellent bg,0
4247,jessie bay lin way 2 rock golf stance jessie b...,1
9465,simple conf u im winn im los smile,0
3595,fleet week glad take ri bridge work,1
3597,mj newham stupidly went 24 month contract oh w...,2
8144,decided stop eating fish could become true veg...,2


In [67]:
# Feature Construction: Adjective count / Tweet
# ---
df['adj_count'] = df.text.apply(lambda x: pos_check(x, 'adj'))
df[['text','adj_count']].sample(10)


Unnamed: 0,text,adj_count
516,cath ster morn u ever fix ur sound problem for...,4
3552,love love love studio 44 cast crew thanks ever...,0
4882,photo via fuck yeah miley cyrus miley body hot...,3
4117,arni bella yup fun watching 21,1
2485,turn fine least didnt annoy malay h w 11 09 pm...,4
2566,updated blog good bye gary aka auction rebel,2
8362,wu hum mm wait cum power like bf hate,0
5436,get known radio l mao 0 0 wat zach cuter name ...,0
468,venomous one prince charming,1
5410,mandy 29 havent spoken age,0


In [68]:
# Feature Construction: Adverb count / Tweet
# ---
df['adv_count'] = df.text.apply(lambda x: pos_check(x, 'adv'))
df[['text','adv_count']].sample(10)


Unnamed: 0,text,adv_count
5283,mr wiggin z hey mr nt actually month thought i...,3
1231,x skyline like life ambition chosen career per...,0
9728,getting lunch francesca roman grandma,0
6452,br end yn left early cry,0
7525,amanda holden good city serious shopping lol h...,1
2602,tik shi right live music everywhere back miss,3
5532,sims 3 doesnt come new zealand 5 th june boo,0
9522,pick nose much,1
9980,going relax weekend start packing room,0
2971,kay ley bum im actual shit scared,0


In [69]:
# Feature Construction: Pronoun 
# ---
df['pron_count'] = df.text.apply(lambda x: pos_check(x, 'pron'))
df[['text','pron_count']].sample(10)


Unnamed: 0,text,pron_count
9313,dealing mistake idiot colleague made draining,0
9694,fingertip holding crack foundation know let go...,0
9917,miley cyrus mandy pl eeee see e make another v...,0
9427,lovely feel like crap birthday isnt lovely,0
4773,leaving leanns house going boardwalk,0
872,jamie malt man im looking best day go think im...,0
8139,didnt even get free vegetable first day pick job,0
2289,work temp fine pure javascript time rd g salsa...,0
3411,ca pe still hanging saturday night x oxo,0
4608,costa vida fred um molly called back said deli...,0


In [71]:
# Feature Construction: Subjectivity
# ---
# Function to get subjectivity of text using the module textblob
def get_subjectivity(text):
    try:
        textblob = TextBlob(unicode(text, 'utf-8'))
        subj = textblob.sentiment.subjectivity
    except:
        subj = 0.0
    return subj

df['subjectivity'] = df.text.apply(get_subjectivity)
df[['text', 'subjectivity']].sample(10)


Unnamed: 0,text,subjectivity
9523,teeth insist falling apart,0.0
6565,ada wada towel quo sub eth sen mati c quo guide,0.0
8847,twi crack addict ww fair guess vote either,0.0
6330,switched morning shift tomorrow wednesday awes...,0.0
5381,sivan aish good morning,0.0
933,jing new mix car woo p woo p,0.0
2310,god feel like shit never watch jur r asic park...,0.0
9822,feel like ive work life somebody txt f id like...,0.0
6722,dd lovato aw cant wait see love demi dream day...,0.0
7704,really dont want work pool tomorrow,0.0


In [70]:
# Feature Construction: Polarity
# ---
# Function to get polarity of text using the module textblob
def get_polarity(text):
    try:
        textblob = TextBlob(unicode(text, 'utf-8'))
        pol = textblob.sentiment.polarity
    except:
        pol = 0.0
    return pol

df['polarity'] = df.text.apply(get_polarity)
df[['text', 'polarity']].sample(10)


Unnamed: 0,text,polarity
1186,computer today great,0.0
4939,oh summer come quickly,0.0
1027,nick carter say hi sissy kiss aaa bed time man...,0.0
8551,1 st day work,0.0
2817,watching hannah montana love hannah montana gt...,0.0
8411,im finally home cant get online,0.0
2701,came back wwdc tampa fl find local burger loca...,0.0
8707,brian clayton ahh studying going enough time s...,0.0
2724,ka vo 830 acceptable,0.0
3689,bubble boy good luck,0.0


In [78]:
# Feature Construction: Word Level N-Gram TF-IDF Feature 
# Importing the TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1,3),  stop_words= 'english')
df_word_vect = tfidf.fit_transform(df['text'])

# Show feature matrix / Priviewing the created sparse matrix
#
df_word_vect.toarray()


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [79]:
# Feature Construction: Character Level N-Gram TF-IDF Feature
# ---
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1,3),  stop_words= 'english')
df_char_vect = tfidf.fit_transform(df['text'])
df_char_vect.toarray()


array([[0.2442019 , 0.        , 0.        , ..., 0.        , 0.08323056,
        0.        ],
       [0.23113135, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.17551132, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.21992303, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.18421584, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.28106559, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [95]:
# Let's prepare the constructed features for modeling
# ---
#
X_metadata = np.array(df.iloc[:, 3:12])
X_metadata

array([[67., 11.,  6., ...,  0.,  0.,  0.],
       [81., 12.,  5., ...,  0.,  0.,  0.],
       [40.,  6.,  4., ...,  0.,  0.,  0.],
       ...,
       [45.,  8.,  3., ...,  0.,  0.,  0.],
       [34.,  6.,  2., ...,  0.,  0.,  0.],
       [44., 10.,  5., ...,  0.,  0.,  0.]])

In [96]:
# We combine our two tfidf (sparse) matrices and X_metadata
# ---
#
X = scipy.sparse.hstack([df_word_vect, df_char_vect, X_metadata])
X

<10000x2009 sparse matrix of type '<class 'numpy.float64'>'
	with 928106 stored elements in COOrdinate format>

In [81]:
# Getting our response variable
# ---
#
y = np.array(df.iloc[:, 0])
y

array([0, 4, 0, ..., 0, 4, 0])

### 4. Data Modelling

During this step, we will use machine learning algorithms to train and test our sentiment analysis models.

In [97]:
# Splitting our data
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [98]:
# Fitting our model
# ---
#

# Importing the algorithms
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression

nb_classifier = MultinomialNB() 
lr_classifier = LogisticRegression(max_iter=1000) 

# Training our model
nb_classifier.fit(X_train, y_train) 
lr_classifier.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [99]:
# Making predictions
# ---
#
y_predict_nb = nb_classifier.predict(X_test) 
y_predict_lr = lr_classifier.predict(X_test)

In [100]:
# Evaluating the Models
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
# ---
#
print("Naive Bayes Classifier:\n", accuracy_score(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", accuracy_score(y_test, y_predict_lr))

Naive Bayes Classifier:
 0.729
Logistic Regression Classifier: 
 0.7315


In [101]:
# Confusion matrices
# ---
# 
print("Naive Bayes Classifier: \n", confusion_matrix(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", confusion_matrix(y_test, y_predict_lr))

Naive Bayes Classifier: 
 [[758 292]
 [250 700]]
Logistic Regression Classifier: 
 [[759 291]
 [246 704]]


In [102]:
# Classification Reports
# ---
#
print("Naive Bayes Classifier: \n", classification_report(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", classification_report(y_test, y_predict_lr))

Naive Bayes Classifier: 
               precision    recall  f1-score   support

           0       0.75      0.72      0.74      1050
           4       0.71      0.74      0.72       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000

Logistic Regression Classifier: 
               precision    recall  f1-score   support

           0       0.76      0.72      0.74      1050
           4       0.71      0.74      0.72       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000



**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

To improve our model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. We can also use different vectorizing techniques, implement other machine learning models and perform hyperparameter tuning.

### 5. Recommendations


Our best model had an accuracy of 72.9% and use it for classifying newer tweets. We can improve this performance by performing hyperparameter tuning and feature engineering methods. 