<font color="#4b76b7">To start practicing, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# AfterWork Data Science: Getting Started with NLP Project

### Prerequisites

In [3]:
# Importing the required libraries
# ---
# 
import pandas as pd # library for data manipulation
import numpy as np  # librariy for scientific computations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # scientific conputing 

### 1. Importing our Data

In [4]:
# Question: Given a new tweets, create a sentiment analysis model that will 
# predict whether a tweet will contain positive or negative sentiment.
# ---
# Dataset url = https://bit.ly/31kqByD 
# ---
#
df = pd.read_csv('/content/project_ds.csv', encoding='latin-1')
df.head()

Unnamed: 0.1,Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


### 2. Data Exploration

In [5]:
# We can determine the size of our dataset
# ---
#
df.shape

(10000, 7)

Seems this dataset will need some data cleaning i.e. columns. We also don't need some columns to perform create our model. We will drop those columns.

### 3. Data Preparation

#### Basic Data Cleaning Techniques

In [6]:
# We rename the columns for ease of referencing our columns later on
# ---
#
df.columns = ['id', 'target', 't_id', 'created_at', 'query', 'user', 'text']
df.head()

Unnamed: 0,id,target,t_id,created_at,query,user,text
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


In [7]:
# We retain the relevant columns by dropping the columns we don't need 
# for creating a sentiment analysis model. 
# ---
#
df = df.drop(['id', 't_id', 'created_at', 'query', 'user'], axis = 1)
df.head()

Unnamed: 0,target,text
0,0,Obama forges his Muslim alliance against the c...
1,4,Had the most spectacular prom ever but now my...
2,0,I am overwhelmed today taking a moment to eat...
3,0,@lindork Tres sad. I was totally a Max fan. #...
4,0,"Crap, I was counting down the hours until my d..."


In [8]:
# Understanding the distribution of target
# ---
#
df.target.value_counts() 

0    5067
4    4933
Name: target, dtype: int64

In [9]:
# Let's determine whether our columns have the right data types
# ---
#
df.dtypes

target     int64
text      object
dtype: object

In [10]:
# What values are in our target variable?
# ---
#
df.target.unique()

array([0, 4])

These are the two classes to which each document (text) belongs. The target value 0 means a text with a negative sentiment, while that of 4 means a text with a positive sentiment. 

In [11]:
# Let's check for missing values 
# ---
# 
df.isnull().sum()

target    0
text      0
dtype: int64

We don't have any missing values, so we are good to go.

#### Text Processing

In [12]:
# Text Cleaning: Removing all urls/links
# ---
# 
df['text'] =  df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))
df[['text']].head()

Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,@lindork Tres sad. I was totally a Max fan. #...
4,"Crap, I was counting down the hours until my d..."


In [21]:
# Text Cleaning: Removing @ and # characters or replace them with space

df['text'] = df.text.str.replace('#',' ')
df['text'] = df.text.str.replace('@',' ') 
df[['text']].sample(10) 


Unnamed: 0,text
6374,rumple doodles boring saturday night ugh i dk ...
7430,anthony cash cash you better come back soon lt 3
7305,bummed that our camping trip to ny got cancelled
8268,good morning wanna stay home but can't bloody ...
1486,joey mcintyre missed all your tweets again joe...
9137,sk devi tt nah it's be oct y or ot us
6867,angela james oh poor bride that's sad good sto...
1420,wv goo street team i wake up to a goo song eve...
5244,better in pink by the way if you make any web ...
3461,just woke up omg soo sick still a ww


In [20]:
# Text Cleaning: Conversion to lowercase

df['text'] = df.text.apply(lambda x: " ".join(x.lower() for x in x.split()))
df[['text']].sample(10) 


Unnamed: 0,text
1406,is ac tully gutted about the katy perry gig an...
5612,just be lying on your shelf ' ' whatever i rea...
6164,kl a sik 1 hah a true a wee no i know sucks to...
3836,its hot today more vitamin d for me hah a
6031,snickers 1015 yup this was a bad flu year i ha...
3945,niro who
5483,lol well i am responding via my blackberry pho...
5791,goodbye cocktail with anna tear at flatiron lo...
2969,ugh school wont give my mom the marks
9126,very serious business planning with ma duck fo...


In [18]:
# Text Cleaning: Splitting concatenated words
# ---
# Performing this step will take few minutes...
# ---
#
# Library for Stop words
!pip3 install wordninja
!pip3 install textblob
import wordninja 
from textblob import TextBlob

nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

# Library for Lemmatization
nltk.download('wordnet')
from textblob import Word

# Library for Noun count
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Library for TD-IDF
from sklearn.feature_extraction.text import TfidfVectorizer 

Collecting wordninja
  Downloading wordninja-2.0.0.tar.gz (541 kB)
[?25l[K     |▋                               | 10 kB 19.6 MB/s eta 0:00:01[K     |█▏                              | 20 kB 22.9 MB/s eta 0:00:01[K     |█▉                              | 30 kB 11.2 MB/s eta 0:00:01[K     |██▍                             | 40 kB 9.7 MB/s eta 0:00:01[K     |███                             | 51 kB 4.8 MB/s eta 0:00:01[K     |███▋                            | 61 kB 5.5 MB/s eta 0:00:01[K     |████▎                           | 71 kB 5.9 MB/s eta 0:00:01[K     |████▉                           | 81 kB 6.0 MB/s eta 0:00:01[K     |█████▌                          | 92 kB 6.6 MB/s eta 0:00:01[K     |██████                          | 102 kB 5.3 MB/s eta 0:00:01[K     |██████▋                         | 112 kB 5.3 MB/s eta 0:00:01[K     |███████▎                        | 122 kB 5.3 MB/s eta 0:00:01[K     |███████▉                        | 133 kB 5.3 MB/s eta 0:00:01[K     |

In [19]:
# Performing the split
# ---
df['text'] = df.text.apply(lambda x: wordninja.split(str(TextBlob(x))))  
df['text'] = df.text.str.join(' ')
df[['text']].sample(10) 


Unnamed: 0,text
1591,s pb's weather sucks
2929,ti qui 54 hate you
1181,still nothing to say
2054,mike last ort yeah watching the news now
5923,two knotty boys damn wish i was staying in ca ...
5972,the rain sucks really didn't want to get out o...
3121,damm nx megan i don't understand it either im ...
5903,dan lopez 2012 no i did not i'll have 2 look i...
9781,young q ok twi cw to mexico why don't u come t...
1319,sole i rie e yummy enjoy


In [23]:
# Text Cleaning: Removing punctuation characters

df['text'] = df.text.str.replace('[^\w\s]','')
df[['text']].sample(10) 

  after removing the cwd from sys.path.


Unnamed: 0,text
6008,e hm ce eye hm i hope youll be ok soon teena hugs
2871,lady les hur r yeah make sure ur following me
7101,n iq yap ici c hope its nothing serious may ur...
3759,hates cleaning
9752,i really want jon and kate to stay together
2986,i hate storms
1279,the redstone so no more splits on the ice
1342,far too many late nites
3825,so i hv spent 14 hrs trying to get vista xp ub...
1838,james waters no no i still fully intend on get...


In [24]:
# Text Cleaning: Removing stop words

df['text'] = df.text.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df[['text']].sample(10) 


Unnamed: 0,text
7938,derek maca rio well miss lucario
8358,ms res dont think goalie time check face bette...
9048,juse dayne dirty juse im sleepy shit amp u wen...
9726,good afternoon sw aq qed missy baby
5291,burnt little lady
2151,j sw ching ive take pic
9206,r sue naga thanks bringing ticket today
6954,giraud official meet get ready flirt
8341,fl avid j wish would allow image signature gma...
2591,mia r buh bye res zz pati lazy weekend huh yea...


In [25]:
# Text Cleaning: Lemmatization
# ---
# Lemmatizing our text
df['text'] = df.text.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()])) 
df[['text']].sample(10) 

Unnamed: 0,text
9842,tho man tho supposed call
1408,ipod playing loud ear hurt
851,another day home resting knee knee surgery sta...
4402,jon bon 88 seems like long time pk lol u worki...
5520,im twittering sitting next josh mayor dann fest
7951,richie j 5 oh shush scaring jesus outta lol
2032,eating haag en daz coffee ice cream yummy
9310,yarn lust color way idea
4312,prom awesome bad sunburn foot cut killing
6945,vi x ster 25 im fine hun got friend visiting a...


We won't remove numerics because we could loose meaning of our text if we lost the numerics. We could also further prepare our text by performing spelling correction but this is a resource intensive process that we will skip for now.

#### Feature Engineering Techniques 

In [27]:
# Feature Construction: Length of tweet

df['length_of_text'] = df.text.str.len()



In [36]:
# Feature Construction: Word count 
df['word_count'] = df.text.apply(lambda x: len(str(x).split(" ")))

In [30]:
# Feature Construction: Word density (Average no. of words / tweet)

df['avg_word_length'] = df.text.apply(lambda x: avg_word(x)) 

In [None]:
# Feature Construction: Noun count
#
# First, we will download the punkt and the averaged_perceptron_tagger into our notebook environment. 
# which will allow us to find the part of speech tags.
# ---
#


# We create the function to check and get the part of speech tag count of a words in a given sentence


In [26]:
# Custom Functions
# ---
#

# Avg. words
def avg_word(sentence):
  words = sentence.split()
  try:
    z = (sum(len(word) for word in words)/len(words))
  except ZeroDivisionError:
    z = 0 
  return z

# Noun count
pos_dic = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt

# Subjectivity 
def get_subjectivity(tweet):
    try:
        textblob = TextBlob(unicode(tweet, 'utf-8'))
        subj = textblob.sentiment.subjectivity
    except:
        subj = 0.0
    return subj

# Polarity
def get_polarity(tweet):
    try:
        textblob = TextBlob(unicode(tweet, 'utf-8'))
        pol = textblob.sentiment.polarity
    except:
        pol = 0.0
    return pol

In [37]:
# Noun Count

#
df['noun_count'] = df.text.apply(lambda x: pos_check(x, 'noun'))

In [38]:
# Feature Construction: Verb count

df['verb_count'] = df.text.apply(lambda x: pos_check(x, 'verb'))

In [39]:
# Feature Construction: Adjective count / Tweet
# ---

df['adj_count'] = df.text.apply(lambda x: pos_check(x, 'adj'))


In [40]:
# Feature Construction: Adverb count / Tweet

df['adv_count'] = df.text.apply(lambda x: pos_check(x, 'adv'))


In [41]:
# Feature Construction: Pronoun 

df['pron_count'] = df.text.apply(lambda x: pos_check(x, 'pron'))

In [42]:
# Feature Construction: Subjectivity
# ---
df['subjectivity'] = df.text.apply(get_subjectivity)


In [43]:
# Feature Construction: Polarity
# ---
df['polarity'] = df.text.apply(get_polarity)


In [45]:
# Feature Construction: Word Level N-Gram TF-IDF Feature 

# Feature Construction: Word Level N-Gram TF-IDF Feature 
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1,3),  stop_words= 'english')
df_word_vect = tfidf.fit_transform(df.text) 



In [46]:
# Feature Construction: Character Level N-Gram TF-IDF Feature
# ---
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1,3),  stop_words= 'english')
df_char_vect = tfidf.fit_transform(df.text)


In [47]:
# Let's prepare the constructed features for modeling
# ---
#
X_metadata = np.array(df.iloc[:, 2:12])
X_metadata

array([[67.        , 11.        ,  5.18181818, ...,  0.        ,
         0.        ,  0.        ],
       [81.        , 12.        ,  5.83333333, ...,  0.        ,
         0.        ,  0.        ],
       [40.        ,  6.        ,  5.83333333, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [45.        ,  8.        ,  4.75      , ...,  0.        ,
         0.        ,  0.        ],
       [34.        ,  6.        ,  4.83333333, ...,  0.        ,
         0.        ,  0.        ],
       [44.        , 10.        ,  3.5       , ...,  0.        ,
         0.        ,  0.        ]])

In [48]:
# We combine our two tfidf (sparse) matrices and X_metadata
# ---
#
X = scipy.sparse.hstack([df_word_vect, df_char_vect,  X_metadata])
X

<10000x2010 sparse matrix of type '<class 'numpy.float64'>'
	with 938159 stored elements in COOrdinate format>

In [49]:
# Getting our response variable
# ---
#
y = np.array(df.iloc[:, 0])
y

array([0, 4, 0, ..., 0, 4, 0])

### 4. Data Modelling

During this step, we will use machine learning algorithms to train and test our sentiment analysis models.

In [50]:
# Splitting our data
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [51]:
# Fitting our model
# ---
#

# Importing the algorithms
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression

nb_classifier = MultinomialNB() 
lr_classifier = LogisticRegression(max_iter=1000) 

# Training our model
nb_classifier.fit(X_train, y_train) 
lr_classifier.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [52]:
# Making predictions
# ---
#
y_predict_nb = nb_classifier.predict(X_test) 
y_predict_lr = lr_classifier.predict(X_test)

In [53]:
# Evaluating the Models
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
# ---
#
print("Naive Bayes Classifier:\n", accuracy_score(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", accuracy_score(y_test, y_predict_lr))

Naive Bayes Classifier:
 0.7265
Logistic Regression Classifier: 
 0.731


In [54]:
# Confusion matrices
# ---
# 
print("Naive Bayes Classifier: \n", confusion_matrix(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", confusion_matrix(y_test, y_predict_lr))

Naive Bayes Classifier: 
 [[759 291]
 [256 694]]
Logistic Regression Classifier: 
 [[760 290]
 [248 702]]


In [55]:
# Classification Reports
# ---
#
print("Naive Bayes Classifier: \n", classification_report(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", classification_report(y_test, y_predict_lr))

Naive Bayes Classifier: 
               precision    recall  f1-score   support

           0       0.75      0.72      0.74      1050
           4       0.70      0.73      0.72       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000

Logistic Regression Classifier: 
               precision    recall  f1-score   support

           0       0.75      0.72      0.74      1050
           4       0.71      0.74      0.72       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000



**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

To improve our model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. We can also use different vectorizing techniques, implement other machine learning models and perform hyperparameter tuning.

### 5. Recommendations


This best model had an accuracy of 73.25% and use it for classifying newer tweets. We can improve this performance by performing hyperparameter tuning and feature engineering methods. 