## Outline

1. Read libraries
2. Data preparation
	- Re-split data into train and test
3. Data vectorization (for train data)
	- Word tokenization
	- Word stemming
	- Word Lemmatization
	- Bag of Words (BoW) with train data
	- Tf - idf with train data
4. Data vectorization (for test data)
	- Word tokenization
	- Word stemming
	- Word Lemmatization
	- BoW with test data
	- TF-IDF with test data
5. Model Building: Sentiment Analysis
	- Splitting the Dataset into Train and Test set
	- Logistic Regression
	- Logistic Regression (2)
	- Support Vector Machine (SVM)
	- Gaussian NB classifier
	- MultinomialNB classifier
	- Xgboost classifier
	- Decision Tree
	- Random Forest
	- Deep Learning Classification (RNN-LSTM)
	- Using Vader Pre-trained model
6. Data prediction
7. Save Model
8. Load Model
9. Conclusion

## 1) Read libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# for nlp
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from textblob import TextBlob

# for stemming
from nltk.stem import PorterStemmer
stemming = PorterStemmer()

# for Lemmatizing
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizing = WordNetLemmatizer()

# for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# for machine learning
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report, f1_score
# from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

# save and load models
import pickle

# import warnings
import warnings
warnings.filterwarnings("ignore")

## 2. Data preparation

**This includes**:
    - Re-split data into train and test

In [2]:
# Read data
path = ''
df = pd.read_csv(path + 'trump_analyzed_df.csv')
del df['Unnamed: 0']
df

Unnamed: 0,id,label,tweet,tidy_tweet,hashtag,word_count,char_count,avg_word,stopwords,hashtags,numerics,upper_case
0,1,0.0,@user when a father is dysfunctional and is s...,dysfunctional selfish drags kids dysfunction #run,run,21,102,4.555556,10,1,0,0
1,2,0.0,@user @user thanks for #lyft credit i can't us...,thanks #lyft credit cause offer wheelchair van...,lyft disapointed getthanked,22,122,5.315789,5,3,0,0
2,3,0.0,bihday your majesty,majesty,,5,21,5.666667,1,0,0,0
3,4,0.0,#model i love u take with u all the time in ...,#model,model,17,86,4.928571,5,1,0,0
4,5,0.0,factsguide: society now #motivation,factsguide society #motivation,motivation,8,39,8.000000,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
49154,49155,,thought factory: left-right polarisation! #tru...,thought factory left right polarisation #trump...,trump uselections leadership politics ...,13,108,8.727273,0,6,0,0
49155,49156,,feeling like a mermaid ð #hairflip #neverre...,feeling mermaid #hairflip #neverready #formal ...,hairflip neverready formal wedding gown ...,15,96,6.307692,1,7,0,0
49156,49157,,#hillary #campaigned today in #ohio((omg)) &am...,#hillary #campaigned #ohio used words assets l...,hillary campaigned ohio omg clinton ra...,20,145,7.411765,3,5,0,0
49157,49158,,"happy, at work conference: right mindset leads...",work conference right mindset leads culture de...,work mindset,15,104,7.500000,2,2,0,0


In [None]:
# Read data
path = ''
df = pd.read_csv(path + 'trump_analyzed_df.csv')
del df['Unnamed: 0']

In [5]:
df.columns

Index(['id', 'script', 'tidy_script', 'hashtag', 'word_count', 'char_count',
       'avg_word', 'stopwords', 'hashtags', 'numerics', 'upper_case'],
      dtype='object')

In [4]:
df.shape

(12, 11)

### 2.1 Re-split data into train and test

In [6]:
# check data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           12 non-null     int64  
 1   script       12 non-null     object 
 2   tidy_script  12 non-null     object 
 3   hashtag      0 non-null      float64
 4   word_count   12 non-null     int64  
 5   char_count   12 non-null     int64  
 6   avg_word     12 non-null     float64
 7   stopwords    12 non-null     int64  
 8   hashtags     12 non-null     int64  
 9   numerics     12 non-null     int64  
 10  upper_case   12 non-null     int64  
dtypes: float64(2), int64(7), object(2)
memory usage: 1.2+ KB


In [7]:
df['hashtag']='none'

In [4]:
# split the data based on label
# train_df = df[0:31962]
# test_df = df[31962:]

**After re-splitting**

- train data: (31962, 3)
- test data: (17197, 2)

In [8]:
# chek train dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           12 non-null     int64  
 1   script       12 non-null     object 
 2   tidy_script  12 non-null     object 
 3   hashtag      12 non-null     object 
 4   word_count   12 non-null     int64  
 5   char_count   12 non-null     int64  
 6   avg_word     12 non-null     float64
 7   stopwords    12 non-null     int64  
 8   hashtags     12 non-null     int64  
 9   numerics     12 non-null     int64  
 10  upper_case   12 non-null     int64  
dtypes: float64(1), int64(7), object(3)
memory usage: 1.2+ KB


The most important column here is **'tidy_tweet',** which has the pre-processed version of the tweets. We will do verctorization and modeling over this column.

In [9]:
# drop rows where tidy_tweet = null
df = df[df['tidy_script'].notna()]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 0 to 11
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           12 non-null     int64  
 1   script       12 non-null     object 
 2   tidy_script  12 non-null     object 
 3   hashtag      12 non-null     object 
 4   word_count   12 non-null     int64  
 5   char_count   12 non-null     int64  
 6   avg_word     12 non-null     float64
 7   stopwords    12 non-null     int64  
 8   hashtags     12 non-null     int64  
 9   numerics     12 non-null     int64  
 10  upper_case   12 non-null     int64  
dtypes: float64(1), int64(7), object(3)
memory usage: 1.1+ KB


In [38]:
df2 = df[['tidy_script']].copy()

In [39]:
df2.head()

Unnamed: 0,tidy_script
0,hello iowa congratulations iowa hawkers today ...
1,running many love marjorie help take house sen...
2,grew gutted thing round question could could w...
3,nation america mourn loss brave brilliant amer...
4,crowd tell goes wish show goes looked televisi...


In [42]:
df2['word_count'] = df2['tidy_script'].apply(lambda x: len(x.split()))

In [45]:
df2.word_count.sum()/6

7103.5

#### Generate 7100 rows of new data

In [46]:
df2['tidy_script'].apply(lambda x: [ x.split()[i:i+5] for i in range(len(x.split())-6)])

0     [[hello, iowa, congratulations, iowa, hawkers]...
1     [[running, many, love, marjorie, help], [many,...
2     [[grew, gutted, thing, round, question], [gutt...
3     [[nation, america, mourn, loss, brave], [ameri...
4     [[crowd, tell, goes, wish, show], [tell, goes,...
5     [[seen, vanity, president, seen, vanity], [van...
6     [[charlie, introduction, beautiful, fearless, ...
7     [[audience, matt, precedes, done, can], [matt,...
8     [[broke, rolling, topic, elite, firms], [rolli...
9     [[audience, audience, crosstalk, well, ohio], ...
10    [[well, michael, congratulations, reelection, ...
11    [[well, hello, can, miss, miss], [hello, can, ...
Name: tidy_script, dtype: object

## 3. Data vectorization (for train data)

In order to use textual data for predictive modeling, words need to then be encoded as integers, or floating-point values, for use as inputs in machine learning algorithms. This process is called feature extraction (or vectorization). This includes:

- Word tokenization
- Data normalization
    - Word stemming
    - Word Lemmatization
- Bag of Words (BoW) with train data
- Tf - idf with train data

### 3.1 Word tokenization

In [10]:
df['token'] = df['tidy_script'].apply(lambda x: word_tokenize(x))
df.head()

Unnamed: 0,id,script,tidy_script,hashtag,word_count,char_count,avg_word,stopwords,hashtags,numerics,upper_case,token
0,0,"\n\n\n\n \nDonald Trump: (00:13)\nHello, Iowa....",hello iowa congratulations iowa hawkers today ...,none,15295,86940,4.581675,5641,0,43,362,"[hello, iowa, congratulations, iowa, hawkers, ..."
1,1,\n\n\n\n \nDonald Trump: (03:37)\nWe have grea...,running many love marjorie help take house sen...,none,13511,76565,4.573238,5115,0,41,271,"[running, many, love, marjorie, help, take, ho..."
2,2,\n\n\n\n \nGreg Gutfeld: (00:05)\nAll right. W...,grew gutted thing round question could could w...,none,1130,6763,4.575557,403,0,2,61,"[grew, gutted, thing, round, question, could, ..."
3,3,\n\n\n\n \nDonald Trump: (00:00)\nAs one natio...,nation america mourn loss brave brilliant amer...,none,321,1860,4.742236,136,0,1,6,"[nation, america, mourn, loss, brave, brillian..."
4,4,\n\n\n\n \nDonald Trump: (08:53)\nThank you. T...,crowd tell goes wish show goes looked televisi...,none,13628,77118,4.548464,5021,0,51,345,"[crowd, tell, goes, wish, show, goes, looked, ..."


### 3.2 Word stemming

In [11]:
# Created one more columns tweet_stemmed it shows tweets' stemmed version
df['script_stemmed'] = df['token'].apply(lambda x: ' '.join([stemming.stem(i) for i in x]))
df.head()

Unnamed: 0,id,script,tidy_script,hashtag,word_count,char_count,avg_word,stopwords,hashtags,numerics,upper_case,token,script_stemmed
0,0,"\n\n\n\n \nDonald Trump: (00:13)\nHello, Iowa....",hello iowa congratulations iowa hawkers today ...,none,15295,86940,4.581675,5641,0,43,362,"[hello, iowa, congratulations, iowa, hawkers, ...",hello iowa congratul iowa hawker today thrill ...
1,1,\n\n\n\n \nDonald Trump: (03:37)\nWe have grea...,running many love marjorie help take house sen...,none,13511,76565,4.573238,5115,0,41,271,"[running, many, love, marjorie, help, take, ho...",run mani love marjori help take hous send nanc...
2,2,\n\n\n\n \nGreg Gutfeld: (00:05)\nAll right. W...,grew gutted thing round question could could w...,none,1130,6763,4.575557,403,0,2,61,"[grew, gutted, thing, round, question, could, ...",grew gut thing round question could could wife...
3,3,\n\n\n\n \nDonald Trump: (00:00)\nAs one natio...,nation america mourn loss brave brilliant amer...,none,321,1860,4.742236,136,0,1,6,"[nation, america, mourn, loss, brave, brillian...",nation america mourn loss brave brilliant amer...
4,4,\n\n\n\n \nDonald Trump: (08:53)\nThank you. T...,crowd tell goes wish show goes looked televisi...,none,13628,77118,4.548464,5021,0,51,345,"[crowd, tell, goes, wish, show, goes, looked, ...",crowd tell goe wish show goe look televis tele...


### 3.3 Word Lemmatization

In [12]:
df['script_lemmatized'] = df['token'].apply(lambda x: ' '.join([lemmatizing.lemmatize(i) for i in x]))
df.head()

Unnamed: 0,id,script,tidy_script,hashtag,word_count,char_count,avg_word,stopwords,hashtags,numerics,upper_case,token,script_stemmed,script_lemmatized
0,0,"\n\n\n\n \nDonald Trump: (00:13)\nHello, Iowa....",hello iowa congratulations iowa hawkers today ...,none,15295,86940,4.581675,5641,0,43,362,"[hello, iowa, congratulations, iowa, hawkers, ...",hello iowa congratul iowa hawker today thrill ...,hello iowa congratulation iowa hawker today th...
1,1,\n\n\n\n \nDonald Trump: (03:37)\nWe have grea...,running many love marjorie help take house sen...,none,13511,76565,4.573238,5115,0,41,271,"[running, many, love, marjorie, help, take, ho...",run mani love marjori help take hous send nanc...,running many love marjorie help take house sen...
2,2,\n\n\n\n \nGreg Gutfeld: (00:05)\nAll right. W...,grew gutted thing round question could could w...,none,1130,6763,4.575557,403,0,2,61,"[grew, gutted, thing, round, question, could, ...",grew gut thing round question could could wife...,grew gutted thing round question could could w...
3,3,\n\n\n\n \nDonald Trump: (00:00)\nAs one natio...,nation america mourn loss brave brilliant amer...,none,321,1860,4.742236,136,0,1,6,"[nation, america, mourn, loss, brave, brillian...",nation america mourn loss brave brilliant amer...,nation america mourn loss brave brilliant amer...
4,4,\n\n\n\n \nDonald Trump: (08:53)\nThank you. T...,crowd tell goes wish show goes looked televisi...,none,13628,77118,4.548464,5021,0,51,345,"[crowd, tell, goes, wish, show, goes, looked, ...",crowd tell goe wish show goe look televis tele...,crowd tell go wish show go looked television t...


### 3.4 Bag of Words (BoW) with train data

CounterVectorization is a SciKitLearn library takes any text document and returns each unique word as a feature with the count of number of times that word occurs.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')
bow_vectorizer

CountVectorizer(max_df=0.9, max_features=1000, min_df=2, stop_words='english')

In [14]:
# bag-of-words stemmed
trainbow_stem = bow_vectorizer.fit_transform(df['script_stemmed'])
trainbow_stem

<12x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 6732 stored elements in Compressed Sparse Row format>

In [15]:
trainbow_stem.toarray()

array([[4, 2, 2, ..., 4, 0, 0],
       [1, 1, 1, ..., 5, 0, 1],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 1, 1, ..., 2, 0, 1],
       [4, 2, 5, ..., 3, 0, 3],
       [1, 0, 5, ..., 5, 3, 0]], dtype=int64)

In [16]:
# bow lemmatized
trainbow_lemm = bow_vectorizer.fit_transform(df['script_lemmatized'])
trainbow_lemm

<12x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 6710 stored elements in Compressed Sparse Row format>

In [17]:
trainbow_lemm.toarray()

array([[4, 2, 0, ..., 4, 0, 0],
       [1, 1, 0, ..., 5, 0, 1],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 1, 2, ..., 2, 0, 1],
       [4, 5, 2, ..., 3, 0, 3],
       [1, 2, 2, ..., 5, 3, 0]], dtype=int64)

**Note**: Why we need toarray()? This method converts the sparse matrix representation to a dense ndarray representation.

**YOUR TURN**

Can you use the code above without using toarray() function?

### 3.5 Tf - idf with train data

**TF-IDF** works by penalizing the common words by assigning them lower weights while giving importance to words which are rare in the entire corpus but appear in good numbers in few documents.

Let’s have a look at the important terms related to TF-IDF:

TF = (Number of times term t appears in a document)/(Number of terms in the document)
IDF = log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.
TF-IDF = TF*IDF

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')
tfidf_vectorizer

TfidfVectorizer(max_df=0.9, max_features=1000, min_df=2, stop_words='english')

In [20]:
# tf-idf stemmed
traintfidf_stem = tfidf_vectorizer.fit_transform(df['script_stemmed'])
traintfidf_stem.toarray()

array([[0.02140815, 0.0136249 , 0.01157825, ..., 0.02508921, 0.        ,
        0.        ],
       [0.00596989, 0.00759889, 0.00645743, ..., 0.03498193, 0.        ,
        0.00907048],
       [0.01779407, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.00801385, 0.00681006, ..., 0.01475689, 0.        ,
        0.0095658 ],
       [0.0259271 , 0.01650091, 0.03505562, ..., 0.02278888, 0.        ,
        0.02954469],
       [0.00662603, 0.        , 0.03583579, ..., 0.03882675, 0.03330764,
        0.        ]])

In [21]:
# tf-idf lemmatized
traintfidf_lemm = tfidf_vectorizer.fit_transform(df['script_lemmatized'])
traintfidf_lemm.toarray()

array([[0.02193933, 0.01186553, 0.        , ..., 0.02571172, 0.        ,
        0.        ],
       [0.00600493, 0.00649534, 0.        , ..., 0.03518728, 0.        ,
        0.00912373],
       [0.01817601, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.00670454, 0.02077182, ..., 0.01452823, 0.        ,
        0.00941758],
       [0.02499032, 0.03378902, 0.02093685, ..., 0.02196549, 0.        ,
        0.02847721],
       [0.00662474, 0.01433153, 0.02220078, ..., 0.03881919, 0.03330116,
        0.        ]])

## 4. Data vectorization (for test data)

This includes:
    - Word tokenization
    - Word stemming
    - Word Lemmatization
    - BoW with test data
    - TF-IDF with test data

## 5. Model Building: Sentiment Analysis

We are now done with all the pre-modeling stages required to get the data in the proper form and shape. We will be building models on the datasets with different feature sets prepared in the earlier sections — Bag-of-Words, and TF-IDF vectors. We will use the following algorithms to build models:

- Splitting the Dataset into Train and Test set
- Logistic Regression
- Logistic Regression (2)
- Support Vector Machine (SVM)
- Gaussian NB classifier
- MultinomialNB classifier
- Xgboost classifier
- Decision Tree
- Random Forest
- Deep Learning Classification
- Vader Pre-trained model

**A Note on Evaluation Metrics**

We will use the following evaluation metrics:


1. **Accuracy score** is the rate of correct predictions.Out of every 100 predictions made, the model was correct 94 times. It is used when we want to know the number of correct predictions, which is, when the algorithm correctly predicts a type T when it is actually type T. It takes into account all the possible classes and how much we predicted correctly.The score should be as high as possible.

2. **F1 Score** It is the weighted average of Precision and Recall is used when the datasets don't have an equal representation for each type that's being classified. Therefore, this score takes both false positives and false negatives into account. It is suitable for uneven class distribution problems. It is calculated as follows: F1 Score = 2 (Recall Precision) / (Recall + Precision)

3. **A confusion matrix** is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. Confusion matrix represents accurate predictions made along the diagonal of the matrix.

In [31]:
from IPython.display import display
from PIL import Image

In [None]:
# Confusion matrix
path="C:\\Users\\lenovo\\Tutorials\\03. Data Science\\DS images 2\\confusion-matrix.png"
display(Image.open(path))

In [None]:
# Evaluation metrics
path="C:\\Users\\lenovo\\Tutorials\\03. Data Science\\DS images 2\\f1-score.jpg"
display(Image.open(path))

### 5.1 Splitting the Dataset into Train and Test set

In [24]:
traintfidf_lemm.shape

(12, 1000)

In [30]:
X=traintfidf_lemm #x: predictors


In [33]:
xtrain,xtest,ytrain,ytest=train_test_split(X,y,test_size=.3,random_state=42)

We will aplly an number of classifiers

### 5.2 Logistic Regression

In [34]:
lr=LogisticRegression() # for lemmatized data
lr.fit(xtrain,ytrain)

LogisticRegression()

In [35]:
predict_lr=lr.predict(xtest)

In [36]:
print("accuracy score :", accuracy_score(predict_lr,ytest))

# calculating the f1 score for the validation set
print("f1 score :", f1_score(predict_lr,ytest))

print(confusion_matrix(predict_lr,ytest))
print(classification_report(predict_lr,ytest))

accuracy score : 0.9455175309678774
f1 score : 0.4188129899216126
[[8820  479]
 [  40  187]]
              precision    recall  f1-score   support

         0.0       1.00      0.95      0.97      9299
         1.0       0.28      0.82      0.42       227

    accuracy                           0.95      9526
   macro avg       0.64      0.89      0.70      9526
weighted avg       0.98      0.95      0.96      9526



### 5.3 Logistic Regression (2)

In [None]:
X1=traintfidf_stem # for stemmed data
y1=train_df['label']

In [None]:
x1train,x1test,y1train,y1test=train_test_split(X1,y1,test_size=.3,random_state=42)

In [None]:
lr1=LogisticRegression()
lr1.fit(x1train,y1train)

In [None]:
predict_lr1=lr1.predict(x1test)

In [None]:
# accuracy score
print("accuracy score :", accuracy_score(predict_lr1,y1test))

# calculating the f1 score for the validation set
print("f1 score :", f1_score(predict_lr1,ytest))

print(confusion_matrix(predict_lr1,y1test))
print(classification_report(predict_lr1,y1test))

### 5.4 Support Vector Machine (SVM)

In [None]:
svc=SVC()
svc.fit(xtrain,ytrain)
predict_svc=svc.predict(xtest)

In [None]:
# accuracy score
print("accuracy score :", accuracy_score(predict_svc,ytest))

# calculating the f1 score for the validation set
print("f1 score :", f1_score(predict_svc,ytest))

print(confusion_matrix(predict_svc,ytest))
print(classification_report(predict_svc,ytest))

### 5.5 Gaussian NB classifier

**Naive Bayes** is a classification technique based on Bayes' Theorem. Bayes’ theorem is based conditional probability which states the likelihood the occurrence of event “A” given another event “B” has already happened. There are 3 type of Naïve Bayes:

1. **Gaussian** -> The model assume that the data follows normal distribution and all our features are continuous.
2. **Bernoulli** -> It assumes that all our features are binary such that they only take two values: 0s and 1s.
3. **Multinomial** -> It assumes that the data has discreate value such as ratings between 1 to 5.


In [None]:
nb=GaussianNB()
nb.fit(xtrain.toarray(),ytrain)
predict_nb=nb.predict(xtest.toarray())

In [None]:
# accuracy score
print("accuracy score :", accuracy_score(predict_nb,ytest))

# calculating the f1 score for the validation set
print("f1 score :", f1_score(predict_nb,ytest))

print(confusion_matrix(predict_nb,ytest))
print(classification_report(predict_nb,ytest))

### 5.6 MultinomialNB classifier

In [None]:
mlnb = MultinomialNB()
mlnb.fit(xtrain.toarray(),ytrain)
predict_mlnb=mlnb.predict(xtest.toarray())

In [None]:
# accuracy score
print("accuracy score :", accuracy_score(predict_mlnb,ytest))

# calculating the f1 score for the validation set
print("f1 score :", f1_score(predict_mlnb,ytest))

print(confusion_matrix(predict_mlnb,ytest))
print(classification_report(predict_mlnb,ytest))

### 5.7 Xgboost classifier

In [None]:
xgb = XGBClassifier()
xgb.fit(xtrain.toarray(),ytrain)
predict_xgb=xgb.predict(xtest.toarray())

In [None]:
# accuracy score
print("accuracy score :", accuracy_score(predict_xgb,ytest))

# calculating the f1 score for the validation set
print("f1 score :", f1_score(predict_xgb,ytest))

print(confusion_matrix(predict_xgb,ytest))
print(classification_report(predict_xgb,ytest))

**Note**

Xgboost classifier is the preferred classifier to use in data science competitions

### 5.8 Decision Tree

In [38]:
dt = DecisionTreeClassifier()
dt.fit(xtrain.toarray(),ytrain)
predict_dt = dt.predict(xtest.toarray())

KeyboardInterrupt: 

In [None]:
# accuracy score
print("accuracy score :", accuracy_score(predict_dt,ytest))

# calculating the f1 score for the validation set
print("f1 score :", f1_score(predict_dt,ytest))

print(confusion_matrix(predict_dt,ytest))
print(classification_report(predict_dt,ytest))

### 5.9 Random Forest

In [40]:
xtrain.shape, ytrain.shape

((22225, 1000), (22225,))

In [46]:
rf = RandomForestClassifier()
rf.fit(xtrain.toarray(),ytrain) # you can test with grid search methodology
predict_rf = rf.predict(xtest.toarray())

In [47]:
# accuracy score
print("accuracy score :", accuracy_score(predict_rf,ytest))

# calculating the f1 score for the validation set
print("f1 score :", f1_score(predict_rf,ytest))

print(confusion_matrix(predict_rf,ytest))
print(classification_report(predict_rf,ytest))

accuracy score : 0.9517111064455175
f1 score : 0.5568400770712909
[[8777  377]
 [  83  289]]
              precision    recall  f1-score   support

         0.0       0.99      0.96      0.97      9154
         1.0       0.43      0.78      0.56       372

    accuracy                           0.95      9526
   macro avg       0.71      0.87      0.77      9526
weighted avg       0.97      0.95      0.96      9526



### 5.10 Deep Learning Classification

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Embedding, SpatialDropout1D
from keras.layers import Bidirectional

In [None]:
max_features = 220
tokenizer = Tokenizer(num_words = max_features, split = (' '))
tokenizer.fit_on_texts(train_df['tweet'].values)
X = tokenizer.texts_to_sequences(train_df['tweet'].values)

# making all the tokens into same sizes using padding.
X = pad_sequences(X, maxlen = max_features)
X.shape

In [None]:
Y = train_df['label'].values

In [None]:
model = Sequential()
model.add(Embedding(max_features, 64, input_length = X.shape[1], trainable=False))
model.add(Bidirectional(LSTM(128, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.50))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [None]:
model.fit(X, Y,batch_size=1500,epochs = 1)

In [None]:
prediction = model.predict(X)
classes_x=np.argmax(prediction,axis=1)

In [None]:
from sklearn.metrics import accuracy_score
score = accuracy_score(Y, classes_x)
print(score)

### 5.11 Using Vader Pre-trained model

In [None]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

In [None]:
train_df['score']=train_df['tweet'].apply(lambda tweet: sid.polarity_scores(tweet))
train_df.head()

In [None]:
train_df['compound']  = train_df['score'].apply(lambda score_dict: score_dict['compound'])
train_df.head()

In [None]:
train_df['comp_score'] = train_df['compound'].apply(lambda c: 1 if c >0 else 0)
train_df.head()

In [None]:
from sklearn.metrics import accuracy_score
score = accuracy_score(train_df['label'], train_df['comp_score'])
print(score)

**YOUR TURN**

Can you suggest other classifiers?

## 6. Data prediction

**Random Forst model** has given us the best performance so far in terms of F1-score and accuracy. Let’s try to do predictions using the test data.

In [48]:
# Random Forst
test_predict_rf = rf.predict(testtfidf_lemm)
test_predict_rf

array([0., 0., 0., ..., 0., 0., 0.])

In [49]:
test_df['label'] = test_predict_rf
test_df.head()

Unnamed: 0,id,tweet,tidy_tweet,token,tweet_stemmed,tweet_lemmatized,label
31962,31963,#studiolife #aislife #requires #passion #dedic...,#studiolife #aislife #requires #passion #dedic...,"[#, studiolife, #, aislife, #, requires, #, pa...",# studiolif # aislif # requir # passion # dedi...,# studiolife # aislife # requires # passion # ...,0.0
31963,31964,@user #white #supremacists want everyone to s...,#white #supremacists everyone #birds #movie,"[#, white, #, supremacists, everyone, #, birds...",# white # supremacist everyon # bird # movi,# white # supremacist everyone # bird # movie,0.0
31964,31965,safe ways to heal your #acne!! #altwaystohe...,safe ways heal #acne #altwaystoheal #healing,"[safe, ways, heal, #, acne, #, altwaystoheal, ...",safe way heal # acn # altwaystoh # heal,safe way heal # acne # altwaystoheal # healing,0.0
31965,31966,is the hp and the cursed child book up for res...,cursed child book reservations already #harryp...,"[cursed, child, book, reservations, already, #...",curs child book reserv alreadi # harrypott # p...,cursed child book reservation already # harryp...,0.0
31966,31967,"3rd #bihday to my amazing, hilarious #nephew...",#bihday amazing hilarious #nephew ahmir uncle ...,"[#, bihday, amazing, hilarious, #, nephew, ahm...",# bihday amaz hilari # nephew ahmir uncl dave ...,# bihday amazing hilarious # nephew ahmir uncl...,0.0


In [50]:
test_df['label'].value_counts()

0.0    16710
1.0      349
Name: label, dtype: int64

### A. Prediction: Positive

In [51]:
# increase column width
pd.set_option('max_colwidth', 400)

In [52]:
prediction_pos = test_df[test_df['label'] == 1]
prediction_pos = prediction_pos[['id','tweet','label']]
prediction_pos

Unnamed: 0,id,tweet,label
32084,32085,"@user @user @user always, always, always somebody else's fault... #bigot",1.0
32096,32097,#rainbow over wall street a good way to end training!! #officialrainbowspotterâ¦,1.0
32107,32108,4u nonhockey people. hockeys babe ruth died. gordie howe was beyond myth and legend. hockey has lost mr. hockey.,1.0
32190,32191,@user so relaxed #peaceful âºï¸,1.0
32228,32229,@user @user it is still fucking bullshit that they are giving #bigots a platform to spread their vile #hate. #boyâ¦,1.0
...,...,...,...
48847,48848,happy 16th anniversary pti! ð·ðºð¯â #pilipinasteleserv #pti16thanniversary #cowboygrillâ¦,1.0
48863,48864,just saw #thelivingandthedead trailer @user looks soo good! ðð #cantwait,1.0
48898,48899,@user ha! good riddance! #blacklivesmatter,1.0
48969,48970,i remember days ago i just wana. say thank you almighty godÂ¤Â¤back to back.++***+++*### blessed friday to all my palz in nation wild.,1.0


### B. Prediction: Negative

In [53]:
prediction_neg = test_df[test_df['label'] == 0]
prediction_neg = prediction_neg[['id','tweet','label']]
prediction_neg

Unnamed: 0,id,tweet,label
31962,31963,#studiolife #aislife #requires #passion #dedication #willpower to find #newmaterialsâ¦,0.0
31963,31964,@user #white #supremacists want everyone to see the new â #birdsâ #movie â and hereâs why,0.0
31964,31965,safe ways to heal your #acne!! #altwaystoheal #healthy #healing!!,0.0
31965,31966,"is the hp and the cursed child book up for reservations already? if yes, where? if no, when? ððð #harrypotter #pottermore #favorite",0.0
31966,31967,"3rd #bihday to my amazing, hilarious #nephew eli ahmir! uncle dave loves you and missesâ¦",0.0
...,...,...,...
49154,49155,thought factory: left-right polarisation! #trump #uselections2016 #leadership #politics #brexit #blm &gt;3,0.0
49155,49156,feeling like a mermaid ð #hairflip #neverready #formal #wedding #gown #dresses #mermaid â¦,0.0
49156,49157,"#hillary #campaigned today in #ohio((omg)) &amp; used words like ""assets&amp;liability"" never once did #clinton say thee(word) #radicalization",0.0
49157,49158,"happy, at work conference: right mindset leads to culture-of-development organizations #work #mindset",0.0


## 7. Save Model

In [54]:
# save the model to disk
# we give what ever name in fist line (the model will be stored in that name)
# in second line we provide the name of our model (which is classifier in our case)

import pickle
RVC_filename = 'finalized_RFC_model.sav' # finalized_RFC_model: is the new model name
pickle.dump(rf, open(RVC_filename, 'wb')) # rf: random forest

**Model will be saved in the current directory**

In [55]:
# Save tfidf vectorizer
# Save fit vectorizer and fit tfidftransformer, use in prediction

tfidftransformer_path = 'tfidf-vectorizer.pkl' # tfidf-vectorizer is the new vectorizer name
with open(tfidftransformer_path, 'wb') as fw:
    pickle.dump(traintfidf_lemm, fw) # traintfidf_lemm: vectorizer name

**YOUR TURN**

There is another library to handle Machine learning models, called **'joblib'.** Can you use it to do the same job as pickle here?

## 8. Load Model

In [29]:
# 1) load logistic regression
import pickle
with open('finalized_RFC_model.sav', 'rb') as f:
    rf = pickle.load(f)

rf

RandomForestClassifier()

In [28]:
# 2) Load Tfidf vectorizer
import pickle
tfidftransformer_path = 'tfidf-vectorizer.pkl'
vectorizer = pickle.load(open(tfidftransformer_path, "rb"))
vectorizer

<31751x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 95447 stored elements in Compressed Sparse Row format>

In [31]:
vectorizer.shape

(31751, 1000)

In [33]:
rf.predict(X).sum()

0.0

**YOUR TURN**

Can you apply this model to new unseen twitter data? Don't forget to pre-process and vectorize data before injecting it into the model!!

## 9. Conclusion

In this module work, Logistic Regression, Support Vector Classifier, Gaussian NB, MultinomialNB, Decision Tree, Random Forest and XGBoost classifiers are used to perform Twitter sentiment analysis, out of these algorithms Random Forest classifier works best in terms of accuracy and F1 evaluation measures.