# Twitter Sentiment Analysis Project

In [1]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import en_core_web_md
import spacy
from bs4 import BeautifulSoup as bf
import re


This project is going to predict how well the trained model is able to predict the sentiment of the tweets, using the data of 1.6 million tweets available on the Kaggle Dataset.

# Extrating the Data from the file, Data Exploration and Data Processing

Let's extract the from the file and store it into the variable as a Dataframe.

In [2]:
df = pd.read_csv("data/Tweets.csv", encoding = "ISO-8859-1", names = ["Sentiment", "Id", "Date", "Query_Type", "Username", "Tweet"])
df.head()

Unnamed: 0,Sentiment,Id,Date,Query_Type,Username,Tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


## Data Exploration

Let's explore the data and under the datasets more thoroughly. Let's explore the datatype of each columns.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Sentiment   1600000 non-null  int64 
 1   Id          1600000 non-null  int64 
 2   Date        1600000 non-null  object
 3   Query_Type  1600000 non-null  object
 4   Username    1600000 non-null  object
 5   Tweet       1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


It can be seen from the information of the datasets that none of the cells of the data frame are empty, which concludes that we do not need to delete or input any information to fill those cells. Thus, we can move forward with data engineering and data processing for further analysis.

Before moving forward with the data engineering and processing, let's check the description of each column.

In [4]:
df.describe(include = "all")

Unnamed: 0,Sentiment,Id,Date,Query_Type,Username,Tweet
count,1600000.0,1600000.0,1600000,1600000,1600000,1600000
unique,,,774363,1,659775,1581466
top,,,Mon Jun 15 12:53:14 PDT 2009,NO_QUERY,lost_dog,isPlayer Has Died! Sorry
freq,,,20,1600000,549,210
mean,2.0,1998818000.0,,,,
std,2.000001,193576100.0,,,,
min,0.0,1467810000.0,,,,
25%,0.0,1956916000.0,,,,
50%,2.0,2002102000.0,,,,
75%,4.0,2177059000.0,,,,


From the above information and the description stated into the datasets, we can conclude that the columns: id, Date, and username are not required, as they do not have any role in the analysis that makes the analysis deviate from its true value.

Moreover, from the above information it can also be concluded that the column Query_Type can also be removed as it only has one unique value in it which is "NO_QUERY", which means that it will not be able to contribute any importance towards the analysis of the sentiment of the tweets. Let's remove those columns and move forward which further data exploration.

In [5]:
df = df.drop(columns = ["Id", "Date", "Query_Type", "Username"])
df.head()

Unnamed: 0,Sentiment,Tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


Now, lets check how many unique values and there counts are present in the Sentiment Column.

In [6]:
df["Sentiment"].value_counts()

Sentiment
0    800000
4    800000
Name: count, dtype: int64

It can be seen that there are two unique values (0 and 4) and both of them are equal in number, which concludes that there is no skewness in the target column. Thus, no need to use target transformation techniques such as TargetTransformer, which changes the skewness of the target column to be more normal bell-shaped.

Moreover, changing the data of the target column, changing the value 4 in the target column to 1, as it will be more aligned towards the general meaning of the outcome in the classification problem, where the 0 represents the negative tweet and the 1 presents the positive tweet.

In [7]:
df = df.replace({"Sentiment":{4:1}})
df["Sentiment"].value_counts()

Sentiment
0    800000
1    800000
Name: count, dtype: int64

Now, let's explore the Tweets column.

In [8]:
df["Tweet"].head(10)

0    @switchfoot http://twitpic.com/2y1zl - Awww, t...
1    is upset that he can't update his Facebook by ...
2    @Kenichan I dived many times for the ball. Man...
3      my whole body feels itchy and like its on fire 
4    @nationwideclass no, it's not behaving at all....
5                        @Kwesidei not the whole crew 
6                                          Need a hug 
7    @LOLTrish hey  long time no see! Yes.. Rains a...
8                 @Tatiana_K nope they didn't have it 
9                            @twittera que me muera ? 
Name: Tweet, dtype: object

It can be seen from the above output that the Tweet column do contain a lot of unnecessary data such as urls, and html tags that do not contribute towards the sentiment analysis. Thus, we need to process the data and remove those unwanted texts from the data for better analysis.

## Data Processing

First of all, let's us process the html tags and its data from the tweets. For this one of the most used package is going to be used which is known as BeautifulSoup that decodes the html data and gives the actual text or the text it displays after parsing the html data.

In [9]:
i = 0

for text in df["Tweet"]:
    parsedText = bf(text).get_text()

    # Let's remove the URLs using the re
    parsedText = re.sub(r'http[A-Za-z0-9.:/]+', ' ', parsedText)

    #Let's remove the mentions "@"
    parsedText = re.sub(r'@[A-Za-z0-9]+', ' ', parsedText)

    #Let's remove the special characters
    parsedText = re.sub(r'[^A-Za-z0-9]', ' ', parsedText)
    
    df["Tweet"][i] = parsedText
    i += 1

df.head(10)

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



Unnamed: 0,Sentiment,Tweet
0,0,Awww that s a bummer You shoulda got ...
1,0,is upset that he can t update his Facebook by ...
2,0,I dived many times for the ball Managed to ...
3,0,my whole body feels itchy and like its on fire
4,0,no it s not behaving at all i m mad why a...
5,0,not the whole crew
6,0,Need a hug
7,0,hey long time no see Yes Rains a bit on...
8,0,K nope they didn t have it
9,0,que me muera


Let's remove the common punctuation or irrelevant words from the text that do not give any insights into the sentiment of the tweet, such as words like I, you, etc.

In [10]:
nlp = en_core_web_md.load()
text = nlp.pipe(df["Tweet"])

import nltk
nltk.download("stopwords")
stop_words = stopwords.words("english")

# print(text)

irrelevant_pos=["ADV", "PRON", "CCONJ", "PUNCT", "PART", "DET", "ADP"]
clean_text = "";
i = 0;

for token in text:
    clean_text_ind = "";
    for word in token:
        if word not in stop_words and word.pos_ not in irrelevant_pos:
            # print("I am Here")
            clean_text_ind = clean_text_ind + str(word.lemma_.lower()) + " "

    df.loc[i, "Tweet_cleaned"] = clean_text_ind
    i = i + 1
        
df.head(10)

[nltk_data] Downloading package stopwords to /Users/foram/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Sentiment,Tweet,Tweet_cleaned
0,0,Awww that s a bummer You shoulda got ...,awww s bummer shoulda get david ca...
1,0,is upset that he can t update his Facebook by ...,be upset that can update facebook texte mi...
2,0,I dived many times for the ball Managed to ...,dive many time for ball manage save 50 ...
3,0,my whole body feels itchy and like its on fire,whole body feel itchy fire
4,0,no it s not behaving at all i m mad why a...,no s behave m mad why be because ca...
5,0,not the whole crew,whole crew
6,0,Need a hug,need hug
7,0,hey long time no see Yes Rains a bit on...,hey long time see yes rain bit bit...
8,0,K nope they didn t have it,k nope didn t have
9,0,que me muera,que muera


Thus, all the data is being cleaned and is ready for being inputted into the models to get trained on them.

# Training the model

### Splitting the data into train and test data

Utilizing the methods of the scikit-learn library to split the data into train and test data, with the ratio of 80:20.

In [11]:
from sklearn.model_selection import (
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)

In [12]:
train_df, test_df = train_test_split(df, test_size = 20, random_state = 123)

In [13]:
X_train = train_df[["Tweet_cleaned"]]
y_train = train_df["Sentiment"]

X_test = test_df[["Tweet_cleaned"]]
y_test = test_df["Sentiment"]

Using the CountVectorizer to create a table of the word to document that has the information of the number of times a particular word is written in a document.

In [14]:
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
countVec = CountVectorizer(max_features = 1000)
transformed_X_train = countVec.fit_transform(X_train["Tweet_cleaned"])
transformed_X_test = countVec.transform(X_test["Tweet_cleaned"])

## Training different models

### Random Forest Classifier

We are going to utilize the advantages of the ensembles to train the model that are less prone to overfitting.

In [16]:
from sklearn.ensemble import RandomForestClassifier

In [17]:
rf = RandomForestClassifier(n_jobs = -1, random_state = 123)

rf_param_grid = {"n_estimators": np.arange(1, 100, 10), 
              "max_depth": np.arange(1, 20, 2)}
rf_random_search = RandomizedSearchCV(rf, param_distributions=rf_param_grid, n_iter=100, n_jobs=-1, return_train_score=True, random_state=76)
rf_random_search.fit(transformed_X_train, y_train)

In [18]:
print("Best Hyperparamters are: ", rf_random_search.best_params_)
print("Best Score is:", rf_random_search.best_score_)

Best Hyperparamters are:  {'n_estimators': 91, 'max_depth': 17}
Best Score is: 0.7006968837110463


From the above results, it can be concluded that the hyperparamters of values n_estimators of 91 and max_depth of 17 have the most optimal results when using the Random Forest Classifier and has achieved the Cross Validation score of 70%.

### Logistic Regression

Let's use the logistic regression technique to train the model and compare the optimized results with the Random Forest Classifier.

In [19]:
from sklearn.linear_model import LogisticRegression

In [20]:
lr = LogisticRegression()

C_vals = 10.0 ** np.arange(-1.5, 2, 0.5)

lr_param_grid = {"C": C_vals}


lr_random_search = RandomizedSearchCV(lr, param_distributions = lr_param_grid, n_iter = 100, n_jobs = -1, return_train_score = True, random_state = 76)
lr_random_search.fit(transformed_X_train, y_train)



In [21]:
print("Best Hyperparamters are: ", lr_random_search.best_params_)
print("Best Score is:", lr_random_search.best_score_)

Best Hyperparamters are:  {'C': 31.622776601683793}
Best Score is: 0.743343666795835


From the above results, it can be concluded that the hyperparameters of values C = 31.623 have the most optimal results when using the Logistic Regression as the training model and has a cross-validation score of 74%.

### Conclusion on training model

From the above two models, it can be concluded that the model that is more fit for the tweet sentimental analysis is the (Logistic Regression model, with the hyperparameters of values C = 31.623 that have the cross-validation of 74%. Now, let us check the score achieved by the test data using the most fit model trained above, Logistic Regression.

In [22]:
# using the Randomized Search CV variable to score the test data
print("The accuracy achieved on test score for the logistic regression model is: ", lr_random_search.score(transformed_X_test, y_test))

The accuracy achieved on test score for the logistic regression model is:  0.7


From the above results, it can be concluded that the best model trained can achieve an accuracy score of 70% to correctly predict the sentimental analysis of the tweet that has never been studied by the model before.

# Summary

Summary of important results

| Model | Metric | Training Data | Testing Data
| ---- | ----- | ------------- | --------- |
| Logistic Regression | Accuracy | 74% | 70%|

Even though the ensembles are more powered compared to linear models in training data in general, using the advantages of multiple independent training models, which later on predicts the outcome through voting. It can be seen that the linear models were more capable to learn more insights into the text and predict more correct sentiment of the tweets compared to the Random Forest Classifiers.

In terms of improvements that can be done in the future:
 - We can train other types of models and compare the results with them, such as training the LightGBM or can also use the averaging or stacking model to train multiple different models and use them to get the outcome.

