### Tweets Sentiment Prediction using NLP and Classification algorithm

The goal is to predict if a review has got negative or positive sentiment aiding faster decision by the reader or the business in classifying reviews and make decisions basis the sentiment. Finding the sentiment also aids in summarizing the outcome for perfoemance measurement which otherwise becomes highly manual process.


#### Dataset:
The dataset contains tweets and associated sentiments tagged.
Source: https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset

### Importing dataset

In [2]:
import pandas as pd
import numpy as np
data = pd.read_csv(r"Tweets.csv")

#### Dataset Exploration

In [3]:
data.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


In [4]:
data["sentiment"].value_counts()

neutral     11118
positive     8582
negative     7781
Name: sentiment, dtype: int64

In this section, we will look at only positive and negative reviews to do a binary classification.

In [5]:
data = data[(data['sentiment'] == "positive") | (data['sentiment'] == "negative")]

In [6]:
data["sentiment"].value_counts()

positive    8582
negative    7781
Name: sentiment, dtype: int64

Data has good number of both classes of reviews and they are balanced. We'll use only the 'text' and 'sentiment' columns

In [7]:
data = data[['text','sentiment']]

In [8]:
data.head()

Unnamed: 0,text,sentiment
1,Sooo SAD I will miss you here in San Diego!!!,negative
2,my boss is bullying me...,negative
3,what interview! leave me alone,negative
4,"Sons of ****, why couldn`t they put them on t...",negative
6,2am feedings for the baby are fun when he is a...,positive


In [26]:
data = data.reset_index() # to remove outdated indices

In [27]:
data = data[['text','sentiment']]

In [28]:
data.head()

Unnamed: 0,text,sentiment
0,Sooo SAD I will miss you here in San Diego!!!,negative
1,my boss is bullying me...,negative
2,what interview! leave me alone,negative
3,"Sons of ****, why couldn`t they put them on t...",negative
4,2am feedings for the baby are fun when he is a...,positive


Making the sentiment numerical (0,1) for model training 

In [43]:
data['sentiment'] = data['sentiment'].apply(lambda x: 0 if x == "negative" else 1)

In [44]:
data.head()

Unnamed: 0,text,sentiment
0,Sooo SAD I will miss you here in San Diego!!!,0
1,my boss is bullying me...,0
2,what interview! leave me alone,0
3,"Sons of ****, why couldn`t they put them on t...",0
4,2am feedings for the baby are fun when he is a...,1


### NLP - Cleaning the text data

Functionalities and explanations for each cleaning step will be added as a part of another detailed readout.

In [None]:
# Libraries for cleaning text data
import re
import nltk
nltk.download('stopwords') # To identify stop words
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer  # To stem the word to it's basic form

In [30]:
word_list = []
for i in range(0, len(data)):
  tweet = re.sub('[^a-zA-Z]', ' ', data['text'][i])  # Using only alphabets
  tweet = tweet.lower()   
  tweet = tweet.split()
  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')   # using English stopwords
  all_stopwords.remove('not')
  tweet = [ps.stem(word) for word in tweet if not word in set(all_stopwords)] # stemming all words that are not stop words
  tweet = ' '.join(tweet)
  word_list.append(tweet)  

In [33]:
word_list[1:5]

['boss bulli',
 'interview leav alon',
 'son put releas alreadi bought',
 'feed babi fun smile coo']

Model can be improved by better methods available as we see they are not done as expected. 
Let us use lemmatizing which is a better and advanced way of reducing words to it's basic form.

Downloading required resources

In [None]:
import nltk
nltk.download('omw-1.4')

In [37]:
from nltk.stem import WordNetLemmatizer
 
# Create WordNetLemmatizer object
wnl = WordNetLemmatizer()

In [38]:
word_list = []
for i in range(0, len(data)):
  tweet = re.sub('[^a-zA-Z]', ' ', data['text'][i])  # Using only alphabets
  tweet = tweet.lower()   
  tweet = tweet.split()
  all_stopwords = stopwords.words('english')   # using English stopwords
  all_stopwords.remove('not')
  tweet = [wnl.lemmatize(word) for word in tweet if not word in set(all_stopwords)] # lemmatizing all words that are not stop words
  tweet = ' '.join(tweet)
  word_list.append(tweet)  

In [47]:
word_list[1:5]

['bos bullying',
 'interview leave alone',
 'son put release already bought',
 'feeding baby fun smile coo']

Lemmatized list is much better but still not great. Will use this list to train the model and understand the performance.

In [48]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(word_list).toarray()
y = data.iloc[:, -1].values

Checking the size of the X vector

In [59]:
len(X[1])

16221

### Splitting dataset for training and testing

In [49]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

### Model training

Using a Naive Bayes classifier for classifying the sentiment

In [50]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB()

### Testing with test data

In [53]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[0 0]
 [0 0]
 [0 1]
 ...
 [0 0]
 [1 1]
 [0 1]]


### Measuring accuracy and making confusion matrix

In [54]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[1295  237]
 [1010  731]]


0.6190039718912312

Now limiting the vector to 15000 words to see if performance on test data increases

In [61]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features= 15000)
X = vectorizer.fit_transform(word_list).toarray()
y = data.iloc[:, -1].values

In [62]:
len(X[1])

15000

In [63]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
# print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

# from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[1285  247]
 [ 972  769]]


0.6275588145432325

We are able to icreae the accuracy by removing the over fitting by reducing the number of elements in the word vector.

In [64]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features= 12000)
X = vectorizer.fit_transform(word_list).toarray()
y = data.iloc[:, -1].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
# print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

# from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[1186  346]
 [ 503 1238]]


0.7406049495875344

By going with 12000 words vector, the accuracy that we achieved is 74%