# Mini Project 3 – Twitter Sentimental Analysis Using NLP and Python

**Scenario:** By analyzing text data, we can find meaningful insights from non-numeric data that can help us achieve our objective. With the help of NLP and its concepts, we can do it. Twitter is one of the biggest platforms that people use to write their messages, express their feelings about a particular topic, and share knowledge in the form of text. By analyzing text data, we can make good decisions for different use cases like judging the sentiment of the human tweets, and any product review/comments can tell us the performance of a product in the market.


NLP allows us to study and understand the colinearity of the data. So we can predict our objective. 

**Objective:** Use Python libraries such as Pandas for data operations, Seaborn and Matplotlib for data visualization and EDA tasks, NLTK to extract and analyze the information, Sklearn for model building and performance visualization, to predict our different categories of people’s mindsets.

## Dataset description: 
The data contain information about many Tweets in the form of text and their types, as mentioned below.

**Tweets:** Data is in the form of a sentence written by individuals.

**category:** Numeric(0: Neutral, -1: Negative, 1: Positive) (It is our dependent variable)

## Tasks to be performed:

The following tasks are to be performed:

- Read the Data from the Given excel file.

- Change our dependent variable to categorical. ( 0 to “Neutral,” -1 to “Negative”, 1 to “Positive”)

- Do Missing value analysis and drop all null/missing values

- Do text cleaning. (remove every symbol except alphanumeric, transform all words to lower case, and remove punctuation and stopwords )

- Create  a new column and find the length of each sentence (how many words they contain)

- Split data into dependent(X) and independent(y) dataframe

- Do operations on text data 

- **Hints:**

    - Do one-hot encoding for each sentence (use TensorFlow)
     - Add padding from the front side (use Tensorflow)

     - Build an LSTM model and compile it (describe features, input length,       vocabulary size, information drop-out layer, activation function for       output, )
     
     - Do dummy variable creation for the dependent variable
    
     - split the data into tests and train 


- Train new model

- Normalize the prediction as same as the original data(prediction might be in decimal, so whoever is nearest to 1 is predicted as yes and set other as 0)

- Measure performance metrics and accuracy

- print Classification report

# Importing all the required modules:-

In [1]:
import pandas as pd
import numpy as np
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings("ignore")




# Task-1:- 

**Read the Data from the Given excel file.**

In [2]:
# Step-1:- To import the Twitter_Data_df dataset:
Twitter_Data_df = pd.read_csv(r"Twitter_Data.csv")

# Step-2:- Checking the shape,cloumns and info of our Twitter_Data_df data:
print('Shape of the Twitter_Data_df Dataset is:- ',
      Twitter_Data_df.shape,'\n')

print("Columns of Twitter_Data_df Dataset is:- \n",
     Twitter_Data_df.columns,'\n')

print('The info of Twitter_Data_df Dataset is:- \n')
Twitter_Data_df.info()

Shape of the Twitter_Data_df Dataset is:-  (162980, 2) 

Columns of Twitter_Data_df Dataset is:- 
 Index(['clean_text', 'category'], dtype='object') 

The info of Twitter_Data_df Dataset is:- 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162980 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162976 non-null  object 
 1   category    162973 non-null  float64
dtypes: float64(1), object(1)
memory usage: 2.5+ MB


In [3]:
# Step-3:- Checking the description of our Twitter_Data_df data:

print('Description of our Twitter_Data_df data is:- ')
Twitter_Data_df.describe()

Description of our Twitter_Data_df data is:- 


Unnamed: 0,category
count,162973.0
mean,0.225436
std,0.781279
min,-1.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [4]:
# Step-4:- Viewing the starting 10 records of Twitter_Data_df Data:

Twitter_Data_df.head(10)

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0
5,kiya tho refresh maarkefir comment karo,0.0
6,surat women perform yagna seeks divine grace f...,0.0
7,this comes from cabinet which has scholars lik...,0.0
8,with upcoming election india saga going import...,1.0
9,gandhi was gay does modi,1.0


# Task-2:- 

**Change our dependent variable to categorical. ( 0 to “Neutral,” -1 to “Negative”, 1 to “Positive”)**

In [5]:
# Step-1:- Checking out the unique values in our 'category' column:

print('The unique values in our category column are:- ',
      Twitter_Data_df['category'].unique())

print('View the total value:- \n', 
      Twitter_Data_df['category'].value_counts())

The unique values in our category column are:-  [-1.  0.  1. nan]
View the total value:- 
 category
 1.0    72250
 0.0    55213
-1.0    35510
Name: count, dtype: int64


In [6]:
# Step-2:- Change dependent variable to categorical:
Twitter_Data_df['category'] = Twitter_Data_df[
    'category'].map({0: 'Neutral', -1: 'Negative', 1: 'Positive'})

# Step-3:- Checking out the unique values in our 'category' column after change:
print('The values after Change dependent variable to categorical:- \n',
      Twitter_Data_df['category'].unique())

The values after Change dependent variable to categorical:- 
 ['Negative' 'Neutral' 'Positive' nan]


# Task-3:- 

**Do Missing value analysis and drop all null/missing values**

In [7]:
# Step-1:- Checking out the total null/missing values in our dataset:

Twitter_Data_df.isnull().sum()

clean_text    4
category      7
dtype: int64

In [8]:
# Step-2:- Drop all null/missing values present in our dataset:
Twitter_Data_df.dropna(inplace=True)

# Step-3:- Checking the nan values in our dataset after droping all the nan values:
Twitter_Data_df.isnull().sum()

clean_text    0
category      0
dtype: int64

# Task-4:- 

**Do text cleaning. (remove every symbol except alphanumeric, transform all words to lower case, and remove punctuation and stopwords )**

In [9]:
# Step-1:- Set the stopwords to 'english':
stop_words = set(stopwords.words('english'))

# Step-2:- Creating a function to do text cleaning:
def clean_text(text):
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = text.lower()
    text = ' '.join(word for word in text.split() if word not in stop_words)
    return text

# Step-3:- Now apply the function to our 'clean_text' column:
Twitter_Data_df['cleaned_text'] = Twitter_Data_df['clean_text'].apply(clean_text)

# Step-4:- Viewing our Twitter_Data_df dataset after performing text cleaning is:
Twitter_Data_df.head(10)

Unnamed: 0,clean_text,category,cleaned_text
0,when modi promised “minimum government maximum...,Negative,modi promised minimum government maximum gover...
1,talk all the nonsense and continue all the dra...,Neutral,talk nonsense continue drama vote modi
2,what did just say vote for modi welcome bjp t...,Positive,say vote modi welcome bjp told rahul main camp...
3,asking his supporters prefix chowkidar their n...,Positive,asking supporters prefix chowkidar names modi ...
4,answer who among these the most powerful world...,Positive,answer among powerful world leader today trump...
5,kiya tho refresh maarkefir comment karo,Neutral,kiya tho refresh maarkefir comment karo
6,surat women perform yagna seeks divine grace f...,Neutral,surat women perform yagna seeks divine grace n...
7,this comes from cabinet which has scholars lik...,Neutral,comes cabinet scholars like modi smriti hema t...
8,with upcoming election india saga going import...,Positive,upcoming election india saga going important p...
9,gandhi was gay does modi,Positive,gandhi gay modi


# Task-5:- 

**Create a new column and find the length of each sentence (how many words they contain)**

In [10]:
# Step-1:- Checking the shape of our data before create a new column :
print('Shape of the Twitter_Data_df Dataset before create a new column is:- \n',
      Twitter_Data_df.shape,'\n')

# Step-2:- Creating a function to count words in a sentence:
def count_words(cleaned_text):
    return len(cleaned_text.split())

# Step-3:- Create a new column as 'sentence_length' for length of each sentence:
Twitter_Data_df['sentence_length'] = Twitter_Data_df[
    'cleaned_text'].apply(count_words)

# Step-4:- Checking the shape of our data after create a new column :
print('Shape of the Twitter_Data_df Dataset after create a new column is:- \n',
      Twitter_Data_df.shape)

# Step-5:- Viewing our Twitter_Data_df dataset after find the length of each sentence is:
Twitter_Data_df.head(10)

Shape of the Twitter_Data_df Dataset before create a new column is:- 
 (162969, 3) 

Shape of the Twitter_Data_df Dataset after create a new column is:- 
 (162969, 4)


Unnamed: 0,clean_text,category,cleaned_text,sentence_length
0,when modi promised “minimum government maximum...,Negative,modi promised minimum government maximum gover...,21
1,talk all the nonsense and continue all the dra...,Neutral,talk nonsense continue drama vote modi,6
2,what did just say vote for modi welcome bjp t...,Positive,say vote modi welcome bjp told rahul main camp...,13
3,asking his supporters prefix chowkidar their n...,Positive,asking supporters prefix chowkidar names modi ...,19
4,answer who among these the most powerful world...,Positive,answer among powerful world leader today trump...,10
5,kiya tho refresh maarkefir comment karo,Neutral,kiya tho refresh maarkefir comment karo,6
6,surat women perform yagna seeks divine grace f...,Neutral,surat women perform yagna seeks divine grace n...,10
7,this comes from cabinet which has scholars lik...,Neutral,comes cabinet scholars like modi smriti hema t...,9
8,with upcoming election india saga going import...,Positive,upcoming election india saga going important p...,21
9,gandhi was gay does modi,Positive,gandhi gay modi,3


# Task-6:- 

**Split data into dependent(X) and independent(y) dataframe**

In [11]:
# Step-1:- Split data into dependent(X) and independent(y):
X = Twitter_Data_df['cleaned_text']
y = Twitter_Data_df['category']

# Step-2:- To check the shape of X & y:
print("The shape of our X is:- ",X.shape)
print("The shape of our y is:- ",y.shape)

# Step-3:- Checking the X data:
X.head()

The shape of our X is:-  (162969,)
The shape of our y is:-  (162969,)


0    modi promised minimum government maximum gover...
1               talk nonsense continue drama vote modi
2    say vote modi welcome bjp told rahul main camp...
3    asking supporters prefix chowkidar names modi ...
4    answer among powerful world leader today trump...
Name: cleaned_text, dtype: object

In [12]:
# Step-4:- Checking the y data:

y.head()

0    Negative
1     Neutral
2    Positive
3    Positive
4    Positive
Name: category, dtype: object

# Task-7:- 

**Do operations on text data**

- **Hints:**

    - **Do one-hot encoding for each sentence (use TensorFlow)**
     - **Add padding from the front side (use Tensorflow)**

     - **Build an LSTM model and compile it (describe features, input length,       vocabulary size, information drop-out layer, activation function for       output, )**
     
     - **Do dummy variable creation for the dependent variable**
    
     - **split the data into tests and train** 

**Q-1:- Doing one-hot encoding for each sentence (use TensorFlow)**

In [13]:
# Step-1:- Doing one-hot encoding for each sentence:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
X = tokenizer.texts_to_sequences(X)
X_Encoded = pad_sequences(X, padding='pre')

# Step-2:- Checking the encoded data:
X_Encoded

array([[   0,    0,    0, ..., 4033, 5335, 2687],
       [   0,    0,    0, ...,  703,    9,    1],
       [   0,    0,    0, ...,   46,    1, 3861],
       ...,
       [   0,    0,    0, ..., 2472, 7332,  316],
       [   0,    0,    0, ...,  472,  285,  514],
       [   0,    0,    0, ...,   48,  366,  105]])

**Q-2:- Adding padding from the front side (use Tensorflow)**

In [14]:
# Step-1:- Adding padding from the front side:
X_padded = pad_sequences(X_Encoded, padding='pre')

# Step-2:- Checking the padding data:
X_padded

array([[   0,    0,    0, ..., 4033, 5335, 2687],
       [   0,    0,    0, ...,  703,    9,    1],
       [   0,    0,    0, ...,   46,    1, 3861],
       ...,
       [   0,    0,    0, ..., 2472, 7332,  316],
       [   0,    0,    0, ...,  472,  285,  514],
       [   0,    0,    0, ...,   48,  366,  105]])

**Q-3:- Build an LSTM model and compile it (describe features, input length, vocabulary size, information drop-out layer, activation function for output, )**

In [15]:
# Step-1:- Build our LSTM model:

vocab_size = len(tokenizer.word_index) + 1
input_length = X_Encoded.shape[1]

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=100, 
                    input_length=input_length))
model.add(LSTM(units=64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(units=3, activation='softmax'))

# Step-2:- Compile our LSTM Model:
model.compile(loss='categorical_crossentropy', optimizer='adam',
              metrics=['accuracy'])





**Q-4:- Doing dummy variable creation for the dependent variable**

In [16]:
# Step-1:- Create dummy variables for the categorical features:
dummy_variables = pd.get_dummies(y).astype(int)

# Step-2:- Concatenate the original DataFrame with the dummy variables:
y_dummy = pd.concat([y,dummy_variables], axis=1)

# Step-3:- Drop the original categorical features from the DataFrame:
y_dummy.drop(columns=y_dummy.columns[0], inplace=True) 

# Step-4:- Checking the dummy data:
y_dummy

Unnamed: 0,Negative,Neutral,Positive
0,1,0,0
1,0,1,0
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
162975,1,0,0
162976,1,0,0
162977,0,1,0
162978,0,1,0


**Q-5:- split the data into tests and train** 

In [17]:
# Step-1:- Split our dataset into train and test datasets:
X_train, X_test, y_train, y_test = train_test_split(X_padded, y_dummy,
                            test_size=0.2, random_state= 7)


# Step-2:- To check the shape of "X_train", "X_test" & "y_train", "y_test":
print("The shape of X_train is:- ", X_train.shape)
print("The shape of X_test is:- ", X_test.shape)
print("The shape of y_train is:- ", y_train.shape)
print("The shape of y_test is:- ", y_test.shape)

The shape of X_train is:-  (130375, 43)
The shape of X_test is:-  (32594, 43)
The shape of y_train is:-  (130375, 3)
The shape of y_test is:-  (32594, 3)


# Task-8:- 

**Train new model**

In [18]:
# Step-1:- Train a new model:

model.fit(X_train, y_train, epochs=10, 
          batch_size=32, validation_data=(X_test, y_test))

Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x19e3f648bd0>

# Task-9:- 

**Normalize the prediction as same as the original data(prediction might be in decimal, so whoever is nearest to 1 is predicted as yes and set other as 0)**

In [19]:
# Step-1:- Predict the model:
y_pred = model.predict(X_test)

# Step-2:- Normalize the prediction:
y_pred_rounded = np.round(y_pred)

# Step-3:- Convert rounded predictions to binary:
y_pred_normalized = np.argmax(y_pred_rounded, axis=1)

# Step-4:- Printing out the normalize data:
y_pred_normalized



array([1, 0, 1, ..., 2, 2, 2], dtype=int64)

# Task-10:- 

**Measure performance metrics and accuracy**

In [20]:
# Step-1:- Measure performance metrics and accuracy:
accuracy = accuracy_score(np.argmax(y_test, axis=1), y_pred_normalized)

# Step-2:- Printing out the accuracy score:
print("The accuracy score is:- ", accuracy)

The accuracy score is:-  0.8750997116033625


# Task-11:- 

**print Classification report**

In [21]:
# Step-1:- Printing out the Classification report:

print(classification_report(np.argmax(y_test, axis=1), y_pred_normalized))

              precision    recall  f1-score   support

           0       0.81      0.81      0.81      7033
           1       0.92      0.88      0.90     11052
           2       0.88      0.90      0.89     14509

    accuracy                           0.88     32594
   macro avg       0.87      0.86      0.87     32594
weighted avg       0.88      0.88      0.88     32594



# Submitted by Biswakant Nayak