<a href="https://colab.research.google.com/github/chomssky/chomssky/blob/main/nlp.sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **NLP PROJECT: SENTIMENT ANALYSIS USING TF-IDF VECTORS**


# **Project Introduction**

In the age of social media, the ability to understand public sentiment through textual data has become increasingly important. This project aims to design and implement a machine learning model that performs sentiment analysis on a dataset containing tweets. By utilizing TF-IDF (Term Frequency-Inverse Document Frequency) vectors, the project seeks to transform raw textual data into numerical features, enabling effective classification of sentiments. The insights gained from this analysis can be valuable for businesses, researchers, and policymakers to gauge public opinion and respond accordingly.

# **Problem Statement**

The primary challenge addressed in this project is the need for an effective method to analyze and classify sentiments expressed in tweets. Traditional methods of sentiment analysis often struggle with the nuances of language, especially in informal contexts like social media. This project will leverage TF-IDF vectors to enhance the representation of textual data, thereby improving the accuracy of sentiment classification.

# **Methodology**

The methodology for this project comprises the following steps:
# Data Preparation:
The provided dataset was unzipped and loaded into a readable format (CSV).
The data was also cleaned by removing irrelevant information and handling missing values.
# Feature Extraction:
TF-IDF technique was used to convert the tweet text into numerical features. The process involved calculating the term frequency and inverse document frequency for each term in the dataset.
# Model Development:
The dataset was split into training and test sets.
Machine learning algorithms including Logistic Regression, and Support Vector Machines were implemented to classify the sentiments based on the TF-IDF features.
# Model Evaluation:
The model's performance was evaluated using the relevant metrics including accuracy, precision, recall, and F1-score.
# Result Interpretation:
The results were analyzed to draw insights regarding the sentiments expressed in the tweets.

# **Data Description**
The dataset utilized for this sentiment analysis project is structured in a CSV format containing six fields:
# Polarity (Column 0):  
Sentiment label for the tweet (0: Negative, 2: Neutral, 4: Positive).
# Tweet ID (Column 1):
A unique identifier for each tweet.
# Date (Column 2):
Timestamp of when the tweet was posted.
# Query (Column 3):
Search query used to retrieve the tweet (or NO_QUERY if none).
# User (Column 4):
Username of the account that posted the tweet.
# Text (Column 5)  
The content of the tweet, consisting of raw text.

# **Import and setup the required librariest**

In [1]:
# Import the required libraries for nlp
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import zipfile as zf

# Unzip the Dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')
!unzip "/content/drive/MyDrive/archive (9).zip" -d "/content/"

Mounted at /content/drive
Archive:  /content/drive/MyDrive/archive (9).zip
  inflating: /content/test.csv       
  inflating: /content/testdata.manual.2009.06.14.csv  
  inflating: /content/train.csv      
  inflating: /content/training.1600000.processed.noemoticon.csv  


# Load the dataset

In [3]:
# Load the dataset for the test data
df = pd.read_csv('/content/test.csv', encoding='latin-1')

# Show the forst 5 rows of the test dataset
df.head()

Unnamed: 0,textID,text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral,morning,0-20,Afghanistan,38928346.0,652860.0,60.0
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,noon,21-30,Albania,2877797.0,27400.0,105.0
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,night,31-45,Algeria,43851044.0,2381740.0,18.0
3,01082688c6,happy bday!,positive,morning,46-60,Andorra,77265.0,470.0,164.0
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,noon,60-70,Angola,32866272.0,1246700.0,26.0


In [29]:
# load the dataset for testdata
df1 = pd.read_csv('/content/train.csv', encoding='latin-1')

# Show the first 5 rows of the train datatset
df1.head()


Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26


In [30]:
# Load the testdataset.manual.2009
df2 = pd.read_csv('/content/testdata.manual.2009.06.14.csv', encoding='latin-1')

# Show the first 5 rows of the testdataset.manual.2009
df2.head()

Unnamed: 0,4,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,"@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right."
0,4,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs i...
1,4,5,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fuck..."
2,4,6,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had...
3,4,7,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2...
4,4,8,Mon May 11 03:22:00 UTC 2009,kindle2,GeorgeVHulme,@richardebaker no. it is too big. I'm quite ha...


In [31]:
# Load the traning.1600000.processed.noemoticon
df3 = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', encoding='latin-1')

# Show the first 5 rows of the training.1600000.processed.noemoticon
df3.head()


Unnamed: 0,polarity of tweet,id of the tweet,date of the tweet,query,user,text of the tweet
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


**Observations**

The zip file contining the csv datasets was unzipped and the four (4) datasets were loaded and read.  


# **Explore the Dataset**

In [47]:
# Check the structure of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3535 entries, 0 to 3534
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   textID            3534 non-null   object 
 1   text              3534 non-null   object 
 2   sentiment         3534 non-null   object 
 3   Time of Tweet     3534 non-null   object 
 4   Age of User       3534 non-null   object 
 5   Country           3534 non-null   object 
 6   Population -2020  3534 non-null   float64
 7   Land Area (Km²)   3534 non-null   float64
 8   Density (P/Km²)   3534 non-null   float64
dtypes: float64(3), object(6)
memory usage: 276.2+ KB


In [5]:
# Check the shape of the dataset
df.shape

(4815, 9)

**Observations**
The individual variables in the dataset include 'textID', 'text', 'sentiment', 'Time of Tweet', 'Age of User', 'Country', 'Population-2020', 'Land Area (km2)', 'Density (P/km2). The structure of the dataset shows that out of the nine (9) variables in the dataset, six of them are objects, and three (3) are floats. The dataset is made up of nine (9) columns and 3,535 rows.  

In [42]:
# Check the dataset for missing values
df.isnull().sum()

Unnamed: 0,0
textID,1281
text,1281
sentiment,1281
Time of Tweet,1281
Age of User,1281
Country,1281
Population -2020,1281
Land Area (Km²),1281
Density (P/Km²),1281


In [43]:
# Check for duplicate values
df.duplicated().sum()

np.int64(1280)

In [44]:
# Fix the duplicate values
df.drop_duplicates(inplace=True)

# **Preprocess the Text Data**

In [45]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Extract features and labels
X = df1['text']  # The tweets
y = df1['sentiment']  # The sentiment labels

# Preprocess the text data - Convert to lowercase
X = X.str.lower()

# **Transform Text Data into TF-IDF Vectors**

In [46]:
# Create a TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Adjust max_features as needed

# Fill any potential NaN values with an empty string
X = X.fillna('')

# Fit and transform the text data
X_tfidf = tfidf_vectorizer.fit_transform(X)

# **Split the Data**

In [39]:
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)


# **Train a Machine Learning Model**

In [40]:
from sklearn.linear_model import LogisticRegression

# Create a model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)


# **Evaluate the Model**

In [41]:
from sklearn.metrics import classification_report, accuracy_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.69674367837002
Classification Report:
               precision    recall  f1-score   support

    negative       0.73      0.61      0.67      1562
     neutral       0.63      0.76      0.69      2230
    positive       0.79      0.70      0.74      1705

    accuracy                           0.70      5497
   macro avg       0.72      0.69      0.70      5497
weighted avg       0.71      0.70      0.70      5497



# **Summary**
This project successfully implements a machine learning model for sentiment analysis using TF-IDF vectors. The model effectively classifies tweets into negative, neutral, and positive sentiments based on their textual content. The evaluation metrics indicate a satisfactory performance, demonstrating the model's potential for real-world applications.

**Evaluation metrics adopted**

Accuracy: This metric shows the proportion of correctly classified instances out of the total instances. The overall accuracy achieves a score of 0.6967 (or about 70%). This means that approximately 70% of the predictions made by the model are correct. While this is a decent accuracy, it indicates that there is room for improvement in the model's performance.

Precision: This metric shows the ratio of true positive predictions to the total predicted positives. It indicates how many of the predicted positive instances were actually positive.

Recall (Sensitivity): It shows the ratio of true positive predictions to the total actual positives. It measures the model's ability to identify all relevant instances.

F1 Score: It shows the harmonic mean of precision and recall, providing a balance between the two metrics.

**Classification Report Breakdown**

The classification report provides precision, recall, F1-score, and support for each sentiment category (negative, neutral, positive).

1. **Negative Sentiment**

Precision: 0.73

Of all the instances predicted as negative, 73% were actually negative. This indicates a relatively good ability to identify negative sentiments without misclassifying too many neutral or positive sentiments.

Recall: 0.61

Out of all actual negative sentiments, the model correctly identified 61%. This suggests that the model misses some negative sentiments (39% of actual negatives are not identified).

F1-score: 0.67

The F1-score is a balance between precision and recall. A score of 0.67 indicates a moderate performance in identifying negative sentiments, reflecting the trade-off between precision and recall.

Support: 1562

This is the number of actual instances of negative sentiment in the dataset.

2. **Neutral Sentiment**

Precision: 0.63

The model's precision for neutral sentiments is lower than for negative sentiments, meaning it misclassifies some neutral tweets as negative or positive.

Recall: 0.76

The model correctly identifies 76% of the actual neutral sentiments, which is relatively high. This indicates that the model is better at capturing neutral sentiments compared to negative ones.
F1-score: 0.69

The F1-score here reflects a decent balance, with the model performing better in recall than precision for neutral sentiments.

Support: 2230

The number of actual neutral instances.

3. **Positive Sentiment**

Precision: 0.79

The model has a high precision for positive sentiments, correctly identifying 79% of the predicted positive instances.

Recall: 0.70

It correctly identifies 70% of the actual positive sentiments, indicating that it misses 30% of them.

**F1-score: 0.74**

This score indicates a strong performance in classifying positive sentiments, balancing precision and recall effectively.

**Support: 1705**

The number of actual positive instances.

**Averages**

Macro Average:

Precision: 0.72
Recall: 0.69
F1-score: 0.70
The macro average treats all classes equally, giving an overall sense of model performance across all sentiment categories.
Weighted Average:

Precision: 0.71
Recall: 0.70
F1-score: 0.70

The weighted average takes into account the support (number of instances) for each class, providing a more nuanced view of performance relative to the distribution of classes in the dataset.

# **Conclusion**
The sentiment analysis model developed in this project provides a robust framework for understanding public sentiment expressed in tweets. By leveraging TF-IDF for feature extraction and employing machine learning algorithms for classification, the project addresses the challenges posed by informal language in social media.

**Strengths:** The model performs well in identifying positive sentiments (high precision and decent recall) and has a reasonable precision for negative sentiments.

**Weaknesses:** The recall for negative sentiments is lower, indicating that the model may be missing many negative instances. Additionally, precision for neutral sentiments is lower than desired, suggesting that the model might confuse neutral sentiments with negative or positive ones.

# **Recommendations for Action**
*   Deployment: It is recommended to implement the model as a web application or API to allow users to analyze sentiments in real-time.
*  Model Improvement: To improve the model performance, it is also recommended to consider experimenting with more complex models, such as ensemble methods and deep learning to improve recall and precision, particularly for negative sentiments.
*   Feature Engineering: It is advisable to explore additional features or different text representation techniques, including word embeddings to enhance the model's understanding of sentiment.
*   User Feedback: Gather feedback from users to refine the model and enhance its accuracy and usability.
*   Broader Dataset: Consider expanding the dataset to include tweets from various sources and languages to improve the model's generalizability.
*  Error Analysis: Finally, it is recommended to conduct a thorough analysis of misclassified instances to understand common patterns and improve the model accordingly.
*   






