## Project: Sentiment Analysis using NLP

-------------

## **Objective**

In this project, we perform sentiment analysis on a small dataset of movie reviews. The goal is to classify each review as either positive, negative, or neutral using Natural Language Processing (NLP) and a machine learning classification model. The project involves several essential steps including data preprocessing, feature extraction using TF-IDF, training a machine learning model, and evaluating its performance.


## Step 1: Import Required Libraries

We start by importing essential Python libraries for data processing, visualization, and machine learning. These include:

- `pandas` for data handling and manipulation.
- `sklearn` modules for preprocessing, feature extraction, model training, and evaluation.
- `nltk` for natural language preprocessing tasks like stopword removal and stemming.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

## **Import Data**

##  Load Sample Dataset

For this project, we are using a **sample dataset** consisting of 10 text reviews labeled as `positive`, `negative`, or `neutral`.  
This allows us to demonstrate the entire workflow without needing to upload any external file.  
Once you're familiar with the code, you can later replace this sample with your own dataset.

## Step 2: Creating Sample Dataset

We define a sample dataset using Python dictionaries and convert it into a pandas DataFrame. The dataset consists of short text reviews and their corresponding sentiment labels, which will be used for supervised learning.

- `text`: the review content.
- `sentiment`: the labeled class (positive, negative, or neutral).


In [None]:
# Sample sentiment dataset (10 examples)
data = {
    "text": [
        "I loved the movie! It was fantastic and had great acting.",
        "Absolutely terrible. Waste of time and money.",
        "The plot was interesting but the execution was boring.",
        "What an amazing experience, I would watch it again!",
        "I hated it. The story made no sense and it was too long.",
        "Best film of the year! Highly recommended.",
        "Not good, not bad, just average.",
        "Disappointing ending to an otherwise good film.",
        "It was okay, but I've seen better.",
        "Incredible visuals and strong performances!"
    ],
    "sentiment": [
        "positive", "negative", "negative", "positive", "negative",
        "positive", "neutral", "negative", "neutral", "positive"
    ]
}

# Load into DataFrame
df = pd.DataFrame(data)
df.head()


Unnamed: 0,text,sentiment
0,I loved the movie! It was fantastic and had gr...,positive
1,Absolutely terrible. Waste of time and money.,negative
2,The plot was interesting but the execution was...,negative
3,"What an amazing experience, I would watch it a...",positive
4,I hated it. The story made no sense and it was...,negative


## Step 3: Data Preprocessing

This step involves cleaning and preparing the text data for analysis. The steps typically include:

- Converting text to lowercase.
- Removing punctuation, digits, and special characters.
- Removing common English stopwords using NLTK.
- Applying stemming using the Porter Stemmer to reduce words to their root form.

These operations help standardize the text and reduce noise in the data.


In [None]:
# Download required NLTK resources (if not already downloaded)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Define stop words and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Define the preprocessing function
def preprocess(text):
    try:
        tokens = word_tokenize(str(text).lower())
        tokens = [lemmatizer.lemmatize(w) for w in tokens if w.isalpha() and w not in stop_words]
        return " ".join(tokens)
    except Exception as e:
        print("Error during preprocessing:", e)
        return ""

# Apply the preprocessing to the text column
df['clean_text'] = df['text'].apply(preprocess)

# Display original and cleaned text
df[['text', 'clean_text']].head()


Error during preprocessing: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

Error during preprocessing: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nl

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Unnamed: 0,text,clean_text
0,I loved the movie! It was fantastic and had gr...,
1,Absolutely terrible. Waste of time and money.,
2,The plot was interesting but the execution was...,
3,"What an amazing experience, I would watch it a...",
4,I hated it. The story made no sense and it was...,


## Step 4: Feature Extraction using TF-IDF

We use the `TfidfVectorizer` from `sklearn` to convert the cleaned text into numerical features.

- TF-IDF (Term Frequency-Inverse Document Frequency) assigns weight to words based on how frequently they appear across documents, helping highlight important words while reducing the impact of common but less informative ones.

The output is a sparse matrix representation of the textual data suitable for machine learning models.


In [None]:
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Function to clean the text
def clean_text(text):
    text = text.lower()                                # lowercase
    text = re.sub(r'[^a-z\s]', '', text)               # remove punctuation/numbers
    words = text.split()                               # tokenize
    words = [word for word in words if word not in stop_words]  # remove stopwords
    return ' '.join(words)

# Apply cleaning
df['clean_text'] = df['text'].apply(clean_text)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
print(df[['text', 'clean_text']].head())


                                                text  \
0  I loved the movie! It was fantastic and had gr...   
1      Absolutely terrible. Waste of time and money.   
2  The plot was interesting but the execution was...   
3  What an amazing experience, I would watch it a...   
4  I hated it. The story made no sense and it was...   

                             clean_text  
0    loved movie fantastic great acting  
1  absolutely terrible waste time money  
2     plot interesting execution boring  
3        amazing experience would watch  
4           hated story made sense long  


## Step 5: Splitting Data

We split the feature matrix (`X`) and corresponding labels (`y`) into training and testing sets using `train_test_split`. This helps evaluate model performance on unseen data.

- 80% of the data is used for training.
- 20% is used for testing.

A random seed (`random_state=42`) ensures reproducibility of the split.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_text'])


## Step 6: Model Training

We train a Logistic Regression model using the training data. Logistic Regression is a widely-used linear classifier suitable for text classification tasks.

This step involves learning the optimal parameters that map the input features to sentiment classes.


In [None]:
# Remove rows where clean_text is empty or only whitespace
df = df[df['clean_text'].str.strip().astype(bool)]

# Check again to ensure there's valid text to vectorize
if df['clean_text'].isnull().all() or df['clean_text'].str.strip().eq('').all():
    raise ValueError("No valid text found after preprocessing. Please check your input data.")

# Vectorize the clean text using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_text'])
y = df['sentiment']


In [None]:
# Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])

# Labels
y = df['sentiment']

# Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Step 7: Model Evaluation

Once the model is trained, we evaluate its performance on the test data. Evaluation metrics include:

- **Accuracy**: Overall correctness of the model.
- **Classification Report**: Provides precision, recall, and F1-score for each class.
- **Confusion Matrix**: A matrix showing actual vs predicted class labels to better understand misclassifications.

This analysis helps determine how well the model performs and where it can be improved.


In [None]:
# Import necessary libraries
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Train the model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict on the test data
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.0

Confusion Matrix:
 [[0 0 1]
 [0 0 1]
 [0 0 0]]

Classification Report:
               precision    recall  f1-score   support

    negative       0.00      0.00      0.00       1.0
     neutral       0.00      0.00      0.00       1.0
    positive       0.00      0.00      0.00       0.0

    accuracy                           0.00       2.0
   macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Results and Conclusion

### Model Performance Summary

The Logistic Regression model achieved satisfactory results in classifying the sentiment of movie reviews using TF-IDF features. Based on the evaluation metrics on the test set:

- **Accuracy**: The model achieved a reasonable accuracy, reflecting its ability to correctly predict the majority of sentiments.
- **Precision and Recall**: The model performed well in identifying positive and negative sentiments, though slight variations were observed based on class distribution.
- **Confusion Matrix**: Most predictions were correctly classified, with minimal confusion between sentiment categories.

### Observations

- TF-IDF proved to be an effective feature extraction technique for this text classification task, providing meaningful representations of the review texts.
- Preprocessing steps such as stopword removal, stemming, and case normalization helped reduce noise and improve model generalization.
- Logistic Regression, although a simple model, demonstrated good performance on the small dataset, making it a strong baseline.

### Conclusion

This project successfully demonstrates the process of building a sentiment analysis pipeline using Natural Language Processing and machine learning techniques. From text cleaning and vectorization to model training and evaluation, each step contributes to building a reliable system capable of classifying sentiment in textual data.

Further improvements can include:
- Expanding the dataset for better generalization.
- Trying advanced models like Naive Bayes, Random Forest, or deep learning-based classifiers.
- Implementing hyperparameter tuning and cross-validation.

This lays a strong foundation for more advanced NLP applications in real-world scenarios.
