# Sentiment Analysis with a Single Neuron Classifier

## Introduction

In this exercise, you'll build a **single neuron classifier** using PyTorch to perform sentiment analysis on movie reviews. The dataset consists of movie reviews categorized as either **positive** or **negative**. By training the model, you'll learn to classify reviews based on their text content.

The relationship between input features (word counts) and the target sentiment is modeled using a **linear neuron**, and the task involves converting textual data into numerical features using the **bag-of-words representation**.

The dataset can be downloaded from [here](https://ai.stanford.edu/~amaas/data/sentiment/).

By the end of this exercise, you will:

1. Understand how to preprocess text data for machine learning using *CountVectorizer*.
2. Implement and train a binary classifier using a single neuron (*nn.Linear*).
3. Evaluate model performance using metrics like accuracy, precision, recall, and specificity.
4. Test the model on custom reviews to make predictions.

This hands-on exercise demonstrates the power of linear classifiers and provides a strong foundation for understanding text classification problems.

![Alt Text](./images/mr.png)

## Importing Libraries

This cell imports the necessary libraries:

 - *torch* and *torch.nn*: For building and training the machine learning model.
 - *pandas*: For managing and manipulating the dataset in a tabular format.
 - *CountVectorizer*: Converts text data into numerical features using a bag-of-words representation.
 - *load_files*: Loads the IMDb movie review dataset from a directory structure.

In [1]:
import torch
from torch import nn
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import load_files

## Loading and Preparing the Dataset

- *load_files*: Loads the IMDb dataset. Reviews labeled *pos* are positive, and those labeled *neg* are negative.
- *pd.DataFrame*: Converts the dataset into a pandas DataFrame for easier processing.
- *decode('utf-8')*: Decodes the text data from bytes to a readable string format.
- **Sentiment Conversion**: Converts sentiment labels to floating-point values for compatibility with PyTorch.

In [2]:
# Load the IMDb movie review dataset (download it if needed)
dataset = load_files("./data/aclImdb_v1/aclImdb/train", categories=["pos", "neg"], shuffle=True, random_state=42)
data, target = dataset.data, dataset.target

# Convert data to a DataFrame for easier handling
df = pd.DataFrame({"review": [doc.decode('utf-8') for doc in data], "sentiment": target})
df["sentiment"] = df["sentiment"].astype(float)  # Convert sentiment to float for PyTorch

## Splitting the Dataset

- *sample(frac=0.8)*: Randomly selects 80% of the data for training.
- **Validation Set**: The remaining 20% is used for validation.

In [3]:
df_train = df.sample(frac=0.8, random_state=0)
df_val = df.drop(index=df_train.index)

## Converting Text to Numerical Features

- *CountVectorizer*: Converts text into a bag-of-words representation, retaining the 1000 most frequent words.
- *fit_transform*: Fits the vectorizer on the training data and transforms it into a sparse matrix of word counts.
- *transform*: Transforms the validation data using the fitted vectorizer.

In [4]:
cv = CountVectorizer(max_features=1000, stop_words="english")
reviews_train = cv.fit_transform(df_train["review"])
reviews_val = cv.transform(df_val["review"])

## Creating Tensors

- *X_train* and *X_val*: Convert the sparse matrices of word counts into dense tensors for PyTorch.
- *y_train* and *y_val*: Convert the target sentiment values into tensors and reshape them to match the model's output format.

In [5]:
X_train = torch.tensor(reviews_train.todense(), dtype=torch.float32)
y_train = torch.tensor(df_train["sentiment"].values, dtype=torch.float32).reshape((-1, 1))

X_val = torch.tensor(reviews_val.todense(), dtype=torch.float32)
y_val = torch.tensor(df_val["sentiment"].values, dtype=torch.float32).reshape((-1, 1))

## Defining the Model

- *nn.Linear(1000, 1)*: A single neuron with 1000 input features (word counts) and 1 output (sentiment score).
- *BCEWithLogitsLoss*: Combines a Sigmoid activation with Binary Cross-Entropy loss for classification tasks.
- *SGD*: Optimizer that updates the model's weights using stochastic gradient descent with a learning rate of 0.02.

In [10]:
model = nn.Linear(1000, 1)
loss_fn = nn.BCEWithLogitsLoss()  # Binary Cross Entropy with logits
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

## Training the Model

- **Training Loop**: Runs for 10,000 epochs, updating the model weights to minimize the loss.
- *model.train()*: Puts the model in training mode.
- **Gradient Descent**:
    - Clears gradients (*optimizer.zero_grad()*).
    - Computes predictions (*outputs*).
    - Calculates the loss (*loss_fn*).
    - Backpropagates the gradients (*loss.backward()*).
    - Updates the weights (*optimizer.step()*).
- **Progress Printing**: Displays the loss every 1000 epochs.

In [11]:
# Training loop
for epoch in range(20000):
    # Training pass
    model.train()
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = loss_fn(outputs, y_train)
    loss.backward()
    optimizer.step()
    
    # Print loss every 1000 epochs
    if epoch % 1000 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item()}")

Epoch 0, Loss: 0.6949761509895325
Epoch 1000, Loss: 0.48841580748558044
Epoch 2000, Loss: 0.44055911898612976
Epoch 3000, Loss: 0.41479140520095825
Epoch 4000, Loss: 0.39772844314575195
Epoch 5000, Loss: 0.3852979838848114
Epoch 6000, Loss: 0.3757256269454956
Epoch 7000, Loss: 0.36807674169540405
Epoch 8000, Loss: 0.36179855465888977
Epoch 9000, Loss: 0.3565382957458496
Epoch 10000, Loss: 0.3520577847957611
Epoch 11000, Loss: 0.3481898009777069
Epoch 12000, Loss: 0.34481292963027954
Epoch 13000, Loss: 0.341836541891098
Epoch 14000, Loss: 0.3391915559768677
Epoch 15000, Loss: 0.33682435750961304
Epoch 16000, Loss: 0.33469244837760925
Epoch 17000, Loss: 0.3327619731426239
Epoch 18000, Loss: 0.33100515604019165
Epoch 19000, Loss: 0.3293995261192322


## Defining the Evaluation Function

- **Evaluation Mode**: Disables gradient computation (*torch.no_grad()*).
- **Metrics**:
    - **Accuracy**: Overall correctness.
    - **Precision**: Fraction of correctly predicted positives.
    - **Recall** (Sensitivity): Fraction of true positives detected.
    - **Specificity**: Fraction of true negatives detected.

In [12]:
def evaluate_model(X, y):
    model.eval()
    with torch.no_grad():
        y_pred = torch.sigmoid(model(X)) > 0.5
        accuracy = (y_pred == y).type(torch.float32).mean().item()
        precision = (y_pred[y_pred == 1] == y[y_pred == 1]).type(torch.float32).mean().item()
        recall = (y_pred[y == 1] == y[y == 1]).type(torch.float32).mean().item()
        specificity = (y_pred[y == 0] == y[y == 0]).type(torch.float32).mean().item()

        print(f"Accuracy: {accuracy:.4f}")
        print(f"Precision: {precision:.4f}")
        print(f"Recall (Sensitivity): {recall:.4f}")
        print(f"Specificity: {specificity:.4f}")

## Evaluating and Testing the Model

- Training and Validation Evaluation:
    - Uses evaluate_model() to assess accuracy, precision, recall, and specificity on both datasets.
- Custom Predictions:
    - Predicts the sentiment of new, unseen reviews.
    - Sentiment Classification:
        1. \> 0.5: Positive sentiment.
        2. <= 0.5: Negative sentiment.
        
This section demonstrates how the model generalizes to unseen data and provides actionable insights.

In [13]:
# Evaluate on training data
print("Evaluating on the training data:")
evaluate_model(X_train, y_train)

# Evaluate on validation data
print("Evaluating on the validation data:")
evaluate_model(X_val, y_val)

# Custom predictions
custom_reviews = [
    "This movie was fantastic! The plot was engaging and the acting was great.",
    "I really hated this movie. It was a waste of time and money.",
    "The film was okay, but nothing special. It had some good moments."
]

custom_messages = cv.transform(custom_reviews)
X_custom = torch.tensor(custom_messages.todense(), dtype=torch.float32)

model.eval()
with torch.no_grad():
    predictions = torch.sigmoid(model(X_custom))
    print("\nCustom Predictions:")
    for review, pred in zip(custom_reviews, predictions):
        sentiment = "Positive" if pred > 0.5 else "Negative"
        print(f"Review: {review}\nPredicted Sentiment: {sentiment} (Score: {pred.item():.4f})\n")

Evaluating on the training data:
Accuracy: 0.8699
Precision: 0.8602
Recall (Sensitivity): 0.8853
Specificity: 0.8543
Evaluating on the validation data:
Accuracy: 0.8578
Precision: 0.8480
Recall (Sensitivity): 0.8630
Specificity: 0.8528

Custom Predictions:
Review: This movie was fantastic! The plot was engaging and the acting was great.
Predicted Sentiment: Positive (Score: 0.6094)

Review: I really hated this movie. It was a waste of time and money.
Predicted Sentiment: Negative (Score: 0.2225)

Review: The film was okay, but nothing special. It had some good moments.
Predicted Sentiment: Positive (Score: 0.5199)

