# Lecture 69: Sentiment Analysis with NLP

This notebook demonstrates how to build a **sentiment analysis model** using Scikit-learn to classify movie reviews as positive or negative. We'll use the IMDb dataset from the `datasets` library, preprocess the text data, extract features using TF-IDF, train a logistic regression model, and evaluate its performance. The steps include:

- Loading and exploring the IMDb dataset
- Preprocessing text (cleaning, tokenization, lemmatization)
- Converting text to numerical features with TF-IDF
- Training a logistic regression model
- Evaluating the model with accuracy, precision, recall, and F1-score
- Making predictions on sample texts

This approach is a simple yet effective way to perform sentiment analysis using traditional NLP techniques.

## Setup and Imports

Let's import the necessary libraries and set up the environment.

In [2]:
%%capture
!pip install -U datasets

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from datasets import load_dataset

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')
# Set random seed for reproducibility
np.random.seed(42)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Loading and Exploring the IMDb Dataset

We'll use the IMDb dataset from the `datasets` library, which contains 50,000 movie reviews labeled as positive (1) or negative (0). We'll load a subset to keep the computation manageable.

In [16]:
dataset = load_dataset("imdb")


# Convert to pandas DataFrame
df_train = pd.DataFrame(dataset['train']).sample(10000, random_state=42)  # Subset of 10,000 reviews
# Accessing the 'test' split should now work
df_test = pd.DataFrame(dataset['test']).sample(2000, random_state=42)    # Subset of 2,000 reviews

# Extract texts and labels
X_train = df_train['text']
y_train = df_train['label']
X_test = df_test['text']
y_test = df_test['label']

print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")

# Display sample reviews
print("\nSample Reviews:")
for i in range(2):
    print(f"Review {i+1}: {X_train.iloc[i][:100]}...\nLabel: {'Positive' if y_train.iloc[i] == 1 else 'Negative'}\n")

Training data shape: (10000,)
Test data shape: (2000,)

Sample Reviews:
Review 1: Dumb is as dumb does, in this thoroughly uninteresting, supposed black comedy. Essentially what star...
Label: Negative

Review 2: I dug out from my garage some old musicals and this is another one of my favorites. It was written b...
Label: Positive



## Text Preprocessing

We'll preprocess the text by:
- Converting to lowercase
- Removing special characters and numbers
- Tokenizing the text
- Removing stop words
- Applying lemmatization to reduce words to their base form

In [17]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stop words and lemmatize
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    return ' '.join(tokens)

# Apply preprocessing
X_train_processed = X_train.apply(preprocess_text)
X_test_processed = X_test.apply(preprocess_text)

print("Sample Preprocessed Review:")
print(X_train_processed.iloc[0][:200] + "...")

Sample Preprocessed Review:
dumb dumb thoroughly uninteresting supposed black comedy essentially start chris klein trying maintain low profile eventually morphs uninspired version three amigo without laugh order black comedy wor...


## Feature Extraction with TF-IDF

We'll convert the preprocessed text into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency), which weighs words based on their importance in the document and across the corpus.

In [18]:
# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))

# Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_processed)

# Transform the test data
X_test_tfidf = tfidf_vectorizer.transform(X_test_processed)

print(f"TF-IDF Training data shape: {X_train_tfidf.shape}")
print(f"TF-IDF Test data shape: {X_test_tfidf.shape}")

TF-IDF Training data shape: (10000, 5000)
TF-IDF Test data shape: (2000, 5000)


## Training the Sentiment Analysis Model

We'll train a logistic regression model, which is effective for binary classification tasks like sentiment analysis.

In [19]:
# Initialize and train logistic regression model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_tfidf, y_train)

# Make predictions on test set
y_pred = model.predict(X_test_tfidf)

## Evaluating the Model

We'll evaluate the model using accuracy, precision, recall, F1-score, and a detailed classification report.

In [20]:
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

Accuracy: 0.8715
Precision: 0.8576
Recall: 0.8781
F1-Score: 0.8677

Classification Report:
              precision    recall  f1-score   support

    Negative       0.88      0.87      0.88      1040
    Positive       0.86      0.88      0.87       960

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000



## Making Predictions on Sample Texts

Let's test the model on some custom review texts to see how it predicts sentiment.

In [21]:
# Sample texts for prediction
sample_texts = [
    "This movie was absolutely fantastic! The acting was superb and the plot kept me engaged throughout.",
    "I was really disappointed with this film. The story was boring and the characters were unlikable.",
    "An average movie with some good moments but nothing special overall."
]

# Preprocess and transform sample texts
sample_texts_processed = [preprocess_text(text) for text in sample_texts]
sample_tfidf = tfidf_vectorizer.transform(sample_texts_processed)

# Predict sentiment
sample_predictions = model.predict(sample_tfidf)

# Display predictions
print("\nSample Text Predictions:")
for text, pred in zip(sample_texts, sample_predictions):
    print(f"Text: {text[:100]}...\nPredicted Sentiment: {'Positive' if pred == 1 else 'Negative'}\n")


Sample Text Predictions:
Text: This movie was absolutely fantastic! The acting was superb and the plot kept me engaged throughout....
Predicted Sentiment: Positive

Text: I was really disappointed with this film. The story was boring and the characters were unlikable....
Predicted Sentiment: Negative

Text: An average movie with some good moments but nothing special overall....
Predicted Sentiment: Positive



## Explanation

- **Dataset**: Used a subset of the IMDb dataset (10,000 training, 2,000 test reviews) labeled as positive or negative.
- **Preprocessing**: Applied text cleaning, tokenization, stop word removal, and lemmatization to standardize the text.
- **Feature Extraction**: Used TF-IDF to convert text into numerical features, capturing word importance with up to 5,000 features and bigrams.
- **Model**: Trained a logistic regression model, which is effective for binary classification and interpretable.
- **Evaluation**: Assessed performance with accuracy, precision, recall, and F1-score, providing a comprehensive view of model quality.
- **Predictions**: Tested the model on custom texts to demonstrate real-world applicability.

To extend this work, consider:
- Using other models (e.g., SVM, Naive Bayes, or deep learning with LSTM)
- Incorporating word embeddings (e.g., Word2Vec, GloVe) for better semantic understanding
- Tuning hyperparameters (e.g., TF-IDF max_features, n-grams, or regularization in logistic regression)
- Handling imbalanced data or expanding to multi-class sentiment (e.g., neutral)