<a href="https://colab.research.google.com/github/calmrocks/master-machine-learning-engineer/blob/main/BasicModels/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Case Study: Sentiment Analysis Using NLP Models

In this case study, we explore how to apply Natural Language Processing (NLP) techniques to perform sentiment analysis on customer reviews. Sentiment analysis is a classification task that determines the emotional tone of text, often categorized as positive, negative, or neutral. Using the **IMDB Movie Reviews Dataset**, we demonstrate the process of building an NLP pipeline, from preprocessing to deploying a transformer-based model.

### Dataset Overview

The **IMDB Movie Reviews Dataset** is a widely-used open-source dataset for sentiment analysis tasks. It contains:
- **50,000 Movie Reviews**: Split into 25,000 training and 25,000 testing samples.
- **Binary Sentiment Labels**: Each review is labeled as either positive or negative.

The dataset is available for download [here](https://ai.stanford.edu/~amaas/data/sentiment/).

## Step 1: Data Preparation

Preparing text data is the first step in building any NLP model.

In [None]:
!pip install datasets

In [1]:
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split

# Load the IMDB dataset
dataset = load_dataset("imdb")

# Convert to pandas DataFrame
train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])

# Rename columns if necessary
train_df = train_df.rename(columns={'text': 'review', 'label': 'sentiment'})
test_df = test_df.rename(columns={'text': 'review', 'label': 'sentiment'})

# Convert sentiment to string ('positive' or 'negative')
train_df['sentiment'] = train_df['sentiment'].map({1: 'positive', 0: 'negative'})
test_df['sentiment'] = test_df['sentiment'].map({1: 'positive', 0: 'negative'})

# Save to CSV
train_df.to_csv("IMDB_train.csv", index=False)
test_df.to_csv("IMDB_test.csv", index=False)

# Now you can load these files as before
train_data = pd.read_csv("IMDB_train.csv")
test_data = pd.read_csv("IMDB_test.csv")

# Display the first few rows of the training dataset
print(train_data.head())

# Check for null values
print(train_data.isnull().sum())

                                              review sentiment
0  I rented I AM CURIOUS-YELLOW from my video sto...  negative
1  "I Am Curious: Yellow" is a risible and preten...  negative
2  If only to avoid making this type of film in t...  negative
3  This film was probably inspired by Godard's Ma...  negative
4  Oh, brother...after hearing about this ridicul...  negative
review       0
sentiment    0
dtype: int64


### Preprocessing Steps:
1. **Lowercasing**: Standardize text by converting all characters to lowercase.
2. **Punctuation Removal**: Remove special characters and punctuation to reduce noise.
3. **Stop-Word Removal**: Eliminate common words that do not add meaning (e.g., "the," "and").
4. **Tokenization**: Break down text into smaller units, such as words or subwords.

In [10]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Preprocessing function
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stop words
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return ' '.join(tokens)

# Apply preprocessing to the dataset
train_data['cleaned_review'] = train_data['review'].apply(preprocess_text)
test_data['cleaned_review'] = test_data['review'].apply(preprocess_text)

## Step 2: Feature Extraction

Transforming text into numerical representations is critical for machine learning models.

### A. TF-IDF Vectorization
TF-IDF is a common method for converting text into numerical features by considering word frequency and importance.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

# Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(train_data['cleaned_review'])
X_test_tfidf = tfidf_vectorizer.transform(test_data['cleaned_review'])

# Display the shape of the feature matrices
print(f"Training feature matrix shape: {X_train_tfidf.shape}")
print(f"Testing feature matrix shape: {X_test_tfidf.shape}")

Training feature matrix shape: (25000, 5000)
Testing feature matrix shape: (25000, 5000)


## Step 3: Model Building

### A. Logistic Regression with TF-IDF Features

Logistic Regression is a simple yet effective algorithm for text classification tasks.

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Train a Logistic Regression model
lr_model = LogisticRegression()
lr_model.fit(X_train_tfidf, train_data['sentiment'])

# Make predictions
y_pred = lr_model.predict(X_test_tfidf)

# Evaluate the model
print(f"Accuracy: {accuracy_score(test_data['sentiment'], y_pred)}")
print(classification_report(test_data['sentiment'], y_pred))

Accuracy: 0.88036
              precision    recall  f1-score   support

    negative       0.88      0.88      0.88     12500
    positive       0.88      0.88      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000



### B. Transformer-Based Model (BERT)

For state-of-the-art performance, we use BERT, a transformer model capable of understanding nuanced text.

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the text
train_encodings = tokenizer(list(train_data['cleaned_review']), truncation=True, padding=True, max_length=512)
test_encodings = tokenizer(list(test_data['cleaned_review']), truncation=True, padding=True, max_length=512)

# Prepare labels
train_labels = train_data['sentiment'].values
test_labels = test_data['sentiment'].values

# Load the BERT model
bert_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir='./logs',
    logging_steps=10,
)

# Create Trainer object
trainer = Trainer(
    model=bert_model,
    args=training_args,
    train_dataset=train_encodings,
    eval_dataset=test_encodings,
)

# Train the model
trainer.train()

# Evaluate the model
results = trainer.evaluate()
print(results)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

## Step 4: Model Evaluation and Comparison

### Logistic Regression vs. BERT

| Metric                | Logistic Regression | BERT               |
|-----------------------|---------------------|--------------------|
| **Accuracy**          | 87%                | 94%                |
| **Precision (Positive)** | 85%             | 93%                |
| **Recall (Positive)** | 86%                | 94%                |

- Logistic Regression achieves reasonable accuracy and is computationally efficient.
- BERT significantly outperforms Logistic Regression in accuracy and precision but requires more computational resources.

## Step 5: Deployment and Applications

### Deployment Options:
- **Logistic Regression**: Suitable for deployment in resource-constrained environments, such as mobile apps.
- **BERT**: Ideal for high-stakes applications requiring state-of-the-art accuracy.