<a href="https://colab.research.google.com/github/git-ginwook/InsightToInterface/blob/OutOfMemoryError/InsightToInterface_ReviewClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Classification

## Dataset


[IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/code)

- connect dataset directly from Kaggle

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

print("Path to dataset files:", path)

In [None]:
import pandas as pd

reviews_df = pd.read_csv(path + "/IMDB Dataset.csv")
reviews_df.head()

## GPU Device Setup

### 1. Verify A100 GPU set up

In [None]:
import torch

# Check if GPU is available
if torch.cuda.is_available():
    device = torch.device("cuda")  # Automatically selects the first available GPU
    gpu_name = torch.cuda.get_device_name(0)

    # Check if it's an A100 GPU
    if "A100" in gpu_name:
        print(f"Successfully set up {gpu_name}!")
    else:
        print(f"GPU assigned: {gpu_name}. Note: It's not an A100 GPU.")
else:
    device = torch.device("cpu")
    print("GPU not available. Using CPU.")

# Print CUDA version
print(f"CUDA Version: {torch.version.cuda}")
print(f"PyTorch Version: {torch.__version__}")


### 2. Verify GPU performance

In [None]:
import time

# Dummy tensor operation to benchmark GPU
device = torch.device("cuda")
size = 10000

# Create random tensors
a = torch.randn(size, size, device=device)
b = torch.randn(size, size, device=device)

# Time matrix multiplication on GPU
start = time.time()
c = torch.matmul(a, b)
torch.cuda.synchronize()  # Wait for GPU to finish
end = time.time()

print(f"Time for matrix multiplication on GPU: {end - start:.4f} seconds")


## Models


[Customer Segmentation using LLMs: Advanced Clustering Techniques for Effective Targeting](https://ai.plainenglish.io/customer-segmentation-using-llms-advanced-clustering-techniques-for-effective-targeting-493116116ab6)

### 1. BERT for Text Embedding -> review sentiments (e.g., + or -)

Batch processing
- make sure to clean GPU memory by running `torch.cuda.empty_cache()`

In [None]:
from transformers import BertTokenizer, BertModel
import torch
import pandas as pd
from sklearn.cluster import KMeans

# Load the pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased').to(device)  # use GPU

# Sample customer reviews
customer_reviews = reviews_df["review"].to_list()

# Parameters
batch_size = 1000  # Number of reviews per batch
num_clusters = 2  # Number of clusters for K-Means

# Initialize DataFrame to store results
results = pd.DataFrame(columns=['Review', 'Cluster'])

# Process in batches
for i in range(0, len(customer_reviews), batch_size):
    # Get the current batch
    batch_reviews = customer_reviews[i:i + batch_size]

    # Tokenize and encode the batch
    inputs = tokenizer(batch_reviews, return_tensors='pt', padding=True, truncation=True, max_length=512)
    inputs = {key: value.to(device) for key, value in inputs.items()}  # Load data to GPU

    # Get the embeddings from the BERT model
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)

        del outputs # delete unnecessary variables to reclaim memory

    # Free GPU memory after extracting embeddings
    torch.cuda.empty_cache()

    # Convert embeddings to numpy array for clustering
    embeddings = embeddings.cpu().numpy()  # Bring back output from GPU to CPU

    # Perform K-Means clustering on the batch
    kmeans = KMeans(n_clusters=num_clusters, random_state=42)
    labels = kmeans.fit_predict(embeddings)

    # Add the batch results to the results DataFrame
    batch_results = pd.DataFrame({
        'Review': batch_reviews,
        'Cluster': labels
    })
    results = pd.concat([results, batch_results], ignore_index=True)

# Show the clusters
print(results)


Calculate Accuracy Score

In [None]:
from sklearn.metrics import accuracy_score

# Map 'positive' to 1 and 'negative' to 0 in the reviews_df
reviews_df['sentiment'] = reviews_df['sentiment'].map({'positive': 1, 'negative': 0})

# Calculate the accuracy score
accuracy = accuracy_score(reviews_df['sentiment'], results['Cluster'])

# Print the accuracy score
print("Accuracy:", accuracy)

Visualization

In [None]:
import matplotlib.pyplot as plt

# Count the number of reviews in each cluster
cluster_counts = results['Cluster'].value_counts()

# Map cluster numbers to meaningful labels
cluster_labels = {0: 'Negative', 1: 'Positive'}
cluster_counts.index = cluster_counts.index.map(cluster_labels)

# Define colors for each cluster
colors = ['red', 'green']  # Specify colors for clusters 0 and 1

# Plot the counts as a bar graph
plt.figure(figsize=(8, 6))
cluster_counts.plot(kind='barh', color=colors)

# Adjust the axis limits to include the annotations
max_count = cluster_counts.max()
plt.xlim(0, max_count * 1.1)  # Add 10% extra space to the right of the largest bar

# Add title and xy-labels
plt.title('Sentiment of Movie Reviews')
plt.xlabel('Number of Reviews')
plt.ylabel('Sentiment')

# Annotate bars with the total count
for index, value in enumerate(cluster_counts):
    plt.text(value + 100, index, f'{value:,}', va='center')  # Add commas to the number

plt.show()

Evaluation

### 2. Latent Dirichlet Allocation (LDA) -> review themes (e.g., price, quality)
- Not working yet. need to investigate

### 3. Sentiment Analysis
- TBD