<a href="https://colab.research.google.com/github/git-ginwook/InsightToInterface/blob/movies_bert/InsightToInterface_ReviewClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Classification

## Dataset


[IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/code)

- connect dataset directly from Kaggle

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

print("Path to dataset files:", path)

Path to dataset files: /root/.cache/kagglehub/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/versions/1


In [None]:
import pandas as pd

reviews_df = pd.read_csv(path + "/IMDB Dataset.csv")
reviews_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## GPU Device Setup

### 1. Verify A100 GPU set up

In [None]:
import torch

# Check if GPU is available
if torch.cuda.is_available():
    device = torch.device("cuda")  # Automatically selects the first available GPU
    gpu_name = torch.cuda.get_device_name(0)

    # Check if it's an A100 GPU
    if "A100" in gpu_name:
        print(f"Successfully set up {gpu_name}!")
    else:
        print(f"GPU assigned: {gpu_name}. Note: It's not an A100 GPU.")
else:
    device = torch.device("cpu")
    print("GPU not available. Using CPU.")

# Print CUDA version
print(f"CUDA Version: {torch.version.cuda}")
print(f"PyTorch Version: {torch.__version__}")


Successfully set up NVIDIA A100-SXM4-40GB!
CUDA Version: 12.1
PyTorch Version: 2.5.1+cu121


### 2. Verify GPU performance

In [None]:
import time

# Dummy tensor operation to benchmark GPU
device = torch.device("cuda")
size = 10000

# Create random tensors
a = torch.randn(size, size, device=device)
b = torch.randn(size, size, device=device)

# Time matrix multiplication on GPU
start = time.time()
c = torch.matmul(a, b)
torch.cuda.synchronize()  # Wait for GPU to finish
end = time.time()

print(f"Time for matrix multiplication on GPU: {end - start:.4f} seconds")


Time for matrix multiplication on GPU: 0.1635 seconds


## Models


[Customer Segmentation using LLMs: Advanced Clustering Techniques for Effective Targeting](https://ai.plainenglish.io/customer-segmentation-using-llms-advanced-clustering-techniques-for-effective-targeting-493116116ab6)

### 1. BERT for Text Embedding -> review sentiments (e.g., + or -)

In [None]:
from transformers import BertTokenizer, BertModel
import torch

from sklearn.cluster import KMeans

# Load the pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased').to(device) # use GPU

# Sample customer reviews
customer_reviews = reviews_df["review"].to_list()

# Tokenize and encode the reviews
# inputs = tokenizer(customer_reviews, return_tensors='pt', padding=True, truncation=True)
inputs = tokenizer(customer_reviews, return_tensors='pt', padding=True, truncation=True)
inputs.to(device) # load data to GPU

# Get the embeddings from the BERT model
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)

# Convert embeddings to a numpy array for clustering
embeddings = embeddings.cpu().numpy() # bring back output from GPU to CPU

# Perform K-Means clustering
kmeans = KMeans(n_clusters=2)
labels = kmeans.fit_predict(embeddings)

# Add the results to a DataFrame for better visualization
df = pd.DataFrame({
    'Review': customer_reviews,
    'Cluster': labels
})

# Show the clusters
print(df)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


OutOfMemoryError: CUDA out of memory. Tried to allocate 73.24 GiB. GPU 0 has a total capacity of 39.56 GiB of which 36.99 GiB is free. Process 160480 has 2.57 GiB memory in use. Of the allocated memory 2.11 GiB is allocated by PyTorch, and 45.65 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

### 2. Latent Dirichlet Allocation (LDA) -> review themes (e.g., price, quality)
- Not working yet. need to investigate

In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Sample Dataset
documents = [
    "Machine learning is fascinating.",
    "Artificial intelligence and deep learning are branches of machine learning.",
    "I like icecream.",
    "AI and machine learning are transforming industries.",
    "Blueberry is delicious."
]

# Step 1: Text Preprocessing (e.g., Tokenization, Stopword Removal, Vectorization)
vectorizer = CountVectorizer(
    max_df=0.95,  # Ignore terms with a document frequency > 95%
    min_df=2,     # Ignore terms with a document frequency < 2
    stop_words='english'  # Remove common stopwords
)
X = vectorizer.fit_transform(documents)

# Step 2: Apply LDA
n_topics = 2  # Specify the number of topics
lda_model = LatentDirichletAllocation(
    n_components=n_topics,   # Number of topics
    max_iter=10,             # Maximum number of iterations
    learning_method='batch', # Batch or online learning
    random_state=42          # Random seed for reproducibility
)
lda_model.fit(X)

# Step 3: Display Topics
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx + 1}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 5  # Number of top words to display per topic
feature_names = vectorizer.get_feature_names_out()
display_topics(lda_model, feature_names, no_top_words)

# Step 4: Topic Distribution for Documents
doc_topic_distribution = lda_model.transform(X)
print("\nDocument-Topic Distributions:")
print(doc_topic_distribution)


Topic 1:
learning machine
Topic 2:
machine learning

Document-Topic Distributions:
[[0.72330431 0.27669569]
 [0.81495224 0.18504776]
 [0.5        0.5       ]
 [0.72330431 0.27669569]
 [0.5        0.5       ]]
