<a href="https://colab.research.google.com/github/pksungwan/InsightToInterface/blob/imdb-bert-clustering/IMDB_Clustering_using_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Clustering Movies and TV Shows using BERT Embedding.

[BERT model](https://huggingface.co/google-bert/bert-base-uncased) can be used to represent input text into vector embedding. Input text is movies and tv shows descriptions. One can use the output of the model (embedding) to do clustering.

source - https://medium.com/ai-in-plain-english/customer-segmentation-using-llms-advanced-clustering-techniques-for-effective-targeting-493116116ab6


In [None]:
import torch
import kagglehub
import pandas as pd
from sklearn.cluster import KMeans
from transformers import BertTokenizer, BertModel

# Write device agnostic code
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Import IMDB dataset
path = kagglehub.dataset_download("harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows")
imdb_data = pd.read_csv(path + "/imdb_top_1000.csv")

# Load the pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased').to(device)

# Movies and TV Shows Descriptions
descriptions = imdb_data["Overview"][:10].to_list()

# Tokenize and encode the reviews
inputs = tokenizer(descriptions, return_tensors='pt', padding=True, truncation=True)
inputs.to(device)

# Get the embeddings from the BERT model
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)

# Convert embeddings to a numpy array for clustering
embeddings = embeddings.cpu().numpy()

# Perform K-Means clustering
kmeans = KMeans(n_clusters=2)
labels = kmeans.fit_predict(embeddings)

# Add the results to a DataFrame for better visualization
df = pd.DataFrame({
    'Review': descriptions,
    'Cluster': labels
})

# Show the clusters
print(df)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


                                              Review  Cluster
0  Two imprisoned men bond over a number of years...        1
1  An organized crime dynasty's aging patriarch t...        0
2  When the menace known as the Joker wreaks havo...        1
3  The early life and career of Vito Corleone in ...        1
4  A jury holdout attempts to prevent a miscarria...        1
5  Gandalf and Aragorn lead the World of Men agai...        1
6  The lives of two mob hitmen, a boxer, a gangst...        1
7  In German-occupied Poland during World War II,...        0
8  A thief who steals corporate secrets through t...        1
9  An insomniac office worker and a devil-may-car...        1
