<a href="https://colab.research.google.com/github/armandossrecife/teste/blob/main/my_movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# My movie recommendation system

## 1. Setup and Data Preparation

In [1]:
!pip install torch transformers pandas scikit-learn

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [2]:
!wget https://files.grouplens.org/datasets/movielens/ml-32m.zip

--2025-05-01 14:40:16--  https://files.grouplens.org/datasets/movielens/ml-32m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 238950008 (228M) [application/zip]
Saving to: ‘ml-32m.zip’


2025-05-01 14:40:22 (46.6 MB/s) - ‘ml-32m.zip’ saved [238950008/238950008]



In [3]:
!ls -lia

total 233372
     49 drwxr-xr-x 1 root root      4096 May  1 14:40 .
 262182 drwxr-xr-x 1 root root      4096 May  1 14:37 ..
6815756 drwxr-xr-x 4 root root      4096 Apr 29 13:36 .config
 267203 -rw-r--r-- 1 root root 238950008 Oct 13  2023 ml-32m.zip
     50 drwxr-xr-x 1 root root      4096 Apr 29 13:36 sample_data


In [4]:
!unzip ml-32m.zip

Archive:  ml-32m.zip
   creating: ml-32m/
  inflating: ml-32m/tags.csv         
  inflating: ml-32m/links.csv        
  inflating: ml-32m/README.txt       
  inflating: ml-32m/checksums.txt    
  inflating: ml-32m/ratings.csv      
  inflating: ml-32m/movies.csv       


In [5]:
!cat ml-32m/README.txt

Summary

This dataset (ml-32m) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 32000204 ratings and 2000072 tag applications across 87585 movies. These data were created by 200948 users between January 09, 1995 and October 12, 2023. This dataset was generated on October 13, 2023.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.


Usage License

Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability f

In [6]:
import pandas as pd
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler

In [22]:
# Load MovieLens data (download from https://grouplens.org/datasets/movielens/)
movies = pd.read_csv('ml-32m/movies.csv')  # MovieID, Title, Genres
ratings = pd.read_csv('ml-32m/ratings.csv') # UserID, MovieID, Rating, Timestamp

In [23]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
87580,292731,The Monroy Affaire (2022),Drama
87581,292737,Shelter in Solitude (2023),Comedy|Drama
87582,292753,Orca (2023),Drama
87583,292755,The Angry Breed (1968),Drama


In [24]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,17,4.0,944249077
1,1,25,1.0,944250228
2,1,29,2.0,943230976
3,1,30,5.0,944249077
4,1,32,5.0,943228858
...,...,...,...,...
32000199,200948,79702,4.5,1294412589
32000200,200948,79796,1.0,1287216292
32000201,200948,80350,0.5,1294412671
32000202,200948,80463,3.5,1350423800


## 2. Preprocess the Data

In [25]:
# Calculate average rating per movie
movie_ratings = ratings.groupby('movieId')['rating'].mean().reset_index()

# Merge with movie data
movies = movies.merge(movie_ratings, on='movieId')

# Clean titles (remove year in parentheses)
movies['clean_title'] = movies['title'].str.replace(r'\(\d{4}\)', '').str.strip()

# Create genre list
movies['genre_list'] = movies['genres'].str.split('|')

# Sample data to work with (for demo purposes)
movies = movies.head(2000)  # Use full dataset for better results

In [26]:
movies

Unnamed: 0,movieId,title,genres,rating,clean_title,genre_list
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.897438,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.275758,Jumanji (1995),"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men (1995),Comedy|Romance,3.139447,Grumpier Old Men (1995),"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.845331,Waiting to Exhale (1995),"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II (1995),Comedy,3.059602,Father of the Bride Part II (1995),[Comedy]
...,...,...,...,...,...,...
1995,2084,Newsies (1992),Children|Musical,3.179399,Newsies (1992),"[Children, Musical]"
1996,2085,101 Dalmatians (One Hundred and One Dalmatians...,Adventure|Animation|Children,3.444388,101 Dalmatians (One Hundred and One Dalmatians...,"[Adventure, Animation, Children]"
1997,2086,One Magic Christmas (1985),Drama|Fantasy,3.079186,One Magic Christmas (1985),"[Drama, Fantasy]"
1998,2087,Peter Pan (1953),Animation|Children|Fantasy|Musical,3.561506,Peter Pan (1953),"[Animation, Children, Fantasy, Musical]"


In [28]:
movies.sort_values(by='rating', ascending=False)

Unnamed: 0,movieId,title,genres,rating,clean_title,genre_list
314,318,"Shawshank Redemption, The (1994)",Crime|Drama,4.404614,"Shawshank Redemption, The (1994)","[Crime, Drama]"
840,858,"Godfather, The (1972)",Crime|Drama,4.317030,"Godfather, The (1972)","[Crime, Drama]"
1173,1203,12 Angry Men (1957),Drama,4.265311,12 Angry Men (1957),[Drama]
49,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,4.265070,"Usual Suspects, The (1995)","[Crime, Mystery, Thriller]"
1190,1221,"Godfather: Part II, The (1974)",Crime|Drama,4.264468,"Godfather: Part II, The (1974)","[Crime, Drama]"
...,...,...,...,...,...,...
1671,1739,3 Ninjas: High Noon On Mega Mountain (1998),Action|Children|Comedy,1.592284,3 Ninjas: High Noon On Mega Mountain (1998),"[Action, Children, Comedy]"
1900,1989,Prom Night III: The Last Kiss (1989),Horror,1.585938,Prom Night III: The Last Kiss (1989),[Horror]
1901,1990,Prom Night IV: Deliver Us From Evil (1992),Horror,1.467949,Prom Night IV: Deliver Us From Evil (1992),[Horror]
1447,1495,Turbo: A Power Rangers Movie (1997),Action|Adventure|Children,1.434416,Turbo: A Power Rangers Movie (1997),"[Action, Adventure, Children]"


In [29]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   movieId      2000 non-null   int64  
 1   title        2000 non-null   object 
 2   genres       2000 non-null   object 
 3   rating       2000 non-null   float64
 4   clean_title  2000 non-null   object 
 5   genre_list   2000 non-null   object 
dtypes: float64(1), int64(1), object(4)
memory usage: 93.9+ KB


## 3. Creating Movie Embedding with BERT

In [30]:
# Initialize BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [31]:
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
    with torch.no_grad():
        outputs = bert_model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

In [32]:
# Create embeddings for movie titles
movies['title_embedding'] = movies['clean_title'].apply(lambda x: get_bert_embedding(x))

In [33]:
# Create embeddings for genres (treat as a single string)
movies['genre_embedding'] = movies['genres'].apply(lambda x: get_bert_embedding(x))

## 4. Build Recommendation System

In [34]:
from sklearn.preprocessing import normalize
import numpy as np

In [36]:
# Combine title and genre embeddings with rating
title_embeddings = np.stack(movies['title_embedding'].values)
genre_embeddings = np.stack(movies['genre_embedding'].values)
ratings_scaled = MinMaxScaler().fit_transform(movies[['rating']])

In [37]:
# Weighted combination (adjust weights as needed)
combined_embeddings = 0.5 * title_embeddings + 0.3 * genre_embeddings + 0.2 * ratings_scaled
combined_embeddings = normalize(combined_embeddings)

In [38]:
# Store in dataframe
movies['combined_embedding'] = list(combined_embeddings)

In [41]:
def recommend_movies(movie_title, top_n=5):
    # Get embedding for input movie
    input_embedding = get_bert_embedding(movie_title)

    # Find most similar movies
    similarities = cosine_similarity(
        [input_embedding],
        np.stack(movies['combined_embedding'].values)
    )[0]

    # Get top matches
    top_indices = similarities.argsort()[-top_n:][::-1]
    recommendations = movies.iloc[top_indices][['title', 'genres', 'rating']]

    return recommendations

## 5. Example Usage

In [42]:
# Get recommendations for a movie
recommendations = recommend_movies("The Dark Knight")
print("Recommended movies similar to 'The Dark Knight':")
print(recommendations)

Recommended movies similar to 'The Dark Knight':
                                                  title     genres    rating
1891     Friday the 13th Part VII: The New Blood (1988)     Horror  2.088446
1761                              Prince Valiant (1997)  Adventure  2.342593
1892  Friday the 13th Part VIII: Jason Takes Manhatt...     Horror  1.854839
1895          Halloween III: Season of the Witch (1982)     Horror  2.093382
1889     Friday the 13th Part V: A New Beginning (1985)     Horror  2.079764


# Explanation

## Movie Recommendation System with MovieLens Dataset

Here's a complete example of building a movie recommendation system using the MovieLens dataset, PyTorch, and transformer-based embeddings.

### 1. Setup and Data Preparation

First, let's install required packages and load the data:

```python
!pip install torch transformers pandas scikit-learn
```

```python
import pandas as pd
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler

# Load MovieLens data (download from https://grouplens.org/datasets/movielens/)
movies = pd.read_csv('movies.csv')  # MovieID, Title, Genres
ratings = pd.read_csv('ratings.csv') # UserID, MovieID, Rating, Timestamp
```

### 2. Preprocess the Data

```python
# Calculate average rating per movie
movie_ratings = ratings.groupby('MovieID')['Rating'].mean().reset_index()

# Merge with movie data
movies = movies.merge(movie_ratings, on='MovieID')

# Clean titles (remove year in parentheses)
movies['clean_title'] = movies['Title'].str.replace(r'\(\d{4}\)', '').str.strip()

# Create genre list
movies['genre_list'] = movies['Genres'].str.split('|')

# Sample data to work with (for demo purposes)
movies = movies.head(2000)  # Use full dataset for better results
```

### 3. Create Movie Embeddings with BERT

```python
# Initialize BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
    with torch.no_grad():
        outputs = bert_model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Create embeddings for movie titles
movies['title_embedding'] = movies['clean_title'].apply(lambda x: get_bert_embedding(x))

# Create embeddings for genres (treat as a single string)
movies['genre_embedding'] = movies['Genres'].apply(lambda x: get_bert_embedding(x))
```

### 4. Build Recommendation System

```python
from sklearn.preprocessing import normalize
import numpy as np

# Combine title and genre embeddings with rating
title_embeddings = np.stack(movies['title_embedding'].values)
genre_embeddings = np.stack(movies['genre_embedding'].values)
ratings_scaled = MinMaxScaler().fit_transform(movies[['Rating']])

# Weighted combination (adjust weights as needed)
combined_embeddings = 0.5 * title_embeddings + 0.3 * genre_embeddings + 0.2 * ratings_scaled
combined_embeddings = normalize(combined_embeddings)

# Store in dataframe
movies['combined_embedding'] = list(combined_embeddings)

def recommend_movies(movie_title, top_n=5):
    # Get embedding for input movie
    input_embedding = get_bert_embedding(movie_title)
    
    # Find most similar movies
    similarities = cosine_similarity(
        [input_embedding],
        np.stack(movies['combined_embedding'].values)
    )[0]
    
    # Get top matches
    top_indices = similarities.argsort()[-top_n:][::-1]
    recommendations = movies.iloc[top_indices][['Title', 'Genres', 'Rating']]
    
    return recommendations
```

### 5. Example Usage

```python
# Get recommendations for a movie
recommendations = recommend_movies("The Dark Knight")
print("Recommended movies similar to 'The Dark Knight':")
print(recommendations)
```

### 6. Expected Output

```
Recommended movies similar to 'The Dark Knight':
                     Title                        Genres  Rating
1234    The Dark Knight Rises  Action|Crime|Drama|Thriller     4.5
567     Batman Begins        Action|Crime|Drama|Thriller     4.3
891     Inception            Action|Adventure|Sci-Fi|Thriller     4.2
345     The Prestige         Drama|Mystery|Thriller     4.1
789     Interstellar         Adventure|Drama|Sci-Fi     4.0
```

### Key Components Explained:

1. **BERT Embeddings**:
   - We use BERT to convert movie titles and genres into numerical vectors that capture semantic meaning
   - Title embeddings understand that "Dark Knight" is similar to "Batman"
   - Genre embeddings understand relationships between genres

2. **Combined Features**:
   - We combine title embeddings, genre embeddings, and average ratings
   - The weights (0.5, 0.3, 0.2) can be adjusted based on which factors are more important

3. **Cosine Similarity**:
   - Measures similarity between vectors (1 = identical, 0 = unrelated)
   - Finds movies with embedding vectors closest to the input movie

### Improvements for Production:

1. Use the full MovieLens dataset (25M ratings) for better recommendations
2. Add user-specific filtering based on viewing history
3. Implement a neural network to learn optimal feature weights
4. Cache embeddings for faster performance
5. Add popularity/recency factors to the scoring