# MARES - Model Development

## Goal

To build a model in which:

1. Gets all of the good reviews (>= 4 stars)
2. For each user, summarizes its taste:
    - user_a -> [f1_sum, f2_sum, ..., fn_sum]
3. Uses KNN to cluster and calulate similariteies between users.
4. Asks for a user and its favorite movies, then calculates its vector of summarized features (taste)
5. Find similar users
6. Find the movies that similar users have liked
7. Recommend those to the actual user

## Imports

In [246]:
import pandas as pd
import pprint
import numpy as np
from google.cloud import bigquery
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier


## Getting the data

In [247]:
client = bigquery.Client()

ratings_query = """
    SELECT *
    FROM `le-wagon-march2024.mares_movies.ratings`
"""

movies_query = """
    SELECT *
    FROM `le-wagon-march2024.mares_movies.movie_detail`
"""

ratings_job = client.query(ratings_query)
movies_job = client.query(movies_query)

ratings_df = ratings_job.result().to_dataframe()
movies_df = movies_job.result().to_dataframe()

In [248]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     Int64  
 1   movieId    Int64  
 2   rating     float64
 3   timestamp  Int64  
dtypes: Int64(3), float64(1)
memory usage: 834.5 MB


In [249]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62620 entries, 0 to 62619
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   movie_id      62620 non-null  Int64  
 1   title         62620 non-null  object 
 2   img_path      60555 non-null  object 
 3   year          62054 non-null  object 
 4   avg_rating    59244 non-null  float64
 5   rating_count  62620 non-null  Int64  
 6   genres        62620 non-null  object 
 7   summary       61360 non-null  object 
 8   duration      61746 non-null  float64
 9   pop_score     62492 non-null  object 
dtypes: Int64(2), float64(2), object(6)
memory usage: 4.9+ MB


In [250]:
movies_df = movies_df.rename(columns={"movie_id":"movieId"})
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62620 entries, 0 to 62619
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   movieId       62620 non-null  Int64  
 1   title         62620 non-null  object 
 2   img_path      60555 non-null  object 
 3   year          62054 non-null  object 
 4   avg_rating    59244 non-null  float64
 5   rating_count  62620 non-null  Int64  
 6   genres        62620 non-null  object 
 7   summary       61360 non-null  object 
 8   duration      61746 non-null  float64
 9   pop_score     62492 non-null  object 
dtypes: Int64(2), float64(2), object(6)
memory usage: 4.9+ MB


## Dropping duplicates

In [251]:
movies_df = movies_df.drop_duplicates(subset=['movieId'], keep='first')

In [252]:
movies_df = movies_df.reset_index(drop=True)

In [253]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   movieId       62423 non-null  Int64  
 1   title         62423 non-null  object 
 2   img_path      60358 non-null  object 
 3   year          61857 non-null  object 
 4   avg_rating    59047 non-null  float64
 5   rating_count  62423 non-null  Int64  
 6   genres        62423 non-null  object 
 7   summary       61163 non-null  object 
 8   duration      61549 non-null  float64
 9   pop_score     62295 non-null  object 
dtypes: Int64(2), float64(2), object(6)
memory usage: 4.9+ MB


## Cleaning Data

### Movies data frame

In [254]:
movies_titles_df = movies_df[["movieId","title"]]
movies_relevant_features_df = movies_df.drop(columns=["img_path", "summary"])

In [255]:
movies_titles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  Int64 
 1   title    62423 non-null  object
dtypes: Int64(1), object(1)
memory usage: 1.0+ MB


In [256]:
movies_relevant_features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   movieId       62423 non-null  Int64  
 1   title         62423 non-null  object 
 2   year          61857 non-null  object 
 3   avg_rating    59047 non-null  float64
 4   rating_count  62423 non-null  Int64  
 5   genres        62423 non-null  object 
 6   duration      61549 non-null  float64
 7   pop_score     62295 non-null  object 
dtypes: Int64(2), float64(2), object(4)
memory usage: 3.9+ MB


#### Year

Goal: to bucketize years.

First we will drop movies without a year.
Then bucketize in the following categories:



Before

In [257]:
clean_1 = movies_relevant_features_df.copy()
clean_1 = movies_relevant_features_df[movies_relevant_features_df.year.notna()]
clean_1 = clean_1.copy()
clean_1.year = clean_1.year.astype(int)

In [258]:
clean_1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 61857 entries, 0 to 62422
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   movieId       61857 non-null  Int64  
 1   title         61857 non-null  object 
 2   year          61857 non-null  int64  
 3   avg_rating    58520 non-null  float64
 4   rating_count  61857 non-null  Int64  
 5   genres        61857 non-null  object 
 6   duration      61040 non-null  float64
 7   pop_score     61735 non-null  object 
dtypes: Int64(2), float64(2), int64(1), object(3)
memory usage: 4.4+ MB


In [259]:
clean_1.year.describe()

count    61857.000000
mean      1992.033125
std         25.375008
min       1874.000000
25%       1976.000000
50%       2002.000000
75%       2012.000000
max       2019.000000
Name: year, dtype: float64

In [260]:
bucketize_years = lambda x: pd.cut(x, bins=5, labels=['A', 'B', 'C', 'D', 'E'])

In [261]:
clean_1['year_bucketized'] = bucketize_years(clean_1.year)

In [262]:
clean_1.head()

Unnamed: 0,movieId,title,year,avg_rating,rating_count,genres,duration,pop_score,year_bucketized
0,4,Waiting to Exhale,1995,2.853547,2523,Comedy|Drama|Romance,127.0,13.24,E
1,6,Heat,1995,3.854909,24588,Action|Crime|Thriller,170.0,49.132,E
2,7,Sabrina,1995,3.363666,12132,Comedy|Romance,127.0,15.428,E
3,10,GoldenEye,1995,3.421458,28265,Action|Adventure|Thriller,130.0,45.794,E
4,14,Nixon,1995,3.423489,5509,Drama,192.0,18.411,E


In [263]:
clean_1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 61857 entries, 0 to 62422
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   movieId          61857 non-null  Int64   
 1   title            61857 non-null  object  
 2   year             61857 non-null  int64   
 3   avg_rating       58520 non-null  float64 
 4   rating_count     61857 non-null  Int64   
 5   genres           61857 non-null  object  
 6   duration         61040 non-null  float64 
 7   pop_score        61735 non-null  object  
 8   year_bucketized  61857 non-null  category
dtypes: Int64(2), category(1), float64(2), int64(1), object(3)
memory usage: 4.4+ MB


In [264]:
clean_1 = clean_1.reset_index(drop=True)

In [265]:
clean_1

Unnamed: 0,movieId,title,year,avg_rating,rating_count,genres,duration,pop_score,year_bucketized
0,4,Waiting to Exhale,1995,2.853547,2523,Comedy|Drama|Romance,127.0,13.24,E
1,6,Heat,1995,3.854909,24588,Action|Crime|Thriller,170.0,49.132,E
2,7,Sabrina,1995,3.363666,12132,Comedy|Romance,127.0,15.428,E
3,10,GoldenEye,1995,3.421458,28265,Action|Adventure|Thriller,130.0,45.794,E
4,14,Nixon,1995,3.423489,5509,Drama,192.0,18.411,E
...,...,...,...,...,...,...,...,...,...
61852,206303,The Crimes That Bind,2018,3.166667,3,Mystery|Thriller,119.0,5.622,E
61853,206531,Sons of Denmark,2019,2.500000,2,Drama|Thriller,119.0,3.712,E
61854,206893,Matthias & Maxime,2019,2.500000,3,Drama,119.0,20.536,E
61855,207323,The 8-Year Engagement,2017,4.000000,1,Drama|Romance,119.0,9.107,E


In [266]:
clean_1 = clean_1.drop(columns=["year"])
clean_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61857 entries, 0 to 61856
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   movieId          61857 non-null  Int64   
 1   title            61857 non-null  object  
 2   avg_rating       58520 non-null  float64 
 3   rating_count     61857 non-null  Int64   
 4   genres           61857 non-null  object  
 5   duration         61040 non-null  float64 
 6   pop_score        61735 non-null  object  
 7   year_bucketized  61857 non-null  category
dtypes: Int64(2), category(1), float64(2), object(3)
memory usage: 3.5+ MB


In [267]:
clean_1.year_bucketized.unique().tolist()

['E', 'D', 'C', 'B', 'A']

In [268]:
year_encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
year_unique_values = clean_1.year_bucketized.unique().tolist()
year_unique_values_reshaped = np.array(year_unique_values).reshape(-1, 1)
year_encoder.fit(year_unique_values_reshaped)
year_encoded_values = year_encoder.transform(clean_1.year_bucketized.values.reshape(-1, 1))
year_encoded_values.columns = ["years_A","years_B","years_C","years_D","years_E"]
year_encoded_values


Unnamed: 0,years_A,years_B,years_C,years_D,years_E
0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...
61852,0.0,0.0,0.0,0.0,1.0
61853,0.0,0.0,0.0,0.0,1.0
61854,0.0,0.0,0.0,0.0,1.0
61855,0.0,0.0,0.0,0.0,1.0


In [269]:
clean_1[year_encoded_values.columns] = year_encoded_values
clean_1 = clean_1.drop(columns=["year_bucketized"])
clean_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61857 entries, 0 to 61856
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   movieId       61857 non-null  Int64  
 1   title         61857 non-null  object 
 2   avg_rating    58520 non-null  float64
 3   rating_count  61857 non-null  Int64  
 4   genres        61857 non-null  object 
 5   duration      61040 non-null  float64
 6   pop_score     61735 non-null  object 
 7   years_A       61857 non-null  float64
 8   years_B       61857 non-null  float64
 9   years_C       61857 non-null  float64
 10  years_D       61857 non-null  float64
 11  years_E       61857 non-null  float64
dtypes: Int64(2), float64(7), object(3)
memory usage: 5.8+ MB


#### Genres

In [270]:
clean_2 = clean_1.copy()

In [271]:
clean_2.genres

0             Comedy|Drama|Romance
1            Action|Crime|Thriller
2                   Comedy|Romance
3        Action|Adventure|Thriller
4                            Drama
                   ...            
61852             Mystery|Thriller
61853               Drama|Thriller
61854                        Drama
61855                Drama|Romance
61856                      Mystery
Name: genres, Length: 61857, dtype: object

In [272]:
clean_2.genres.fillna(" ")

0             Comedy|Drama|Romance
1            Action|Crime|Thriller
2                   Comedy|Romance
3        Action|Adventure|Thriller
4                            Drama
                   ...            
61852             Mystery|Thriller
61853               Drama|Thriller
61854                        Drama
61855                Drama|Romance
61856                      Mystery
Name: genres, Length: 61857, dtype: object

In [273]:
clean_2['genres'] = clean_2['genres'].str.split('|')

In [274]:
clean_2.genres

0             [Comedy, Drama, Romance]
1            [Action, Crime, Thriller]
2                    [Comedy, Romance]
3        [Action, Adventure, Thriller]
4                              [Drama]
                     ...              
61852              [Mystery, Thriller]
61853                [Drama, Thriller]
61854                          [Drama]
61855                 [Drama, Romance]
61856                        [Mystery]
Name: genres, Length: 61857, dtype: object

In [275]:
len(clean_2)

61857

In [276]:
mlb = MultiLabelBinarizer()

one_hot_encoded_genres = mlb.fit_transform(clean_2['genres'])

one_hot_encoded_genres_df = pd.DataFrame(one_hot_encoded_genres, columns=mlb.classes_)

one_hot_encoded_genres_df.columns = one_hot_encoded_genres_df.columns.str.lower().str.replace('-', '_').str.replace(' ', '_')

# Deleting no_genres_listed col
one_hot_encoded_genres_df = one_hot_encoded_genres_df.drop(one_hot_encoded_genres_df.columns[0], axis=1)

In [277]:
one_hot_encoded_genres_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61857 entries, 0 to 61856
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   action       61857 non-null  int64
 1   adventure    61857 non-null  int64
 2   animation    61857 non-null  int64
 3   children     61857 non-null  int64
 4   comedy       61857 non-null  int64
 5   crime        61857 non-null  int64
 6   documentary  61857 non-null  int64
 7   drama        61857 non-null  int64
 8   fantasy      61857 non-null  int64
 9   film_noir    61857 non-null  int64
 10  horror       61857 non-null  int64
 11  imax         61857 non-null  int64
 12  musical      61857 non-null  int64
 13  mystery      61857 non-null  int64
 14  romance      61857 non-null  int64
 15  sci_fi       61857 non-null  int64
 16  thriller     61857 non-null  int64
 17  war          61857 non-null  int64
 18  western      61857 non-null  int64
dtypes: int64(19)
memory usage: 9.0 MB


In [278]:
one_hot_encoded_genres_df = one_hot_encoded_genres_df.add_prefix("gen_")

In [279]:
clean_2[one_hot_encoded_genres_df.columns] = one_hot_encoded_genres_df
clean_2

Unnamed: 0,movieId,title,avg_rating,rating_count,genres,duration,pop_score,years_A,years_B,years_C,...,gen_film_noir,gen_horror,gen_imax,gen_musical,gen_mystery,gen_romance,gen_sci_fi,gen_thriller,gen_war,gen_western
0,4,Waiting to Exhale,2.853547,2523,"[Comedy, Drama, Romance]",127.0,13.24,0.0,0.0,0.0,...,0,0,0,0,0,1,0,0,0,0
1,6,Heat,3.854909,24588,"[Action, Crime, Thriller]",170.0,49.132,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0
2,7,Sabrina,3.363666,12132,"[Comedy, Romance]",127.0,15.428,0.0,0.0,0.0,...,0,0,0,0,0,1,0,0,0,0
3,10,GoldenEye,3.421458,28265,"[Action, Adventure, Thriller]",130.0,45.794,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0
4,14,Nixon,3.423489,5509,[Drama],192.0,18.411,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61852,206303,The Crimes That Bind,3.166667,3,"[Mystery, Thriller]",119.0,5.622,0.0,0.0,0.0,...,0,0,0,0,1,0,0,1,0,0
61853,206531,Sons of Denmark,2.500000,2,"[Drama, Thriller]",119.0,3.712,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0
61854,206893,Matthias & Maxime,2.500000,3,[Drama],119.0,20.536,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
61855,207323,The 8-Year Engagement,4.000000,1,"[Drama, Romance]",119.0,9.107,0.0,0.0,0.0,...,0,0,0,0,0,1,0,0,0,0


In [280]:
clean_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61857 entries, 0 to 61856
Data columns (total 31 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movieId          61857 non-null  Int64  
 1   title            61857 non-null  object 
 2   avg_rating       58520 non-null  float64
 3   rating_count     61857 non-null  Int64  
 4   genres           61857 non-null  object 
 5   duration         61040 non-null  float64
 6   pop_score        61735 non-null  object 
 7   years_A          61857 non-null  float64
 8   years_B          61857 non-null  float64
 9   years_C          61857 non-null  float64
 10  years_D          61857 non-null  float64
 11  years_E          61857 non-null  float64
 12  gen_action       61857 non-null  int64  
 13  gen_adventure    61857 non-null  int64  
 14  gen_animation    61857 non-null  int64  
 15  gen_children     61857 non-null  int64  
 16  gen_comedy       61857 non-null  int64  
 17  gen_crime   

In [281]:
clean_2 = clean_2.drop(columns=["genres"])

#### Duration

In [282]:
clean_3 = clean_2.copy()
clean_3

Unnamed: 0,movieId,title,avg_rating,rating_count,duration,pop_score,years_A,years_B,years_C,years_D,...,gen_film_noir,gen_horror,gen_imax,gen_musical,gen_mystery,gen_romance,gen_sci_fi,gen_thriller,gen_war,gen_western
0,4,Waiting to Exhale,2.853547,2523,127.0,13.24,0.0,0.0,0.0,0.0,...,0,0,0,0,0,1,0,0,0,0
1,6,Heat,3.854909,24588,170.0,49.132,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0
2,7,Sabrina,3.363666,12132,127.0,15.428,0.0,0.0,0.0,0.0,...,0,0,0,0,0,1,0,0,0,0
3,10,GoldenEye,3.421458,28265,130.0,45.794,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0
4,14,Nixon,3.423489,5509,192.0,18.411,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61852,206303,The Crimes That Bind,3.166667,3,119.0,5.622,0.0,0.0,0.0,0.0,...,0,0,0,0,1,0,0,1,0,0
61853,206531,Sons of Denmark,2.500000,2,119.0,3.712,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0
61854,206893,Matthias & Maxime,2.500000,3,119.0,20.536,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
61855,207323,The 8-Year Engagement,4.000000,1,119.0,9.107,0.0,0.0,0.0,0.0,...,0,0,0,0,0,1,0,0,0,0


In [283]:
clean_3.duration.describe()

count    61040.000000
mean        93.517824
std         30.959350
min          0.000000
25%         85.000000
50%         94.000000
75%        105.000000
max        803.000000
Name: duration, dtype: float64

In [284]:
clean_3['length'] = clean_3['duration'].fillna('unknown').map(lambda x: 'unknown' if x == 'unknown' 
                                                                    else 'short' if x < 85 
                                                                    else 'medium' if x <= 105 
                                                                    else 'long')

In [285]:
clean_3 = clean_3.drop(columns=["duration"])

In [286]:
clean_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61857 entries, 0 to 61856
Data columns (total 30 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movieId          61857 non-null  Int64  
 1   title            61857 non-null  object 
 2   avg_rating       58520 non-null  float64
 3   rating_count     61857 non-null  Int64  
 4   pop_score        61735 non-null  object 
 5   years_A          61857 non-null  float64
 6   years_B          61857 non-null  float64
 7   years_C          61857 non-null  float64
 8   years_D          61857 non-null  float64
 9   years_E          61857 non-null  float64
 10  gen_action       61857 non-null  int64  
 11  gen_adventure    61857 non-null  int64  
 12  gen_animation    61857 non-null  int64  
 13  gen_children     61857 non-null  int64  
 14  gen_comedy       61857 non-null  int64  
 15  gen_crime        61857 non-null  int64  
 16  gen_documentary  61857 non-null  int64  
 17  gen_drama   

In [287]:
unique_len_values = clean_3.length.unique().tolist()

In [288]:
length_encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")

In [289]:
unique_len_values_reshaped = np.array(unique_len_values).reshape(-1, 1)

In [290]:
unique_len_values_reshaped

array([['long'],
       ['short'],
       ['unknown'],
       ['medium']], dtype='<U7')

In [291]:
length_encoder.fit(unique_len_values_reshaped)

In [292]:
len_encoded_values = length_encoder.transform(clean_3['length'].values.reshape(-1, 1))

In [293]:
len_encoded_values 

Unnamed: 0,x0_long,x0_medium,x0_short,x0_unknown
0,1.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0
...,...,...,...,...
61852,1.0,0.0,0.0,0.0
61853,1.0,0.0,0.0,0.0
61854,1.0,0.0,0.0,0.0
61855,1.0,0.0,0.0,0.0


In [294]:
len_encoded_values.columns = ["len_lon","len_medium","len_short", "len_unknown"]
len_encoded_values

Unnamed: 0,len_lon,len_medium,len_short,len_unknown
0,1.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0
...,...,...,...,...
61852,1.0,0.0,0.0,0.0
61853,1.0,0.0,0.0,0.0
61854,1.0,0.0,0.0,0.0
61855,1.0,0.0,0.0,0.0


In [295]:
clean_3[len_encoded_values.columns] = len_encoded_values
clean_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61857 entries, 0 to 61856
Data columns (total 34 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movieId          61857 non-null  Int64  
 1   title            61857 non-null  object 
 2   avg_rating       58520 non-null  float64
 3   rating_count     61857 non-null  Int64  
 4   pop_score        61735 non-null  object 
 5   years_A          61857 non-null  float64
 6   years_B          61857 non-null  float64
 7   years_C          61857 non-null  float64
 8   years_D          61857 non-null  float64
 9   years_E          61857 non-null  float64
 10  gen_action       61857 non-null  int64  
 11  gen_adventure    61857 non-null  int64  
 12  gen_animation    61857 non-null  int64  
 13  gen_children     61857 non-null  int64  
 14  gen_comedy       61857 non-null  int64  
 15  gen_crime        61857 non-null  int64  
 16  gen_documentary  61857 non-null  int64  
 17  gen_drama   

In [296]:
clean_3 = clean_3.drop(columns=["length","len_unknown"])
clean_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61857 entries, 0 to 61856
Data columns (total 32 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movieId          61857 non-null  Int64  
 1   title            61857 non-null  object 
 2   avg_rating       58520 non-null  float64
 3   rating_count     61857 non-null  Int64  
 4   pop_score        61735 non-null  object 
 5   years_A          61857 non-null  float64
 6   years_B          61857 non-null  float64
 7   years_C          61857 non-null  float64
 8   years_D          61857 non-null  float64
 9   years_E          61857 non-null  float64
 10  gen_action       61857 non-null  int64  
 11  gen_adventure    61857 non-null  int64  
 12  gen_animation    61857 non-null  int64  
 13  gen_children     61857 non-null  int64  
 14  gen_comedy       61857 non-null  int64  
 15  gen_crime        61857 non-null  int64  
 16  gen_documentary  61857 non-null  int64  
 17  gen_drama   

#### avg_rating, rating_count and pop_score

In [297]:
clean_4 = clean_3.copy()

In [298]:
imputer = SimpleImputer(strategy='mean').set_output(transform="pandas")

In [299]:
cols_to_impute = clean_4[["avg_rating","rating_count","pop_score"]]
cols_to_impute

Unnamed: 0,avg_rating,rating_count,pop_score
0,2.853547,2523,13.24
1,3.854909,24588,49.132
2,3.363666,12132,15.428
3,3.421458,28265,45.794
4,3.423489,5509,18.411
...,...,...,...
61852,3.166667,3,5.622
61853,2.500000,2,3.712
61854,2.500000,3,20.536
61855,4.000000,1,9.107


In [300]:
cols_to_impute.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61857 entries, 0 to 61856
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   avg_rating    58520 non-null  float64
 1   rating_count  61857 non-null  Int64  
 2   pop_score     61735 non-null  object 
dtypes: Int64(1), float64(1), object(1)
memory usage: 1.5+ MB


In [301]:
cols_to_impute = cols_to_impute.copy()
cols_to_impute = cols_to_impute.replace("NaN.", np.nan)
cols_to_impute = cols_to_impute.astype(float)

In [302]:
cols_to_impute.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61857 entries, 0 to 61856
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   avg_rating    58520 non-null  float64
 1   rating_count  61857 non-null  float64
 2   pop_score     61044 non-null  float64
dtypes: float64(3)
memory usage: 1.4 MB


In [303]:
imputed_data = imputer.fit_transform(cols_to_impute)
imputed_data

Unnamed: 0,avg_rating,rating_count,pop_score
0,2.853547,2523.0,13.240
1,3.854909,24588.0,49.132
2,3.363666,12132.0,15.428
3,3.421458,28265.0,45.794
4,3.423489,5509.0,18.411
...,...,...,...
61852,3.166667,3.0,5.622
61853,2.500000,2.0,3.712
61854,2.500000,3.0,20.536
61855,4.000000,1.0,9.107


In [304]:
clean_4 = clean_4.drop(columns=cols_to_impute.columns)
clean_4[cols_to_impute.columns] = imputed_data
clean_4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61857 entries, 0 to 61856
Data columns (total 32 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movieId          61857 non-null  Int64  
 1   title            61857 non-null  object 
 2   years_A          61857 non-null  float64
 3   years_B          61857 non-null  float64
 4   years_C          61857 non-null  float64
 5   years_D          61857 non-null  float64
 6   years_E          61857 non-null  float64
 7   gen_action       61857 non-null  int64  
 8   gen_adventure    61857 non-null  int64  
 9   gen_animation    61857 non-null  int64  
 10  gen_children     61857 non-null  int64  
 11  gen_comedy       61857 non-null  int64  
 12  gen_crime        61857 non-null  int64  
 13  gen_documentary  61857 non-null  int64  
 14  gen_drama        61857 non-null  int64  
 15  gen_fantasy      61857 non-null  int64  
 16  gen_film_noir    61857 non-null  int64  
 17  gen_horror  

In [305]:
cols_to_scale = clean_4[["avg_rating","rating_count","pop_score"]]
scaler = MinMaxScaler().set_output(transform="pandas")
scaled_data = scaler.fit_transform(cols_to_scale)
scaled_data

Unnamed: 0,avg_rating,rating_count,pop_score
0,0.523011,0.030960,0.034768
1,0.745535,0.301727,0.129021
2,0.636370,0.148875,0.040514
3,0.649213,0.346848,0.120255
4,0.649664,0.067603,0.048347
...,...,...,...
61852,0.592593,0.000037,0.014763
61853,0.444444,0.000025,0.009748
61854,0.444444,0.000037,0.053928
61855,0.777778,0.000012,0.023915


In [306]:
clean_4 = clean_4.drop(columns=cols_to_scale.columns)
clean_4[cols_to_scale.columns] = scaled_data
clean_4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61857 entries, 0 to 61856
Data columns (total 32 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movieId          61857 non-null  Int64  
 1   title            61857 non-null  object 
 2   years_A          61857 non-null  float64
 3   years_B          61857 non-null  float64
 4   years_C          61857 non-null  float64
 5   years_D          61857 non-null  float64
 6   years_E          61857 non-null  float64
 7   gen_action       61857 non-null  int64  
 8   gen_adventure    61857 non-null  int64  
 9   gen_animation    61857 non-null  int64  
 10  gen_children     61857 non-null  int64  
 11  gen_comedy       61857 non-null  int64  
 12  gen_crime        61857 non-null  int64  
 13  gen_documentary  61857 non-null  int64  
 14  gen_drama        61857 non-null  int64  
 15  gen_fantasy      61857 non-null  int64  
 16  gen_film_noir    61857 non-null  int64  
 17  gen_horror  

In [307]:
clean_4[cols_to_scale.columns]

Unnamed: 0,avg_rating,rating_count,pop_score
0,0.523011,0.030960,0.034768
1,0.745535,0.301727,0.129021
2,0.636370,0.148875,0.040514
3,0.649213,0.346848,0.120255
4,0.649664,0.067603,0.048347
...,...,...,...
61852,0.592593,0.000037,0.014763
61853,0.444444,0.000025,0.009748
61854,0.444444,0.000037,0.053928
61855,0.777778,0.000012,0.023915


#### Final movies data frame cleaned

In [308]:
movies_cleand_df = clean_4.copy()

In [309]:
movies_cleand_df.to_csv("movies_cleaned.csv")

### Ratings data frame

In [310]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,104575,1204,5.0,1135287920
1,43840,969,5.0,974948098
2,47269,1247,5.0,1052089349
3,155815,59387,5.0,1570898833
4,157275,1589,5.0,1458600812


In [311]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     Int64  
 1   movieId    Int64  
 2   rating     float64
 3   timestamp  Int64  
dtypes: Int64(3), float64(1)
memory usage: 834.5 MB


In [312]:
ratings_df[ratings_df.isna()].count()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [313]:
ratings_df_no_time = ratings_df.drop(columns=["timestamp"])
ratings_df_no_time.head()

Unnamed: 0,userId,movieId,rating
0,104575,1204,5.0
1,43840,969,5.0
2,47269,1247,5.0
3,155815,59387,5.0
4,157275,1589,5.0


#### Removing ratings from ratings df because users wornt have that 

In [314]:
cleaned_ratings_df = ratings_df_no_time.drop(columns=["rating"])
cleaned_ratings_df.head()

Unnamed: 0,userId,movieId
0,104575,1204
1,43840,969
2,47269,1247
3,155815,59387
4,157275,1589


### Joining movies and ratings

In [315]:
df_complete = cleaned_ratings_df.merge(movies_cleand_df, on="movieId", how="left")
df_complete.head()

Unnamed: 0,userId,movieId,title,years_A,years_B,years_C,years_D,years_E,gen_action,gen_adventure,...,gen_sci_fi,gen_thriller,gen_war,gen_western,len_lon,len_medium,len_short,avg_rating,rating_count,pop_score
0,104575,1204,Lawrence of Arabia,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.808008,0.16165,0.0919
1,43840,969,"African Queen, The",0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.792414,0.126137,0.054713
2,47269,1247,"Graduate, The",0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.78478,0.229768,0.088302
3,155815,59387,"Fall, The",0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.75997,0.031316,0.059308
4,157275,1589,Cop Land,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.61733,0.059086,0.048768


In [316]:
df_complete.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 33 columns):
 #   Column           Dtype  
---  ------           -----  
 0   userId           Int64  
 1   movieId          Int64  
 2   title            object 
 3   years_A          float64
 4   years_B          float64
 5   years_C          float64
 6   years_D          float64
 7   years_E          float64
 8   gen_action       float64
 9   gen_adventure    float64
 10  gen_animation    float64
 11  gen_children     float64
 12  gen_comedy       float64
 13  gen_crime        float64
 14  gen_documentary  float64
 15  gen_drama        float64
 16  gen_fantasy      float64
 17  gen_film_noir    float64
 18  gen_horror       float64
 19  gen_imax         float64
 20  gen_musical      float64
 21  gen_mystery      float64
 22  gen_romance      float64
 23  gen_sci_fi       float64
 24  gen_thriller     float64
 25  gen_war          float64
 26  gen_western      float64
 27  len_lon   

## User "Taste" Summary

In [317]:
df_no_movie_id_and_title = df_complete.drop(columns=["movieId","title"])
df_no_movie_id_and_title.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 31 columns):
 #   Column           Dtype  
---  ------           -----  
 0   userId           Int64  
 1   years_A          float64
 2   years_B          float64
 3   years_C          float64
 4   years_D          float64
 5   years_E          float64
 6   gen_action       float64
 7   gen_adventure    float64
 8   gen_animation    float64
 9   gen_children     float64
 10  gen_comedy       float64
 11  gen_crime        float64
 12  gen_documentary  float64
 13  gen_drama        float64
 14  gen_fantasy      float64
 15  gen_film_noir    float64
 16  gen_horror       float64
 17  gen_imax         float64
 18  gen_musical      float64
 19  gen_mystery      float64
 20  gen_romance      float64
 21  gen_sci_fi       float64
 22  gen_thriller     float64
 23  gen_war          float64
 24  gen_western      float64
 25  len_lon          float64
 26  len_medium       float64
 27  len_short 

In [318]:
cols_to_summarize = df_no_movie_id_and_title.columns.difference(['userId'])

# Group by 'userId' and sum all other columns
user_summary = df_no_movie_id_and_title.groupby('userId')[cols_to_summarize].mean().reset_index()

In [319]:
user_summary.head()

Unnamed: 0,userId,avg_rating,gen_action,gen_adventure,gen_animation,gen_children,gen_comedy,gen_crime,gen_documentary,gen_drama,...,len_lon,len_medium,len_short,pop_score,rating_count,years_A,years_B,years_C,years_D,years_E
0,1,0.734406,0.057143,0.157143,0.028571,0.042857,0.328571,0.114286,0.014286,0.757143,...,0.528571,0.414286,0.057143,0.069996,0.105391,0.0,0.014286,0.2,0.242857,0.542857
1,2,0.701244,0.358696,0.407609,0.092391,0.13587,0.342391,0.097826,0.0,0.494565,...,0.722826,0.244565,0.032609,0.10437,0.23544,0.0,0.0,0.027174,0.277174,0.695652
2,3,0.656232,0.508397,0.30229,0.076336,0.073282,0.268702,0.201527,0.00458,0.354198,...,0.708397,0.268702,0.021374,0.114227,0.133151,0.0,0.0,0.006107,0.056489,0.937405
3,4,0.697993,0.599174,0.471074,0.128099,0.115702,0.334711,0.152893,0.020661,0.202479,...,0.681818,0.264463,0.041322,0.140197,0.171947,0.0,0.0,0.004132,0.132231,0.863636
4,5,0.662441,0.178218,0.207921,0.039604,0.089109,0.485149,0.138614,0.0,0.445545,...,0.613861,0.356436,0.029703,0.080413,0.266468,0.0,0.0,0.009901,0.188119,0.80198


In [320]:
user_summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162541 entries, 0 to 162540
Data columns (total 31 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   userId           162541 non-null  Int64  
 1   avg_rating       162541 non-null  float64
 2   gen_action       162541 non-null  float64
 3   gen_adventure    162541 non-null  float64
 4   gen_animation    162541 non-null  float64
 5   gen_children     162541 non-null  float64
 6   gen_comedy       162541 non-null  float64
 7   gen_crime        162541 non-null  float64
 8   gen_documentary  162541 non-null  float64
 9   gen_drama        162541 non-null  float64
 10  gen_fantasy      162541 non-null  float64
 11  gen_film_noir    162541 non-null  float64
 12  gen_horror       162541 non-null  float64
 13  gen_imax         162541 non-null  float64
 14  gen_musical      162541 non-null  float64
 15  gen_mystery      162541 non-null  float64
 16  gen_romance      162541 non-null  floa

In [321]:
user_summary.to_csv("users_avg_summarized.csv")

## Getting the preprocessed data

#### Creating a function to grab all the preprocessed movies 
Now, that we have the data preprocessed, we uplaoded the movies_cleaned_df to BQ, this way we simulate how to get the movies like we will do in prodcution

#### Creating a function to grab preprocessed movies by MovieId

In [322]:
def get_preprocessed_movies_by_id(ids):
    ids_str = ','.join(map(str, ids))
    
    movies_preprocessed_query = f"""
        SELECT *
        FROM `le-wagon-march2024.mares_movies.movies_preprocessed`
        WHERE movieId IN ({ids_str})
    """
    
    # Execute on BQ
    client = bigquery.Client()
    preprocessed_job = client.query(movies_preprocessed_query)
    preprocessed = preprocessed_job.result().to_dataframe()
    preprocessed = preprocessed[sorted(preprocessed.columns)]

    return preprocessed


In [323]:
get_preprocessed_movies_by_id([1, 2])
    

Unnamed: 0,avg_rating,gen_action,gen_adventure,gen_animation,gen_children,gen_comedy,gen_crime,gen_documentary,gen_drama,gen_fantasy,...,len_short,movieId,pop_score,rating_count,title,years_A,years_B,years_C,years_D,years_E
0,0.61145,0,1,0,1,0,0,0,0,1,...,0.0,2,0.043715,0.297309,Jumanji,0.0,0.0,0.0,0.0,1.0
1,0.754157,0,1,1,1,1,0,0,0,1,...,1.0,1,0.312758,0.703256,Toy Story,0.0,0.0,0.0,0.0,1.0


#### Getting all the users interests (training data)

In [324]:
def get_users_interests():    
    movies_preprocessed_query = f"""
        SELECT *
        FROM `le-wagon-march2024.mares_movies.users_interests`;
        """
    
    # Execute on BQ
    client = bigquery.Client()
    users_interests_job = client.query(movies_preprocessed_query)
    users_interests = users_interests_job.result().to_dataframe()
    users_interests = users_interests[sorted(users_interests.columns)]
    return users_interests


In [333]:
users_interests = get_users_interests()
users_interests.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162541 entries, 0 to 162540
Data columns (total 31 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   avg_rating       162541 non-null  float64
 1   gen_action       162541 non-null  float64
 2   gen_adventure    162541 non-null  float64
 3   gen_animation    162541 non-null  float64
 4   gen_children     162541 non-null  float64
 5   gen_comedy       162541 non-null  float64
 6   gen_crime        162541 non-null  float64
 7   gen_documentary  162541 non-null  float64
 8   gen_drama        162541 non-null  float64
 9   gen_fantasy      162541 non-null  float64
 10  gen_film_noir    162541 non-null  float64
 11  gen_horror       162541 non-null  float64
 12  gen_imax         162541 non-null  float64
 13  gen_musical      162541 non-null  float64
 14  gen_mystery      162541 non-null  float64
 15  gen_romance      162541 non-null  float64
 16  gen_sci_fi       162541 non-null  floa

#### User Interests pipeline

In [351]:
def get_user_interest_vector(favorite_movies_id):
    preproc_favorite_movies = get_preprocessed_movies_by_id(favorite_movies_id)
    preproc_favorite_movies = preproc_favorite_movies.drop(columns=["movieId","title"])
    preproc_favorite_movies = preproc_favorite_movies[sorted(preproc_favorite_movies.columns)]
    user_summary = preproc_favorite_movies.mean().reset_index(drop=True)
    df = pd.DataFrame(user_summary).T
    df.columns = preproc_favorite_movies.columns
    return df

### Getting users by id

In [354]:
def get_ratings_by_users_id(ids):    
    ids_str = ','.join(map(str, ids))
    
    movies_preprocessed_query = f"""
        SELECT *
        FROM `le-wagon-march2024.mares_movies.ratings`
        WHERE userId IN ({ids_str})
    """
    # Execute on BQ
    client = bigquery.Client()
    ratings_job = client.query(movies_preprocessed_query)
    ratings = ratings_job.result().to_dataframe()
    return ratings

## The Model

In [334]:
users_ids = users_interests.userId.values

In [397]:
preproc_favorite_movies = get_preprocessed_movies_by_id([1721])
preproc_favorite_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 32 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   avg_rating       1 non-null      float64
 1   gen_action       1 non-null      Int64  
 2   gen_adventure    1 non-null      Int64  
 3   gen_animation    1 non-null      Int64  
 4   gen_children     1 non-null      Int64  
 5   gen_comedy       1 non-null      Int64  
 6   gen_crime        1 non-null      Int64  
 7   gen_documentary  1 non-null      Int64  
 8   gen_drama        1 non-null      Int64  
 9   gen_fantasy      1 non-null      Int64  
 10  gen_film_noir    1 non-null      Int64  
 11  gen_horror       1 non-null      Int64  
 12  gen_imax         1 non-null      Int64  
 13  gen_musical      1 non-null      Int64  
 14  gen_mystery      1 non-null      Int64  
 15  gen_romance      1 non-null      Int64  
 16  gen_sci_fi       1 non-null      Int64  
 17  gen_thriller     1 n

In [398]:
X = users_interests.drop(columns=["userId"])
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162541 entries, 0 to 162540
Data columns (total 30 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   avg_rating       162541 non-null  float64
 1   gen_action       162541 non-null  float64
 2   gen_adventure    162541 non-null  float64
 3   gen_animation    162541 non-null  float64
 4   gen_children     162541 non-null  float64
 5   gen_comedy       162541 non-null  float64
 6   gen_crime        162541 non-null  float64
 7   gen_documentary  162541 non-null  float64
 8   gen_drama        162541 non-null  float64
 9   gen_fantasy      162541 non-null  float64
 10  gen_film_noir    162541 non-null  float64
 11  gen_horror       162541 non-null  float64
 12  gen_imax         162541 non-null  float64
 13  gen_musical      162541 non-null  float64
 14  gen_mystery      162541 non-null  float64
 15  gen_romance      162541 non-null  float64
 16  gen_sci_fi       162541 non-null  floa

In [394]:
X.to_numpy().shape

(162541, 30)

In [399]:
knn_model = NearestNeighbors(n_neighbors=5)

knn_model.fit(X)

In [400]:
test_interests = get_user_interest_vector([1721])
test_interests.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 30 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   avg_rating       1 non-null      float64
 1   gen_action       1 non-null      float64
 2   gen_adventure    1 non-null      float64
 3   gen_animation    1 non-null      float64
 4   gen_children     1 non-null      float64
 5   gen_comedy       1 non-null      float64
 6   gen_crime        1 non-null      float64
 7   gen_documentary  1 non-null      float64
 8   gen_drama        1 non-null      float64
 9   gen_fantasy      1 non-null      float64
 10  gen_film_noir    1 non-null      float64
 11  gen_horror       1 non-null      float64
 12  gen_imax         1 non-null      float64
 13  gen_musical      1 non-null      float64
 14  gen_mystery      1 non-null      float64
 15  gen_romance      1 non-null      float64
 16  gen_sci_fi       1 non-null      float64
 17  gen_thriller     1 n

In [401]:
len(test_interests)

1

In [402]:
distances, indices = knn_model.kneighbors(test_interests, n_neighbors=5)

In [403]:
indices[0]

array([14920, 15104,  2276, 59487, 14003])

In [404]:
similar_users = users_interests.loc[indices[0]].userId.tolist()

In [405]:
ratings_from_similar_users = get_ratings_by_users_id(similar_users)
ratings_from_similar_users

Unnamed: 0,userId,movieId,rating,timestamp
0,126459,912,5.0,974712290
1,62667,3993,5.0,1286565757
2,1500,6953,4.0,1554781617
3,117704,90405,4.5,1524756012
4,119373,3260,5.0,1537666637
...,...,...,...,...
152,117704,84374,4.5,1524756127
153,117704,103539,4.5,1524757545
154,117704,127202,4.5,1524757521
155,1500,116797,5.0,1554780156


In [406]:
good_ratings_from_similar_users = ratings_from_similar_users[ratings_from_similar_users.rating > 4]
good_ratings_from_similar_users

Unnamed: 0,userId,movieId,rating,timestamp
0,126459,912,5.0,974712290
1,62667,3993,5.0,1286565757
3,117704,90405,4.5,1524756012
4,119373,3260,5.0,1537666637
5,117704,108190,4.5,1524755452
...,...,...,...,...
152,117704,84374,4.5,1524756127
153,117704,103539,4.5,1524757545
154,117704,127202,4.5,1524757521
155,1500,116797,5.0,1554780156


In [407]:
recommendations_ids = good_ratings_from_similar_users.movieId.tolist()


In [408]:
recomendations = get_preprocessed_movies_by_id(recommendations_ids)

In [409]:
recomendations.title[0:20]

0            Dances with Wolves
1            North by Northwest
2            Gone with the Wind
3               Say Anything...
4                    Casablanca
5        Affair to Remember, An
6                  Pretty Woman
7         Sense and Sensibility
8     Shawshank Redemption, The
9                  Forrest Gump
10      Remains of the Day, The
11             Schindler's List
12                         Emma
13            Good Will Hunting
14                      Titanic
15          Shakespeare in Love
16             Playing by Heart
17                 Notting Hill
18                  Howards End
19                       Quills
Name: title, dtype: object