<img src="images/StreamYou.png" alt="example" style="width:1500px; height:500px;">

# StreamYou: Movie Recommendation Engine
* Author: Angela Loyola
* Student pace: Self-Paced
* Instructor name: Mark Barbour

## Overview & Business Problem


Consumers today have thousands of movies available to them for viewing at any given moment om many different streaming platforms. While access to so much content is what drives many users to have subscriptions to multiple platforms, the sheer volumne can be overwhelming at the same time. 

The massive movie catalogs leave consumers grappling with the challenge of choice as they start their scrolling journey. While the vast selection caters to diverse tastes, the experience of navigating through countless titles can be time consuming and overwhelming. The more time users spend sifting through choices, the higher the likelihood of encountering decision fatigue, frustration, and, ultimately, moving away from using the platform. 

StreamYou is seeking to address this issue as they've observed increased scrolling time on their platform reducing their conversion rates (when users select and watch a movie). For the context of this project, I am a data scientist working on optimizing the recommendation engine at StreamYou to address the following concerns: 

<b> 1. Reduced Satisfaction: </b> Users struggle to find relevant movies, leading to decreased satisfaction with the platform.

<b> 2. Lower Conversion Rates: </b> Lengthy scrolling sessions resulting in users leaving the platform before making a selection, impacting conversion rates.

<b> 3. Lack of Genre Specific Recommendations: </b> Users don't currently have a way to express a genre preference to explicitly choose and receive recommendations based on their preferred movie genre. 


## Data Understanding and Exploration

In [1]:
### Importing all packages and libaries 
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import json  
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
from surprise import Dataset, Reader, accuracy, NormalPredictor, KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline, SVD, BaselineOnly, SVDpp, NMF, SlopeOne, CoClustering
from surprise.accuracy import rmse
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
from surprise.prediction_algorithms import SVD, SVDpp, NMF, BaselineOnly, NormalPredictor, knns

In [3]:
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from plotly.io import to_image
init_notebook_mode(connected=True)

<b> Load and Preview Data </b>

Data used in this analysis: 
1. Movie DataFrame includes unique Movie Ids, Movie Title and Genre information 
2. Rating Dataframe includes unique User Ids, Movie Ids, Rating and Timestamp information 

The data for this project was already very clean and did not require any manipulation. There were no null values and all columns were in their expected data type objects. 

Every user in the rating dataframe has reviewed at least 20 movies and each movie has been reviewed at least once. Rating are on a scale of 1-5 with decimal values available as well. 

In [4]:
movie_df = pd.read_csv('data/movies.csv')

In [5]:
movie_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
rating_df = pd.read_csv('data/ratings.csv')

In [7]:
rating_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [8]:
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [9]:
rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [10]:
rating_df.userId.value_counts() ##Every User has at least 20 reviews

414    2698
599    2478
474    2108
448    1864
274    1346
       ... 
406      20
595      20
569      20
431      20
442      20
Name: userId, Length: 610, dtype: int64

In [11]:
rating_df.movieId.value_counts() ##Every Movie has at least 1 review

356       329
318       317
296       307
593       279
2571      278
         ... 
5986        1
100304      1
34800       1
83976       1
8196        1
Name: movieId, Length: 9724, dtype: int64

In [12]:
rating_df['rating'].unique()

array([4. , 5. , 3. , 2. , 1. , 4.5, 3.5, 2.5, 0.5, 1.5])

<b> Visualize Distribution of Data </b> 

Over 80% of the movies in the data set have a rating above 3. This shows a skew in the data to favor higher ratings and this is something to take into consideration when looking at the results. 

In [13]:
data = rating_df['rating'].value_counts().sort_index(ascending=False)
trace = go.Bar(x=data.index,
               text=['{:.1f} %'.format(val) for val in (data.values / rating_df.shape[0] * 100)],
               textposition='auto',
               textfont=dict(color='#000000'),
               y=data.values,
               )

# Create layout
layout = dict(title='Distribution of Movie Ratings'.format(rating_df.shape[0]),
              xaxis=dict(title='Rating', tickvals=list(data.index), ticktext=['{:.1f}'.format(val) for val in list(data.index)]),
              yaxis=dict(title='Count'),)

# Create plot
fig = go.Figure(data=[trace], layout=layout)

In [14]:
fig

## Creating User and Item ID Dataframe

In this section, I dropped timestamp as it's not needed for run a recommendation engine and dropped any duplicate values in case a user watched a movie twice and provided the same rating. 

In [15]:
data = rating_df[['userId', 'movieId', 'rating']]

In [16]:
data = data.drop_duplicates(['userId', 'movieId'])

In [17]:
data.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


## Surprise Set Up

In setting up surprise to run the different recommendation models, I identified there were less users in my training set than movies. For computional time, we'll user user to user similarity instead of item because of this. This choice will also help when there are new users on the platform with limited rating history. 

In [18]:
reader = Reader(rating_scale=(1, 5))
surprise_data = Dataset.load_from_df(data, reader)
trainset, testset = train_test_split(surprise_data, test_size=0.2, random_state=123)

In [19]:
surprise_data

<surprise.dataset.DatasetAutoFolds at 0x2337169bd00>

In [20]:
print('Number of users: ', trainset.n_users, '\n')
print('Number of movies: ', trainset.n_items, '\n')

Number of users:  610 

Number of movies:  8974 



In [21]:
print('Type trainset :',type(trainset),'\n')
print('Type testset :',type(testset))

Type trainset : <class 'surprise.trainset.Trainset'> 

Type testset : <class 'list'>


## User to User using Cosine Similarity

Running a a KNN Basic model as my baseline model to have a comparison as I try different approaches. 

I will be using Root Mean Squared Error to compare model performance in each approach. I chose RMSE as it will provide a measure of how well the model's predictions align with the actual observed values and predicts movie ratings. 

In [22]:
sim_cos = {"name": "cosine", "user_based": True}

In [23]:
basic = knns.KNNBasic(sim_options=sim_cos)
basic.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x23371410eb0>

In [24]:
predictions = basic.test(testset)

In [25]:
print(accuracy.rmse(predictions))

RMSE: 0.9692
0.969223803904209


## Testing Pearson Correlation

I ran the same baseline model, but using Pearson correlation to see which metric best aligns with the characteristics of my data as Cosine performs better with non-linear relationships and Pearson handles linear patterns better. 

Both models performend similarly, but the KNN Basic model using Cosine similarity performed slightly better with an RMSE of .96 compared to .97 using Pearson correltion. Going forward in other models, I will continue to use Cosine similarity. 

In [26]:
sim_pearson = {"name": "pearson", "user_based": True}
basic_pearson = knns.KNNBasic(sim_options=sim_pearson)
basic_pearson.fit(trainset)
predictions = basic_pearson.test(testset)
print(accuracy.rmse(predictions))

Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 0.9708
0.9708314484587909


## KNN with Means

The next model choice was KNN with Means with Cosine Similarity to improve upon the Baseline model by: 

<b> Normalizing User Ratings: </b> This well help the model account for users that consistnetly rate items higher or lower than the average. By normalizing the ratings, the model can focus on user preferences relative to their own tendencies. 

<b> Addressing Skewed Distributions: </b> As discussed in the exploratory section, the data is skewed showing most ratings higher than 3. This model helps address this and show a more balanced representation of user preference.

The RMSE was .89 which shows it performed better than the baseline model with an RMSE of .97. 

In [27]:
sim_pearson = {"name": "cosine", "user_based": True}
knn_means = knns.KNNWithMeans(sim_options=sim_pearson)
knn_means.fit(trainset)
predictions = knn_means.test(testset)
print(accuracy.rmse(predictions))

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.8961
0.8961015085170626


## KNN Baseline

The next model choice was KNN with Baseline to test normalization in a different way. While KNN with means (the previous model) uses the mean to address individual user bias, KNN Baseline encompases both user and item biases. This approach will consider the user baseline predictions and the popularity or unpopularity of items. 

The RMSE was .87 which shows a slight improvement from the RMSE of KNN with Means that was .89. 

In [28]:
sim_pearson = {"name": "cosine", "user_based": True}
knn_baseline = knns.KNNBaseline(sim_options=sim_pearson)
knn_baseline.fit(trainset)
predictions = knn_baseline.test(testset)
print(accuracy.rmse(predictions))

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.8738
0.8738196364118903


## Grid Search for SVD Model

The next model leveraged Grid Search to find the optimal parameters for using Singluar Value Decompostion. By using grid search, I wanted to identify the set of hyperparameters that would optimize model performance and RMSE. 

The best parameters found by grid search for RMSE were: 
- n_factors: 20
- n_epochs: 10
- lr_all: 0.005
- reg_all: 0.4

Overall, the model performend better than KNN Basic and KNN with Means. While model performed closely to the KNN Baseline model, the RMSE was still slightly higher at .89 compared to .87. This was not surprising as SVD models assume linear relationships and the data has shown to have non-linear relationships, as identified when testing between Cosine and Pearson similarity metrics. 

In [29]:
from surprise.prediction_algorithms import SVD
from surprise.model_selection import GridSearchCV

param_grid = {'n_factors':[20, 100],'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs_model = GridSearchCV(SVD,param_grid=param_grid,n_jobs = -1,joblib_verbose=5)
gs_model.fit(surprise_data)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:   14.9s
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:   27.8s finished


In [30]:
# Print the best parameters found by the grid search
print("Best parameters found by grid search:", gs_model.best_params)

# Print the best cross-validated score
print("Best cross-validated score:", gs_model.best_score)

Best parameters found by grid search: {'rmse': {'n_factors': 20, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}, 'mae': {'n_factors': 20, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}}
Best cross-validated score: {'rmse': 0.8902740370312057, 'mae': 0.6886888284841641}


In [31]:
svd = SVD(n_factors=20, n_epochs=10, lr_all= 0.005, reg_all=0.4)
svd.fit(trainset)
predictions = svd.test(testset)
print(accuracy.rmse(predictions))

RMSE: 0.8923
0.8923464348205502


## Final Model Choice: KNN Baseline 

The KNN Baseline was chosen as the best model as it had the lowest RMSE in the modeling process: .87. When compared to the .96 RMSE of the KNN Basic model (first model used as baseline), this is almost a full point drop which is a significant improvement on the rating scale of 1-5. 

## Predictions using the Best Model: KNN Baseline

This section runs the full model and leverages a function where a user can input their ID number and the number of recommmendations they want. The function will output the Movie titles and Genres as well as the predicted rating. 

In [32]:
trainset_full = surprise_data.build_full_trainset()

In [33]:
trainset_full

<surprise.trainset.Trainset at 0x2335d4061c0>

In [34]:
sim_pearson = {"name": "cosine", "user_based": True}
best_model = knns.KNNBaseline(sim_options=sim_pearson)
best_model.fit(trainset_full)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x23379b1c910>

In [35]:
def viewer_recommendations():
    viewer = int(input("User ID: "))
    n_recs = int(input("How many recommendations would you like? "))
    already_reviewed = [rating_df.loc[viewer, "movieId"]]
    not_reviewed = movie_df.copy()
    not_reviewed = not_reviewed[not_reviewed.movieId.isin(already_reviewed) == False]
    not_reviewed.reset_index(inplace=True)
    not_reviewed["predicted_rating"] = not_reviewed["movieId"].apply(lambda x: best_model.predict(viewer, x).est)
    not_reviewed.sort_values(by="predicted_rating", ascending=False, inplace=True)
    not_reviewed = not_reviewed[['movieId','title', 'genres', 'predicted_rating']].head(n_recs)
    return not_reviewed

In [36]:
viewer_recommendations()

User ID: 414
How many recommendations would you like? 5


Unnamed: 0,movieId,title,genres,predicted_rating
7811,92494,Dylan Moran: Monster (2004),Comedy|Documentary,5.0
5434,25947,Unfaithfully Yours (1948),Comedy,5.0
8106,100556,"Act of Killing, The (2012)",Documentary,5.0
6666,57502,Cat Soup (Nekojiru-so) (2001),Adventure|Animation|Drama|Horror,5.0
2328,3086,Babes in Toyland (1934),Children|Comedy|Fantasy|Musical,5.0


## Add Category Selection to the Recommendations
The address the issue of category specific recommendations outlined in the Business Problem section. The code and function below allows the user to also select a specific Genre they want receommendations for. 

In [37]:
movie_df.genres

0       Adventure|Animation|Children|Comedy|Fantasy
1                        Adventure|Children|Fantasy
2                                    Comedy|Romance
3                              Comedy|Drama|Romance
4                                            Comedy
                           ...                     
9737                Action|Animation|Comedy|Fantasy
9738                       Animation|Comedy|Fantasy
9739                                          Drama
9740                               Action|Animation
9741                                         Comedy
Name: genres, Length: 9742, dtype: object

In [38]:
movie_df['genres_list'] = movie_df['genres'].str.split('|')

In [39]:
movie_df['genres_list']

0       [Adventure, Animation, Children, Comedy, Fantasy]
1                          [Adventure, Children, Fantasy]
2                                       [Comedy, Romance]
3                                [Comedy, Drama, Romance]
4                                                [Comedy]
                              ...                        
9737                 [Action, Animation, Comedy, Fantasy]
9738                         [Animation, Comedy, Fantasy]
9739                                              [Drama]
9740                                  [Action, Animation]
9741                                             [Comedy]
Name: genres_list, Length: 9742, dtype: object

In [40]:
subcategories = []
for row in movie_df["genres_list"]:
    for genre in row:
        value = genre.lower()
        if value not in subcategories:
            subcategories.append(value)

In [41]:
def viewer_recommendations_category():
    viewer = int(input("User ID: "))
    n_recs = int(input("How many recommendations would you like? "))
    available_genres = [subcategory.capitalize() for subcategory in subcategories if subcategory != "(no genres listed)"]
    #Show available Genres
    print("Available Genres:")
    for i, genre in enumerate(available_genres, start=1):
        print(f"{i}. {genre}")
    genre_choice = int(input("Enter the number corresponding to the movie genre you want to watch: "))
    # Validate user input
    if 1 <= genre_choice <= len(available_genres):
        request_genre = subcategories[genre_choice - 1]
    else:
        print("Invalid genre choice. Defaulting to 'Adventure'.")
        request_genre = subcategories[0]
    already_reviewed = [rating_df.loc[viewer, "movieId"]] 
    not_reviewed = movie_df.copy()
    not_reviewed = not_reviewed[~not_reviewed.movieId.isin(already_reviewed)]
    not_reviewed.reset_index(inplace=True)
    not_reviewed["predicted_rating"] = not_reviewed["movieId"].apply(lambda x: best_model.predict(viewer, x).est)
    # Filter by requested genre
    not_reviewed = not_reviewed[not_reviewed["genres"].str.contains(request_genre, case=False)]
    not_reviewed.sort_values(by="predicted_rating", ascending=False, inplace=True)
    not_reviewed = not_reviewed[['movieId', 'title', 'genres', 'predicted_rating']].head(n_recs)
    return not_reviewed

In [42]:
viewer_recommendations_category()

User ID: 414
How many recommendations would you like? 5
Available Genres:
1. Adventure
2. Animation
3. Children
4. Comedy
5. Fantasy
6. Romance
7. Drama
8. Action
9. Crime
10. Thriller
11. Horror
12. Mystery
13. Sci-fi
14. War
15. Musical
16. Documentary
17. Imax
18. Western
19. Film-noir
Enter the number corresponding to the movie genre you want to watch: 1


Unnamed: 0,movieId,title,genres,predicted_rating
6666,57502,Cat Soup (Nekojiru-so) (2001),Adventure|Animation|Drama|Horror,5.0
8706,124404,"Snowflake, the White Gorilla (2011)",Adventure|Animation|Children|Comedy,4.957347
7759,91355,Asterix and the Vikings (Astérix et les Viking...,Adventure|Animation|Children|Comedy|Fantasy,4.957347
8725,126921,The Fox and the Hound 2 (2006),Adventure|Animation|Children|Comedy,4.957347
9193,150554,The Love Bug (1997),Adventure|Children|Comedy|Fantasy,4.957347


## Allow Users to Add Additional Ratings

This function was written to allow users to add ratings for other movies. This would expand the current ratings dataframe and allows the recommendation engine to be agile to improvements. 

This would be helpful in cases where a user did not watch a movie on the streaming platform, but they've seen it and want to review it. Additionally, if a user has not seen a movie in a specifc genre, this would help the model capture more information on the user for the genre specific recommendations.  

In [43]:
def movie_rater(user_id, genre=None):
    rating_list = []
    rated_movies = set(rating_df[(rating_df['userId'] == user_id)]['movieId'].values)
    print("Please rate the movies on a scale of 1-5")
    while len(rating_list) < 5:
        if genre:
            movie = movie_df[(~movie_df['movieId'].isin(rated_movies)) & movie_df['genres'].str.contains(genre)].sample(1)
        else:
            movie = movie_df[(~movie_df['movieId'].isin(rated_movies))].sample(1)
        if movie.empty:
            print("No more movies available to rate.")
            break
        movie_title = movie['title'].values[0]
        print(f"Title: {movie_title}")
        while True:
            rating = input('How do you rate this movie on a scale of 1-5, press n if you have not seen :\n')
            if rating.lower() == 'n':
                break
            elif rating.replace('.', '', 1).isdigit() and 1 <= float(rating) <= 5:
                rating_one_movie = {'userId': user_id, 'movieId': movie['movieId'].values[0], 'rating': float(rating)}
                rating_list.append(rating_one_movie)
                rated_movies.add(movie['movieId'].values[0])
                break
            else:
                print("Invalid input. Please enter a number between 1 and 5 (including decimal values).")
    return rating_list

In [45]:
### Example Usage
user_id = 414 
ratings = movie_rater(user_id, genre='Action')

Please rate the movies on a scale of 1-5
Title: World War Z (2013)
How do you rate this movie on a scale of 1-5, press n if you have not seen :
2
Title: War of the Worlds (2005)
How do you rate this movie on a scale of 1-5, press n if you have not seen :
3
Title: Faster (2010)
How do you rate this movie on a scale of 1-5, press n if you have not seen :
4
Title: King Solomon's Mines (1950)
How do you rate this movie on a scale of 1-5, press n if you have not seen :
5
Title: Adventures of Tintin, The (2011)
How do you rate this movie on a scale of 1-5, press n if you have not seen :
6
Invalid input. Please enter a number between 1 and 5 (including decimal values).
How do you rate this movie on a scale of 1-5, press n if you have not seen :
1


In [46]:
ratings

[{'userId': 414, 'movieId': 103249, 'rating': 2.0},
 {'userId': 414, 'movieId': 64997, 'rating': 3.0},
 {'userId': 414, 'movieId': 82744, 'rating': 4.0},
 {'userId': 414, 'movieId': 25962, 'rating': 5.0},
 {'userId': 414, 'movieId': 90746, 'rating': 1.0}]

## Make New Predictions for the User

In this step, the ratings dataframe is updated with the new data and the model is run again to learn from the new ratings added. The function for viewer recommendations by category is updated to use the new model. 

In [47]:
user_ratings = pd.DataFrame(ratings)
rating_df.drop(columns = 'timestamp', inplace =True)
new_ratings_df = pd.concat([rating_df, user_ratings], axis=0)
new_data = Dataset.load_from_df(new_ratings_df,reader)

In [48]:
sim_pearson = {"name": "cosine", "user_based": True}
updated_best_model = knns.KNNBaseline(sim_options=sim_pearson)
updated_best_model.fit(new_data.build_full_trainset())

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x23379b15d60>

In [49]:
def viewer_recommendations_category_updated():
    viewer = int(input("User Id: "))
    n_recs = int(input("How many recommendations would you like? "))
    available_genres = [subcategory.capitalize() for subcategory in subcategories if subcategory != "(no genres listed)"]
    #Show available Genres
    print("Available Genres:")
    for i, genre in enumerate(available_genres, start=1):
        print(f"{i}. {genre}")
    genre_choice = int(input("Enter the number corresponding to the movie genre you want to watch: "))
    # Validate user input
    if 1 <= genre_choice <= len(available_genres):
        request_genre = subcategories[genre_choice - 1]
    else:
        print("Invalid genre choice. Defaulting to 'Adventure'.")
        request_genre = subcategories[0]
    already_reviewed = [rating_df.loc[viewer, "movieId"]] 
    not_reviewed_updated = movie_df.copy()
    not_reviewed_updated = not_reviewed_updated[~not_reviewed_updated.movieId.isin(already_reviewed)]
    not_reviewed_updated.reset_index(inplace=True)
    not_reviewed_updated["predicted_rating"] = not_reviewed_updated["movieId"].apply(lambda x: updated_best_model.predict(viewer, x).est)
    # Filter by requested genre
    not_reviewed_updated = not_reviewed_updated[not_reviewed_updated["genres"].str.contains(request_genre, case=False)]
    not_reviewed_updated.sort_values(by="predicted_rating", ascending=False, inplace=True)
    not_reviewed_updated = not_reviewed_updated[['movieId', 'title', 'genres', 'predicted_rating']].head(n_recs)
    return not_reviewed_updated

In [50]:
viewer_recommendations_category_updated()

User Id: 414
How many recommendations would you like? 2
Available Genres:
1. Adventure
2. Animation
3. Children
4. Comedy
5. Fantasy
6. Romance
7. Drama
8. Action
9. Crime
10. Thriller
11. Horror
12. Mystery
13. Sci-fi
14. War
15. Musical
16. Documentary
17. Imax
18. Western
19. Film-noir
Enter the number corresponding to the movie genre you want to watch: 1


Unnamed: 0,movieId,title,genres,predicted_rating
6666,57502,Cat Soup (Nekojiru-so) (2001),Adventure|Animation|Drama|Horror,5.0
8706,124404,"Snowflake, the White Gorilla (2011)",Adventure|Animation|Children|Comedy,4.956661


## Limitations and Potential Next Steps

One of the limitations of the final model is that it does not address the cold start problem, as in the model needs previous history from the user to build recommendations. This is a result of using user to user similarity which the final model does. 

As a next step on this analysis, it would be necessary to explore model choices that can work with information beyond user history. For example, content-based recommendations or hybrid models that use item features to make recommendations. By doing so, this recommendation model could be used more effectively when new users join the platform. 

## Conclusion

The final model addresses the three concerns outlines in the business problem section. By leveraging this model StreamYou can increase user satisifcation by helping them find relevant movies within their category of choice. Consequently, when users receive better recommendations for the specific category they want to watch, this should have an impact on the conversion rates of the platform as users will need to scroll less to find a movie. 