---

🎬 **Welcome to the Movie Recommendation System Notebook!** 🍿

In this notebook, we'll dive into the exciting world of movie recommendations. Whether you're a movie buff looking for new gems or a data enthusiast exploring recommendation systems, this journey is for you!

🤖 Throughout this notebook, we'll learn how to build a hybrid recommendation system that combines the best of both worlds: non-personalized and collaborative filtering. This approach allows us to provide meaningful recommendations for both new and existing users.

📚 So grab your popcorn, sit back, and let's get started on this cinematic adventure in recommendation systems!

---

## 1. Import

In [1]:
import pandas as pd

In [2]:
rating_data = pd.read_csv('ratings.csv')

rating_data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


- It can be seen that there are several features in this data

|feature|description|
|:--|:--|
|`userId`|User ID|
|`movieId`|Movie ID|
|`rating`|The rating assigned by user-id x to movie-id i|
|`timestamp`|Timestamp of recording the rating given by the user to the movie|

In [3]:
rating_data.shape

(100836, 4)

## 2. Wrangling Data

In [4]:
min_ratings = 5         # minimum number of ratings each movieId receives
min_user_ratings = 5    # minimum number of movieIds given a rating by the user

In [5]:
# Look for movieIds whose total ratings are > min_ratings
cond_movie = rating_data['movieId'].value_counts() > min_ratings

# Output the filtered movie_id
filtered_movie_id = (
     cond_movie[cond_movie] # 1. filter movieId that meets the conditions above
     .index                 # 2. extract its movieId (its index)
     .tolist()              # 3. then save it in a list
)

# Show results for the first 10 movieIds
filtered_movie_id[:10]

[356, 318, 296, 593, 2571, 260, 480, 110, 589, 527]

In [6]:
# Look for a userId whose total rating is > min_user_ratings
cond_user = rating_data['userId'].value_counts() > min_user_ratings

# Output the filtered movie_id
filtered_user_id = (
     cond_user[cond_user] # 1. filter userIds that meet the conditions above
     .index               # 2. extract its userId (its index)
     .tolist()            # 3. then save it in a list
)

# Show results for the first 10 userIds
filtered_user_id[:10]

[414, 599, 474, 448, 274, 610, 68, 380, 606, 288]

In [7]:
# Create the final dataset based on the conditions above
rating_data_final = (
     rating_data[                                         # Filter rating data complies
         (rating_data['movieId'].isin(filtered_movie_id)) # cond 1: corresponding movie Id filtered_movie_id
         &                                                # AND
         (rating_data['userId'].isin(filtered_user_id))   # cond 2: corresponding user Id filtered_user_id
     ]
)

rating_data_final

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100830,610,166528,4.0,1493879365
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047


Seen from 100,000 rating data, now we have ~90,000 rating data

## 3. Determine whether the user is new or not

<br>
<center>
<img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/intro-ai-recommendersystem/5_01.png">

In [8]:
user_id = 843

In [9]:
# To find whether the user ID is new or not
# we just have to check all user IDs who have rated the film
import numpy as np

old_user_id = np.array(list(set(rating_data_final['userId'])))
old_user_id[:50]

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])

In [10]:
# Now, check whether the user id is a new user id / not
# then output old_user status
if user_id in old_user_id:
    is_old_user = True
else:
    is_old_user = False

is_old_user

False

- It can be seen that `user_id = 843` is a new user
- Let's prove it

In [12]:
# We check the activity of user ID 843
rating_data_final[rating_data_final['userId'] == 843]

Unnamed: 0,userId,movieId,rating,timestamp


- No activity
- Now enter it into a function that checks the user ID status

In [13]:
# Function to check whether the user ID entered is new or old
def check_user_id(user_id, rating_data):
    """
    Function to check if the user ID is new or existing.

    This function returns the status in JSON format.

    Parameters
    ----------
    user_id : int
        User ID

    rating_data : pd.DataFrame
        User's rating activity on movies

    Returns
    -------
    status : dict
        Status of the user ID
    """
    # First, find the set of old_user_ids from the rating_data
    old_user_ids = np.array(list(set(rating_data['userId'])))

    # Next, check if the user id is in old_user_ids
    if user_id in old_user_ids:
        is_old_user = True
    else:
        is_old_user = False

    # Finally, return the status as a dictionary
    status = {
        'userId': user_id,
        'is_old_user': is_old_user
    }

    return status

In [15]:
# Check the status of user_id = 3
check_user_id(user_id = 3,
              rating_data = rating_data_final)

{'userId': 3, 'is_old_user': True}

## 4. Create non-personalized models

- This model will predict top rated movies

<br>
<center>
<img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/intro-ai-recommendersystem/5_02.png">

In [17]:
# We want to wrangle to see the top 10 movies with the most ratings
groupby_movie_id = (
    rating_data_final                           # 1. get rating data
    .groupby(by='movieId')                      # 2. group by movieId
    .agg({'rating': 'count'})                   # 3. aggregate count of ratings for each group
    .sort_values(by='rating', ascending=False)  # 4. then sort by highest to lowest rating count
)

groupby_movie_id

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
356,329
318,317
296,307
593,279
2571,278
...,...
3007,6
2976,6
2971,6
78893,6


In [18]:
# Then, we simply select the top 10 movie IDs
recommend_id = groupby_movie_id.index[:10].to_numpy()

recommend_id

array([ 356,  318,  296,  593, 2571,  260,  480,  110,  589,  527],
      dtype=int64)

In [21]:
def get_non_personalized(n, rating_data):
    """
    Function to output non-personalized recommended movie IDs.
    Input only contains rating data (no need for user IDs),
        as all recommendations will be the same for all users.

    Parameters
    ----------
    n : int
        Number of recommendations to output

    rating_data : pd.DataFrame
        Rating data

    Returns
    -------
    recommend_id : np.array
        List containing the recommended movie IDs
    """
    # First, group by movieId in the rating data
    groupby_movie_id = (
        rating_data                                 # 1. get rating data
        .groupby(by='movieId')                      # 2. group by movieId
        .agg({'rating': 'count'})                   # 3. aggregate count of ratings for each group
        .sort_values(by='rating', ascending=False)  # 4. sort by highest to lowest rating count
    )

    # Next, select n recommendations
    recommend_id = (
        groupby_movie_id                            # 1. in groupby_movie_id
        .index[:n]                                  # 2. select top n indices
        .to_numpy()                                 # 3. and convert to Numpy array
    )

    # Finally, return recommend_id
    return recommend_id

In [22]:
get_non_personalized(n = 10,
                     rating_data = rating_data_final)

array([ 356,  318,  296,  593, 2571,  260,  480,  110,  589,  527],
      dtype=int64)

---
**Fun Fact**
1. What is the problem with this non-personalized recommendation system?  
---> Old films may be included, because the longer the film, the higher the rating number.
2. How to fix the problem?  
---> Select the top-rated films in the last xx years.

---
## 5. Creating personalized models: collaborative filtering

<br>
<center>
<img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/intro-ai-recommendersystem/5_03.png">

In [24]:
from surprise import Reader, Dataset

# Create a reader object
reader = Reader(rating_scale=(1, 5)) # Define the rating scale. We use a scale of 1-5

# Create a data object
# This object was created to enable cross-validation of the model
data = Dataset.load_from_df(df = rating_data_final[['userId', 'movieId', 'rating']], # Insert data
                             reader = reader)                                        # Enter reader

In [25]:
from surprise import KNNBaseline

# Create a trainset
trainset = data.build_full_trainset()

# Do training
algo_best = KNNBaseline()
algo_best.fit(trainset)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x1e90e844e90>

In [31]:
# This is the code to predict the rating that user-u will give to item-i
def predict_item_rating(user_id, item_id, model):
    """
    Predict the rating that user-u will give to item-i

    Parameters
    ----------
    user_id : int
        User ID

    item_id : int
        Item ID

    model : surprise model
        Collaborative filtering model

    Returns
    -------
    dict
        A dictionary containing the user ID, item ID, predicted rating, and prediction details
    """
    # Perform prediction
    pred = model.predict(uid=user_id, iid=item_id)

    # Extract prediction result
    return {
        'user': user_id,
        'item': item_id,
        'rating_pred': np.round(pred.est, 1),
        'details': pred.details
    }

# create a function to generate n movie recommendations for a user
def get_personalized(user_id, n, model, rating_data):
    """
    Function to find movie recommendations for a user based on collaborative filtering

    Parameters
    ----------
    user_id : int
        User ID

    n : int
        Number of recommendations

    model : surprise model
        Collaborative filtering model

    rating_data : pd.DataFrame
        Historical rating data

    Returns
    -------
    recommend_movie_id : np.array
        List containing recommendations of movie IDs for user_id
    """
    # First, extract all movie IDs from the rating data
    all_movie_id = set(rating_data['movieId'])

    # Next, find the movie IDs that the user has watched
    watched_movie_id = set(rating_data['movieId'][rating_data['userId'] == user_id])

    # Then, find the movie IDs that the user has not watched yet
    unwatched_movie_id = np.array(list(all_movie_id - watched_movie_id))

    # Iterate to predict the rating for unwatched movie IDs
    unwatched_rating = []
    for movie_id in unwatched_movie_id:
        pred = predict_item_rating(user_id=user_id, 
                                   item_id=movie_id, 
                                   model=model)
        
        unwatched_rating.append(pred['rating_pred'])
    unwatched_rating = np.array(unwatched_rating)

    # Next, sort the unwatched_rating from high to low and take the top n indices with the highest ratings
    recommend_idx = (
        np.argsort(unwatched_rating)[::-1]  # sort ratings from high to low
        [:n]                               # take n highest values (indices)
    )

    # Finally, create a recommendation of movie IDs
    recommend_movie_id = unwatched_movie_id[recommend_idx]

    return recommend_movie_id

In [32]:
# Find recommendations for the 10 latest movies for user 610
recommend_id = get_personalized(
                    user_id = 610,                   # fill in with user id
                    n = 10,                          # fill in with the number of recommendations
                    model = algo_best,               # fill in with the final algorithm from collaborative filtering
                    rating_data = rating_data_final  # fill in with historical data
                )
recommend_id

array([177593,   2239,   3451,   2202,   1248, 170705,   5747,   1041,
         3266,   3201])

## 6. Combine all models into a system

<br>
<center>
<img src="https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/intro-ai-recommendersystem/5_04.png">
</center>

- Flowchartnya
  1. Check whether the user is new or not?
  2. If you are a new user, give a non-personalized recommendation
  3. If you are an old user, provide a personalized recommendation

In [33]:
# Input user id & number of recommendations (n)
user_id = 1
n = 10

# Check if the user id is an old or new user
results = check_user_id(user_id = user_id, 
                        rating_data = rating_data_final)

# Make recommendations based on the condition
if results['is_old_user'] == False:
    # If the user id is a new user
    # Call non-personalized recommendations
    recommend_id = get_non_personalized(n = n, 
                                        rating_data = rating_data_final)
else:
    # If the user id is an old user
    # Call collaborative filtering recommendations
    recommend_id = get_personalized(user_id = user_id, 
                                    n = n, 
                                    model= algo_best, 
                                    rating_data = rating_data_final)

# Display the recommendations
recommend_id

array([  1446, 116897,   5690,   1235,    741,  80906,   1237, 159817,
          750,   2239])

In [50]:
def get_recommendation(user_id, n, model, rating_data):
    """
    Function that contains a hybrid recommendation system to help solve
    the problem of cold-start users.

    If the user is new -> non-personalized recommendation
    If the user is old -> collaborative filtering recommendation

    Parameters
    ----------
    user_id : int
        User ID

    n : int
        Number of recommendations desired

    model : surprise object
        Model for collaborative filtering recommendations

    rating_data : pd.DataFrame
        Historical rating data

    Returns
    -------
    recommend_id : np.array
        n recommended movie IDs
    """
    # First, check if the user ID is an old or new user
    results = check_user_id(user_id=user_id, rating_data=rating_data)

    # Next, make recommendations based on the condition
    if results['is_old_user'] == False:
        # If the user ID is a new user
        print('=' * 40)
        print('This user is a new user')
        print('=' * 40)
        # Call non-personalized recommendations
        recommend_id = get_non_personalized(n=n, rating_data=rating_data)
    else:
        # If the user ID is an old user
        print('=' * 40)        
        print('This user is an old user')
        print('=' * 40)
        # Call collaborative filtering recommendations
        recommend_id = get_personalized(user_id=user_id, n=n, model=model, rating_data=rating_data)

    # Return the recommendation result
    status = {
        'user_id': user_id,
        'recommend_id': recommend_id
    }

    return status

In [51]:
# Check recommendations for old users (userId = 200)
get_recommendation(user_id = 200,
                   n = 10,
                   model = algo_best,
                   rating_data = rating_data_final)

This user is an old user


{'user_id': 200,
 'recommend_id': array([177593,   3266,   2202,   2239,   3451,   1683, 106642,   1104,
          1204,   3200])}

In [52]:
# Load metadata from movies
movie_metadata = pd.read_csv('movies.csv', index_col='movieId')

In [53]:
# Check recommendations for old users (userId = 32)
status = get_recommendation(user_id = 32,
                            n = 10,
                            model = algo_best,
                            rating_data = rating_data_final)

# Search for movie metadata
movie_metadata.loc[status['recommend_id']]

This user is an old user


Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
177593,"Three Billboards Outside Ebbing, Missouri (2017)",Crime|Drama
2239,Swept Away (Travolti da un insolito destino ne...,Comedy|Drama
3451,Guess Who's Coming to Dinner (1967),Drama
106642,"Day of the Doctor, The (2013)",Adventure|Drama|Sci-Fi
3266,Man Bites Dog (C'est arrivé près de chez vous)...,Comedy|Crime|Drama|Thriller
2202,Lifeboat (1944),Drama|War
3201,Five Easy Pieces (1970),Drama
1217,Ran (1985),Drama|War
3022,"General, The (1926)",Comedy|War
1683,"Wings of the Dove, The (1997)",Drama|Romance


In [54]:
# Check recommendations for new users (userId = 1500)
status = get_recommendation(user_id = 1500,
                            n = 10,
                            model = algo_best,
                            rating_data = rating_data_final)

# Cari movie metadata
movie_metadata.loc[status['recommend_id']]

This user is a new user


Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
356,Forrest Gump (1994),Comedy|Drama|Romance|War
318,"Shawshank Redemption, The (1994)",Crime|Drama
296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller
2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
110,Braveheart (1995),Action|Drama|War
589,Terminator 2: Judgment Day (1991),Action|Sci-Fi
527,Schindler's List (1993),Drama|War


---

🚀 **It's a Wrap!**

Congratulations on completing the Hybrid Recommendation System notebook! 🎉 You've learned how to tackle the challenge of recommending movies to both new and existing users using a combination of non-personalized and collaborative filtering approaches. 

🎥 Remember, understanding user preferences and providing accurate recommendations are key in enhancing user experience and engagement. Keep exploring new techniques and datasets to further improve your recommendation systems!

🌟 Thank you for joining me on this journey. Happy recommending! 😊🍿

---