
# <p style="font-family:newtimeroman;color:#A60000;font-weight: bold;font-size:150%;"> Hybrid Recommender Project</p>



* An online movie streaming website wants to develop **hybrid recommender system** for its users.
* For the user whose ID is given, make predictions using **item-based and user-based recomennder methods**.
* Consider 5 recommendations from user-based model and 5 recommendations from item-based model and finally make 10 recommendations from 2 models.

# Dataset

* The data set consists of the services received by customers and the categories of these services.
* It contains the date and time information of each service received.

* **UserId**: Customer number
* **ServiceId**: Anonymized services belonging to each category (Example: Sofa washing service under the cleaning category)
> * A ServiceId can be found under different categories and refers to different services under different categories.
> * Example: CategoryId 7 ServiceId 4 refers to honeycomb cleaning while CategoryId 2 ServiceId 4 refers to furniture assembly)
* **CategoryId**: Anonymized categories (Example: Cleaning, transportation, renovation category)
* **CreateDate**: Date the service was purchased

# Task 1: Data Preparation

In [1]:
import pandas as pd
pd.options.display.max_columns=None
pd.options.display.max_rows=10
pd.options.display.float_format = '{:.3f}'.format
pd.options.display.width = 1000

*** Step 1: Read the Movie and Rating data sets.**

* movieId, movie title and movie genre data set

In [2]:
movie = pd.read_csv("/kaggle/input/movielens-20m-dataset/movie.csv")

* Data set containing UserID, movie name, vote for the movie and time information

In [3]:

rating = pd.read_csv("/kaggle/input/movielens-20m-dataset/rating.csv")

**Step 2: Add the names and genre of the movies to the rating dataset using the movie movie set.**
> * The movies that users in the rating voted for have only ids.
> * We add the movie names and genre of the ids from the movie dataset.

In [4]:
df = pd.merge(movie,rating, how="inner", on="movieId")

**Step 3: Calculate how many people voted for each movie in total. Exclude movies with less than 1000 total votes from the dataset.**
> * We calculate how many people voted for each movie.
> * We keep the names of the movies with less than 1000 total votes in rare_movies and remove them from the dataset.

In [5]:
values_pd = df["title"].value_counts()

rare_movies = values_pd[values_pd < 1000].index

df_ = df[~df["title"].isin(rare_movies)]

**Step 4: # Create a pivot table for dataframe with userIDs in index, movie names in columns and ratings as values (name it user_movie_df)**

In [6]:
user_movie_df = df_.pivot_table(index="userId",columns="title",values="rating")
user_movie_df.shape

(138493, 3159)

**Step 5: Let's functionalize all the above operations**

In [7]:
def yuk(df1, df2, min_vote=1000):
    df1,df2 = df1.copy(), df2.copy()
    merged = pd.merge(df1, df2, how="inner", on="movieId")
    counter = merged["title"].value_counts()
    rares = counter[counter < min_vote].index
    merged_non_rares = merged[~merged["title"].isin(rares)]
    pivottable = merged_non_rares.pivot_table(values="rating", index="userId", columns="title")
    return pivottable

# Task 2: Determining the Movies Watched by the User for Recommendation

**Step 1: Select a random user id.**

In [8]:
random_user = user_movie_df.sample(1,random_state=45).index[0]

**Step 2: Create a new dataframe named random_user_df with the observation units of the selected user.**

In [9]:
random_user_df = user_movie_df[user_movie_df.index==random_user]

**Step 3: Assign the movies that the selected user voted for to a list called movies_watched.**

In [10]:
movies_watched = random_user_df.dropna(axis=1).columns.tolist()

# Task 3: Accessing Data and Ids of Other Users Watching the Same Movies

**Step 1: Select the columns of the movies watched by the selected user from user_movie_df and create a new dataframe called movies_watched_df.**

In [11]:
movies_watched_df = user_movie_df[movies_watched]

**Step 2: Create a new dataframe called user_movie_count, which carries the information about how many movies each user has watched of the movies watched by the selected user, and create a new df.**

In [12]:
user_movie_count = movies_watched_df.notnull().sum(axis=1)
user_movie_count.max()

33

**Step 3: We consider users who have watched 60 percent or more of the movies voted by the selected user as similar users.**
> * Create a list called users_same_movies from the ids of these users.

In [13]:
users_same_movies = user_movie_count[user_movie_count > (movies_watched_df.shape[1] * 60 ) / 100].index

# Task 4: Identifying the Users Most Similar to the User for Recommendation

**Step 1: Filter the movies_watched_df dataframe to find the ids of the users that are similar to the selected user in the user_same_movies list.**

In [14]:
filtered_df = movies_watched_df[movies_watched_df.index.isin(users_same_movies)]

**Step 2: Create a new corr_df dataframe where the correlations of the users with each other will be found.**

In [15]:
corr_df = filtered_df.T.corr().unstack()
corr_df[random_user]

userId
91        0.399
130      -0.052
156      -0.027
158       0.427
160       0.294
          ...  
138208    0.260
138279    0.427
138382   -0.045
138415    0.482
138483    0.568
Length: 4139, dtype: float64

**Step 3: Create a new dataframe called top_users by filtering the users with high correlation (above 0.65) with the selected user.**


In [16]:
top_users = pd.DataFrame(corr_df[random_user][corr_df[random_user] > 0.65], columns=["corr"])

**Step 4: Merge top_users dataframe with the rating dataset**

In [17]:
top_users_ratings = pd.merge(top_users, rating[["userId", "movieId", "rating"]], how='inner', on="userId")

# Task 5: Calculating the Weighted Average Recommendation Score and Keeping the Top 5 Movies

**Step 1: Create a new variable called weighted_rating which is the product of each user's corr and rating values.**

In [18]:
top_users_ratings['weighted_rating'] = top_users_ratings['corr'] * top_users_ratings['rating']

**Step 2: Create a new dataframe called recommendation_df containing the movie id and the average value of all users' weighted ratings for each movie.**

In [19]:
recommendation_df = top_users_ratings.pivot_table(values="weighted_rating", index="movieId", aggfunc="mean")

**Step 3: in recommendation_df, select movies with a weighted rating greater than 3.5 and sort them by weighted rating.**
> * Save the first 5 observations as movies_to_be_recommend.

In [20]:
movies_to_be_recommend = recommendation_df[recommendation_df["weighted_rating"] > 3.5].sort_values(by="weighted_rating", ascending=False).head(5)


**Step 4: Bring the names of 5 recommended movies.**

In [21]:
movie["title"][movie["movieId"].isin(movies_to_be_recommend.index)]

52                     Lamerica (1994)
1838                   Whatever (1998)
1973    Incredible Journey, The (1963)
2400             She's All That (1999)
3031                Tumbleweeds (1999)
Name: title, dtype: object

# Task 6: Item-Based Recommendation

**Make item-based suggestions based on the name of the movie that the user has watched last and rated the highest.**

In [22]:
user = 108170


**Step 1: Get the id of the movie with the most recent score among the movies that the user to be recommended has given 5 points.**

In [23]:
pick = rating[(rating["rating"] == 5) & (rating["userId"]==user)].sort_values(by="timestamp", ascending=False).iloc[0]["movieId"]


**# Step 2 :Filter the user_movie_df dataframe created in the User based recommendation section according to the selected movie id.**

In [24]:
picked_movie_name = movie["title"][movie["movieId"]==pick].iloc[0]

filtered_movies = user_movie_df[picked_movie_name]

filtered_movies[filtered_movies.notna()]


userId
394      3.500
424      2.000
455      3.500
563      4.500
664      5.000
          ... 
137683   3.000
137839   3.500
138206   4.000
138208   2.500
138307   4.000
Name: Wild at Heart (1990), Length: 1537, dtype: float64

**Step 3: Use the filtered dataframe, find the correlation between the selected movie and other movies and sort them.**

In [25]:
users_wo_user = user_movie_df.drop(user,axis=0).drop(movies_watched,axis=1)
filtered_wo_user = filtered_movies.drop(user,axis=0)

movies_similarity = users_wo_user.corrwith(filtered_wo_user).sort_values(ascending=False).reset_index()

movies_similarity.columns=["title","similarity"]

**Step 4: Suggest the first 5 movies other than the selected movie itself.**

In [26]:
movies_similarity.sort_values(by="similarity",ascending=False).iloc[1:6]

Unnamed: 0,title,similarity
1,My Science Project (1985),0.57
2,Mediterraneo (1991),0.539
3,"Old Man and the Sea, The (1958)",0.536
4,National Lampoon's Senior Trip (1995),0.533
5,Clockwatchers (1997),0.483


# Task 7: Hybrid Recommendation

*** Combine movies that have a high similarity to the movie that the user rated 5 and movies that other similar users rated 3.5 and above on average.**

*** Multiply the similarity score and average score and recommend, recommend by ranking.**

*** Recommend 10 movies.**

In [27]:
movies_ordered_by_rating = pd.merge(recommendation_df,movie,how="inner",on="movieId")[["movieId","weighted_rating","title"]]

merged = pd.merge(movies_similarity,movies_ordered_by_rating,how="inner", on="title")

merged["hybrid"] = merged["similarity"] * merged["weighted_rating"]

merged[["title","hybrid"]].sort_values(by="hybrid", ascending=False).head(10)

Unnamed: 0,title,hybrid
0,Wild at Heart (1990),2.236
2,Mediterraneo (1991),1.561
3,"Old Man and the Sea, The (1958)",1.336
22,Naked (1993),1.328
5,Clockwatchers (1997),1.299
7,Lost Highway (1997),1.276
42,Beautiful Thing (1996),1.269
15,Guys and Dolls (1955),1.23
35,Giant (1956),1.192
122,Angela's Ashes (1999),1.184
