# Developing a User-Based Recommendation System using the Datasets 'Movie and Rating'

# Business problem

### An online movie viewing platform wants to improve its recommendation system. Trying out content and product-based recommendation systems, the company wants users to be more customizable. Suggestions have been made according to similar liking structures for movies, but they want to customize these general suggestions more based on the similarity of the users to the users. 

# Dataset story

### The data set is provided by the mobile lens, it contains the movies and the scores given to these movies. The dataset contains more than 20000000 ratings for approximately 27000 movies per hour.

# Variables

### There are many different tables in this dataset, but there are 2 CSV files to use.

#### movie.csv
* movieId - Unique movie number
* title - movie name

#### rating.csv
* userid - Unique user number
* movieId - Unique movie number
* rating - the rating given to the movie by the user
* timestamp - review date

# Importing the libraries

In [1]:
import pandas as pd

# Reading and preparing the dataset

### Let's define the 'create_user_movie_df' function to prepare the dataset

In [2]:
def create_user_movie_df():
    import pandas as pd
    rating = pd.read_csv('/kaggle/input/movie-rating/rating.csv')
    movie = pd.read_csv('/kaggle/input/movie-rating/movie.csv')
    df = movie.merge(rating, how='left', on='movieId')
    df.columns = [col.lower() for col in df.columns]
    comment_counts = pd.DataFrame(df.title.value_counts())
    rare_movies = comment_counts[comment_counts['title'] <= 1000].index
    common_movies = df[~df['title'].isin(rare_movies)]
    user_movie_df = common_movies.pivot_table(index=['userid'], columns=['title'], values='rating')
    return user_movie_df

user_movie_df = create_user_movie_df()

In [3]:
user_movie_df.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),...,Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zulu (1964),[REC] (2007),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,,,,,,,,,,,...,,,,,,,,,,
2.0,,,,,,,,,,,...,,,,,,,,,,
3.0,,,,,,,,,,,...,,,,,,,,,,
4.0,,,,,,,,,,,...,,,,,,,,,,
5.0,,,,,,,,,,,...,,,,,,,,,,


### Let'select a random user

In [4]:
random_user = int(pd.Series(user_movie_df.index).sample(1).values)
random_user

47509

# Application to get the watched movies

### Our other purpose after this point is to determine which movies were watched by the random user selected.

In [5]:
random_user_df = user_movie_df[user_movie_df.index == random_user]
movies_watched = random_user_df.columns[random_user_df.notna().any()].tolist()
movies_watched

['Birdcage, The (1996)',
 'Broken Arrow (1996)',
 'Eraser (1996)',
 'Fargo (1996)',
 'Father of the Bride Part II (1995)',
 'Grumpier Old Men (1995)',
 'Happy Gilmore (1996)',
 'Heat (1995)',
 'Independence Day (a.k.a. ID4) (1996)',
 'Mission: Impossible (1996)',
 'Nutty Professor, The (1996)',
 'Phenomenon (1996)',
 'Primal Fear (1996)',
 'Rock, The (1996)',
 'Sabrina (1995)',
 'Star Trek: First Contact (1996)',
 'Star Wars: Episode IV - A New Hope (1977)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'Toy Story (1995)',
 'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)',
 'Twister (1996)',
 'Willy Wonka & the Chocolate Factory (1971)']

### Let's get the number of the movies watched by the user

In [6]:
print(f'The number of the movies watched by the user is {len(movies_watched)}')

The number of the movies watched by the user is 22


### So, we got the movies watched and the number by the selected random user.

### Now, let's validate the movies watched by the user selected

In [7]:
user_movie_df.loc[user_movie_df.index == random_user, user_movie_df.columns == 'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)']

title,Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
userid,Unnamed: 1_level_1
47509.0,4.0


### As you see that the user watched the movie named 'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)'

# Getting other users wathching the same movies

In [8]:
movies_watched_df = user_movie_df[movies_watched]
movies_watched_df

title,"Birdcage, The (1996)",Broken Arrow (1996),Eraser (1996),Fargo (1996),Father of the Bride Part II (1995),Grumpier Old Men (1995),Happy Gilmore (1996),Heat (1995),Independence Day (a.k.a. ID4) (1996),Mission: Impossible (1996),...,Primal Fear (1996),"Rock, The (1996)",Sabrina (1995),Star Trek: First Contact (1996),Star Wars: Episode IV - A New Hope (1977),Star Wars: Episode VI - Return of the Jedi (1983),Toy Story (1995),Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Twister (1996),Willy Wonka & the Chocolate Factory (1971)
userid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,,,,,,,,,,,...,,,,,4.0,,,3.5,,
2.0,,,,,,4.0,,,,,...,,,,5.0,5.0,5.0,,,,
3.0,,,,,,,,,3.0,,...,,,,5.0,5.0,5.0,4.0,4.0,,5.0
4.0,,,,,,,,3.0,,,...,,5.0,,,,,,1.0,,
5.0,5.0,,,3.0,,,2.0,,5.0,3.0,...,,,,,5.0,5.0,,,5.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138489.0,,,,,,,,,,,...,,,,,,,,,,
138490.0,,,,,,,,,,,...,,,,,,,,5.0,,
138491.0,,,,,,,,,,,...,,,,,,,2.0,,,
138492.0,,,,,,,,,,,...,,,,,,,,,,3.5


### So, we have information regarding only the watched movies by the user and customized in terms of the user by reducing the dataset.

### Let's reduce the data to users who watch at least 60% of the movies together with the user by setting a threshold value
### It doesn't make sense for people who have watched at least one common movie to stay in the dataset, because we can't catch a relationship from here.
### Therefore, users who watch more than a certain number of joint movies should be caught.

### Let's determine how many movies each user in the dataset (movies_watched_df) has watched

In [9]:
user_movie_count = movies_watched_df.T.notnull().sum()
user_movie_count

userid
1.0          2
2.0          4
3.0          8
4.0          3
5.0         10
            ..
138489.0     0
138490.0     1
138491.0     1
138492.0     1
138493.0     6
Length: 138493, dtype: int64

### Since the 'userid' in the index will be processed, it must be converted to a variable.

In [10]:
user_movie_count = user_movie_count.reset_index()
user_movie_count

Unnamed: 0,userid,0
0,1.0,2
1,2.0,4
2,3.0,8
3,4.0,3
4,5.0,10
...,...,...
138488,138489.0,0
138489,138490.0,1
138490,138491.0,1
138491,138492.0,1


### let's rename the variables

In [11]:
user_movie_count.columns = ['userid', 'movie_count']
user_movie_count

Unnamed: 0,userid,movie_count
0,1.0,2
1,2.0,4
2,3.0,8
3,4.0,3
4,5.0,10
...,...,...
138488,138489.0,0
138489,138490.0,1
138490,138491.0,1
138491,138492.0,1


### Let's bring users who watched more than 60 percent of movies in partnership with the user

In [12]:
percent = len(movies_watched) * 0.6
user_same_movies = user_movie_count[user_movie_count['movie_count'] > percent]['userid']
user_same_movies

11            12.0
18            19.0
23            24.0
53            54.0
68            69.0
            ...   
138386    138387.0
138396    138397.0
138410    138411.0
138473    138474.0
138479    138480.0
Name: userid, Length: 9930, dtype: float64

### Let's access the number of users who watched all the movies the user watched

In [13]:
user_movie_count[user_movie_count['movie_count'] == len(movies_watched)].count()

userid         205
movie_count    205
dtype: int64

# Determination of similarities

### Goal: to identify users with the most similar behavior to the selected user
### This process can be performed in three stages
* Aggregating user and other users' data
* Creating the correlation dataframe
* Finding the most similar users (top_users)

### Stage 1: Aggregating user and other users' data

In [14]:
final_df = pd.concat([movies_watched_df[movies_watched_df.index.isin(user_same_movies)], random_user_df[movies_watched]])
final_df

title,"Birdcage, The (1996)",Broken Arrow (1996),Eraser (1996),Fargo (1996),Father of the Bride Part II (1995),Grumpier Old Men (1995),Happy Gilmore (1996),Heat (1995),Independence Day (a.k.a. ID4) (1996),Mission: Impossible (1996),...,Primal Fear (1996),"Rock, The (1996)",Sabrina (1995),Star Trek: First Contact (1996),Star Wars: Episode IV - A New Hope (1977),Star Wars: Episode VI - Return of the Jedi (1983),Toy Story (1995),Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Twister (1996),Willy Wonka & the Chocolate Factory (1971)
userid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12.0,,3.0,4.0,5.0,2.0,3.0,4.0,3.0,3.0,3.0,...,,4.0,3.0,,4.0,,4.0,3.0,3.0,
19.0,5.0,3.0,4.0,5.0,,4.0,,5.0,4.0,3.0,...,4.0,4.0,5.0,,,,5.0,2.0,4.0,5.0
24.0,,4.0,,2.0,2.0,,5.0,4.0,4.0,4.0,...,,4.0,3.0,3.0,5.0,5.0,4.0,4.0,,2.0
54.0,,2.0,4.0,5.0,3.0,,,3.0,5.0,3.0,...,,4.0,,5.0,4.0,3.0,4.0,5.0,4.0,3.0
69.0,,3.0,3.0,3.0,3.0,,3.0,4.0,4.0,4.0,...,,4.0,,5.0,5.0,5.0,4.0,5.0,3.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138397.0,2.0,,4.5,4.0,,5.0,4.5,5.0,5.0,4.0,...,,5.0,,3.5,3.5,3.5,,4.5,,3.5
138411.0,3.5,3.0,2.5,4.0,,,2.5,5.0,2.5,2.5,...,,4.0,,4.5,5.0,5.0,5.0,5.0,2.0,5.0
138474.0,4.0,4.0,5.0,,4.0,,5.0,5.0,,4.0,...,,5.0,,5.0,5.0,5.0,5.0,5.0,,5.0
138480.0,3.0,3.0,4.0,,,3.0,4.0,,4.0,4.0,...,,5.0,,4.0,4.0,,4.0,4.0,4.0,3.0


### Stage 2: Creating the correlation dataframe


### Let's transport the users to variables to calculate the correlations.

In [15]:
corr_df = final_df.T.corr().unstack().sort_values().drop_duplicates()
corr_df

userid    userid  
136696.0  66208.0    -1.0
65379.0   78200.0    -1.0
57176.0   83858.0    -1.0
49766.0   115050.0   -1.0
10635.0   108403.0   -1.0
                     ... 
12435.0   43820.0     1.0
138480.0  138480.0    1.0
25460.0   29553.0     1.0
18101.0   66208.0     1.0
12.0      3052.0      NaN
Length: 14261534, dtype: float64

### Let's do the following to better read the price

In [16]:
corr_df = pd.DataFrame(corr_df, columns=['corr'])
corr_df.index.names = ['user_id_1', 'user_id_2']
corr_df = corr_df.reset_index()
corr_df

Unnamed: 0,user_id_1,user_id_2,corr
0,136696.0,66208.0,-1.0
1,65379.0,78200.0,-1.0
2,57176.0,83858.0,-1.0
3,49766.0,115050.0,-1.0
4,10635.0,108403.0,-1.0
...,...,...,...
14261529,12435.0,43820.0,1.0
14261530,138480.0,138480.0,1.0
14261531,25460.0,29553.0,1.0
14261532,18101.0,66208.0,1.0


### Stage 3: Finding the most similar users (top_users)

### Let's get users who have 65% or more correlation with the user and call them 'top_user'

In [17]:
top_users = corr_df[(corr_df['user_id_1'] == random_user) & (corr_df['corr'] >= 0.65)][['user_id_2', 'corr']].reset_index(drop=True).sort_values('corr', ascending=False)
top_users

Unnamed: 0,user_id_2,corr
171,91740.0,0.937229
170,96849.0,0.872742
169,76435.0,0.854592
168,80677.0,0.832424
167,84374.0,0.824160
...,...,...
4,49936.0,0.651595
3,29598.0,0.651463
2,14285.0,0.650985
1,92636.0,0.650380


### Since we do not have the information of how many points these users gave to which movie, let's combine the rating file with the tup_users data to find out.

### First, let's change the iser_id_2 to userId

In [18]:
top_users.rename(columns = {'user_id_2': 'userId'}, inplace=True)

In [19]:
rating = pd.read_csv('/kaggle/input/movie-rating/rating.csv')
top_users_rating = top_users.merge(rating[['userId', 'movieId', 'rating']], how='inner')
top_users_rating

Unnamed: 0,userId,corr,movieId,rating
0,91740.0,0.937229,1,3.5
1,91740.0,0.937229,6,3.0
2,91740.0,0.937229,10,2.5
3,91740.0,0.937229,16,3.5
4,91740.0,0.937229,19,2.0
...,...,...,...,...
126207,75939.0,0.650167,31747,1.0
126208,75939.0,0.650167,33493,4.5
126209,75939.0,0.650167,33794,5.0
126210,75939.0,0.650167,35836,4.5


In [20]:
top_users_rating = top_users_rating[top_users_rating['userId'] != random_user]
top_users_rating

Unnamed: 0,userId,corr,movieId,rating
0,91740.0,0.937229,1,3.5
1,91740.0,0.937229,6,3.0
2,91740.0,0.937229,10,2.5
3,91740.0,0.937229,16,3.5
4,91740.0,0.937229,19,2.0
...,...,...,...,...
126207,75939.0,0.650167,31747,1.0
126208,75939.0,0.650167,33493,4.5
126209,75939.0,0.650167,33794,5.0
126210,75939.0,0.650167,35836,4.5


### As a result, users with the highest correlation with the user and their scores for various movies were obtained. The scores given to each movie by the users who watched at least 60% of the movies in common with the user and the correlations of these users were calculated.

# Score calculation

### Let's calculate the weighted average recommendation score. To make this, let's consider both correlation score and rating score. Therefore, we can evaluate the effect of the correlation and rating at the same moment.

In [21]:
top_users_rating['weighted_rating'] = top_users_rating['corr'] * top_users_rating['rating']
top_users_rating.sort_values('weighted_rating', ascending=False).head()

Unnamed: 0,userId,corr,movieId,rating,weighted_rating
112,91740.0,0.937229,1196,5.0,4.686146
31,91740.0,0.937229,260,5.0,4.686146
486,91740.0,0.937229,7118,5.0,4.686146
282,91740.0,0.937229,2712,5.0,4.686146
407,91740.0,0.937229,4825,5.0,4.686146


### As you see that for the movieId variable, same user gave many ratings. Therefore, if we perform the groupby operation according to the movieId variable and take the average of the weighted_rating variable, then our final values will be revealed.

In [22]:
recommendation_df = top_users_rating.groupby('movieId').agg({'weighted_rating': 'mean'}).sort_values('weighted_rating', ascending=False)
recommendation_df = recommendation_df.reset_index()
recommendation_df

Unnamed: 0,movieId,weighted_rating
0,26701,4.686146
1,47836,4.686146
2,53002,4.059863
3,59143,4.059863
4,51007,3.970387
...,...,...
9574,108689,0.328150
9575,89041,0.328150
9576,7002,0.328150
9577,27136,0.328150


### We can filter the weighted_rating variable to bring the movies that the user may like.For example, those with a value of weighted_rating greater than 3.5 can be movies that the user will like.

In [23]:
movies_to_be_recommended = recommendation_df[recommendation_df['weighted_rating'] > 3.5]
movies_to_be_recommended

Unnamed: 0,movieId,weighted_rating
0,26701,4.686146
1,47836,4.686146
2,53002,4.059863
3,59143,4.059863
4,51007,3.970387
5,27357,3.970387
6,583,3.970387
7,2830,3.970387
8,31705,3.970387
9,864,3.970387


### So, we can recommend only 5 movies to the selected user that being a weighted_rating higher than 3.5

### Let's determine which movies are recommended movies

In [24]:
movie = pd.read_csv('/kaggle/input/movie-rating/movie.csv')
movies_to_be_recommended = movies_to_be_recommended.merge(movie[['movieId', 'title']])
movies_to_be_recommended

Unnamed: 0,movieId,weighted_rating,title
0,26701,4.686146,Patlabor: The Movie (Kidô keisatsu patorebâ: T...
1,47836,4.686146,Sketches of Frank Gehry (2005)
2,53002,4.059863,Georgia Rule (2007)
3,59143,4.059863,Super High Me (2007)
4,51007,3.970387,Days of Glory (Indigènes) (2006)
5,27357,3.970387,Old Men in New Cars (Gamle mænd i nye biler) (...
6,583,3.970387,Dear Diary (Caro Diario) (1994)
7,2830,3.970387,Cabaret Balkan (Bure Baruta) (1998)
8,31705,3.970387,Beautiful Boxer (2003)
9,864,3.970387,"Wife, The (1995)"


### Consequently, we recommended to a user 5 movies based on the users.

# Thank you for checking my notebook!