 # Recommender Systems

<img src="https://cdn-gkenl.nitrocdn.com/uMXnJIMvZJzBKtDYnZeeVwfFXPbPMFBY/assets/static/optimized/rev-e313826/wp-content/uploads/2021/06/recommendersystemblog1.jpg">

## Introduction :
     
     A recommender system is a compelling information filtering system running on machine learning (ML) algorithms that can predict a customer’s ratings or preferences for a product. 
    There are two methods to construct a recommender system : 
    1. Content-based recommendation 
    2. Collaborative Filtering
    
   Content-based recommendation is dependent on keyword searching techniques. 
   Collaborative Filtering finds out similar users based on user-item interactions.
   
   Here I have used both types to create a movie recommender system.
    


### Data set:
**MovieLens 25M Dataset**

This dataset (ml-25m) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 25000095 ratings and 1093360 tag applications across 62423 movies. These data were created by 162541 users between January 09, 1995 and November 21, 2019. This dataset was generated on November 21, 2019.



ML algorithm Used:

    **cosine_similarity**
    
cosine similarity would yield a similarity matrix for the selected textual data for recommendation and the content with higher similarity scores

## Table of Contents
Import Libraries

Checking DataSet

Recommended system: Title Similarity
>Creating Search Function

Recommended system: User ratings

Movie Recommentaions



In [2]:
import numpy as np 
import pandas as pd
from colorama import init, Fore, Back, Style


import ipywidgets as wid
from IPython.display import display

In [3]:
movie_df=pd.read_csv('movies.csv')
movie_df.sample(5)

Unnamed: 0,movieId,title,genres
37405,152575,"Butcher, Baker, Nightmare Maker (1982)",Horror
54825,190471,Chicken with Vinegar (1985),Comedy|Crime|Mystery|Romance
14168,73344,"Prophet, A (Un Prophète) (2009)",Crime|Drama
15270,80622,Rhapsody in Blue (1945),Drama|Musical
16114,85014,Autumn Leaves (1956),Drama


## Claenaing the dataset

In [4]:
import re

def clean_title(title):
    title = re.sub("[^a-zA-Z0-9 ]", "", title)
    return title

In [5]:
movie_df["clean_title"] = movie_df["title"].apply(clean_title)

In [6]:
movie_df.sample(5)

Unnamed: 0,movieId,title,genres,clean_title
45617,170755,Small Town Killers (2017),Comedy|Crime,Small Town Killers 2017
3152,3245,I Am Cuba (Soy Cuba/Ya Kuba) (1964),Drama,I Am Cuba Soy CubaYa Kuba 1964
25804,124550,Monday Night Mayhem (2002),Drama,Monday Night Mayhem 2002
14261,73983,"Maid, The (Nana, La) (2009)",Drama,Maid The Nana La 2009
60129,202531,The Last Resort (2018),Documentary,The Last Resort 2018


In [7]:
movie_df.to_csv("movies_Cleaned.csv")

## Recommended system: Title Similarity

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,2))

tfidf = vectorizer.fit_transform(movie_df["clean_title"])

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

def search(tittle):
    #tittle="Harry Potter"
    tittle=clean_title(tittle)
    queary=vectorizer.transform([tittle])
    sim=cosine_similarity(queary,tfidf).flatten()
    i=np.argpartition(sim,-7)[-7:]
    result=movie_df.iloc[i][::-1]
    return result


Now we are getting results realted to our query. Instaed of giving query via
code I am gonna build a user interactive search box using Python widgets

In [36]:
movie_search=wid.Text(
              value=" ",
               description="Movie Title:",
                disabled=False
)

movie_list=wid.Output()

def search_type(data):
    with movie_list:
        movie_list.clear_output()
        display(data)
        tittle=data["new"]
        if len(tittle)>3:
            display(search(tittle))
            
movie_search.observe(search_type,names="value")



display(movie_search,movie_list)

Text(value=' ', description='Movie Title:')

Output()

New search box using widgets creatd!!!👻 max of 7 results can be produced with this.

## Recommended system: User Ratings

In [11]:
ratings=pd.read_csv('ratings.csv')
ratings.sample(5)

Unnamed: 0,userId,movieId,rating,timestamp
20822420,135429,3018,4.0,1057597963
4817036,31458,2710,2.0,1111474992
3346990,22175,4801,3.0,1197794459
12715141,82233,79,4.0,834329435
1288314,8680,1985,1.0,944913392


In [12]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

All are ints anf floats

In [13]:
#lets search for Lord of the Rings
movie_id=5952
movie = movie_df[movie_df["movieId"] == movie_id]
movie

Unnamed: 0,movieId,title,genres,clean_title
5840,5952,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy,Lord of the Rings The Two Towers The 2002


In [14]:
#how many users have liked same movie 
similar_users=ratings[(ratings["movieId"]==movie_id)&(ratings["rating"]>4)]["userId"].unique()

In [15]:
similar_users

array([     2,      4,     13, ..., 162532, 162533, 162541], dtype=int64)

In [16]:
len(similar_users)

25118

Above users have same preference as you🖖


Those similar users also watched and liked:

In [17]:
similar_users_liked=ratings[(ratings["userId"].isin(similar_users))&(ratings["rating"]>4)]

In [18]:
similar_users_liked

Unnamed: 0,userId,movieId,rating,timestamp
72,2,110,5.0,1141416589
74,2,151,4.5,1141415643
76,2,260,5.0,1141417172
79,2,318,5.0,1141417181
80,2,333,5.0,1141415931
...,...,...,...,...
25000085,162541,8983,4.5,1240953211
25000086,162541,31658,4.5,1240953287
25000089,162541,45517,4.5,1240953353
25000090,162541,50872,4.5,1240953372


In [19]:
similar_users_liked.value_counts()

userId  movieId  rating  timestamp 
2       110      5.0     1141416589    1
109221  59501    5.0     1267892243    1
        68358    5.0     1267891285    1
        68269    5.0     1267889841    1
        68157    5.0     1267928191    1
                                      ..
55318   293      4.5     1243118031    1
        260      5.0     1237763268    1
        50       4.5     1237763359    1
55316   74458    5.0     1266874885    1
162541  63876    5.0     1240952515    1
Length: 1795254, dtype: int64

lets pick some common movies all similar users liked. so it can be a cutoff to a recommened list 

In [20]:
similar_users_watchlist=ratings[(ratings["userId"].isin(similar_users))&(ratings["rating"]>4)]["movieId"]
similar_users_watchlist

72            110
74            151
76            260
79            318
80            333
            ...  
25000085     8983
25000086    31658
25000089    45517
25000090    50872
25000094    63876
Name: movieId, Length: 1795254, dtype: int64

Above are some of movies liked by similar users.

In [21]:
similar_users_watch_cutoff = similar_users_watchlist.value_counts() / len(similar_users)

similar_users_watch_cutoff = similar_users_watch_cutoff[similar_users_watch_cutoff > .10]
similar_users_watch_cutoff

5952     1.000000
4993     0.835496
7153     0.805478
2571     0.522653
318      0.476272
           ...   
81834    0.102357
3000     0.102078
588      0.101959
1580     0.101242
33493    0.100406
Name: movieId, Length: 131, dtype: float64

To more personalized results I have to exclude some movies liked by all such as blockbusters.

In [22]:
all_users=ratings[(ratings["movieId"].isin(similar_users_watch_cutoff.index))&(ratings["rating"]>4)]

In [23]:
all_users

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
23,1,3949,5.0,1147868678
29,1,4973,4.5,1147869080
37,1,6016,5.0,1147869090
48,1,7361,5.0,1147880055
...,...,...,...,...
25000058,162541,4995,5.0,1240951903
25000062,162541,5618,4.5,1240953299
25000065,162541,5952,5.0,1240952617
25000078,162541,7153,5.0,1240952613


These are common movies rated high that existed  our similar users data.

In [24]:
print(Style.BRIGHT + Back.RED + Fore.BLACK, "Total no of users")
len(all_users["userId"].unique())


[1m[41m[30m Total no of users


150512

In [25]:
print(Style.BRIGHT + Back.RED + Fore.BLACK, "Movies watch count:")
all_users["movieId"].value_counts() 


[1m[41m[30m Movies watch count:


318      51678
296      42988
2571     36851
356      35527
593      34114
         ...  
88125     5367
68358     5268
81834     5159
5349      5015
33493     4003
Name: movieId, Length: 131, dtype: int64

In [26]:
all_user_watchlist=all_users["movieId"].value_counts() / len(all_users["userId"].unique())
 

In [27]:
all_user_watchlist

318      0.343348
296      0.285612
2571     0.244838
356      0.236041
593      0.226653
           ...   
88125    0.035658
68358    0.035001
81834    0.034276
5349     0.033320
33493    0.026596
Name: movieId, Length: 131, dtype: float64

Above are most probably well liked movies by all. Lets check.

In [28]:
records=pd.concat([similar_users_watch_cutoff,all_user_watchlist],axis=1)
records.columns=["similar","all"]

In [29]:
records

Unnamed: 0,similar,all
1,0.195875,0.125140
32,0.148021,0.100623
47,0.241620,0.144945
50,0.273549,0.201173
110,0.246994,0.161402
...,...,...
91529,0.152003,0.055072
99114,0.128235,0.057384
109487,0.164225,0.074286
112852,0.117525,0.043086


In [30]:
records["score"]=records["similar"]/records["all"]

In [31]:
records=records.sort_values("score",ascending=False)
records

Unnamed: 0,similar,all,score
5952,1.000000,0.166884,5.992197
7153,0.805478,0.173142,4.652115
4993,0.835496,0.188131,4.441031
33493,0.100406,0.026596,3.775249
5349,0.109364,0.033320,3.282266
...,...,...,...
750,0.122900,0.095149,1.291663
593,0.290469,0.226653,1.281558
912,0.117326,0.092903,1.262891
150,0.102675,0.090876,1.129834


lets take top 10 high scored movie for recommendation

In [32]:
records.head(15).merge(movie_df,left_index=True,right_on="movieId")

Unnamed: 0,similar,all,score,movieId,title,genres,clean_title
5840,1.0,0.166884,5.992197,5952,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy,Lord of the Rings The Two Towers The 2002
7028,0.805478,0.173142,4.652115,7153,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy,Lord of the Rings The Return of the King The 2003
4887,0.835496,0.188131,4.441031,4993,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy,Lord of the Rings The Fellowship of the Ring T...
9952,0.100406,0.026596,3.775249,33493,Star Wars: Episode III - Revenge of the Sith (...,Action|Adventure|Sci-Fi,Star Wars Episode III Revenge of the Sith 2005
5241,0.109364,0.03332,3.282266,5349,Spider-Man (2002),Action|Adventure|Sci-Fi|Thriller,SpiderMan 2002
7742,0.159209,0.048581,3.277188,8368,Harry Potter and the Prisoner of Azkaban (2004),Adventure|Fantasy|IMAX,Harry Potter and the Prisoner of Azkaban 2004
5704,0.12071,0.036834,3.277118,5816,Harry Potter and the Chamber of Secrets (2002),Adventure|Fantasy,Harry Potter and the Chamber of Secrets 2002
10408,0.123417,0.037984,3.249223,40815,Harry Potter and the Goblet of Fire (2005),Adventure|Fantasy|Thriller|IMAX,Harry Potter and the Goblet of Fire 2005
6416,0.249303,0.07711,3.233081,6539,Pirates of the Caribbean: The Curse of the Bla...,Action|Adventure|Comedy|Fantasy,Pirates of the Caribbean The Curse of the Blac...
4790,0.147106,0.047418,3.102307,4896,Harry Potter and the Sorcerer's Stone (a.k.a. ...,Adventure|Children|Fantasy,Harry Potter and the Sorcerers Stone aka Harry...


Finally we got pretty good recommentations based on created scores. 

In [33]:
def find_Simialr_users(movieId):
    similar_users=ratings[(ratings["movieId"]==movieId)&(ratings["rating"]>4)]["userId"].unique()
    similar_users_liked=ratings[(ratings["userId"].isin(similar_users))&(ratings["rating"]>4)]["movieId"]
    
    similar_users_liked= similar_users_liked.value_counts() / len(similar_users)
    similar_users_liked = similar_users_liked[similar_users_liked > .10]
    
    all_users=ratings[(ratings["movieId"].isin(similar_users_liked.index))&(ratings["rating"]>4)]
    all_user_watchlist=all_users["movieId"].value_counts() / len(all_users["userId"].unique())
    
    records=pd.concat([similar_users_liked,all_user_watchlist],axis=1)
    records.columns=["similar","all"]
    
    records["score"]=records["similar"]/records["all"]
    records=records.sort_values("score",ascending=False)
    #display(records.head(15))
    return records.head(15).merge(movie_df,left_index=True,right_on="movieId")[["score","title","genres"]]
    
    
    
    

In [34]:
from sklearn.metrics.pairwise import cosine_similarity

def search_tittle(tittle):
    #tittle="Harry Potter"
    #tittle=clean_title(tittle)
    queary=vectorizer.transform([tittle])
    sim=cosine_similarity(queary,tfidf).flatten()
    i=np.argpartition(sim,-7)[-7:]
    result=movie_df.iloc[i][::-1]
    return result

In [37]:
movie_search=wid.Text(
              value=" ",
               description="Movie Title:",
                disabled=False
)

movie_list=wid.Output()

def search_type(data):
    with movie_list:
        movie_list.clear_output()
        display(data)
        tittle=data["new"]
        if len(tittle)>3:
            search_result=(search_tittle(tittle))
            movie_id=search_result.iloc[0]["movieId"]
            #display(movie_id)
            display(find_Simialr_users(movie_id))
            
movie_search.observe(search_type,names="value")



display(movie_search,movie_list)

Text(value=' ', description='Movie Title:')

Output()