### Power of Recommendation Engine

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;suppose you're planning to buy a laptop without any idea about the right configuration. So i would check with my friends and colleagues for recommendation and they suggests laptops based on your requirement , their knowledge and trending. The same way Amazon recommends you a laptop based on your previous search , popularity and keeps on showing the best recommendation and tempt you to buy a laptop even if you drop the plan. All the major company has recommendation in their products such as Youtube shows recommendations based on your interests and activity.

We'll explore how to implement it, before that there are two types of Recommendation Engine

* Content Based Filtering
* Collabarative Based Filtering

#### Content Based Filtering
This algorithm recommends products which are similar to the ones that a user has liked in the past.

#### Collabaratvie Based Filtering
The collaborative filtering algorithm uses “User Behavior” for recommending items.

In this Kernel, we shall look at Content Based Filtering implementation

Our task is When User search a movie We'll recommend the top 10 similar movies

Implementation is so simple, We're going to combine and create a bulk of keywords for each movie from the multiple given datasets and final similarity between each movie and popup the top similar movies

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
import os
os.listdir("../input/movielens-20m-dataset")
os.chdir("../input/movielens-20m-dataset/")

In [None]:
genome_tags = pd.read_csv("genome_tags.csv")

# We drop this dataset, since this doesn't have any useful features for predictions
link = pd.read_csv("link.csv")
genome_scores = pd.read_csv("genome_scores.csv")

# For efficiency and compatability We pick top 5000 rows
movies = pd.read_csv("movie.csv",nrows=5000)
rating = pd.read_csv("rating.csv")
tag = pd.read_csv("tag.csv")


In [None]:
# Dataset shape
print("genome_tags shape is {}".format(genome_tags.shape))
print("genome_scores shape is {}".format(genome_scores.shape))
print("movies shape is {}".format(movies.shape))
print("rating shape is {}".format(rating.shape))
print("tag shape is {}".format(tag.shape))

In [None]:
print(genome_scores.columns)
print(movies.columns)
print(rating.columns)
print(tag.columns)

movieId feature is common in all dataset, using that we'll combine all the dataset into final_dataset

In [None]:
# genome_scores dataset has relevance feature which says that how much a tag is relevant to the movie and
# it's value range from 0 to 1, we'll consider only the value which has more than 0.5 relevance. So this gives better 
# predicrion. And We'll merge the tag with genome_scores dataset.
genome_scores = genome_scores[genome_scores['relevance']> 0.5].merge(genome_tags,on='tagId',how='left') 

# concatenating all the tag that belongs to a movie and forming a tag collection for each movie
genome_scores = genome_scores.groupby('movieId')['tag'].apply(' '.join).reset_index()

In [None]:
final_dataset = pd.merge(movies,genome_scores,on='movieId',how='left')

In [None]:
# renaming tag as keywords
tag = tag.rename(columns = {"tag":"keywords"})
tag['keywords'].fillna('',inplace=True)
tag = tag.groupby('movieId')['keywords'].apply(' '.join).reset_index()

In [None]:
final_dataset = pd.merge(final_dataset,tag,on='movieId',how='left')

In [None]:
final_dataset.head()

In [None]:
final_dataset['genres'].head()

In [None]:
final_dataset['keywords'] = final_dataset['keywords'] + " " +final_dataset['tag'] +  " " + \
    final_dataset['genres'].str.replace("|"," ")
final_dataset['keywords'].fillna("",inplace=True)

In [None]:
# rating will be used for collabarative filtering, so we'll skip this now
# final_dataset = pd.merge(final_dataset,rating,on='movieId',how='left')

In [None]:
# Both tag and genres values has added to keywords so we drop this 
final_dataset.drop(['tag','genres'],inplace=True,axis=1)

In [None]:
final_dataset.columns

In [None]:
c_vect = TfidfVectorizer(stop_words='english')
X = c_vect.fit_transform(final_dataset['keywords'])

In [None]:
# There are other similiary distance metric available which are euclidean distance,manhattan distance, Pearson coefficient etc
# But for sparse matrix cosine similarity works better
cosine_sim = cosine_similarity(X)

In [None]:
def get_movie_recommendation(movie_name):
    idx = final_dataset[final_dataset['title'].str.contains(movie_name)].index
    if len(idx):
        movie_indices = sorted(list(enumerate(cosine_sim[idx[0]])), key=lambda x: x[1], reverse=True)[1:11]
        movie_indices = [i[0] for i in movie_indices]
        return movie_indices
    else : 
        return []

In [None]:
title = "Toy Story 2"
recommended_movie_list = get_movie_recommendation(title)
movies.iloc[recommended_movie_list].set_index('movieId')

Our system predicts exactly the similar movies of Toy story

Major drawback of this approach is that it predicts the same lists of movie for all the user who search Toy story irrespective of their interest and their likes. So we need an algorithm to predict based on User behaviour for that We'll use collabrative filtering.

I'm writing my other kernel for collabarative filtering. Will update once it is completed.

**Please upvote it if you like it. Thanks**