# Playing with finding similar movies to specific movie

We want to load ratings from file u.data (Movie Lens with 100k of data) and merge them with titles of movies - data have not headers, so we need to setup them manually - we're merging two frames by movie_id

In [None]:
import pandas as pd

rating_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('./ml-100k/u.data', sep='\t', names=rating_cols, usecols=range(3), encoding="ISO-8859-1")

movies_cols = ['movie_id', 'title']
movies = pd.read_csv('./ml-100k/u.item', sep='|', names=movies_cols, usecols=range(2), encoding="ISO-8859-1")

ratings = pd.merge(movies, ratings).sort_values(by=['rating'], ascending=False)


In [None]:
ratings.head(10)

From the data we can create pivot table to show how specific users rated all the movies - we can see sparsity of the data - so many empty values in the matrix. If we're creating such a matrix in memory nad fill every cell with zeros, then we're wasting space, so better to use sparse matrix and/or pandas framework

In [None]:
movies_ratings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
movies_ratings.head()

We can see how users rated Good Will Hunting (1997):

In [None]:
specific_movie_ratings = movies_ratings['Good Will Hunting (1997)']
specific_movie_ratings.head()

We can use amazing pandas corrwith function - to see correlation of specific movie vector of user rating with every other movie (1.0 - correlated - highest level)

In [None]:
similar_movies = movies_ratings.corrwith(specific_movie_ratings)
similar_movies = similar_movies.dropna()
df = pd.DataFrame(similar_movies)
df.head(10)

Finally we can sort by the value of correlation and see recommended movies

In [None]:
similar_movies.sort_values(ascending=False).head(10)

We should clean our data from niche movies - movies watched only by few people

In [None]:
import numpy as np
movie_stats = ratings.groupby('title').agg({'rating': [np.size, np.mean]})
movie_stats.head()

We're gonna remove movies watched by less than 100 people

In [None]:
popular_movies = movie_stats['rating']['size'] >= 100
print(popular_movies)
movie_stats[popular_movies].sort_values([('rating', 'mean')], ascending=False)[:20]

Now we can merge the movie stats for popular movies only with similar movies frame (by title)

In [None]:
df = movie_stats[popular_movies].join(pd.DataFrame(similar_movies, columns=['similarity']))

In [None]:
df.sort_values(['similarity'], ascending=False)[:10]

Try to filter movies by other number of people watched them (initially 100)