## Step 1 - Loading the dependencies

On this notebook, we will generate item-item recommendations using a technique called https://en.wikipedia.org/wiki/Collaborative_filtering. Let's get started!

In [3]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Step 2: Load the Data

Let's download a small version of the MovieLens dataset. You can access it via the zip file url http://files.grouplens.org/datasets/movielens/ml-latest-small.zip, or directly download here. We're working with data in ml-latest-small.zip and will need to add the following files to our local directory:


In [4]:
ratings = pd.read_csv("data/ratings.csv")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
movies = pd.read_csv("data/movies.csv")
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Step 3: Exploratory Data Analysis

In Part 1 of this tutorial series, we will focus on the ratings dataset. We'll need movies for subsequent sections. Ratings contains users' ratings for a given movie. Let's see how many ratings, unique movies, and unique users are in our dataset.



In [8]:
n_ratings =len(ratings)
n_movies = ratings['movieId'].nunique()
n_users = ratings['userId'].nunique()


print(f"Number of ratings: {n_ratings}")
print(f"Number of unique movieId's: {n_movies}")
print(f"Number of unique users: {n_users}")
print(f"Average number of ratings per user: {round(n_ratings/n_users, 2)}")
print(f"Average number of ratings per movie: {round(n_ratings/n_movies, 2)}")

Number of ratings: 100836
Number of unique movieId's: 9724
Number of unique users: 610
Average number of ratings per user: 165.3
Average number of ratings per movie: 10.37



Now, let's take a look at users' rating counts. We can do this using pandas' groupby() and count() which groups the data by userId's and counts the number of ratings for each userId

In [11]:
user_freq = ratings[['userId','movieId']].groupby('userId').count().reset_index()
user_freq.columns = ['userId', 'n_ratings']
user_freq

Unnamed: 0,userId,n_ratings
0,1,232
1,2,29
2,3,39
3,4,216
4,5,44
...,...,...
605,606,1115
606,607,187
607,608,831
608,609,37
