### MovieLens dataset
Simple Association Rules extraction from MovieLens dataset.  

The MovieLens 20M Dataset includes 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.  
Released 4/2015; updated 10/2016 to update links.csv and add tag genome data.  

Here we use the Apriori algorithm to discover relations between movie ratings. If a user scored a certain movie with positive rating, which other movie will he also rate as positive? This is a rudimentary form of recomendation system: if you liked some film, you will also like that other one.

In this notebook we:  
* Load and preprocess data  
* Select relevant information  
* Extract rules

Dataset avaible at:  
https://grouplens.org/datasets/movielens/   
Efficient-apriori  
https://pypi.org/project/efficient-apriori/

This notebook was made using Nabucodonosor - CCAD Universidad Nacional de Córdoba


In [1]:
from efficient_apriori import apriori
import pandas as pd
path = '/users/salcaide/Movielens/'

**Data Loading and Preprocessing**

In [2]:
ratings = pd.read_csv(path+'ratings.csv')

What's the form of our data?

In [3]:
print('Ratings data shape: ' + str(ratings.shape))

Ratings data shape: (20000263, 4)


In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


Are there duplicated rows? and missing data?

In [5]:
ratings.duplicated().value_counts()

False    20000263
dtype: int64

In [6]:
ratings.isnull().values.any()

False

In [7]:
print('Unique users in database: ' + str(len(ratings.userId.unique())))

Unique users in database: 138493


In [8]:
ratings.shape[0]/len(ratings.userId.unique())

144.4135299257002

We have over 20 million movie ratings, that came from 138493 different users. That means that on average, each user rated 144 movies.   
To find the rules we are searching for, we assume that a rating equal to -or above- 3 points means a positive impression or at least an accepted film. So we select ratings that match that score.

In [9]:
len(ratings[ratings.rating >= 3].userId)

16486759

Movie names are listed in another file, movies.csv

In [10]:
movies = pd.read_csv(path+'movies.csv')

In [11]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [12]:
len(movies)

27278

In [13]:
movies.duplicated().value_counts()

False    27278
dtype: int64

In [14]:
movies.isnull().values.any()

False

In [15]:
ratings.shape[0]/len(movies)

733.2012244299435

On average, each movie is rated 733 times

In [16]:
ratings.sort_values(by='userId',inplace=True)

In [17]:
listby_user = []
for user in ratings[ratings.rating>=3].groupby('userId'):
    listby_user.append(list(user[1].movieId.values))

In [18]:
transactions=listby_user

**Rule discovery**

In [19]:
from efficient_apriori import apriori
itemsets, rules = apriori(transactions, min_support=0.09,  min_confidence=0.3)

In [None]:
rules=sorted(rules, key=lambda rule: rule.confidence)
for rule in rules:
    print(rule) 

{480} -> {10, 589} (conf: 0.300, supp: 0.115, lift: 2.313, conv: 1.243)
{480} -> {50, 296, 457} (conf: 0.300, supp: 0.115, lift: 1.954, conv: 1.209)
{593} -> {296, 356, 592} (conf: 0.300, supp: 0.131, lift: 1.883, conv: 1.201)
{110} -> {260, 1210, 1270} (conf: 0.300, supp: 0.108, lift: 1.811, conv: 1.192)
{1196} -> {260, 1258} (conf: 0.300, supp: 0.093, lift: 2.899, conv: 1.281)
{260} -> {1196, 1210, 1291, 2571} (conf: 0.300, supp: 0.111, lift: 2.573, conv: 1.262)
{1} -> {356, 589, 593} (conf: 0.300, supp: 0.100, lift: 1.609, conv: 1.162)
{150} -> {480, 589, 592, 593} (conf: 0.300, supp: 0.097, lift: 2.292, conv: 1.242)
{589} -> {296, 1214} (conf: 0.300, supp: 0.106, lift: 2.137, conv: 1.228)
{1198} -> {47, 296, 356} (conf: 0.300, supp: 0.091, lift: 1.712, conv: 1.178)
{1198} -> {260, 593, 1136, 1196} (conf: 0.300, supp: 0.091, lift: 2.909, conv: 1.281)
{1196} -> {1136, 4993} (conf: 0.300, supp: 0.093, lift: 2.517, conv: 1.258)
{1196} -> {296, 318, 593, 2571} (conf: 0.300, supp: 0.093,

**Results**

Some high confidence rules are:   
{260, 589, 1198, 1210, 1240, 1270} -> {1196} (conf: 0.982, supp: 0.091, lift: 3.161, conv: 37.825)  
{1196, 1210, 2571, 5952, 7153} -> {4993} (conf: 0.979, supp: 0.093, lift: 3.910, conv: 34.878)  
{1210, 2571, 5952, 7153} -> {4993} (conf: 0.976, supp: 0.098, lift: 3.900, conv: 31.267)  
{260, 296, 1221} -> {858} (conf: 0.970, supp: 0.091, lift: 3.391, conv: 23.472)    

Let's see what are the labels


In [None]:
names=[89, 260, 296, 589, 858, 1196, 1198, 1210, 1221, 1240, 1270, 2571, 4993, 5952, 7153]
for movie in names:
    display(movies[movies['movieId']==movie])

{Star Wars: Episode IV - A New Hope,  Terminator 2: Judgment Day, Raiders of the Lost Ark (Indiana Jones), Star Wars: Episode VI - Return of the Jedi , The Terminator, Back to the Future} -> {Star Wars: Episode V - The Empire Strikes Back} (conf: 0.982, supp: 0.091, lift: 3.161, conv: 37.825)

{Star Wars: Episode V - The Empire Strikes Back, Star Wars: Episode VI - Return of the Jedi ,The Matrix, Lord of the Rings: The Two Towers, The, 	Lord of the Rings: The Return of the King} -> {Lord of the Rings: The Fellowship of the Ring} (conf: 0.979, supp: 0.093, lift: 3.910, conv: 34.878)

{Star Wars: Episode VI - Return of the Jedi , The Matrix, 	Lord of the Rings: The Two Towers, The ,Lord of the Rings: The Return of the King} -> {Lord of the Rings: The Fellowship of the Ring} (conf: 0.976, supp: 0.098, lift: 3.900, conv: 31.267)

{Star Wars: Episode IV - A New Hope , Pulp Fiction, Godfather: Part II, The} -> {	Godfather, The} (conf: 0.970, supp: 0.091, lift: 3.391, conv: 23.472)

So, if a user liked  
* Star Wars Episodes IV  and VI   
* Terminator 1 and 2   
* Indiana Jones   
* Back to the Future   

Then he will probably enjoy watching Star Wars Episode V  

Rules that seem obvious and respond to logic and common sense gives us an impression of the value of this algorithm.  


**Pending**  
* Exploring the dataset.   
* Searching for the most valuable rules based on metrics    
* Exploring other rules. Transactions may be arranged taking into account movie genders, year, or low rating    
