# Predicting Movie Rates with Decision Tree

In this report, we are building step by step a decision tree to predict user ratings on movies using the MovieLens dataset, available at http://grouplens.org/datasets/movielens/

The dataset is composed by 3 main files:

**movies.dat:** this file contains information of all movies, in the format < movie id > :: < movie name > :: < pipe separeted list of genders >

**users.dat:** this file contains information of all users, in the format < user id > :: < user gender > :: < user age > :: < ocupation > :: < zip code > 

**ratings.dat:** this file contains information of all ratings, in the format < user id > :: < movie id > :: < rating > :: < timestamp >


## Pre-processing data

The first step in this project is to join the data in the 3 data files into a feature and a label matrices. For the gender informations, for each movie we added a binary variable for each different gender. 

The function **build_features** does all the processing, and returns two matrices, X and y, corresponding to the features and the labels. 

In [1]:
import pandas as pd
from IPython.display import display

def build_features():
    """
    read data from movies, users and rating and return a single pandas dataframe
    of the joined tables, containing all info on ratings
    """
    # movie genres
    genres = [
        'Action',
        'Adventure',
        'Animation',
        'Children\'s',
        'Comedy',
        'Documentary',
        'Drama',
        'Fantasy',
        'Film-Noir',
        'Horror',
        'Musical',
        'Mystery',
        'Romance',
        'Sci-Fi',
        'Thriller',
        'War',
        'Western'
    ]
    # reading movies data
    movies_df = pd.DataFrame(columns=['movie_id'] + genres)
    with open('../data/movies.dat','r') as file:
        
        
        lines = file.readlines()
        for idx in range(len(lines)):
            row = lines[idx].split("::")
            movie_genres = row[-1][:-1].split('|')
            row = row[:-2] # ignore genre and movie name
            for genre in genres:
                row.append(str(genre in movie_genres))
            movies_df.loc[idx] = row
    
    # reading users data
    users_df = pd.read_table('../data/users.dat', 
                    names=['user_id', 'gender', 'age', 'ocupation', 'zip_code'], 
                     sep='::', engine='python')
    users_df['user_id'] = users_df['user_id'].apply(lambda x: str(x))
    users_df['ocupation'] = users_df['ocupation'].apply(lambda x: str(x))
    
    # reading ratings data
    ratings_df = pd.read_table('../data/ratings.dat', 
                    names=['user_id', 'movie_id', 'rating', 'timestamp'], 
                    sep='::', engine='python')
    ratings_df['movie_id'] = ratings_df['movie_id'].apply(lambda x: str(x))
    ratings_df['user_id'] = ratings_df['user_id'].apply(lambda x: str(x))
    
    # join tables
    user_ratings_df = ratings_df.merge(users_df, how='inner', on='user_id')
    features_df = user_ratings_df.merge(movies_df, how='inner', on='movie_id')
    
    # drop unwanted features
    features_df = features_df.drop('timestamp', 1)
    features_df = features_df.drop('zip_code', 1)
    
   
    # print first 5 rows
    display(features_df.head()) 
    
    return features_df
    
data = build_features()

Unnamed: 0,user_id,movie_id,rating,gender,age,ocupation,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,1193,5,F,1,10,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,2,1193,5,M,56,16,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,12,1193,4,M,25,12,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,15,1193,4,M,25,7,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,17,1193,5,M,50,1,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Building the Tree


#### Splitting the dataset

Now we are going to build the decision tree. To do so, a good start is to create a method that, given some data rows, a feature and a value, splits the data into two disjoint subsets, according to the value.

In [2]:
def split_df(df, feature, value):
    """
    splits a data frame in two separate subsets: 
    one where all rows have row[feature] >= value for ints or floats and row[feature] == value for strings
    and another one where all rows have row[feature] < value for ints or floats and row[feature] != value for strings 
    """
    if isinstance(value,int) or isinstance(value,float):
        df1 = df[feature] >= value
        df2 = df[feature] < value
    else:
        df1 = df[feature] == value
        df2 = df[feature] != value
        
    return df1, df2

#### Counting labels

We now construct a function to, given a dataframe, count how many rows are there of each label, which will be usefull for determining the nodes of the tree.

In [3]:
def label_counts(df, label_name):
    """
    returns the count of each label value
    """
    return df[label_name].value_counts()
    
print label_counts(data, 'rating')

4    348971
3    261197
5    226310
2    107557
1     56174
Name: rating, dtype: int64


#### Entropy

We need to create a method for evaluating the homogenity of the set. We will use the standard entropy function for that.

In [4]:
from math import log 
log2 = lambda x: log(x)/log(2)

def entropy(df, label_name):
    """
    returns the entropy in a dataset
    """
    counts = label_counts(df, label_name)
    s = 0.0
    size = len(data.index)
    for _, count in counts.iteritems():
        p = float(count)/size
        s -= p*log2(p)
    return s

print entropy(data, 'rating')

2.1002315644
