# Predicting Movie Rates with Decision Tree

In this report, we are building step by step a decision tree to predict user ratings on movies using the MovieLens dataset, available at http://grouplens.org/datasets/movielens/

The dataset is composed by 3 main files:

**movies.dat:** this file contains information of all movies, in the format < movie id > :: < movie name > :: < pipe separeted list of genders >

**users.dat:** this file contains information of all users, in the format < user id > :: < user gender > :: < user age > :: < ocupation > :: < zip code > 

**ratings.dat:** this file contains information of all ratings, in the format < user id > :: < movie id > :: < rating > :: < timestamp >


## Pre-processing data

The first step in this project is to join the data in the 3 data files into a feature and a label matrices. For the gender informations, for each movie we added a binary variable for each different gender. 

The function **build_features** does all the processing, and returns two matrices, X and y, corresponding to the features and the labels. 

In [71]:
import pandas as pd

def build_features():
    """
    read data from movies, users and rating and return a single pandas dataframe
    of the joined tables, containing all info on ratings
    """
    # movie genres
    genres = [
        'Action',
        'Adventure',
        'Animation',
        'Children\'s',
        'Comedy',
        'Documentary',
        'Drama',
        'Fantasy',
        'Film-Noir',
        'Horror',
        'Musical',
        'Mystery',
        'Romance',
        'Sci-Fi',
        'Thriller',
        'War',
        'Western'
    ]
    # reading movies data
    movies_df = pd.DataFrame(columns=['movie_id'] + genres)
    with open('../data/movies.dat','r') as file:
        
        
        lines = file.readlines()
        for idx in range(len(lines)):
            row = lines[idx].split("::")
            movie_genres = row[-1][:-1].split('|')
            row = row[:-2] # ignore gender and movie name
            row[0] = int(row[0]) # cast id to integer
            for genre in genres:
                row.append(int(genre in movie_genres))
            movies_df.loc[idx] = row
    
    # reading users data
    users_df = pd.read_table('../data/users.dat', 
                    names=['user_id', 'gender', 'age', 'ocupation', 'zip_code'], 
                     sep='::', engine='python')
    users_df['gender'] = users_df['gender'].apply(lambda x: int(x == 'M'))
    
    # reading ratings data
    ratings_df = pd.read_table('../data/ratings.dat', 
                    names=['user_id', 'movie_id', 'rating', 'timestamp'], 
                    sep='::', engine='python')
    
    # join tables
    user_ratings_df = ratings_df.merge(users_df, how='inner', on='user_id')
    features_df = user_ratings_df.merge(movies_df, how='outer', on='movie_id')
    
    # drop unwanted features
    features_df = features_df.drop('timestamp', 1)
    features_df = features_df.drop('zip_code', 1)
    
    # separate labels from features
    labels = features_df['rating']
    features = features_df.drop('rating',1)
    
    # print first 5 rows
    print features.head()
    print labels.head()
    
build_features()

   user_id  movie_id  gender   age  ocupation  Action  Adventure  Animation  \
0      1.0    1193.0     0.0   1.0       10.0     0.0        0.0        0.0   
1      2.0    1193.0     1.0  56.0       16.0     0.0        0.0        0.0   
2     12.0    1193.0     1.0  25.0       12.0     0.0        0.0        0.0   
3     15.0    1193.0     1.0  25.0        7.0     0.0        0.0        0.0   
4     17.0    1193.0     1.0  50.0        1.0     0.0        0.0        0.0   

   Children's  Comedy   ...     Fantasy  Film-Noir  Horror  Musical  Mystery  \
0         0.0     0.0   ...         0.0        0.0     0.0      0.0      0.0   
1         0.0     0.0   ...         0.0        0.0     0.0      0.0      0.0   
2         0.0     0.0   ...         0.0        0.0     0.0      0.0      0.0   
3         0.0     0.0   ...         0.0        0.0     0.0      0.0      0.0   
4         0.0     0.0   ...         0.0        0.0     0.0      0.0      0.0   

   Romance  Sci-Fi  Thriller  War  Western  