# Importing

Importing all the necessary libraries

In [1]:
import os
import pandas as pd
import numpy as np

import seaborn as sns
from numpy.linalg import svd
from sklearn.model_selection import train_test_split

# Data Processing

In [2]:
directory = "/Users/vbraun/Downloads/training_set/"

Code to append movieId to each record in all of the source files if this has not been executed earlier. This will allow all the source files to be loaded into a dataframe with one line of code and without having to add the movieId seperately before concatting the sourcefiles.

In [3]:
def movieId_to_source():
    x = 0
    string = ","+str(x)
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename), 'r') as f:
            if(f.readlines()[0] == '1:,1\n'):
                print('Already formatted, continuing')
                return
            else:
                file_lines = [''.join([x.strip(), string, '\n']) for x in f.readlines()]
        with open(os.path.join(directory, filename), 'w') as f:
            f.writelines(file_lines)
    return

In [4]:
movieId_to_source()

Already formatted, continuing


Creating the movie dataframe by concatting all the sourcefiles without their title (skiprows=1). Concluding with naming the columns.

In [5]:
movie_df = pd.concat(pd.read_csv(os.path.join(directory, fname), skiprows=1,header=None) for fname in os.listdir(directory)).rename(columns={0:'userId',1:'rating',2:'date',3:'itemId'})

display(movie_df.head(5))

Unnamed: 0,userId,rating,date,itemId
0,1488844,3,2005-09-06,1
1,822109,5,2005-05-13,1
2,885013,4,2005-10-19,1
3,30878,4,2005-12-26,1
4,823519,3,2004-05-03,1


Dropping the date column as this is not relevant for this RecSys

In [6]:
movie_df = movie_df.drop(columns='date')

### Data filtering

To allow for faster development a selection is made of the first 100 movies. This will be removed for the final model.

In [25]:
filtered_movie_df = movie_df[movie_df['itemId'] <= 100]
print('Length of dataset:',len(filtered_movie_df))

Length of dataset: 352771


In order to filter the dataset based on activity and reduce the sparsity of the data, the data will be grouped and filtered based on movies and users. The datasets will show how many ratings each movie has gotten and how many rating each user has given.

In [20]:
filtered_movie_count = filtered_movie_df[['itemId','userId']].groupby('itemId').count().reset_index().rename(columns={'userId':'user_count'})
filtered_user_count = filtered_movie_df[['itemId','userId']].groupby('userId').count().reset_index().rename(columns={'itemId':'item_count'})

display(filtered_movie_count.head(3),filtered_user_count.head(3))

Unnamed: 0,itemId,user_count
0,1,547
1,2,145
2,3,2012


Unnamed: 0,userId,item_count
0,6,1
1,7,4
2,42,1


To reduce the sparcity of data in the dataset, we will filter out the users that have rated less than 10% of the total amount of movies.

In [26]:
filtered_movie_df = filtered_movie_df[filtered_movie_df['userId'].isin(filtered_user_count[filtered_user_count['item_count']/len(filtered_movie_count) > 0.10]['userId'])]

print('Length of dataset:',len(filtered_movie_df))

Length of dataset: 8380


Finally, the movies that have been rated by fewer than 50 people will be filtered out of the dataset.

In [27]:
filtered_movie_df = filtered_movie_df[filtered_movie_df['itemId'].isin(filtered_movie_count[filtered_movie_count['user_count']>50]['itemId'])]

print('Length of dataset:',len(filtered_movie_df))

Length of dataset: 8380
