# Data Preprocessing and Transformation

Below, we turn the dataset from a size that we can not manage or process into a managable form. This is done by removing the Date column, reducing the data types to a smaller form, and only using data on Movies and Customers that are frequently rated or rating.

In [1]:
import numpy as np
import pandas as pd
import os
from sklearn.neighbors import NearestNeighbors
from math import sqrt
from sklearn.metrics import mean_squared_error
import math

In [2]:
import os
cwd = os.getcwd()

In [3]:
movie_title = pd.read_csv(cwd + "/data/movie_titles.csv", encoding='unicode_escape', usecols=[2], header=None)
movie_title.columns = ['title']
movie_title

Unnamed: 0,title
0,Dinosaur Planet
1,Isle of Man TT 2004 Review
2,Character
3,Paula Abdul's Get Up & Dance
4,The Rise and Fall of ECW
...,...
17765,Where the Wild Things Are and Other Maurice Se...
17766,Fidel Castro: American Experience
17767,Epoch
17768,The Company


In [4]:
movie = pd.read_csv(cwd + "/data/final.csv")
movie.describe()

Unnamed: 0,MovieID,CustomerID,Rating
count,100480500.0,100480500.0,100480500.0
mean,9070.915,1322489.0,3.60429
std,5131.891,764536.8,1.085219
min,1.0,6.0,1.0
25%,4677.0,661198.0,3.0
50%,9051.0,1319012.0,4.0
75%,13635.0,1984455.0,4.0
max,17770.0,2649429.0,5.0


In [5]:
movie_freq = pd.DataFrame(movie.groupby('MovieID').size(),columns=['count'])
threshold = 100

popular_movies = list(set(movie_freq.query('count>=@threshold').index))

# ratings df after dropping non popular movies
data_popular_movies = movie[movie.MovieID.isin(popular_movies)]

print('shape of original data:', movie.shape)
print('shape of data_popular_movies', data_popular_movies.shape)
print("No. of movies which are rated more than 100 times:", len(popular_movies))

shape of original data: (100480507, 4)
shape of data_popular_movies (100400918, 4)
No. of movies which are rated more than 100 times: 16795


In [6]:
user_freq = pd.DataFrame(movie.groupby('CustomerID').size(),columns=['count'])
# A large number of users are rated very rarely, so we can remove those users which are rated less than 1000 times.
threshold = 1000
active_user = list(set(user_freq.query('count>=@threshold').index))
data_popular_movies_active_user = data_popular_movies[data_popular_movies.CustomerID.isin(active_user)]

print('shape of original data:', movie.shape)
print('shape of data_popular_movies', data_popular_movies.shape)
print('shape of data_popular_movies_active_user', data_popular_movies_active_user.shape)
print('No. of users who rated more than 1000 times:', len(active_user))

print('user number of new matrix', len(active_user))
print('movie number of new matrix', len(popular_movies))

shape of original data: (100480507, 4)
shape of data_popular_movies (100400918, 4)
shape of data_popular_movies_active_user (18757426, 4)
No. of users who rated more than 1000 times: 13141
user number of new matrix 13141
movie number of new matrix 16795


In [7]:
print(data_popular_movies_active_user.memory_usage(), '\n')
print("Memory Usage: ", data_popular_movies_active_user.memory_usage().sum() / (1024**2), " MB")

Index         150059408
MovieID       150059408
CustomerID    150059408
Rating        150059408
Date          150059408
dtype: int64 

Memory Usage:  715.5390167236328  MB


In [8]:
data_popular_movies_active_user['MovieID'] = data_popular_movies_active_user['MovieID'].astype('int16')
data_popular_movies_active_user['CustomerID'] = data_popular_movies_active_user['CustomerID'].astype('int32')
data_popular_movies_active_user['Rating'] = data_popular_movies_active_user['Rating'].astype('int8')

cleanedMovie = data_popular_movies_active_user.drop(columns=['Date']).iloc[:20000000]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_popular_movies_active_user['MovieID'] = data_popular_movies_active_user['MovieID'].astype('int16')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_popular_movies_active_user['CustomerID'] = data_popular_movies_active_user['CustomerID'].astype('int32')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-

In [9]:
print("Memory Usage: ", cleanedMovie.memory_usage().sum() / (1024**2), " MB")

Memory Usage:  268.3271312713623  MB


In [10]:
cleanedMovie.to_pickle("cleanedMovie.pkl")