# Building a Recommendation System Mini-Project - Part 1: A sneak at data

This notebook states an introduction on the objective, and description about the data we are working on, the Netflix Prize data (which can be accessed publicly [here](https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data)).

## Problem Statement

We are attempting to build an a-bit-above-naive recommendation system which suggests users movies that the system belives users may take interested in.

## Objective

The business objective is to increase the average session time per user, number of subscriptions, and number of monthly active users. Sadly, we do not have a real application here, so let us take this as a reference only but not take it too seriously.

## Data

### Rating table

The rating table is splited into 4 files `combined_data_*.txt` where `*` varies from 1 to 4 with the following format:

```
MovieID1:
CustomerID11,Rating11,Date11
CustomerID12,Rating12,Date12
```

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import glob

In [2]:
rating_files = glob.glob('archive/combined_data_*.txt')
rating_files

['archive/combined_data_2.txt',
 'archive/combined_data_3.txt',
 'archive/combined_data_1.txt',
 'archive/combined_data_4.txt']

In [3]:
df_rating = pd.concat([pd.read_csv(
    filename,
    header=None,
    names=['customer_id', 'rating', 'date'],
    parse_dates=['date']
) for filename in rating_files])
df_rating.head()

Unnamed: 0,customer_id,rating,date
0,4500:,,NaT
1,2532865,4.0,2005-07-26
2,573364,3.0,2005-06-20
3,1696725,3.0,2004-02-27
4,1253431,3.0,2004-03-31


We would like to convert this into the below format:

```
CustomerID11,Rating11,Date11,MovieID1
CustomerID12,Rating12,Date12,MovieID1
```

In [4]:
df_rating['movie_id'] = np.where(df_rating['rating'].isna(), df_rating['customer_id'], np.nan)
df_rating['movie_id'] = df_rating['movie_id'].str.split(':').str[0]
df_rating['movie_id'] = df_rating['movie_id'].fillna(method='ffill')
df_rating.dropna(subset=['rating', 'date'], inplace=True)
df_rating.head()

Unnamed: 0,customer_id,rating,date,movie_id
1,2532865,4.0,2005-07-26,4500
2,573364,3.0,2005-06-20,4500
3,1696725,3.0,2004-02-27,4500
4,1253431,3.0,2004-03-31,4500
5,1265574,2.0,2003-09-01,4500


In [5]:
df_rating.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100480507 entries, 1 to 26851925
Data columns (total 4 columns):
 #   Column       Dtype         
---  ------       -----         
 0   customer_id  object        
 1   rating       float64       
 2   date         datetime64[ns]
 3   movie_id     object        
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 3.7+ GB


In [6]:
# Convert customer_id and movie_id to numeric columns
df_rating = df_rating.astype({
    'customer_id': 'int',
    'movie_id': 'int'
})
df_rating.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100480507 entries, 1 to 26851925
Data columns (total 4 columns):
 #   Column       Dtype         
---  ------       -----         
 0   customer_id  int64         
 1   rating       float64       
 2   date         datetime64[ns]
 3   movie_id     int64         
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 3.7 GB


In [7]:
df_rating.isna().sum()

customer_id    0
rating         0
date           0
movie_id       0
dtype: int64

### Movie table

The movie table is contained in `movie_titles.csv`, with format `MovieID,YearOfRelease,Title`.

We detected the errorneous circumstance in the data:
- The encoding is not UTF-8, we have tried several encoding and `latin-1` works.
- Titles containing commas were not wrapped by quotes.

Therefore, we use the below code snippet to resolve the problem.

In [8]:
import csv

# Specify the correct encoding
encoding = 'latin-1'  # Update this if needed

# Read the CSV file using the specified encoding
with open('archive/movie_titles.csv', 'r', encoding=encoding) as infile:
    # Check the file's dialect to determine the delimiter
    dialect = csv.Sniffer().sniff(infile.read(1024))
    infile.seek(0)  # Reset the file pointer

    # Modify the CSV file to enclose movie titles with double quotes
    with open('archive/modified_movie_titles.csv', 'w', newline='', encoding=encoding) as outfile:
        reader = csv.reader(infile, dialect)
        writer = csv.writer(outfile, delimiter=dialect.delimiter)
        for row in reader:
            if len(row) > 3:
                # Assuming the movie title column is at index 2
                row[2] = f'"{row[2]}{dialect.delimiter} {row[3]}"'
            row = row[:3]
            writer.writerow(row)

In [9]:
df_movies = pd.read_csv(
    'archive/modified_movie_titles.csv',
    header=None,
    names=['movie_id', 'year_of_release', 'title'],
    parse_dates=['year_of_release'],
    encoding='latin-1'
)
df_movies.head()

Unnamed: 0,movie_id,year_of_release,title
0,1,2003-01-01,Dinosaur Planet
1,2,2004-01-01,Isle of Man TT 2004 Review
2,3,1997-01-01,Character
3,4,1994-01-01,Paula Abdul's Get Up & Dance
4,5,2004-01-01,The Rise and Fall of ECW


In [10]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17770 entries, 0 to 17769
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   movie_id         17770 non-null  int64         
 1   year_of_release  17763 non-null  datetime64[ns]
 2   title            17770 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 416.6+ KB


In [11]:
# Export processed data into a new folder
df_rating.to_csv('processed/user_ratings.csv', index=False)
df_movies.to_csv('processed/movies.csv', index=False)