# Transform

We need to prepare the data for using Deep Learning algorithms. There are two main transformations:

- **Preprocessing**:
    - Users: Convert to label encoded values
    - Items: Convert to label encoded values
    - User Side Features: Convert to 
    - Ratings: Convert the interaction to Explicit or Implicit signals


- **Data Spliting**: Create train & test datasets for evaluating the dataset
    - Random Split
    - Stratified Split
    - Chronological Split
    

In [4]:
import numpy as np
import pandas as pd

In [5]:
users = pd.read_csv("data/users.csv")
items = pd.read_csv("data/items.csv")
ratings = pd.read_csv("data/ratings.csv")

In [6]:
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [7]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [20]:
items.head()

Unnamed: 0,movie_id,title,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,SciFi,Thriller,War,Western,year,overview,original_language,runtime,vote_average,vote_count
0,1,Toy Story (1995),0,0,0,1,1,1,0,0,...,0,0,0,0,1995.0,"Led by Woody, Andy's toys live happily in his ...",en,81.0,7.9,10878.0
1,2,GoldenEye (1995),0,1,1,0,0,0,0,0,...,0,1,0,0,1995.0,James Bond must unmask the mysterious head of ...,en,130.0,6.8,2037.0
2,3,Four Rooms (1995),0,0,0,0,0,0,0,0,...,0,1,0,0,1995.0,It's Ted the Bellhop's first night on the job....,en,98.0,6.1,1251.0
3,4,Get Shorty (1995),0,1,0,0,0,1,0,0,...,0,0,0,0,1995.0,Chili Palmer is a Miami mobster who gets sent ...,en,105.0,6.5,501.0
4,5,Copycat (1995),0,0,0,0,0,0,1,0,...,0,1,0,0,1995.0,An agoraphobic psychologist and a female detec...,en,124.0,6.5,424.0


## Preprocessing

In [2]:
from sklearn.preprocessing import LabelEncoder

In [103]:
def encode_user_item(df, user_col, item_col):
    """Function to encode users and items
    
    Params:     
        df (pd.DataFrame): Pandas data frame to be used.
        user_col (string): Name of the user column.
        item_col (string): Name of the item column.
    
    Returns: 
        transform_df (pd.DataFrame): Modifed dataframe with the users and items index
        user_encoder: sklearn Label Encoder for users
        item_encoder: sklearn Label Encoder for users
    """
    
    encoded_df = df.copy()
    
    user_encoder = LabelEncoder()
    user_encoder.fit(encoded_df[user_col].values)
    n_users = len(user_encoder.classes_)
    
    item_encoder = LabelEncoder()
    item_encoder.fit(encoded_df[item_col].values)
    n_items = len(item_encoder.classes_)

    encoded_df["user_index"] = user_encoder.transform(encoded_df[user_col])
    encoded_df["item_index"] = item_encoder.transform(encoded_df[item_col])
    
    print("Number of users: ", n_users)
    print("Number of items: ", n_items)
    
    return encoded_df, user_encoder, item_encoder

In [102]:
encoded_ratings, user_encoder, item_encoder = encode_user_item(ratings, "user_id", "movie_id")

Number of users:  943
Number of items:  1682


In [21]:
ratings['rating'] = ratings['rating'].values.astype(np.float32)
min_rating = min(ratings['rating'])
max_rating = max(ratings['rating'])

## Data Splitting

- Random Split
- Stratified Split
- Chronological Split

### Random Split 

In [87]:
def random_split (df, ratios):
    
    """Function to split pandas DataFrame into train, validation and test
    
    Params:     
        df (pd.DataFrame): Pandas data frame to be split.
        ratios (list of floats): list of ratios for split. The ratios have to sum to 1.
    
    Returns: 
        list: List of pd.DataFrame split by the given specifications.
    """
    seed = 42                  # Set random seed
    df = df.sample(frac=1)     # Shuffle the data
    samples = df.shape[0]      # Number of samples
    
    # Converts [0.7, 0.2, 0.1] to [0.7, 0.9]
    split_ratio = np.cumsum(ratios).tolist()[:-1] # Get split index
    
    # Get the rounded integer split index
    split_index = [round(x * samples) for x in split_ratio]
    
    # split the data
    splits = np.split(df, split_index)

    return splits

### Stratified Split