## AutoEncoder Assignment

Willmer R. Quiñones

### Download the data from the webpage and unzip it

In [1]:
import urllib.request
url = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'
urllib.request.urlretrieve(url, 'ml-100k.zip')  

('ml-100k.zip', <http.client.HTTPMessage at 0x7f0190ae21d0>)

In [2]:
!apt-get install p7zip-full
!unzip ml-100k.zip

Reading package lists... Done
Building dependency tree       
Reading state information... Done
p7zip-full is already the newest version (16.02+dfsg-6).
0 upgraded, 0 newly installed, 0 to remove and 15 not upgraded.
Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating: ml-100k/u3.test         
  inflating: ml-100k/u4.base         
  inflating: ml-100k/u4.test         
  inflating: ml-100k/u5.base         
  inflating: ml-100k/u5.test         
  inflating: ml-1

### Reading the datasets:
##### u.data and u.item

In [3]:
import os, copy
import numpy as np
import pandas as pd

pd.options.mode.chained_assignment = None  # default='warn'

In [4]:
data = pd.read_csv('ml-100k/u.data', sep="\t", header=None)
data.columns =['user_id', 'item_id', 'rating', 'timestamp']

item = pd.read_csv('ml-100k/u.item', sep="|", header=None, encoding='latin-1')
item.columns =['item_id', 'title', 'release_date', 'video_release_date', 'imdb_url', 'unknown', 'action', 
               'adventure', 'animation', 'children', 'comedy', 'crime', 'documentary', 'drama', 'fantasy', 'film_noir', 
               'horror', 'musical', 'mystery', 'romance', 'sci_fi', 'thriller', 'war', 'western']
df = data.pivot(index = 'user_id', columns = 'item_id', values = 'rating')

In [5]:
# Checking the size of data
## Number of rows should be equal to number of users and number of columns should be equal to number of items
df.shape[0] == len(set(data['user_id'].tolist())) and df.shape[1] == len(item['item_id'].tolist())

True

In [6]:
number_of_users = len(df)
print(f"There are {number_of_users} users")

There are 943 users


### Spliting the dataset

#### Split by movies

In [7]:
df = df.fillna(0)
df.columns.name = None

Before building the autoencoder, usually machine learning models work better with normalized data; hence I use min-max to normalize all the ratings in the dataset (I could have use norm-L2, but by using min-max, it’d be easy to get the real values back).

In [8]:
### Normalizing the data (min-max normalization)
### min being 0 and max being 5
df_backup = df.copy()
df_values = df.values
df[df.columns] = df_values/5

In [9]:
df.head(5)

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,1643,1644,1645,1646,1647,1648,1649,1650,1651,1652,1653,1654,1655,1656,1657,1658,1659,1660,1661,1662,1663,1664,1665,1666,1667,1668,1669,1670,1671,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,1.0,0.6,0.8,0.6,0.6,1.0,0.8,0.2,1.0,0.6,0.4,1.0,1.0,1.0,1.0,1.0,0.6,0.8,1.0,0.8,0.2,0.8,0.8,0.6,0.8,0.6,0.4,0.8,0.2,0.6,0.6,1.0,0.8,0.4,0.2,0.4,0.4,0.6,0.8,0.6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,0.8,0.8,0.0,0.0,0.0,0.0,0.6,0.0,0.0,0.0,0.0,0.0,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.8,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.8,0.0,0.0,0.0,0.6,0.0,0.0,0.8,0.6,0.0,0.0,0.0,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As required, the data was separated into training and testing set. The testing set is built by getting the most recently rated movies; to do so, I took the most recent movies by year of release. Getting the movies by just inspecting the ones that were recently rated does not assure that most of their ratings are recent (perhaps a user watched an old movie recently and he or she rated it). 

In [10]:
total_movies = df.shape[1]
train_size = int(total_movies * 0.7)
test_size = total_movies - train_size

print(f'Number of movies in test set: {test_size}')
print(f'Number of movies in train set: {train_size}')

Number of movies in test set: 505
Number of movies in train set: 1177


The data will be split by the the time the movie was release: newest movies will be in testing set

In [11]:
# Converting release date to datetime format and sorting
item_release_date = item[['item_id', 'title', 'release_date']]
item_release_date['datetime'] = pd.to_datetime(item_release_date['release_date'], format = '%d-%b-%Y').tolist()

In [12]:
# Getting the ids and taking the newest ones as testing
by_time_release = item_release_date.sort_values('datetime', ascending = False)['item_id'].tolist()

In [13]:
test_ids = by_time_release[train_size:]
train_ids = by_time_release[:train_size]

In [14]:
# Spliting the dataset
train_df = df[train_ids]
test_df = df[test_ids]

In [15]:
### Finding out if there are common columns
if len(set(train_ids).intersection(set(test_ids))) == 0:
  print('All good')
else:
  print('Check again...')


All good


### Q1. Simple Autoencoder

#### Keras implementation

In [16]:
import keras
from keras.models import Model, Sequential
from keras.layers import Input, Dense, Dropout, Activation
from keras.optimizers import Adam
from keras import regularizers
from keras import backend as K
from keras.layers.normalization import BatchNormalization

#### Data to feed the model

In [17]:
train_size = len(train_df.columns)

### Training input
x_train = train_df.values

### Testing input
test_values = test_df.values
zeros = np.zeros((number_of_users, train_size - len(test_df.columns)))
x_test = np.concatenate((test_values, zeros), axis = 1)

### Real values for testing
y_true = test_df.values

In [18]:
### Encoding dimension value
encoding_dim = 32

### Input layer size
input_layer = Input(shape = (train_size, ))

#### Training the model

The autoencoder used in Q1 has only 1 layer with an encoding dimension of 32. Tanh was used as activation function in the hidden layer and sigmoid in the output layer.

In [19]:
### Loss Function to use
def rmse(y_true, y_pred):
  return K.sqrt(K.mean(K.square(y_pred - y_true)))

## Building model
simple_autoencoder = Sequential([
    ## Encoder
    Dense(encoding_dim, input_shape = (train_size, ), activation = 'tanh'),
    
    ## Decoder    
    Dense(train_size, activation = 'sigmoid'),
])
simple_autoencoder.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 32)                37696     
_________________________________________________________________
dense_1 (Dense)              (None, 1177)              38841     
Total params: 76,537
Trainable params: 76,537
Non-trainable params: 0
_________________________________________________________________


As loss function, RMSE was used. Since Keras does not have a built-in RMSE function, I had to define my own. As optimizer, Adam was used. The model was trained in 200 epochs with batch size of 128.

For testing, since the data is very sparse it’d be unfair to compare the zeros (the entries that the user has not rated yet) with the output of the model (a value very close to zero), because then the loss value would be inflated (very low) due to those zero values. For such reasons, only the non-zero values were used in the RMSE calculation (this is done for Q2 and Q4 as well).


In [20]:
## Train model with train set.
batch_size = 128
epochs = 200

simple_autoencoder.compile(loss = rmse,
                          optimizer = Adam())

simple_autoencoder.fit(x_train, x_train,
                batch_size=batch_size,
                epochs=epochs,
                shuffle=True,
                verbose=0,
                validation_data=(x_test, x_test))

<tensorflow.python.keras.callbacks.History at 0x7f018ff83940>

#### Testing the Model

In [21]:
from sklearn.metrics import mean_squared_error
def test_rmse(y_true, y_pred):
  non_zeros = y_true > 0
  y_true = y_true[non_zeros]
  y_pred = y_pred[non_zeros]
  return mean_squared_error(y_true, y_pred)

In [22]:
### Predict values
y_pred = simple_autoencoder.predict(x_test)
y_pred = y_pred[:,:len(test_df.columns)]
y_pred[y_true == 0] = 0

In [23]:
## Calculating error: RMSE formula
q1_loss = test_rmse(y_true, y_pred)
print(q1_loss)

0.5547575167023033


### Q2. Deep Autoencoder

The deep autoencoder consisted in 4 layers for encoder and 4 layers for decoder; in each layer batch normalization and dropout (50%) is implemented. The loss function, optimizer, batch size and epoch size are kept the same from Q1.

In [24]:
## Building model

deep_autoencoder = Sequential([
    ## Encoder
    Dense(encoding_dim * 16, input_shape = (train_size, )),
    BatchNormalization(),
    Activation('tanh'),
    Dropout(0.5),
    
    Dense(encoding_dim * 4),
    BatchNormalization(),
    Activation('tanh'),
    Dropout(0.5),
    
    Dense(encoding_dim * 2),
    BatchNormalization(),
    Activation('tanh'),
    Dropout(0.5),
    
    Dense(encoding_dim),
    BatchNormalization(),
    Activation('tanh'),
    Dropout(0.5),  
    
    ## Decoder
    Dense(encoding_dim * 2),
    BatchNormalization(),
    Activation('tanh'),
    Dropout(0.5), 
    
    Dense(encoding_dim * 4),
    BatchNormalization(),
    Activation('tanh'),
    Dropout(0.5), 
    
    Dense(encoding_dim * 16),
    BatchNormalization(),
    Activation('tanh'),
    Dropout(0.5), 
    
    Dense(train_size),
    BatchNormalization(),
    Activation('sigmoid'),
])

In [25]:
# Using the same parameters as Q1
deep_autoencoder.compile(loss = rmse,
                          optimizer = Adam())

batch_size = 128
epochs = 200

deep_autoencoder.fit(x_train, x_train,
                batch_size=batch_size,
                epochs=epochs,
                shuffle=True,
                verbose=0,
                validation_data=(x_test, x_test))

<tensorflow.python.keras.callbacks.History at 0x7f00dd2f39b0>

In [26]:
# Testing
y_pred = deep_autoencoder.predict(x_test)
y_pred = y_pred[:,:len(test_df.columns)]
y_pred[y_true == 0] = 0

In [27]:
## Calculating error: RMSE formula
q2_loss = test_rmse(y_true, y_pred)
print(q2_loss)

0.3441531889613632


In [28]:
print(q1_loss - q2_loss)

0.21060432774094012


There’s an improvement from Q1. Out of curiosity, another deep autoencoder was built; this time with 3 layers for encoder and 3 for decoder, with only dropout between the encoder and the decoder. The RMSE score is just a little better than the previous one. Therefore, perhaps there’s no need to go that deep for this dataset.

The loss seems to be decreased, just to try I'll reduce layer and also dropout layers

In [29]:
deep_autoencoder_2 = Sequential([
    ## Encoder
    Dense(encoding_dim * 4, input_shape = (train_size, )),
    BatchNormalization(),
    Activation('tanh'),
    #Dropout(0.5),
    
    Dense(encoding_dim * 2),
    BatchNormalization(),
    Activation('tanh'),
    #Dropout(0.5),
    
    Dense(encoding_dim),
    BatchNormalization(),
    Activation('tanh'),
    Dropout(0.5),  
    
    ## Decoder
    Dense(encoding_dim * 2),
    BatchNormalization(),
    Activation('tanh'),
    #Dropout(0.5), 
    
    Dense(encoding_dim * 4),
    BatchNormalization(),
    Activation('tanh'),
    #Dropout(0.5), 
    
    Dense(train_size),
    BatchNormalization(),
    Activation('sigmoid'),
])

In [30]:
# Using the same parameters as Q1
deep_autoencoder_2.compile(loss = rmse,
                          optimizer = Adam())

batch_size = 128
epochs = 200

deep_autoencoder_2.fit(x_train, x_train,
                batch_size=batch_size,
                epochs=epochs,
                shuffle=True,
                verbose=0,
                validation_data=(x_test, x_test))

<tensorflow.python.keras.callbacks.History at 0x7f00db4f7d30>

In [31]:
# Testing
y_pred = deep_autoencoder_2.predict(x_test)
y_pred = y_pred[:,:len(test_df.columns)] #test size
y_pred[y_true == 0] = 0

## Calculating error: RMSE formula
q2_loss_b = test_rmse(y_true, y_pred)
print(q2_loss_b)

0.33690872210728307


In [32]:
print(q2_loss - q2_loss_b)

0.007244466854080145


### Q3. Auxiliary Information

In [33]:
user = pd.read_csv('ml-100k/u.user', sep="|", header=None)
user.columns =['user_id', 'age', 'gender', 'occupation', 'zip_code']

In [34]:
user.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In the user dataframe, there are three information that can be useful: age, gender and occupation.

#### One Hot Encoding

In [35]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

Using one-hot encoder for occupation

In [36]:
occupations = np.sort(np.array(list(set(user['occupation'].values))))
print(occupations)

['administrator' 'artist' 'doctor' 'educator' 'engineer' 'entertainment'
 'executive' 'healthcare' 'homemaker' 'lawyer' 'librarian' 'marketing'
 'none' 'other' 'programmer' 'retired' 'salesman' 'scientist' 'student'
 'technician' 'writer']


In [37]:
# Using SKLearn, one can easily convert categorical values to binary values
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(occupations)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)

In [38]:
occupation_definition = pd.DataFrame(onehot_encoded, columns = occupations).transpose()
occupation_definition.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
administrator,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
artist,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doctor,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
educator,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
engineer,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


For gender, simply put Female is [0, 1] and Male is [1, 0]. For age, although it is not categorical data, if it’s converted in categorical it would convey more information. Therefore I used the definition of age categories from the Canada government website (https://www.statcan.gc.ca/eng/concepts/definitions/age2), in which children are those from 0 to 14 years old, youth are those from 15 to 24 years old, adults are those from 25 to 64 years old and seniors are those older than 64 years old.

In [39]:
genders = np.sort(np.array(list(set(user['gender'].values))))

# Using SKLearn, one can easily convert categorical values to binary values
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(genders)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)

In [40]:
gender_definition = pd.DataFrame(onehot_encoded, columns = genders).transpose()
print(gender_definition)

     0    1
F  1.0  0.0
M  0.0  1.0


Age: Since age per se conveys little information (just one column), it'd be nice to transform it in categorical data.

According to this definition of age (https://www.statcan.gc.ca/eng/concepts/definitions/age2). Children = 0 -14 yo Youth = 15 - 24 yo Adults = 25 - 64 yo Seniors = 65 and over

In [41]:
ages = ['adult', 'child', 'senior', 'young']

In [42]:
# Using SKLearn, one can easily convert categorical values to binary values
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(ages)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)

In [43]:
age_definition = pd.DataFrame(onehot_encoded, columns = ages).transpose()
age_definition

Unnamed: 0,0,1,2,3
adult,1.0,0.0,0.0,0.0
child,0.0,1.0,0.0,0.0
senior,0.0,0.0,1.0,0.0
young,0.0,0.0,0.0,1.0


#### Building the one-hot encode dataframe per user

Occupations

In [44]:
occupations_df = pd.DataFrame(columns = occupations, index = df.index)

In [45]:
# Using SKLearn, one can easily convert categorical values to binary values
values = np.array(list(user['occupation'].values))
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)

In [46]:
occupations_df[list(occupations)] = onehot_encoded
occupations_df.head()

Unnamed: 0_level_0,administrator,artist,doctor,educator,engineer,entertainment,executive,healthcare,homemaker,lawyer,librarian,marketing,none,other,programmer,retired,salesman,scientist,student,technician,writer
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Genders

In [47]:
gender_df = pd.DataFrame(columns = genders, index = df.index)
# Using SKLearn, one can easily convert categorical values to binary values
values = np.array(list(user['gender'].values))
label_encoder = LabelEncoder()

integer_encoded = label_encoder.fit_transform(values)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)

# Building dataframe
gender_df[list(genders)] = onehot_encoded
gender_df.head()

Unnamed: 0_level_0,F,M
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.0,1.0
2,1.0,0.0
3,0.0,1.0
4,0.0,1.0
5,1.0,0.0


Age

In [48]:
user2 = user
for i in range(len(user2)):
  age = user2.loc[i]['age']
  if age < 15:
    user2.iloc[i, user2.columns.get_loc('age')] = 'child'
  elif age >= 15 and age < 25:
    user2.iloc[i, user2.columns.get_loc('age')] = 'young'
  elif age >= 25 and age < 64:
    user2.iloc[i, user2.columns.get_loc('age')] = 'adult'
  else:
    user2.iloc[i, user2.columns.get_loc('age')] = 'senior'

In [49]:
age_df = pd.DataFrame(columns = ages, index = df.index)
# Using SKLearn, one can easily convert categorical values to binary values
values = np.array(list(user2['age'].values))
label_encoder = LabelEncoder()

integer_encoded = label_encoder.fit_transform(values)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)

# Building dataframe
age_df[list(ages)] = onehot_encoded
age_df.head()

Unnamed: 0_level_0,adult,child,senior,young
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.0,0.0,0.0,1.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
5,1.0,0.0,0.0,0.0


#### Creating user-profile

This dataframe conveys the characteristic of each user. It will later be added to the ratings

In [50]:
# Adding occupations
user_profile = pd.concat([occupations_df, gender_df, age_df], ignore_index=False, axis = 1)
user_profile.head()

Unnamed: 0_level_0,administrator,artist,doctor,educator,engineer,entertainment,executive,healthcare,homemaker,lawyer,librarian,marketing,none,other,programmer,retired,salesman,scientist,student,technician,writer,F,M,adult,child,senior,young
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0


### Q4. Combining Auxiliary Information into Autoencoder

Firstly, for each user I build a user profile dataframe with the encoding explained in Q3, then to the normalize training and testing data, I attached the user profile to the users. Then, the model architecture from Q1 was used to predict the ratings.

In [51]:
# Adding profile to train, test and val sets
profile_values = user_profile.values

x_train_q4 = np.concatenate((x_train, profile_values), axis = 1)
x_test_q4 = np.concatenate((x_test, profile_values), axis = 1)

In [52]:
# Defining autoencoder again
train_size_q4 = x_train_q4.shape[1]

## Building model
simple_autoencoder = Sequential([
    ## Encoder
    Dense(encoding_dim, input_shape = (train_size_q4, ), activation = 'tanh'),
    
    ## Decoder    
    Dense(train_size_q4, activation = 'sigmoid'),
])

In [53]:
## Train model with train set.
simple_autoencoder.compile(loss = rmse,
                          optimizer = Adam())

batch_size = 128
epochs = 200

simple_autoencoder.fit(x_train_q4, x_train_q4,
                batch_size=batch_size,
                epochs=epochs,
                shuffle=True,
                verbose=0,
                validation_data=(x_test_q4, x_test_q4))

<tensorflow.python.keras.callbacks.History at 0x7f00da8a8978>

In [54]:
# Testing
y_pred = simple_autoencoder.predict(x_test_q4)
y_pred = y_pred[:,:len(test_df.columns)] #test size
y_pred[y_true == 0] = 0

## Calculating error: RMSE formula
q4_loss = test_rmse(y_true, y_pred)
print(q4_loss)

0.5561152069482995


In [55]:
print(q1_loss - q4_loss)

-0.0013576902459961904


### Q5. Cold-Start Problem

To get the users (from testing set) with less sparsity, I counted the numbers of zeros in their ratings and get those who have less zeros. Then, I randomly convert the ratings zero by multiplying the original ratings with a mask consisting of 80% zeros and 20% ones.

In [56]:
# Finding the low-sparsity users by
temp = {
    'user': test_df.index,
    'number_zeros': (test_df[test_df.columns] == 0).sum(axis=1)
}

sparcity_df = pd.DataFrame(temp, columns = temp.keys())

In [57]:
low_sparcity_users = sparcity_df.sort_values('number_zeros').head(5)['user'].tolist()
print(low_sparcity_users)

[405, 234, 450, 13, 7]


In [58]:
lsu_df = test_df.loc[low_sparcity_users]

In [59]:
mask1 = np.random.choice(2, lsu_df.shape[1], p=[0.8, 0.2])
mask2 = np.random.choice(2, lsu_df.shape[1], p=[0.8, 0.2])
mask3 = np.random.choice(2, lsu_df.shape[1], p=[0.8, 0.2])
mask4 = np.random.choice(2, lsu_df.shape[1], p=[0.8, 0.2])
mask5 = np.random.choice(2, lsu_df.shape[1], p=[0.8, 0.2])
mask = np.concatenate((mask1, mask2, mask3, mask4, mask5), axis = 0).reshape(lsu_df.shape)

In [60]:
lsu_zeros = np.multiply(lsu_df.values, mask)

In [61]:
new_lsu_df = test_df.loc[low_sparcity_users]
new_lsu_df[new_lsu_df.columns] = lsu_zeros

Adding the user profile information

In [62]:
zeros = np.zeros((5, train_size - new_lsu_df.shape[1]))
x_lsu = np.concatenate((new_lsu_df.values, zeros), axis = 1)

In [63]:
profile_lsu = user_profile.loc[low_sparcity_users].values
x_lsu = np.concatenate((x_lsu, profile_lsu), axis = 1)

Predict ratings

In [64]:
y_pred = simple_autoencoder.predict(x_lsu*5)
y_pred = y_pred[:, :lsu_df.shape[1]] #test size
y_pred[lsu_df.values == 0] = 0

In [65]:
## Calculating error: RMSE formula
q5_loss = test_rmse(lsu_df.values, y_pred)
print(q5_loss)

0.4768370001356463


In [66]:
# Actual ratings
actual_ratings = copy.deepcopy(lsu_df)

# Predicted ratings
pred_ratings = pd.DataFrame(index = actual_ratings.index, columns = actual_ratings.columns)
pred_ratings[:] = y_pred
pred_ratings.head()

Unnamed: 0_level_0,1422,169,1402,1401,399,1332,398,1488,1487,1436,397,467,468,470,1539,469,1141,701,735,576,577,78,79,80,732,734,82,582,1146,83,578,736,579,581,942,77,963,962,961,684,...,1451,1116,432,494,607,478,404,524,1172,836,1473,132,133,1122,136,491,497,633,612,1458,99,1286,967,613,1461,671,1203,615,835,1397,604,493,430,1580,1124,656,617,1542,675,267
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
405,0.136835,0.006778,0.0,0.0,0.006541,0.0,0.011312,0.010982,0.005936,0.0,0.006847,0.005654,0.009566,0.008156,0.006573,0.006609,0.0,0.0,0.006624,0.00652,0.007313,0.006457,0.010181,0.008401,0.051078,0.023971,0.00566,0.005873,0.016165,0.005335,0.007073,0.025819,0.005788,0.006212,0.006755,0.017911,0.0,0.0,0.0,0.00681,...,0.0,0.0,0.024274,0.0,0.0,0.0,0.01206,0.074233,0.0,0.0,0.0,0.008282,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006982,0.0,0.0,0.0,0.0,0.129631,0.0,0.0,0.0,0.0,0.0,0.0,0.006212,0.045235,0.0,0.04457,0.0,0.0,0.062753,0.0
234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008156,0.0,0.0,0.0,0.0,0.006624,0.0,0.0,0.0,0.010181,0.0,0.051078,0.0,0.00566,0.005873,0.0,0.0,0.0,0.0,0.0,0.0,0.006755,0.017911,0.030052,0.0,0.0,0.0,...,0.107407,0.0,0.024274,0.016506,0.03107,0.006098,0.01206,0.074233,0.006654,0.005842,0.0,0.008282,0.015153,0.0,0.006147,0.023623,0.028346,0.0,0.010062,0.04604,0.006982,0.0,0.0,0.024702,0.009023,0.129631,0.041367,0.028801,0.011312,0.006968,0.005516,0.006515,0.006212,0.0,0.0,0.04457,0.006538,0.0,0.062753,0.0
450,0.0,0.006778,0.006794,0.006404,0.006541,0.0,0.0,0.0,0.0,0.0,0.0,0.005654,0.009566,0.008156,0.0,0.006609,0.0,0.0,0.006624,0.0,0.0,0.006457,0.010181,0.008401,0.051078,0.023971,0.00566,0.005873,0.0,0.005335,0.0,0.025819,0.0,0.0,0.006755,0.017911,0.0,0.0,0.0,0.0,...,0.0,0.007953,0.024274,0.016506,0.03107,0.006098,0.0,0.0,0.006654,0.0,0.0,0.008282,0.015153,0.0,0.006147,0.023623,0.028346,0.006601,0.010062,0.0,0.006982,0.045243,0.008215,0.024702,0.0,0.129631,0.041367,0.0,0.0,0.0,0.005516,0.006515,0.006212,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,0.0,0.0,0.0,0.0,0.0,0.0,0.011312,0.0,0.0,0.0,0.0,0.005654,0.0,0.0,0.0,0.0,0.0,0.0,0.006624,0.00652,0.0,0.006457,0.010181,0.0,0.051078,0.0,0.00566,0.0,0.0,0.005335,0.007073,0.025819,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00681,...,0.0,0.0,0.024274,0.016506,0.0,0.006098,0.01206,0.074233,0.0,0.005842,0.0,0.008282,0.0,0.0,0.0,0.023623,0.028346,0.0,0.010062,0.0,0.006982,0.0,0.0,0.024702,0.0,0.129631,0.0,0.028801,0.011312,0.0,0.005516,0.006515,0.006212,0.0,0.0,0.04457,0.006538,0.0,0.062753,0.0
7,0.0,0.0,0.0,0.0,0.006541,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008156,0.0,0.0,0.0,0.0,0.0,0.00652,0.007313,0.006457,0.010181,0.008401,0.0,0.0,0.00566,0.005873,0.0,0.0,0.007073,0.0,0.005788,0.006212,0.0,0.017911,0.0,0.0,0.0,0.0,...,0.0,0.0,0.024274,0.0,0.03107,0.0,0.01206,0.0,0.0,0.0,0.0,0.008282,0.015153,0.0,0.006147,0.023623,0.028346,0.006601,0.010062,0.0,0.006982,0.0,0.0,0.024702,0.0,0.129631,0.0,0.028801,0.0,0.0,0.005516,0.0,0.006212,0.0,0.0,0.04457,0.006538,0.0,0.062753,0.0


For each user, I took the 10 best movies (according to the model) and built a dataframe with the required information. An example of the resulting recommendation is shown below:

In [67]:
# Getting the recommended movies
pred_ratings_t = pred_ratings.transpose()
users = pred_ratings_t.columns.tolist()
recommendations = {}
for u in users:
  recommendations[u] = pred_ratings_t[[u]].sort_values(u, ascending = False).head(10).index.tolist()


In [68]:
recommendations_df = pd.DataFrame(columns =['user', 'movie_id', 'movie_title', 'pred_rating', 'actual_rating'])

for u in users:
  temp_df = pd.DataFrame(columns = ['user', 'movie_id', 'movie_title', 'pred_rating', 'actual_rating'])
  
  # Appending user id
  temp_df['user'] = [u] * 10
  
  # Appending movie id
  temp_df['movie_id'] = recommendations[u]
  
  # Appending movie titles, ratings
  movie_titles = []
  pred_rating = []
  true_rating = []
  for m in recommendations[u]:
    movie_titles.append(item[item['item_id'] == m]['title'].values[0])    
    pred_rating.append(pred_ratings.loc[u][m] * 5)
    true_rating.append(actual_ratings.loc[u][m] * 5)
    
  temp_df['movie_title'] = movie_titles
  temp_df['pred_rating'] = pred_rating
  temp_df['actual_rating'] = true_rating
  
  # Appending to overall df
  recommendations_df = recommendations_df.append(temp_df, ignore_index = True) 
  

In [69]:
recommendations_df.head(5)

Unnamed: 0,user,movie_id,movie_title,pred_rating,actual_rating
0,405,201,Evil Dead II (1987),2.187047,1.0
1,405,673,Cape Fear (1962),1.726695,5.0
2,405,712,Tin Men (1987),1.616848,1.0
3,405,214,Pink Floyd - The Wall (1982),1.558061,4.0
4,405,446,Burnt Offerings (1976),1.29794,1.0


As you can see, the simple autoencoder model was not good to predict the actual rating. The reason for that, as I understand, is that the loss function that Keras uses to train the model does not make distinction between the zero values and the movies that were actually rated.

Hence, the model does not properly pick the pattern from the users’ ratings (due to the zeros). For improvement, PyTorch can be used, since it gives you more flexibility to play with the loss functions.