# Music Recommendation

In this notebook I create and train the model for this dataset https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data. The aim is to predict if a user will listen a song for a second time within a month after the first time. Based on this information a recommendation system can be built.

#### The structure of the data:

**train.csv**

msno: user id
song_id: song id
source_system_tab: the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps functions. For example, tab my library contains functions to manipulate the local storage, and tab search contains functions relating to search.
source_screen_name: name of the layout a user sees.
source_type: an entry point a user first plays music on mobile apps. An entry point could be album, online-playlist, song .. etc.
target: this is the target variable. target=1 means there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, target=0 otherwise .

**songs.csv**

The songs. Note that data is in unicode.

song_id
song_length: in ms
genre_ids: genre category. Some songs have multiple genres and they are separated by |
artist_name
composer
lyricist
language

**members.csv**

user information.

msno
city
bd: age. Note: this column has outlier values, please use your judgement.
gender
registered_via: registration method
registration_init_time: format %Y%m%d
expiration_date: format %Y%m%d

#### Load Python libraries

In [1]:
#1 for data preprocessing
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.externals import joblib
import numpy as np
import pandas as pd

#2 for building a model 
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.layers import *
from tensorflow.keras.optimizers import *
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import load_model, Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow.keras.backend as K

## Data Preparation

#### Downloading data. 1% of Data is used to speed up execution. Can also be run on the full dataset

In [2]:
# Load data
df = pd.read_csv('./data/train.csv')

# 1% sample of items
df = df.sample(frac=0.01)


#### Join data of songs, members and events (train) into one dataframe

In [3]:
# Load and join songs data
songs = pd.read_csv('./data/songs.csv')
df = pd.merge(df, songs, on='song_id', how='left')
del songs

# Load and join songs data
members = pd.read_csv('./data/members.csv')
df = pd.merge(df, members, on='msno', how='left')
del members

#### Check how much data is missing in each column in %

In [4]:
df.isna().sum()/df.count()*100

msno                       0.000000
song_id                    0.000000
source_system_tab          0.342755
source_screen_name         5.960588
source_type                0.293646
target                     0.000000
song_length                0.002711
genre_ids                  1.548542
artist_name                0.002711
composer                  29.056749
lyricist                  75.193541
language                   0.005422
city                       0.000000
bd                         0.000000
gender                    66.431295
registered_via             0.000000
registration_init_time     0.000000
expiration_date            0.000000
dtype: float64

#### Check how the data looks like

In [5]:
df.head(5)

Unnamed: 0,msno,song_id,source_system_tab,source_screen_name,source_type,target,song_length,genre_ids,artist_name,composer,lyricist,language,city,bd,gender,registered_via,registration_init_time,expiration_date
0,R2FvE4x+hefwL4evbgNGNCng13c1MkP3PHocNtU91qU=,0FbtVUWrMLk0Gl9WNMwwJbKTVmvzyt7aJUM532uvsps=,my library,Local playlist more,local-playlist,1,225048.0,921,Frozen,Robert Lopez| Kristen Anderson-Lopez,,52.0,5,47,male,9,20100728,20171218
1,zRunn9KBc0bXEkd7vQKIapskqhhKz9jK2VOIOScBPXI=,8esjTau8b/ZNlndJ1vRAvBLKjGcReGFVCXbS997tUZg=,listen with,Others profile more,song,0,274552.0,465,蔡淳佳 (Joi Chua),,,3.0,9,0,,4,20161230,20171101
2,abQj3Jp4gyZQtOovZdZeXgPS9uPQhDLJK2EohLAEqf0=,nPMmTq6c5pA9WMHXRJVjZxtnm3l06GjTHWCAo2IG5JI=,my library,Local playlist more,local-library,1,376721.0,465,Namie Amuro (安室奈美恵),TETSUYA KOMURO,TETSUYA KOMURO,17.0,5,31,,9,20141210,20180327
3,3kfP6l/SfLV7vhNmKMXmw2WWbJ6YHAekgyJtBGCzPSo=,xNnUQnABeXxUF59mngBVVcBpGiBvkFKIHR3Hkj+rGW0=,my library,Local playlist more,local-library,1,266518.0,458,蕭閎仁,蕭閎仁,蕭閎仁+林尚德,3.0,1,0,,7,20131220,20170930
4,u3g6JVGfbP3kxvJrdUX4jDyF8+eEDl4CE7k2Mpn+/As=,zHqZ07gn+YvF36FWzv9+y8KiCMhYhdAUS+vSIKY3UZY=,discover,Discover Feature,online-playlist,1,189361.0,1616|1609,Alan Walker,Alan Walker| Jesper Borgen| Anders Froen| Gunn...,Alan Walker| Jesper Borgen| Anders Froen| Gunn...,52.0,15,32,male,9,20111127,20170710


#### Drop the columns with a lot of missing values and also registration, expiration dates and song_length, just for simplification

I did it just to save time on data cleaning and preparation. These columns can easily be included if necessary.

In [6]:
df.drop(['lyricist','gender','registration_init_time','expiration_date','song_length'],axis=1,inplace=True)#

#### Rename bd to age. Cleaning age column from unrealistic values, replacing them with 0

In [7]:
df.rename(columns={'bd':'age'},inplace=True)

df['age'].where(((df['age']<75) & (df['age']>10)),0,inplace=True)

#### How much data is missing from the age column in %

In [8]:
df['age'][df['age']==0].count()/df['age'].count()*100

39.74435437959173

#### creating column that indicates if the age is missing (0 or 1)

In [9]:
df['missing_age']=(df['age']==0).astype('int8')

#### Creating list all column that will be treated as categorical

In [10]:
cat_col=['msno', 'song_id', 'source_system_tab', 'source_screen_name',
        'source_type', 'genre_ids', 'artist_name', 'composer',
        'language','registered_via','city']

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 73774 entries, 0 to 73773
Data columns (total 14 columns):
msno                  73774 non-null object
song_id               73774 non-null object
source_system_tab     73522 non-null object
source_screen_name    69624 non-null object
source_type           73558 non-null object
target                73774 non-null int64
genre_ids             72649 non-null object
artist_name           73772 non-null object
composer              57164 non-null object
language              73770 non-null float64
city                  73774 non-null int64
age                   73774 non-null int64
registered_via        73774 non-null int64
missing_age           73774 non-null int8
dtypes: float64(1), int64(4), int8(1), object(8)
memory usage: 8.0+ MB


#### Convert columns to categorical and replace NA with additional category 'unknown'. Add this new category to all categorical columns even if they have no NA values

In [12]:
for col in cat_col:
    df[col] = df[col].astype('category')
    if 'unknown' not in df[col].cat.categories:
        df[col].cat.add_categories(['unknown'],inplace=True)
    df[col].fillna(value='unknown',inplace=True)

#### Train test split

In [13]:
train,test=train_test_split(df,test_size=0.1)

#### Create dict of categories mapping to category index for using during serving the model

In [14]:
categories_map={col:dict(zip(train[col].cat.categories, train[col].cat.codes))\
               for col in train.select_dtypes(include=['category']).columns}

#### Save dict for serving the model

In [15]:
cat_map_filename = "cat_map.pkl"
joblib.dump(categories_map, cat_map_filename) 

['cat_map.pkl']

#### Creating target values for model prediction and removing target from the data sets

In [16]:
YY_train=np.array(train['target'])
YY_test =np.array(test['target'])

train.drop('target',inplace=True,axis=1)
test.drop('target',inplace=True,axis=1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


#### In the test set categories that were not present in the train set should be set to 'unknown'

In [17]:
for col in test.select_dtypes(include=['category']).columns:
    for cat in test[col].cat.categories:
        if not (cat in train[col].cat.categories):
            test[col].cat.remove_categories(cat)
    test[col].fillna('unknown',inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


#### Create datasets as dictionary of numpy arrays with the right dimentions for training and testing for categorical and binary columns

In [18]:
#create data for prediction
XX_train={}
for col in train.select_dtypes(include=['category']).columns:
    XX_train[col]=np.expand_dims(np.array(train[col].cat.codes),axis=-1)

for col in train.select_dtypes(include=['int8']).columns:
    XX_train[col]=np.expand_dims(np.array(train[col].values),axis=-1)

In [19]:
XX_test={}
for col in test.select_dtypes(include=['category']).columns:
    XX_test[col]=np.expand_dims(np.array(test[col].cat.codes),axis=-1)

for col in test.select_dtypes(include=['int8']).columns:
    XX_test[col]=np.expand_dims(np.array(test[col].values),axis=-1)

#### Numerical columns should be scaled. In present case we have just one numerical column 'age'

In [20]:
scaler_age=MinMaxScaler()

#### create and scale age column in the data

In [21]:
scaler_age=MinMaxScaler()

XX_train['age'] = np.array(train['age'],dtype='float32')

XX_train['age']=np.expand_dims(XX_train['age'],axis=-1)

XX_train['age']=scaler_age.fit_transform(XX_train['age'])

#### Same for the testing data except scaler not to be fitted, to avoid data leaking

In [22]:
XX_test['age'] = np.array(test['age'],dtype='float32')

XX_test['age']=np.expand_dims(XX_test['age'],axis=-1)

XX_test['age']=scaler_age.transform(XX_test['age'])


#### Save scaler to use it later for serving the model

In [23]:
scaler_filename = "scaler.pkl"
joblib.dump(scaler_age, scaler_filename) 

['scaler.pkl']

#### Create a dictionary containing the numbers of categories in each column and a list of numerical features. It will be necessary for building the model later

In [24]:
cat_dim_train={col:len(train[col].cat.categories) for col in train.select_dtypes(include=['category']).columns}
cont_features=[col for col in train.select_dtypes(exclude=['category']).columns if col != 'id'] #and col!='target'

# Building the model

#### Defining inputs for categorical data as dictionary of the form {featute: input_tensor}

In [25]:
def build_cat_inputs(cat_dim):
    return {feature:Input((1,),name=feature,dtype='int64') for feature in cat_dim}       

In [26]:
cat_inputs=build_cat_inputs(cat_dim_train)

#### Defining inputs for nmerical data in the same form

In [27]:
def build_cont_inputs(cont_features):
    return {feature:Input((1,),name=feature,dtype='float32') for feature in cont_features}

In [28]:
cont_inputs=build_cont_inputs(cont_features)

#### Creating function that builds the model

This function returns the model and the list of features which is necessary to organize data in inputs in correct order. I've used the functional API of keras

In [29]:
def build_model(cat_inputs,cont_inputs,cat_dim):
    
    inputs=cat_inputs.copy()
    inputs.update(cont_inputs)
    
    #list of ordered features to define correct form of the input data
    features=sorted(list(inputs.keys()))

    vec=cont_inputs.copy()
    for key in cat_inputs.keys():
        
        # Embedding categorical values the embedding dimention is chosen to be number_of_category/2, but not bigger than 50
        vec[key]=Embedding(cat_dim[key],min(cat_dim[key]//2,50))(cat_inputs[key])
        
        # Reshape the embedding vectors to get rid of extra dimension and unify the shape of all vectors
        vec[key]=Lambda(lambda x: K.squeeze(x,axis=1))(vec[key])

    # Concatinate all numerical inputs and embedding vectors
    x=Concatenate(axis=-1)([vec[feature] for feature in features])
    
    # Stack of fully connected layers with BatchNorm and Dropout in between
    x=BatchNormalization()(x)
    x=Dense(128,activation='relu')(x)
    x=Dropout(0.5)(x)
    x=Dense(128,activation='relu')(x)
    x=BatchNormalization()(x)
    x=Dense(128,activation='relu')(x)
    x=Dropout(0.5)(x)
    x=Dense(128,activation='relu')(x)
    out=Dense(1,activation='sigmoid')(x)
    
    model=Model(inputs=[inputs[feature] for feature in features], outputs=[out])
        
    return model,features

In [30]:
mmodel,features=build_model(cat_inputs,cont_inputs,cat_dim_train)

In [31]:
mmodel.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
artist_name (InputLayer)        (None, 1)            0                                            
__________________________________________________________________________________________________
city (InputLayer)               (None, 1)            0                                            
__________________________________________________________________________________________________
composer (InputLayer)           (None, 1)            0                                            
__________________________________________________________________________________________________
genre_ids (InputLayer)          (None, 1)            0                                            
__________________________________________________________________________________________________
language (

#### Organizing training and testing datasets in correctly ordered lists for feeding into model.

That's where the list of features we got together with the model is useful

In [32]:
X_train=[XX_train[feature] for feature in features]

In [33]:
X_test=[XX_test[feature] for feature in features]

## Training the model

In [34]:
opt=Adam(0.0005)

#### Use early stopping to avoid overfitting

In [35]:
call_back=keras.callbacks.EarlyStopping()

In [36]:
mmodel.compile(opt,loss='binary_crossentropy',metrics=['accuracy'])

In [37]:
mmodel.fit(X_train,YY_train,batch_size=128,epochs=5,validation_data=(XX_test,YY_test),callbacks=[call_back])

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 66396 samples, validate on 7378 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5


<tensorflow.python.keras.callbacks.History at 0x1df4b095390>

#### Check the performanse on test data

In [38]:
predict_labels=np.floor(mmodel.predict(X_test)+1./2.)

In [39]:
print(metrics.classification_report(YY_test, predict_labels))

             precision    recall  f1-score   support

          0       0.62      0.58      0.60      3669
          1       0.61      0.65      0.63      3709

avg / total       0.62      0.62      0.62      7378



#### Save the model

In [40]:
mmodel.save('music_rec.hdf5')

#### Save the features list for serving the model

In [41]:
features_filename = "features.pkl"
joblib.dump(features, features_filename) 

['features.pkl']

#### Check if saved model can be downloaded without problems

In [42]:
model1=load_model('music_rec.hdf5',compile=False)

In [43]:
predict_labels=np.floor(model1.predict(X_test)+1./2.)

In [44]:
print(metrics.classification_report(YY_test, predict_labels))

             precision    recall  f1-score   support

          0       0.62      0.58      0.60      3669
          1       0.61      0.65      0.63      3709

avg / total       0.62      0.62      0.62      7378



#### Save uncompiled model to get rid of training information and save space (the model file size is reduced from 38mb to 12mb)

In [45]:
model1.save('music_rec.hdf5')