# Music Recommendation Project

This is the first section of the Capstone Project for Udacity's Machine Learning Engineer Nanodegree.

This notebook includes importing the raw [data available from Kaggle](https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data), exploring the data, cleaning the data, engineering new features, and saving the resulting features for training in the section.

Author: Ben Walsh \
February 7, 2021

## Contents

1. [Data Import](#data-import)
2. [Data Exploration](#data-explore)
3. [Data Cleaning](#data-clean)
4. Feature Engineering
5. Feature Selection
6. [Saving Data](#save-data)

## <a class="anchor" id="data-import"></a>1. Data Import

### Import libraries

In [1]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split

First import all data: input training/test data, member data, and song data

In [2]:
train_file = './data-input-raw/train.csv'
member_file = './data-input-raw/members.csv'
song_file = './data-input-raw/songs.csv'

### Import training/test data

In [3]:
if os.path.exists(train_file):
    train_data = pd.read_csv(train_file)
else:
    print('Train data file {} not found!'.format(train_file))

### Import member data

In [4]:
if os.path.exists(member_file):
    member_data = pd.read_csv(member_file)
else:
    print('Member data file {} not found!'.format(member_file))

### Import song data

In [5]:
if os.path.exists(song_file):
    song_data = pd.read_csv(song_file)
else:
    print('Song data file {} not found!'.format(song_file))

## <a class="anchor" id="data-explore"></a>2. Data Exploration

### Training Data

In [6]:
train_data.head()

Unnamed: 0,msno,song_id,source_system_tab,source_screen_name,source_type,target
0,FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=,BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=,explore,Explore,online-playlist,1
1,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,bhp/MpSNoqoxOIB+/l8WPqu6jldth4DIpCm3ayXnJqM=,my library,Local playlist more,local-playlist,1
2,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,JNWfrrC7zNN7BdMpsISKa4Mw+xVJYNnxXh3/Epw7QgY=,my library,Local playlist more,local-playlist,1
3,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,2A87tzfnJTSWqD7gIZHisolhe4DMdzkbd6LzO1KHjNs=,my library,Local playlist more,local-playlist,1
4,FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=,3qm6XTZ6MOCU11x8FIVbAGH5l5uMkT3/ZalWG1oo2Gc=,explore,Explore,online-playlist,1


In [7]:
train_data.describe()

Unnamed: 0,target
count,7377418.0
mean,0.5035171
std,0.4999877
min,0.0
25%,0.0
50%,1.0
75%,1.0
max,1.0


In [8]:
print('Train data # of data points: {}'.format(train_data.shape[0]))
print('Train data # of columns: {}'.format(train_data.shape[1]))
print('Percent of training target values that are 1: {:.2f}'.format(100*train_data['target'].sum()/train_data.shape[0]))


Train data # of data points: 7377418
Train data # of columns: 6
Percent of training target values that are 1: 50.35


#### Observations

Target values are well balanced between 0 and 1. All source* features will have to be re-encoded to numeric variables.

### Member Data

In [9]:
member_data.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time,expiration_date
0,XQxgAYj3klVKjR3oxPPXYYFp4soD4TuBghkhMTD4oTw=,1,0,,7,20110820,20170920
1,UizsfmJb9mV54qE9hCYyU07Va97c0lCRLEQX3ae+ztM=,1,0,,7,20150628,20170622
2,D8nEhsIOBSoE6VthTaqDX8U6lqjJ7dLdr72mOyLya2A=,1,0,,4,20160411,20170712
3,mCuD+tZ1hERA/o5GPqk38e041J8ZsBaLcu7nGoIIvhI=,1,0,,9,20150906,20150907
4,q4HRBfVSssAFS9iRfxWrohxuk9kCYMKjHOEagUMV6rQ=,1,0,,4,20170126,20170613


In [10]:
member_data.describe()

Unnamed: 0,city,bd,registered_via,registration_init_time,expiration_date
count,34403.0,34403.0,34403.0,34403.0,34403.0
mean,5.371276,12.280935,5.953376,20139940.0,20169010.0
std,6.243929,18.170251,2.287534,29540.15,7320.925
min,1.0,-43.0,3.0,20040330.0,19700100.0
25%,1.0,0.0,4.0,20121030.0,20170200.0
50%,1.0,0.0,7.0,20150900.0,20170910.0
75%,10.0,25.0,9.0,20161100.0,20170930.0
max,22.0,1051.0,16.0,20170230.0,20201020.0


In [11]:
print('Member data # of data points: {}'.format(member_data.shape[0]))
print('Member data # of columns: {}'.format(member_data.shape[1]))
print('Number of unique cities that members come from: {}'.format(len(member_data['city'].unique())))
print('Percent of gender values that are empty: {:.2f}'.format(100*member_data['gender'].isna().sum()/len(member_data['gender'])))


Member data # of data points: 34403
Member data # of columns: 7
Number of unique cities that members come from: 21
Percent of gender values that are empty: 57.85


#### Observations

Gender values are mostly empty. With such a large proportion missing, removing the corresponding NaN entries would remove a large portion of the data. Recommend removing gender as a feature.

### Song Data

In [12]:
song_data.head()

Unnamed: 0,song_id,song_length,genre_ids,artist_name,composer,lyricist,language
0,CXoTN1eb7AI+DntdU1vbcwGRV4SCIDxZu+YD8JP8r4E=,247640,465,張信哲 (Jeff Chang),董貞,何啟弘,3.0
1,o0kFgae9QtnYgRkVPqLJwa05zIhRlUjfF7O1tDw0ZDU=,197328,444,BLACKPINK,TEDDY| FUTURE BOUNCE| Bekuh BOOM,TEDDY,31.0
2,DwVvVurfpuz+XPuFvucclVQEyPqcpUkHR0ne1RQzPs0=,231781,465,SUPER JUNIOR,,,31.0
3,dKMBWoZyScdxSkihKG+Vf47nc18N9q4m58+b4e7dSSE=,273554,465,S.H.E,湯小康,徐世珍,3.0
4,W3bqWd3T+VeHFzHAUfARgW9AvVRaF4N5Yzm4Mr6Eo/o=,140329,726,貴族精選,Traditional,Traditional,52.0


In [13]:
song_data.describe()

Unnamed: 0,song_length,language
count,2296320.0,2296319.0
mean,246993.5,32.378
std,160920.0,24.33241
min,185.0,-1.0
25%,183600.0,-1.0
50%,226627.0,52.0
75%,277269.0,52.0
max,12173850.0,59.0


In [14]:
print('Song data # of data points: {}'.format(song_data.shape[0]))
print('Song data # of columns: {}'.format(song_data.shape[1]))
print('Number of unique languages that songs come from: {}'.format(len(song_data['language'].unique())))
print('Percent of songs with missing genre info: {:.2f}%'.format(100*song_data['genre_ids'].isna().sum()/song_data.shape[0]))

Song data # of data points: 2296320
Song data # of columns: 7
Number of unique languages that songs come from: 11
Percent of songs with missing genre info: 4.10%


In [15]:
song_data['genre_ids'].value_counts()

465                    567911
958                    176349
2022                   168870
1609                   166457
2122                   139938
                        ...  
864|850|437|857|843         1
465|1011|139                1
465|2130|2122               1
1040|423                    1
829|465                     1
Name: genre_ids, Length: 1045, dtype: int64

#### Observations

Genre values have 4% missing, which is small enough that it may be easiest to remove those points. Genre IDs have some entries which contain multiple values, denoted with '|'. These will have to be cleaned to be consistent. The classes are also very imbalanced, with the top 5 most prevalent IDs dominating the distribution, particularly the highest ID=465.

Artist, composer, and lyricist provide information but the useful association in the available data is genre. To simplify, these columns will be dropped.

Language is encoded as a numeric variable but these are IDs, so these will have to one-hot encoded.

## <a class="anchor" id="data-clean"></a>3. Data Cleaning

### Convert categorical variables to numerical variables

In [16]:
train_data['source_system_tab'].value_counts()

my library      3684730
discover        2179252
search           623286
radio            476701
listen with      212266
explore          167949
notification       6185
settings           2200
Name: source_system_tab, dtype: int64

Since `source_system_tab` has more than 2 unique values, use one-hot encoding.

In [17]:
# Generate one-hot encoding
source_system_tab_oh = pd.get_dummies(train_data['source_system_tab'])

# Concatenate new OH columns to dataframe
train_data = pd.concat([train_data, source_system_tab_oh], axis=1)

# Drop original feature
train_data = train_data.drop('source_system_tab', axis=1)

Also one-hot encode the song language, which should be a useful piece of information to predict whether a user listens to the song again.

In [18]:
# Generate one-hot encoding
language_oh = pd.get_dummies(song_data['language'])

# Concatenate new OH columns to dataframe
song_data = pd.concat([song_data, language_oh], axis=1)

# Drop original feature
song_data = song_data.drop('language', axis=1)

Drop nan values in genre_ids

In [19]:
song_data = song_data[song_data['genre_ids'].notna()]

## 4. Feature Engineering

Convert genre_ids to numerical variables. To start, simplify by taking first entry.

In [20]:
song_data['genre_ids_num'] = song_data.apply(lambda row: row['genre_ids'].split('|')[0], axis=1)

In [21]:
song_data = song_data.drop('genre_ids', axis=1)

## 5. Feature Selection

### Merge Data
Merge training data with song information and member information

In [22]:
train_data = pd.merge(train_data, member_data, on='msno')
train_data = pd.merge(train_data, song_data, on='song_id')

In [23]:
train_data.head()

Unnamed: 0,msno,song_id,source_screen_name,source_type,target,discover,explore,listen with,my library,notification,...,3.0,10.0,17.0,24.0,31.0,38.0,45.0,52.0,59.0,genre_ids_num
0,FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=,BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=,Explore,online-playlist,1,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,359
1,pouJqjNRmZOnRNzzMWWkamTKkIGHyvhl/jo4HgbncnM=,BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=,Online playlist more,online-playlist,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,359
2,xbodnNBaLMyqqI7uFJlvHOKMJaizuWo/BB/YHZICcKo=,BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=,Local playlist more,local-library,1,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,359
3,s0ndDsjI79amU0RBiullFN8HRz9HjE++34jGNa7zJ/s=,BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=,Local playlist more,local-library,1,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,359
4,Vw4Umh6/qlsJDC/XMslyAxVvRgFJGHr53yb/nrmY1DU=,BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=,Local playlist more,local-library,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,359


In [24]:
train_data.describe()

Unnamed: 0,target,discover,explore,listen with,my library,notification,radio,search,settings,city,...,-1.0,3.0,10.0,17.0,24.0,31.0,38.0,45.0,52.0,59.0
count,7258963.0,7258963.0,7258963.0,7258963.0,7258963.0,7258963.0,7258963.0,7258963.0,7258963.0,7258963.0,...,7258963.0,7258963.0,7258963.0,7258963.0,7258963.0,7258963.0,7258963.0,7258963.0,7258963.0,7258963.0
mean,0.5037226,0.2942685,0.02263726,0.02870686,0.5003621,0.0008213294,0.0653457,0.08418641,0.0002978387,7.510791,...,0.04197983,0.544693,0.02301279,0.03334374,0.01080499,0.08984203,2.038859e-05,0.0002241367,0.255503,0.000571156
std,0.4999862,0.4557133,0.1487441,0.1669814,0.4999999,0.02864708,0.2471349,0.2776672,0.01725543,6.641459,...,0.2005431,0.4979986,0.149944,0.1795326,0.1033839,0.2859553,0.004515327,0.01496952,0.4361436,0.02389205
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,5.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,13.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,22.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Select Data
Remove IDs from input feature data

In [25]:
drop_id_cols = ['msno', 'song_id']

In [26]:
train_data = train_data.drop(drop_id_cols, axis=1)

Remove features previously identified as uninformative from input feature data

In [27]:
drop_bad_feat_cols = ['source_screen_name', 'source_type', 'gender', 'artist_name', 'composer', 'lyricist']
train_data = train_data.drop(drop_bad_feat_cols, axis=1)

In [28]:
y_col = 'target'
X = train_data.drop(y_col, axis=1)
y = train_data[y_col]

For algorithm evaluation, split the available training data into train and test

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## <a class="anchor" id="save-data"></a> 6. Save Data

In [30]:
X_train.to_csv('./data-input-clean/X_train.csv', index=False)
X_test.to_csv('./data-input-clean/X_test.csv', index=False)
y_train.to_csv('./data-input-clean/y_train.csv', index=False)
y_test.to_csv('./data-input-clean/y_test.csv', index=False)