# Music Recommendation Project

Short description here... this will be Data Pre-processing: Import, exploration, cleaning, feature engineering, resulting in saving off feature data ... [data available on Kaggle](https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data)

Author: Ben Walsh \
February 7, 2020

## Contents

1. [Data Import](#data-import)
2. Data Exploration
3. Data Cleaning
4. Feature Engineering
5. Saving Data

## <a class="anchor" id="data-import"></a>1. Data Import

### Import libraries

In [1]:
import pandas as pd
import os

First import all data: input training data, member data, and song data

In [85]:
train_file = './data-input/train.csv'
member_file = './data-input/members.csv'
song_file = './data-input/songs.csv'

### Import training data

In [88]:
if os.path.exists(train_file):
    train_data = pd.read_csv(train_file)
else:
    print('Training data file {} not found!'.format(train_file))

### Import member data

In [86]:
if os.path.exists(member_file):
    member_data = pd.read_csv(member_file)
else:
    print('Member data file {} not found!'.format(member_file))

### Import song data

In [87]:
if os.path.exists(song_file):
    song_data = pd.read_csv(song_file)
else:
    print('Song data file {} not found!'.format(song_file))

## 2. Data Exploration

### Training Data

In [93]:
train_data.head()

Unnamed: 0,msno,song_id,source_system_tab,source_screen_name,source_type,target
0,FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=,BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=,explore,Explore,online-playlist,1
1,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,bhp/MpSNoqoxOIB+/l8WPqu6jldth4DIpCm3ayXnJqM=,my library,Local playlist more,local-playlist,1
2,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,JNWfrrC7zNN7BdMpsISKa4Mw+xVJYNnxXh3/Epw7QgY=,my library,Local playlist more,local-playlist,1
3,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,2A87tzfnJTSWqD7gIZHisolhe4DMdzkbd6LzO1KHjNs=,my library,Local playlist more,local-playlist,1
4,FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=,3qm6XTZ6MOCU11x8FIVbAGH5l5uMkT3/ZalWG1oo2Gc=,explore,Explore,online-playlist,1


In [90]:
train_data.describe()

Unnamed: 0,target
count,7377418.0
mean,0.5035171
std,0.4999877
min,0.0
25%,0.0
50%,1.0
75%,1.0
max,1.0


In [94]:
print('Train data # of data points: {}'.format(train_data.shape[0]))
print('Train data # of columns: {}'.format(train_data.shape[1]))
print('Percent of training target values that are 1: {:.2f}'.format(100*train_data['target'].sum()/train_data.shape[0]))


Train data # of data points: 7377418
Train data # of columns: 6
Percent of training target values that are 1: 50.35


#### Observations

Target values are well balanced between 0 and 1

### Member Data

In [20]:
member_data.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time,expiration_date
0,XQxgAYj3klVKjR3oxPPXYYFp4soD4TuBghkhMTD4oTw=,1,0,,7,20110820,20170920
1,UizsfmJb9mV54qE9hCYyU07Va97c0lCRLEQX3ae+ztM=,1,0,,7,20150628,20170622
2,D8nEhsIOBSoE6VthTaqDX8U6lqjJ7dLdr72mOyLya2A=,1,0,,4,20160411,20170712
3,mCuD+tZ1hERA/o5GPqk38e041J8ZsBaLcu7nGoIIvhI=,1,0,,9,20150906,20150907
4,q4HRBfVSssAFS9iRfxWrohxuk9kCYMKjHOEagUMV6rQ=,1,0,,4,20170126,20170613


In [21]:
member_data.describe()

Unnamed: 0,city,bd,registered_via,registration_init_time,expiration_date
count,34403.0,34403.0,34403.0,34403.0,34403.0
mean,5.371276,12.280935,5.953376,20139940.0,20169010.0
std,6.243929,18.170251,2.287534,29540.15,7320.925
min,1.0,-43.0,3.0,20040330.0,19700100.0
25%,1.0,0.0,4.0,20121030.0,20170200.0
50%,1.0,0.0,7.0,20150900.0,20170910.0
75%,10.0,25.0,9.0,20161100.0,20170930.0
max,22.0,1051.0,16.0,20170230.0,20201020.0


In [39]:
print('Member data # of data points: {}'.format(member_data.shape[0]))
print('Member data # of columns: {}'.format(member_data.shape[1]))
print('Number of unique cities that members come from: {}'.format(len(member_data['city'].unique())))
print('Percent of gender values that are empty: {:.2f}'.format(100*member_data['gender'].isna().sum()/len(member_data['gender'])))


Member data # of data points: 34403
Member data # of columns: 7
Number of unique cities that members come from: 21
Percent of gender values that are empty: 57.85


#### Observations

Gender values are mostly empty. With such a large proportion missing, removing the corresponding NaN entries would remove a large portion of the data. Recommend removing gender as a feature.

### Song Data

In [40]:
song_data.head()

Unnamed: 0,song_id,song_length,genre_ids,artist_name,composer,lyricist,language
0,CXoTN1eb7AI+DntdU1vbcwGRV4SCIDxZu+YD8JP8r4E=,247640,465,張信哲 (Jeff Chang),董貞,何啟弘,3.0
1,o0kFgae9QtnYgRkVPqLJwa05zIhRlUjfF7O1tDw0ZDU=,197328,444,BLACKPINK,TEDDY| FUTURE BOUNCE| Bekuh BOOM,TEDDY,31.0
2,DwVvVurfpuz+XPuFvucclVQEyPqcpUkHR0ne1RQzPs0=,231781,465,SUPER JUNIOR,,,31.0
3,dKMBWoZyScdxSkihKG+Vf47nc18N9q4m58+b4e7dSSE=,273554,465,S.H.E,湯小康,徐世珍,3.0
4,W3bqWd3T+VeHFzHAUfARgW9AvVRaF4N5Yzm4Mr6Eo/o=,140329,726,貴族精選,Traditional,Traditional,52.0


In [41]:
song_data.describe()

Unnamed: 0,song_length,language
count,2296320.0,2296319.0
mean,246993.5,32.378
std,160920.0,24.33241
min,185.0,-1.0
25%,183600.0,-1.0
50%,226627.0,52.0
75%,277269.0,52.0
max,12173850.0,59.0


In [114]:
print('Song data # of data points: {}'.format(song_data.shape[0]))
print('Song data # of columns: {}'.format(song_data.shape[1]))
print('Number of unique languages that songs come from: {}'.format(len(song_data['language'].unique())))
print('Percent of songs with missing genre info: {:.2f}%'.format(100*song_data['genre_ids'].isna().sum()/song_data.shape[0]))

Song data # of data points: 2296320
Song data # of columns: 7
Number of unique languages that songs come from: 11
Percent of songs with missing genre info: 4.10%


In [121]:
song_data['genre_ids'].value_counts()

465                 567911
958                 176349
2022                168870
1609                166457
2122                139938
                     ...  
921|829                  1
786|2086                 1
1609|465|1011            1
352|1995|430|359         1
465|1981                 1
Name: genre_ids, Length: 1045, dtype: int64

#### Observations

Genre values have 4% missing, which is small enough that it may be easiest to remove those points. Genre IDs have some entries which contain multiple values, denoted with '|'. These will have to be cleaned to be consistent. The classes are also very imbalanced, with the top 5 most prevalent IDs dominating the distribution, particularly the highest ID=465.