<a href="https://colab.research.google.com/github/Wizolingo/Music-Recommendation-System/blob/main/Music_Recommendation_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Music Recommendation Algorithm

In this task, the goal is to predict the chances of a user listening to a song repetitively after the first observable listening event within a time window was triggered. If there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, its target is marked 1, and 0 otherwise in the training set.
The two datasets used in  this project are the KKBOX and Spotify datasets.

### KKBOX Data

KKBOX provides a training data set which consists of information of the first observable listening event for each unique user-song pair within a specific time duration. Metadata of each unique user and song pair is also provided.

Tables
#### main.csv
* msno: user id
* song_id: song id
* source_system_tab: the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps functions. For example, tab my library contains functions to manipulate the local storage, and tab search contains functions relating to search.
* source_screen_name: name of the layout a user sees.
* source_type: an entry point a user first plays music on mobile apps. An entry point could be album, online-playlist, song .. etc.
* target: this is the target variable. target=1 means there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, target=0 otherwise .

#### songs.csv
* The songs. Note that data is in unicode.

* song_id
* song_length: in ms
* genre_ids: genre category. Some songs have multiple genres and they are separated by |
* artist_name
* composer
* lyricist
* language

#### members.csv
user information.
* msno
* city
* bd: age. Note: this column has outlier values, please use your judgement.
* gender
* registered_via: registration method
* registration_init_time: format %Y%m%d
* expiration_date: format %Y%m%d

#### song_extra_info.csv
* song_id
* song name - the name of the song.
* isrc - International Standard Recording Code, theoretically can be used as an identity of a song. However, what worth to note is, ISRCs generated from providers have not been officially verified; therefore the information in ISRC, such as country code and reference year, can be misleading/incorrect. Multiple songs could share one ISRC since a single recording could be re-published several times.

#### sample_submission.csv
sample submission file in the format that we expect you to submit

* id: same as id in test.csv
* target: this is the target variable. target=1 means there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, target=0 otherwise.

### Spotify Data

The Spotify data contains information and audio features of songs in the dataset.

#### dataset
* track_id: sond id
* track_name:	Song's title
* artists: Name of the artist(s)
* duration_ms: Duration of the song
* release_date: Date the song was released
* year: Year of release of the song
* acousticness: How acoustic the song is
* danceability: How easy it is to dance to the song
* energy:	How energetic the song is
* instrumentalness: How instrumental in nature the song is
* liveness: How likely it is the song is a live recording
* loudness: How loud the song is
* speechiness: How much the song is focused on spoken word
* tempo: The tempo of the song
* valence: How positive the mood of the song is
* mode:
* key: The musical key the song is played in
* popularity: Popularity of the song (not a ranking)
* explicit: Whether the song contains explicit lyrics or not
* genre: Genre of song

### Importing necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import glob
from google.colab import drive
import warnings
warnings.filterwarnings('ignore')

### Loading Datasets

In [2]:
drive.mount('/content/drive')
folder_path = '/content/drive/MyDrive/kkbox_spotify_data/*.csv'
file_paths = glob.glob(folder_path)

Mounted at /content/drive


In [3]:
# Checking the file paths
file_paths

['/content/drive/MyDrive/kkbox_spotify_data/song_extra_info.csv',
 '/content/drive/MyDrive/kkbox_spotify_data/spotify_data_1921_to_2020.csv',
 '/content/drive/MyDrive/kkbox_spotify_data/members.csv',
 '/content/drive/MyDrive/kkbox_spotify_data/train.csv',
 '/content/drive/MyDrive/kkbox_spotify_data/songs.csv']

In [4]:
train_df = pd.read_csv('/content/drive/MyDrive/kkbox_spotify_data/train.csv')
songs_df = pd.read_csv('/content/drive/MyDrive/kkbox_spotify_data/songs.csv')
members_df = pd.read_csv('/content/drive/MyDrive/kkbox_spotify_data/members.csv')
songs_extra_info_df = pd.read_csv('/content/drive/MyDrive/kkbox_spotify_data/song_extra_info.csv')

### Data Preparation

#### Merging the datasets

In [5]:
t_s_merged = pd.merge(train_df, songs_df, on='song_id', how='left')
t_s_se_merged = pd.merge(t_s_merged, songs_extra_info_df, on='song_id', how='left')
songs_data = pd.merge(t_s_se_merged, members_df, on='msno', how='left')

#### Reading the merged dataset

In [6]:
songs_data.head()

Unnamed: 0,msno,song_id,source_system_tab,source_screen_name,source_type,target,song_length,genre_ids,artist_name,composer,lyricist,language,name,isrc,city,bd,gender,registered_via,registration_init_time,expiration_date
0,FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=,BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=,explore,Explore,online-playlist,1,206471.0,359,Bastille,Dan Smith| Mark Crew,,52.0,Good Grief,GBUM71602854,1,0,,7,20120102,20171005
1,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,bhp/MpSNoqoxOIB+/l8WPqu6jldth4DIpCm3ayXnJqM=,my library,Local playlist more,local-playlist,1,284584.0,1259,Various Artists,,,52.0,Lords of Cardboard,US3C69910183,13,24,female,9,20110525,20170911
2,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,JNWfrrC7zNN7BdMpsISKa4Mw+xVJYNnxXh3/Epw7QgY=,my library,Local playlist more,local-playlist,1,225396.0,1259,Nas,N. Jones、W. Adams、J. Lordan、D. Ingle,,52.0,Hip Hop Is Dead(Album Version (Edited)),USUM70618761,13,24,female,9,20110525,20170911
3,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,2A87tzfnJTSWqD7gIZHisolhe4DMdzkbd6LzO1KHjNs=,my library,Local playlist more,local-playlist,1,255512.0,1019,Soundway,Kwadwo Donkoh,,-1.0,Disco Africa,GBUQH1000063,13,24,female,9,20110525,20170911
4,FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=,3qm6XTZ6MOCU11x8FIVbAGH5l5uMkT3/ZalWG1oo2Gc=,explore,Explore,online-playlist,1,187802.0,1011,Brett Young,Brett Young| Kelly Archer| Justin Ebach,,52.0,Sleep Without You,QM3E21606003,1,0,,7,20120102,20171005


#### Checking the number of rows and columns

In [7]:
print('There are', songs_data.shape[0], 'rows and', songs_data.shape[1], 'columns in the data')

There are 7377418 rows and 20 columns in the data


#### More information on the data

In [8]:
songs_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7377418 entries, 0 to 7377417
Data columns (total 20 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   msno                    object 
 1   song_id                 object 
 2   source_system_tab       object 
 3   source_screen_name      object 
 4   source_type             object 
 5   target                  int64  
 6   song_length             float64
 7   genre_ids               object 
 8   artist_name             object 
 9   composer                object 
 10  lyricist                object 
 11  language                float64
 12  name                    object 
 13  isrc                    object 
 14  city                    int64  
 15  bd                      int64  
 16  gender                  object 
 17  registered_via          int64  
 18  registration_init_time  int64  
 19  expiration_date         int64  
dtypes: float64(2), int64(6), object(12)
memory usage: 1.2+ GB


The data types registration_init_time and expiration_date columns are integers instead of datetime, we will address that.

#### Reading the spotify data

In [10]:
df_spotify = pd.read_csv('/content/drive/MyDrive/kkbox_spotify_data/spotify_data_1921_to_2020.csv')
df_spotify.head()

Unnamed: 0,id,name,artists,duration_ms,release_date,year,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,mode,key,popularity,explicit
0,0gNNToCW3qjabgTyBSjt3H,!Que Vida! - Mono Version,['Love'],220560,11/1/66,1966,0.525,0.6,0.54,0.00305,0.1,-11.803,0.0328,125.898,0.547,1,9,26,0
1,0tMgFpOrXZR6irEOLNWwJL,"""40""",['U2'],157840,2/28/83,1983,0.228,0.368,0.48,0.707,0.159,-11.605,0.0306,150.166,0.338,1,8,21,0
2,2ZywW3VyVx6rrlrX75n3JB,"""40"" - Live",['U2'],226200,8/20/83,1983,0.0998,0.272,0.684,0.0145,0.946,-9.728,0.0505,143.079,0.279,1,8,41,0
3,6DdWA7D1o5TU2kXWyCLcch,"""40"" - Remastered 2008",['U2'],157667,2/28/83,1983,0.185,0.371,0.545,0.582,0.183,-9.315,0.0307,150.316,0.31,1,8,37,0
4,3vMmwsAiLDCfyc1jl76lQE,"""40"" - Remastered 2008",['U2'],157667,2/28/83,1983,0.185,0.371,0.545,0.582,0.183,-9.315,0.0307,150.316,0.31,1,8,35,0


#### More information on the spotify data

In [11]:
df_spotify.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169907 entries, 0 to 169906
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                169907 non-null  object 
 1   name              169907 non-null  object 
 2   artists           169907 non-null  object 
 3   duration_ms       169907 non-null  int64  
 4   release_date      169907 non-null  object 
 5   year              169907 non-null  int64  
 6   acousticness      169907 non-null  float64
 7   danceability      169907 non-null  float64
 8   energy            169907 non-null  float64
 9   instrumentalness  169907 non-null  float64
 10  liveness          169907 non-null  float64
 11  loudness          169907 non-null  float64
 12  speechiness       169907 non-null  float64
 13  tempo             169907 non-null  float64
 14  valence           169907 non-null  float64
 15  mode              169907 non-null  int64  
 16  key               16

**Filtering the Spotify data**

In [12]:
spotify_data_filtered = df_spotify.loc[(df_spotify['year']>=2004) & (df_spotify['year']<=2016)]

In [13]:
spotify_data_filtered.head()

Unnamed: 0,id,name,artists,duration_ms,release_date,year,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,mode,key,popularity,explicit
5,25Sd73fleKUVPNqITPZkn1,"""45""",['The Gaslight Anthem'],202493,1/1/12,2012,0.000696,0.315,0.97,0.0,0.277,-4.709,0.102,178.068,0.423,1,8,48,0
19,3ozivYJGJGq6TSzdy8m64X,"""DEVILS NEVER CRY""(スタッフロール)",['Capcom Sound Team'],319907,3/31/05,2005,0.000894,0.264,0.951,0.0442,0.127,-7.356,0.146,149.99,0.159,1,7,49,0
51,3rG8ZkmKHb4Ms6CsSzEITv,"""The Take Over, The Breaks Over""",['Fall Out Boy'],213587,1/1/07,2007,0.00614,0.609,0.917,2e-05,0.0775,-2.563,0.0477,149.948,0.67,1,9,57,0
52,4zCfMDdf5QXPKEqxdinXvB,"""The Take Over, The Breaks Over""",['Fall Out Boy'],213587,2/6/07,2007,0.00614,0.609,0.917,2e-05,0.0775,-2.563,0.0477,149.948,0.67,1,9,51,0
60,0P6USuYzHP8GdAyNKLkTZi,#1 Crush,['Garbage'],285107,1/1/07,2007,0.000256,0.635,0.647,0.00132,0.358,-7.055,0.0235,94.196,0.464,0,7,41,0


**Joining the KKBOX and Spotify filtered data**

In [14]:
df_kkbox_spotify = pd.merge(spotify_data_filtered, songs_data, on='name', how='inner')

In [15]:
pd.set_option('display.max_columns', None)
df_kkbox_spotify.head()

Unnamed: 0,id,name,artists,duration_ms,release_date,year,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,mode,key,popularity,explicit,msno,song_id,source_system_tab,source_screen_name,source_type,target,song_length,genre_ids,artist_name,composer,lyricist,language,isrc,city,bd,gender,registered_via,registration_init_time,expiration_date
0,617KSbx52ACbnQBxSsG26X,#Beautiful,"['Mariah Carey', 'Miguel']",199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,LmCPb0h/s7H7s+Wmb5JwRwJTbasi85lneTR85TXW6DY=,WTQo/GJtT2qyQvqTRfsznnW6S63QNi2gDjS9DudG9ik=,search,Artist more,top-hits-for-artist,0,202570.0,465,Mariah Carey,Mariah Carey| Brook Davis| Miguel Pimentel| Na...,,52.0,USUM71305567,9,33,male,7,20140729,20170930
1,617KSbx52ACbnQBxSsG26X,#Beautiful,"['Mariah Carey', 'Miguel']",199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,0eQZY/YLb+YGy3XCo6Q1PP1wQGO7wNuBI8mDGS7/NqQ=,WTQo/GJtT2qyQvqTRfsznnW6S63QNi2gDjS9DudG9ik=,my library,Local playlist more,local-playlist,1,202570.0,465,Mariah Carey,Mariah Carey| Brook Davis| Miguel Pimentel| Na...,,52.0,USUM71305567,22,32,female,9,20060303,20180819
2,617KSbx52ACbnQBxSsG26X,#Beautiful,"['Mariah Carey', 'Miguel']",199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,NMjW8qxMk/ojwFH12PXQmwEsP48/c3rLi7MWU2QHa7k=,lpC0aGdS6Z1I54mHyU7skZt9NalF9lkr531OlEI9AHY=,my library,Local playlist more,local-library,0,199784.0,465,Mariah Carey,Miguel Pimentel|Mariah Carey|Nathan Perez|Broo...,,52.0,USUM71305567,13,24,male,3,20130903,20171002
3,617KSbx52ACbnQBxSsG26X,#Beautiful,"['Mariah Carey', 'Miguel']",199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,CNviohuVgv0rl0YC+b49AFL2UiB01ZNijpkJRzRfuOI=,WTQo/GJtT2qyQvqTRfsznnW6S63QNi2gDjS9DudG9ik=,my library,Local playlist more,local-library,0,202570.0,465,Mariah Carey,Mariah Carey| Brook Davis| Miguel Pimentel| Na...,,52.0,USUM71305567,1,0,,7,20120118,20171004
4,617KSbx52ACbnQBxSsG26X,#Beautiful,"['Mariah Carey', 'Miguel']",199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,Xa3bbIsP+DO7sQdTiljlnXoiRJl6JXU5xDe9y3HG6Lc=,WTQo/GJtT2qyQvqTRfsznnW6S63QNi2gDjS9DudG9ik=,my library,Local playlist more,local-library,0,202570.0,465,Mariah Carey,Mariah Carey| Brook Davis| Miguel Pimentel| Na...,,52.0,USUM71305567,5,31,male,9,20150705,20180111


**Removing the special characters in the name and artists column values**

In [16]:
df_kkbox_spotify['name'] = df_kkbox_spotify['name'].str.replace('#', '')
df_kkbox_spotify['artists'] = df_kkbox_spotify['artists'].str.replace("[", '')
df_kkbox_spotify['artists'] = df_kkbox_spotify['artists'].str.replace("]", '')
df_kkbox_spotify['artists'] = df_kkbox_spotify['artists'].str.replace("'", '')

In [17]:
# changing the delimiter in the artists column values
df_kkbox_spotify['artists'] = df_kkbox_spotify['artists'].str.replace(",", '|')

**Renaming columns**

In [18]:
column_mapping = {'id':'track_id',
                  'name':'track_name',
                 'bd': 'age'}

df_kkbox_spotify = df_kkbox_spotify.rename(columns=column_mapping)

In [19]:
df_kkbox_spotify.head()

Unnamed: 0,track_id,track_name,artists,duration_ms,release_date,year,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,mode,key,popularity,explicit,msno,song_id,source_system_tab,source_screen_name,source_type,target,song_length,genre_ids,artist_name,composer,lyricist,language,isrc,city,age,gender,registered_via,registration_init_time,expiration_date
0,617KSbx52ACbnQBxSsG26X,Beautiful,Mariah Carey| Miguel,199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,LmCPb0h/s7H7s+Wmb5JwRwJTbasi85lneTR85TXW6DY=,WTQo/GJtT2qyQvqTRfsznnW6S63QNi2gDjS9DudG9ik=,search,Artist more,top-hits-for-artist,0,202570.0,465,Mariah Carey,Mariah Carey| Brook Davis| Miguel Pimentel| Na...,,52.0,USUM71305567,9,33,male,7,20140729,20170930
1,617KSbx52ACbnQBxSsG26X,Beautiful,Mariah Carey| Miguel,199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,0eQZY/YLb+YGy3XCo6Q1PP1wQGO7wNuBI8mDGS7/NqQ=,WTQo/GJtT2qyQvqTRfsznnW6S63QNi2gDjS9DudG9ik=,my library,Local playlist more,local-playlist,1,202570.0,465,Mariah Carey,Mariah Carey| Brook Davis| Miguel Pimentel| Na...,,52.0,USUM71305567,22,32,female,9,20060303,20180819
2,617KSbx52ACbnQBxSsG26X,Beautiful,Mariah Carey| Miguel,199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,NMjW8qxMk/ojwFH12PXQmwEsP48/c3rLi7MWU2QHa7k=,lpC0aGdS6Z1I54mHyU7skZt9NalF9lkr531OlEI9AHY=,my library,Local playlist more,local-library,0,199784.0,465,Mariah Carey,Miguel Pimentel|Mariah Carey|Nathan Perez|Broo...,,52.0,USUM71305567,13,24,male,3,20130903,20171002
3,617KSbx52ACbnQBxSsG26X,Beautiful,Mariah Carey| Miguel,199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,CNviohuVgv0rl0YC+b49AFL2UiB01ZNijpkJRzRfuOI=,WTQo/GJtT2qyQvqTRfsznnW6S63QNi2gDjS9DudG9ik=,my library,Local playlist more,local-library,0,202570.0,465,Mariah Carey,Mariah Carey| Brook Davis| Miguel Pimentel| Na...,,52.0,USUM71305567,1,0,,7,20120118,20171004
4,617KSbx52ACbnQBxSsG26X,Beautiful,Mariah Carey| Miguel,199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,Xa3bbIsP+DO7sQdTiljlnXoiRJl6JXU5xDe9y3HG6Lc=,WTQo/GJtT2qyQvqTRfsznnW6S63QNi2gDjS9DudG9ik=,my library,Local playlist more,local-library,0,202570.0,465,Mariah Carey,Mariah Carey| Brook Davis| Miguel Pimentel| Na...,,52.0,USUM71305567,5,31,male,9,20150705,20180111


In [20]:
df_kkbox_spotify.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3055369 entries, 0 to 3055368
Data columns (total 38 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   track_id                object 
 1   track_name              object 
 2   artists                 object 
 3   duration_ms             int64  
 4   release_date            object 
 5   year                    int64  
 6   acousticness            float64
 7   danceability            float64
 8   energy                  float64
 9   instrumentalness        float64
 10  liveness                float64
 11  loudness                float64
 12  speechiness             float64
 13  tempo                   float64
 14  valence                 float64
 15  mode                    int64  
 16  key                     int64  
 17  popularity              int64  
 18  explicit                int64  
 19  msno                    object 
 20  song_id                 object 
 21  source_system_tab       object 

**Drop unnecessary columns**

In [21]:
df_kkbox_spotify.columns

Index(['track_id', 'track_name', 'artists', 'duration_ms', 'release_date',
       'year', 'acousticness', 'danceability', 'energy', 'instrumentalness',
       'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'mode',
       'key', 'popularity', 'explicit', 'msno', 'song_id', 'source_system_tab',
       'source_screen_name', 'source_type', 'target', 'song_length',
       'genre_ids', 'artist_name', 'composer', 'lyricist', 'language', 'isrc',
       'city', 'age', 'gender', 'registered_via', 'registration_init_time',
       'expiration_date'],
      dtype='object')

In [22]:
columns_to_drop = ['song_id','song_length','artist_name','composer','lyricist',
                   'language','isrc','city','gender','registered_via',
                   'registration_init_time','expiration_date']

In [23]:
df_kkbox_spotify.drop(columns_to_drop, axis=1,inplace=True)

In [24]:
df_kkbox_spotify.shape

(3055369, 26)

### Statistical summary of the data

In [25]:
df_kkbox_spotify.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
track_id,3055369.0,12957.0,7crMiinWx373rNBZBaVske,15149.0,,,,,,,
track_name,3055369.0,9873.0,Closer,136341.0,,,,,,,
artists,3055369.0,4883.0,Taylor Swift,77397.0,,,,,,,
duration_ms,3055369.0,,,,239880.856529,55043.40575,30301.0,208067.0,232520.0,261114.0,975267.0
release_date,3055369.0,1956.0,1/1/12,87933.0,,,,,,,
year,3055369.0,,,,2011.490323,3.74372,2004.0,2008.0,2012.0,2015.0,2016.0
acousticness,3055369.0,,,,0.231674,0.278303,2e-06,0.0153,0.0935,0.395,0.996
danceability,3055369.0,,,,0.591012,0.145389,0.0,0.499,0.602,0.696,0.986
energy,3055369.0,,,,0.658484,0.204604,2e-05,0.519,0.696,0.819,0.999
instrumentalness,3055369.0,,,,0.045269,0.172773,0.0,0.0,1e-06,0.00016,0.999


* Taylor swift is the most listened to among the artists in the data.
* Most users have their personal libraries as their preferred source for listening to music.
* The age column has extreme and unrealistic values with the minimum and maximum being -43 and 1030 respectively. This indicates there are outliers in this column.
* The average liveness of songs in the data is 0.18 and 75% of all songs have a liveness of 0.21, showing the dominance of electronic music.
* 75% of songs have speechiness of approximately 0.1 showing that most songs are not very heavy on words.

### Outlier treatment

**Checking outliers using z-score**

In [26]:
mean_of_age = np.mean(df_kkbox_spotify['age'])
mode_of_age = df_kkbox_spotify['age'].mode()[0]
std_of_age = np.std(df_kkbox_spotify['age'])
print("Mean of Age: ",mean_of_age)
print("Mode of Age: ",mode_of_age)
print("Standard Deviation of Age: ",std_of_age)

threshold = 3
outlier = []
for i in df_kkbox_spotify['age']:
    z = (i-mean_of_age)/std_of_age
    if z > threshold:
        outlier.append(i)
print('Total outliers in the column: ', len(outlier))
print("Maximum Age Outlier: ", max(outlier))
print("Minimum Age Outlier: ", min(outlier))

Mean of Age:  16.905725625939127
Mode of Age:  0
Standard Deviation of Age:  19.645736081161477
Total outliers in the column:  3139
Maximum Age Outlier:  1030
Minimum Age Outlier:  78


* There are a total of 3139 outliers in the bd column
* Ages between 78 and 1030 are outliers and can be removed

**Removing outliers in age column**

In [27]:
df_kkbox_spotify.drop(df_kkbox_spotify[df_kkbox_spotify['age'] > 80].index, inplace = True)
df_kkbox_spotify.drop(df_kkbox_spotify[df_kkbox_spotify['age'] <=0].index, inplace = True)

### Checking for Missing Values

In [28]:
print("Total Null values in the dataset:\n", df_kkbox_spotify.isnull().sum())

Total Null values in the dataset:
 track_id                  0
track_name                0
artists                   0
duration_ms               0
release_date              0
year                      0
acousticness              0
danceability              0
energy                    0
instrumentalness          0
liveness                  0
loudness                  0
speechiness               0
tempo                     0
valence                   0
mode                      0
key                       0
popularity                0
explicit                  0
msno                      0
source_system_tab      6697
source_screen_name    81468
source_type            5275
target                    0
genre_ids             14905
age                       0
dtype: int64


### Handling missing values

**Checking the percentage of missing values for the columns in question**

In [29]:
percentage_missing = df_kkbox_spotify.isnull().sum() * 100 / len(df_kkbox_spotify)
percentage_missing

track_id              0.000000
track_name            0.000000
artists               0.000000
duration_ms           0.000000
release_date          0.000000
year                  0.000000
acousticness          0.000000
danceability          0.000000
energy                0.000000
instrumentalness      0.000000
liveness              0.000000
loudness              0.000000
speechiness           0.000000
tempo                 0.000000
valence               0.000000
mode                  0.000000
key                   0.000000
popularity            0.000000
explicit              0.000000
msno                  0.000000
source_system_tab     0.369503
source_screen_name    4.494950
source_type           0.291045
target                0.000000
genre_ids             0.822375
age                   0.000000
dtype: float64

Source screen name has the highest missing values with 4.5%

In [30]:
# extracting all the information of other columns where source_screen_name is null
df_kkbox_spotify.loc[df_kkbox_spotify['source_screen_name'].isnull()==True]

Unnamed: 0,track_id,track_name,artists,duration_ms,release_date,year,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,mode,key,popularity,explicit,msno,source_system_tab,source_screen_name,source_type,target,genre_ids,age
18,617KSbx52ACbnQBxSsG26X,Beautiful,Mariah Carey| Miguel,199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,mHanXIId8t0t4y7YZCdaSwc3/LnnkF/EKOlWy+jNukQ=,discover,,top-hits-for-artist,0,465,24
30,617KSbx52ACbnQBxSsG26X,Beautiful,Mariah Carey| Miguel,199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,1d4acB2bhEZCFjSRwvO4ls8PrBtvNTlkcAcxYx8FcWE=,listen with,,listen-with,0,465,27
86,617KSbx52ACbnQBxSsG26X,Beautiful,Mariah Carey| Miguel,199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,ohnJTn+6M+W0aBR5LMcP8MTTKB7GscEY3zif1vB3Cjc=,my library,,top-hits-for-artist,0,465,38
97,617KSbx52ACbnQBxSsG26X,Beautiful,Mariah Carey| Miguel,199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,VxVaRlhgqAetKLLBOQR6SKvSp1VbYpQ15HjPAetXjME=,my library,,top-hits-for-artist,0,465,27
105,617KSbx52ACbnQBxSsG26X,Beautiful,Mariah Carey| Miguel,199867,5/27/14,2014,0.346,0.677,0.749,0.0,0.347,-5.405,0.0391,107.042,0.469,1,4,54,0,H/MB+uttGlGIMr7qru4HYG4dVbh0e1mBuoryBOKz/RU=,discover,,top-hits-for-artist,0,465,25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3055305,4mwzUFtpBHJVyBs16YjjPN,몸매 Mommae (feat. Ugly Duck),Jay Park| Ugly Duck,204327,11/5/15,2015,0.218,0.745,0.669,0.0,0.104,-5.362,0.2420,94.089,0.681,0,9,59,0,cIAWv5YOFzHtganCl67btiKdXsGEHMAOUSvGQbhcbC8=,my library,,top-hits-for-artist,0,1259,37
3055306,4mwzUFtpBHJVyBs16YjjPN,몸매 Mommae (feat. Ugly Duck),Jay Park| Ugly Duck,204327,11/5/15,2015,0.218,0.745,0.669,0.0,0.104,-5.362,0.2420,94.089,0.681,0,9,59,0,HihZqURxXxAaZN1JI66HV6YuJlGDgp7ZRtHtuBtn+q4=,my library,,top-hits-for-artist,1,1259,43
3055324,4mwzUFtpBHJVyBs16YjjPN,몸매 Mommae (feat. Ugly Duck),Jay Park| Ugly Duck,204327,11/5/15,2015,0.218,0.745,0.669,0.0,0.104,-5.362,0.2420,94.089,0.681,0,9,59,0,zYTm2T59lT+Z/hpoywdET+EGMcRhwi4rzL7WD09iRC0=,my library,,top-hits-for-artist,0,1259,25
3055352,4mwzUFtpBHJVyBs16YjjPN,몸매 Mommae (feat. Ugly Duck),Jay Park| Ugly Duck,204327,11/5/15,2015,0.218,0.745,0.669,0.0,0.104,-5.362,0.2420,94.089,0.681,0,9,59,0,3TZaybijIyaqU3P+moI/mzFWcAj6aR0p7bmxLT9iRq8=,,,online-playlist,0,1259,39


There's no definite partern to the missingness of these values so we can either drop the rows with missing values or replace them with the modes.

**Removing null values**

In [31]:
data_null_removed = df_kkbox_spotify.dropna()
data_null_removed.shape

(1716132, 26)

**Replacing null values with mode**

In [32]:
data_mode_replaced = df_kkbox_spotify.copy()
for col in data_mode_replaced.columns:
    data_mode_replaced[col].fillna(data_mode_replaced[col].mode()[0])

In [33]:
data_mode_replaced.shape

(1812434, 26)