# Music Recommendation System 
- Music recommendation system can sugest songs to users based on their listening patterns.
- `Data Description:`

- `Data Source:` https://www.kaggle.com/competitions/kkbox-music-recommendation-challenge/data

- `Dataset Description`: 

- In this task, you will be asked to predict the chances of a user listening to a song repetitively after the first observable listening event within a time window was triggered. If there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, its target is marked 1, and 0 otherwise in the training set. The same rule applies to the testing set.

- KKBOX provides a training data set consists of information of the first observable listening event for each unique user-song pair within a specific time duration. Metadata of each unique user and song pair is also provided. The use of public data to increase the level of accuracy of your prediction is encouraged.

- The train and the test data are selected from users listening history in a given time period. Note that this time period is chosen to be before the WSDM-KKBox Churn Prediction time period. The train and test sets are split based on time, and the split of public/private are based on unique user/song pairs.

`Tables:`

1. `train.csv`
- msno: user id

- song_id: song id

- source_system_tab: the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps functions. For example, tab my library contains functions to manipulate the local storage, and tab search contains functions relating to search. source_screen_name: name of the layout a user sees.

- source_type: an entry point a user first plays music on mobile apps. An entry point could be album, online-playlist, song .. etc.

- target: this is the target variable. target=1 means there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, target=0 otherwise .

2. `test.csv`
- id: row id (will be used for submission)

- msno: user id

- song_id: song id

- source_system_tab: the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps functions. For example, tab my library contains functions to manipulate the local storage, and tab search contains functions relating to search.

- source_screen_name: name of the layout a user sees.

- source_type: an entry point a user first plays music on mobile apps. An entry point could be album, online-playlist, song .. etc.

3. `sample_submission.csv sample submission file in the format that we expect you to submit`

- id: same as id in test.csv
- target: this is the target variable. target=1 means there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, target=0 otherwise .

4. `songs.csv` The songs. Note that data is in unicode.
- song_id
-song_length: in ms
-genre_ids: genre category. Some songs have multiple genres and they are separated by | artist_name composer lyricist language

5. `members.csv`
- user information.

- msno
  
- city
  
- bd: age. Note: this column has outlier values, please use your judgement. gender

- registered_via: registration method

- registration_init_time: format %Y%m%d

- expiration_date: format %Y%m%d


6. `song_extra_info.csv`
- song_id
- song name - the name of the song.
- isrc - International Standard Recording Code, theoretically can be used as an identity of a song. However, what worth to note is, ISRCs generated from providers have not been officially verified; therefore the information in ISRC, such as country code and reference year, can be misleading/incorrect. Multiple songs could share one ISRC since a single recording could be re-published several times.

# Importing Important Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,accuracy_score
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

# Loading datasets

In [2]:
train_data = pd.read_csv(r"train.csv")

In [3]:
train_data.head()

Unnamed: 0,msno,song_id,source_system_tab,source_screen_name,source_type,target
0,FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=,BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=,explore,Explore,online-playlist,1
1,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,bhp/MpSNoqoxOIB+/l8WPqu6jldth4DIpCm3ayXnJqM=,my library,Local playlist more,local-playlist,1
2,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,JNWfrrC7zNN7BdMpsISKa4Mw+xVJYNnxXh3/Epw7QgY=,my library,Local playlist more,local-playlist,1
3,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,2A87tzfnJTSWqD7gIZHisolhe4DMdzkbd6LzO1KHjNs=,my library,Local playlist more,local-playlist,1
4,FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=,3qm6XTZ6MOCU11x8FIVbAGH5l5uMkT3/ZalWG1oo2Gc=,explore,Explore,online-playlist,1


In [4]:
train_data.shape

(7377418, 6)

In [5]:
members_data = pd.read_csv(r"members.csv")
members_data.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time,expiration_date
0,XQxgAYj3klVKjR3oxPPXYYFp4soD4TuBghkhMTD4oTw=,1,0,,7,20110820,20170920
1,UizsfmJb9mV54qE9hCYyU07Va97c0lCRLEQX3ae+ztM=,1,0,,7,20150628,20170622
2,D8nEhsIOBSoE6VthTaqDX8U6lqjJ7dLdr72mOyLya2A=,1,0,,4,20160411,20170712
3,mCuD+tZ1hERA/o5GPqk38e041J8ZsBaLcu7nGoIIvhI=,1,0,,9,20150906,20150907
4,q4HRBfVSssAFS9iRfxWrohxuk9kCYMKjHOEagUMV6rQ=,1,0,,4,20170126,20170613


In [6]:
members_data.shape

(34403, 7)

In [7]:
songs_data = pd.read_csv(r"songs.csv")
songs_data.head()

Unnamed: 0,song_id,song_length,genre_ids,artist_name,composer,lyricist,language
0,CXoTN1eb7AI+DntdU1vbcwGRV4SCIDxZu+YD8JP8r4E=,247640,465,張信哲 (Jeff Chang),董貞,何啟弘,3.0
1,o0kFgae9QtnYgRkVPqLJwa05zIhRlUjfF7O1tDw0ZDU=,197328,444,BLACKPINK,TEDDY| FUTURE BOUNCE| Bekuh BOOM,TEDDY,31.0
2,DwVvVurfpuz+XPuFvucclVQEyPqcpUkHR0ne1RQzPs0=,231781,465,SUPER JUNIOR,,,31.0
3,dKMBWoZyScdxSkihKG+Vf47nc18N9q4m58+b4e7dSSE=,273554,465,S.H.E,湯小康,徐世珍,3.0
4,W3bqWd3T+VeHFzHAUfARgW9AvVRaF4N5Yzm4Mr6Eo/o=,140329,726,貴族精選,Traditional,Traditional,52.0


In [8]:
songs_data.shape

(2296320, 7)

- merging with different required dataset song_data and members_data

In [9]:
train_df = pd.merge(train_data, songs_data, on='song_id', how='left')
del songs_data

train_df = pd.merge(train_data, members_data, on='msno', how='left')
del members_data

In [10]:
train_df.head()

Unnamed: 0,msno,song_id,source_system_tab,source_screen_name,source_type,target,city,bd,gender,registered_via,registration_init_time,expiration_date
0,FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=,BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=,explore,Explore,online-playlist,1,1,0,,7,20120102,20171005
1,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,bhp/MpSNoqoxOIB+/l8WPqu6jldth4DIpCm3ayXnJqM=,my library,Local playlist more,local-playlist,1,13,24,female,9,20110525,20170911
2,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,JNWfrrC7zNN7BdMpsISKa4Mw+xVJYNnxXh3/Epw7QgY=,my library,Local playlist more,local-playlist,1,13,24,female,9,20110525,20170911
3,Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=,2A87tzfnJTSWqD7gIZHisolhe4DMdzkbd6LzO1KHjNs=,my library,Local playlist more,local-playlist,1,13,24,female,9,20110525,20170911
4,FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=,3qm6XTZ6MOCU11x8FIVbAGH5l5uMkT3/ZalWG1oo2Gc=,explore,Explore,online-playlist,1,1,0,,7,20120102,20171005


In [11]:
train_df.shape

(7377418, 12)

In [12]:
train_df.isna().value_counts()

msno   song_id  source_system_tab  source_screen_name  source_type  target  city   bd     gender  registered_via  registration_init_time  expiration_date
False  False    False              False               False        False   False  False  False   False           False                   False              4185007
                                                                                          True    False           False                   False              2774290
                                   True                False        False   False  False  False   False           False                   False               212630
                                                                                          True    False           False                   False               177307
                True               True                True         False   False  False  False   False           False                   False                12283
                     

In [13]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7377418 entries, 0 to 7377417
Data columns (total 12 columns):
 #   Column                  Dtype 
---  ------                  ----- 
 0   msno                    object
 1   song_id                 object
 2   source_system_tab       object
 3   source_screen_name      object
 4   source_type             object
 5   target                  int64 
 6   city                    int64 
 7   bd                      int64 
 8   gender                  object
 9   registered_via          int64 
 10  registration_init_time  int64 
 11  expiration_date         int64 
dtypes: int64(6), object(6)
memory usage: 675.4+ MB


In [14]:
train_df.describe()

Unnamed: 0,target,city,bd,registered_via,registration_init_time,expiration_date
count,7377418.0,7377418.0,7377418.0,7377418.0,7377418.0,7377418.0
mean,0.5035171,7.511399,17.53927,6.794068,20128100.0,20171570.0
std,0.4999877,6.641625,21.55447,2.275774,30172.81,3869.831
min,0.0,1.0,-43.0,3.0,20040330.0,19700100.0
25%,0.0,1.0,0.0,4.0,20110700.0,20170910.0
50%,1.0,5.0,21.0,7.0,20131020.0,20170930.0
75%,1.0,13.0,29.0,9.0,20151020.0,20171010.0
max,1.0,22.0,1051.0,13.0,20170130.0,20201020.0


# Handling Missing Values- Null values 

In [15]:
def find_dirty_values(data):
    dtypes = pd.DataFrame(data.dtypes,columns=["Data Type"])
    dtypes["Unique Values"]=data.nunique().sort_values(ascending=True)
    dtypes["Null Values"]=data.isnull().sum()
    dtypes["% null Values"]=data.isnull().sum()/len(data)
    return dtypes.sort_values(by="Null Values" , ascending=False).style.background_gradient(cmap='YlOrRd',axis=0)

In [16]:
result = find_dirty_values(train_df)
result

Unnamed: 0,Data Type,Unique Values,Null Values,% null Values
gender,object,2,2961479,0.401425
source_screen_name,object,20,414804,0.056226
source_system_tab,object,8,24849,0.003368
source_type,object,12,21539,0.00292
msno,object,30755,0,0.0
song_id,object,359966,0,0.0
target,int64,2,0,0.0
city,int64,21,0,0.0
bd,int64,92,0,0.0
registered_via,int64,5,0,0.0


In [17]:
def handling_missing_values(dataframe):
    cat_cols = dataframe.select_dtypes(include='O').columns
    num_cols = dataframe.select_dtypes(include=(np.number)).columns
    for col in cat_cols:
        dataframe[col] = dataframe[col].fillna('UnKnown')
    for col in num_cols:
        dataframe[col] = dataframe[col].interpolate(method='linear')
    return dataframe

In [18]:
train_df = handling_missing_values(train_df)

In [19]:
result = find_dirty_values(train_df)
result 

Unnamed: 0,Data Type,Unique Values,Null Values,% null Values
msno,object,30755,0,0.0
song_id,object,359966,0,0.0
source_system_tab,object,9,0,0.0
source_screen_name,object,21,0,0.0
source_type,object,13,0,0.0
target,int64,2,0,0.0
city,int64,21,0,0.0
bd,int64,92,0,0.0
gender,object,3,0,0.0
registered_via,int64,5,0,0.0


In [20]:
date_cols = ['registration_init_time','expiration_date']
def date_formatting(date):
    date = str(date)
    date = f"{date[:4]}-{date[4:6]}-{date[6:]}"
    return date
def paring_date(dataframe,date):
    dataframe[date] = pd.to_datetime(dataframe[date])
    dataframe[f"{date}-day"] = dataframe[date].dt.day
    dataframe[f"{date}-month"] = dataframe[date].dt.month
    dataframe[f"{date}-year"] = dataframe[date].dt.year
    return dataframe

In [21]:
for col in date_cols:
    train_df[col] = train_df[col].apply(date_formatting)
    train_df = paring_date(train_df,col)

In [22]:
train_df.loc[:,'registration_init_time':].head()

Unnamed: 0,registration_init_time,expiration_date,registration_init_time-day,registration_init_time-month,registration_init_time-year,expiration_date-day,expiration_date-month,expiration_date-year
0,2012-01-02,2017-10-05,2,1,2012,5,10,2017
1,2011-05-25,2017-09-11,25,5,2011,11,9,2017
2,2011-05-25,2017-09-11,25,5,2011,11,9,2017
3,2011-05-25,2017-09-11,25,5,2011,11,9,2017
4,2012-01-02,2017-10-05,2,1,2012,5,10,2017


In [23]:
train_df.shape

(7377418, 18)

In [24]:
cat_cols = train_df.select_dtypes(include="O").columns
encoder = LabelEncoder()
for col in cat_cols:
    train_df[col] = encoder.fit_transform(train_df[col])

In [25]:
x = train_df.drop(columns={"registration_init_time","expiration_date","song_id","msno"})
x.head()

Unnamed: 0,source_system_tab,source_screen_name,source_type,target,city,bd,gender,registered_via,registration_init_time-day,registration_init_time-month,registration_init_time-year,expiration_date-day,expiration_date-month,expiration_date-year
0,2,7,7,1,1,0,0,7,2,1,2012,5,10,2017
1,4,8,5,1,13,24,1,9,25,5,2011,11,9,2017
2,4,8,5,1,13,24,1,9,25,5,2011,11,9,2017
3,4,8,5,1,13,24,1,9,25,5,2011,11,9,2017
4,2,7,7,1,1,0,0,7,2,1,2012,5,10,2017


In [26]:
x.shape

(7377418, 14)

# Training and Testing Data

In [27]:
#Training 
y = x.pop('target')
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=.3,random_state=42)
print(f"Shape Of Training Data Set : ",x_train.shape)
print(f"Shape Of Testing Data Set :",x_test.shape)
print(f"Shape Of Train Label :",y_train.shape)
print(f"Shape Of Test Label :",y_test.shape)

Shape Of Training Data Set :  (5164192, 13)
Shape Of Testing Data Set : (2213226, 13)
Shape Of Train Label : (5164192,)
Shape Of Test Label : (2213226,)


# Using Different Classification Techniques 
## Linear Regression

In [28]:
lr = LogisticRegression(max_iter=300,C=0.001,penalty="l2")
lr.fit(x_train,y_train)
train_pred = lr.predict(x_train)
test_pred = lr.predict(x_test)
lr_train_acc = accuracy_score(y_train,train_pred)
lr_test_acc = accuracy_score(y_test,test_pred)
print("Training Accuracy : ",lr_train_acc)
print("Test Accuracy : ",lr_test_acc)
print("Classification Report:\n",classification_report(y_train,train_pred))

Training Accuracy :  0.6002234231415099
Test Accuracy :  0.6005170732677096
Classification Report:
               precision    recall  f1-score   support

           0       0.61      0.55      0.58   2564524
           1       0.59      0.65      0.62   2599668

    accuracy                           0.60   5164192
   macro avg       0.60      0.60      0.60   5164192
weighted avg       0.60      0.60      0.60   5164192



## Naive Bayes Theorem

In [29]:
NB = GaussianNB(var_smoothing=0.01)
NB.fit(x_train,y_train)
train_pred = NB.predict(x_train)
test_pred = NB.predict(x_test)
NB_train_acc = accuracy_score(y_train,train_pred)
NB_test_acc = accuracy_score(y_test,test_pred)
print("Training Accuracy : ",NB_train_acc)
print("Test Accuracy : ",NB_test_acc)
print("Classification Report:\n",classification_report(y_train,train_pred))

Training Accuracy :  0.5583400074977848
Test Accuracy :  0.5584576541211788
Classification Report:
               precision    recall  f1-score   support

           0       0.62      0.29      0.39   2564524
           1       0.54      0.82      0.65   2599668

    accuracy                           0.56   5164192
   macro avg       0.58      0.56      0.52   5164192
weighted avg       0.58      0.56      0.52   5164192



## Decision Tree

In [30]:
Dtree = DecisionTreeClassifier(max_depth=None,min_samples_leaf=1,min_samples_split=5)
Dtree.fit(x_train,y_train)
train_pred = Dtree.predict(x_train)
test_pred = Dtree.predict(x_test)
Dtree_train_acc = accuracy_score(y_train,train_pred)
Dtree_test_acc = accuracy_score(y_test,test_pred)
print("Training Accuracy : ",Dtree_train_acc)
print("Test Accuracy : ",Dtree_test_acc)
print("Classification Report:\n",classification_report(y_train,train_pred))

Training Accuracy :  0.7318786753087414
Test Accuracy :  0.7133501052310067
Classification Report:
               precision    recall  f1-score   support

           0       0.72      0.74      0.73   2564524
           1       0.74      0.72      0.73   2599668

    accuracy                           0.73   5164192
   macro avg       0.73      0.73      0.73   5164192
weighted avg       0.73      0.73      0.73   5164192



# Classification Boosting Algorithm 
## Gradient Boosting

In [31]:
gb = GradientBoostingClassifier(learning_rate=.1,max_depth=4,n_estimators=50)
gb.fit(x_train,y_train)
train_pred = gb.predict(x_train)
test_pred = gb.predict(x_test)
gb_train_acc = accuracy_score(y_train,train_pred)
gb_test_acc = accuracy_score(y_test,test_pred)
print("Training Accuracy : ",gb_train_acc)
print("Test Accuracy : ",gb_test_acc)
print("Classification Report:\n",classification_report(y_train,train_pred))

Training Accuracy :  0.6265076898767513
Test Accuracy :  0.626886725530967
Classification Report:
               precision    recall  f1-score   support

           0       0.62      0.66      0.64   2564524
           1       0.64      0.60      0.62   2599668

    accuracy                           0.63   5164192
   macro avg       0.63      0.63      0.63   5164192
weighted avg       0.63      0.63      0.63   5164192



## XGBOOST Boosting

In [32]:
xg = xgb.XGBClassifier(random_state=0,learning_rate=.01,max_depth=3,n_estimators=50)
xg.fit(x_train,y_train)
train_pred = xg.predict(x_train)
test_pred = xg.predict(x_test)
xg_train_acc = accuracy_score(y_train,train_pred)
xg_test_acc = accuracy_score(y_test,test_pred)
print("Training Accuracy : ",xg_train_acc)
print("Test Accuracy : ",xg_test_acc)
print("Classification Report:\n",classification_report(y_train,train_pred))

Training Accuracy :  0.6236526449829906
Test Accuracy :  0.6240736373059055
Classification Report:
               precision    recall  f1-score   support

           0       0.61      0.67      0.64   2564524
           1       0.64      0.58      0.61   2599668

    accuracy                           0.62   5164192
   macro avg       0.63      0.62      0.62   5164192
weighted avg       0.63      0.62      0.62   5164192



# Comparing Different models

In [35]:
cols = [
    ["Logistic Regression",lr_train_acc,lr_test_acc],
    ["Naive Bayes",NB_train_acc,NB_test_acc],
    ["Decision Trees",Dtree_train_acc,Dtree_test_acc],
    ["Gradient Boosting",gb_train_acc,gb_test_acc],
    ["XGBoost",xg_train_acc,xg_test_acc]
    ]
results = pd.DataFrame( cols,
                       columns = ["Model","Training Accuracy %","Test Evaluation %"]).sort_values(
                        by="Test Evaluation %",ascending=False)
results.style.background_gradient(cmap='Set1')

Unnamed: 0,Model,Training Accuracy %,Test Evaluation %
2,Decision Trees,0.731879,0.71335
3,Gradient Boosting,0.626508,0.626887
4,XGBoost,0.623653,0.624074
0,Logistic Regression,0.600223,0.600517
1,Naive Bayes,0.55834,0.558458


In [34]:
test_df = x_test.reset_index()
test_df.head()

Unnamed: 0,index,source_system_tab,source_screen_name,source_type,city,bd,gender,registered_via,registration_init_time-day,registration_init_time-month,registration_init_time-year,expiration_date-day,expiration_date-month,expiration_date-year
0,1919950,7,15,9,1,0,0,7,24,1,2014,6,10,2017
1,358522,1,19,11,22,21,1,3,2,11,2014,17,6,2017
2,5324459,6,14,8,14,30,1,9,10,3,2006,24,9,2017
3,3353377,4,8,5,11,26,1,9,28,9,2007,30,9,2017
4,3930239,1,1,11,14,0,0,9,8,3,2015,8,9,2017


- So, after collaborative filtering and using different classification algorithm(standard + boosting/tunning), resulting that Decision Tree is best fit for our music recommendation model.

# Submitting the final result

In [33]:
predictions = Dtree.predict(pd.DataFrame(x_test).values)
sub = pd.DataFrame()
sub['id'] = test_df['index']
sub['target'] = predictions
sub.to_csv('submission.csv',index=False)
sub.head()

Unnamed: 0,id,target
0,1919950,0
1,358522,0
2,5324459,0
3,3353377,1
4,3930239,0
