### Build Prediction Model to predict whether a user will listen to an Artist’s songs

## Read Data <a name="read_data"></a>

In [1]:
import pandas as pd

In [2]:
# Read file
plays_data = pd.read_csv("/cxldata/gle/usersha1-artmbid-artname-plays.tsv",sep="\t",header=None)

In [3]:
plays_data.columns = ["user_id","artist_id","artist_name","no_plays"]

In [4]:
# Read user profile data
user_profile = pd.read_csv("/cxldata/gle/usersha1-profile.tsv", sep="\t",header=None)
user_profile.head()

Unnamed: 0,0,1,2,3,4
0,00000c289a1829a808ac09c00daf10bc3c4e223b,f,22.0,Germany,"Feb 1, 2007"
1,00001411dc427966b17297bf4d69e7e193135d89,f,,Canada,"Dec 4, 2007"
2,00004d2ac9316e22dc007ab2243d6fcb239e707d,,,Germany,"Sep 1, 2006"
3,000063d3fe1cf2ba248b9e3c3f0334845a27a6bf,m,19.0,Mexico,"Apr 28, 2008"
4,00007a47085b9aab8af55f52ec8846ac479ac4fe,m,28.0,United States,"Jan 27, 2006"


In [5]:
user_profile.columns = ["user_id","gender","age","country","registered_on"]
len(set(user_profile["user_id"]))

359347

## Predict whether User will listen to a Artist Song

## Approach <a name="approach"></a>


This is the approach we would follow:
    1. For any given artist id, we will look at the user ids who have listened to that artist. Such cases would be labelled as "Yes" in our dataset
    2. All other users who have not listened to that particular artist would be labeled as "NO". This make it a binary classification problem
    3. Instead of considering all the users (~360k) we will only consider the top 50k users
    4. Finally, we will write a generic function for this, so that the logic can be applied for any given artist id
    

In [6]:

plays_data["user_id"].nunique()

358868

In [7]:
tot_plays_user = plays_data.groupby("user_id")["no_plays"].sum()
tot_plays_artist = plays_data.groupby("artist_id")["no_plays"].sum()
tot_plays_user.shape

(358868,)

In [8]:
tot_plays_user.sort_values(ascending=False,inplace=True)
tot_plays_artist.sort_values(ascending=False,inplace=True)

In [9]:
# We next select the top 50k artists as per the no of times they have been played
top_50k_users = tot_plays_user.index[0:50000]

<b> We would next build a classification model for the 1st artist.
We would eventually create a generic function so that we can run the logic for any given artist id
</b>

In [10]:
selected_artist = tot_plays_artist.index[0]
selected_artist

'b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d'

In [11]:
# WE next filter our total plays dataset on this artist id as well as the top 50k user ids
artist_play_data = plays_data[(plays_data["artist_id"] == selected_artist) & (plays_data["user_id"].isin(top_50k_users))]

In [12]:
artist_play_data.head()

Unnamed: 0,user_id,artist_id,artist_name,no_plays
257,0000c176103e538d5c9828e695fed4f7ae42dd01,b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,the beatles,704
547,000163263d2a41a3966a3746855b8b75b7d7aa83,b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,the beatles,170
1544,000532f6886f086f61037acd896828f0b5b36bf2,b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,the beatles,3182
2050,000752c87a61bc4247f5219b4769c347c0062c8a,b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,the beatles,248
10080,00248667343aef7179c66db4d3d4de737403c572,b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,the beatles,321


In [13]:
artist_play_data.shape

(12264, 4)

## Creating Target Variable <a name="target_var"></a>

<b> We can see that ~12k of the users have listened to this artist. i.e remaining ~38k users havent </b>

We next create a dataset that we would use for training the classification model.
The 12k users would have the "Yes" labels while the remaining 38k would be labeled "NO"

In [14]:
# We define 2 variables - "users_yes" ie users who have listeded to the artist..
# .. and "users_no" iei users who have NOT listened to the artist
users_yes = list(artist_play_data["user_id"])
users_no = [x for x in top_50k_users if x not in users_yes]
len(users_no)

37736

In [15]:
# Finally we create our model dataset

users_yes_df = pd.DataFrame({"user_id": users_yes,"target": 1})
users_no_df = pd.DataFrame({"user_id": users_no,"target": 0})
model_df = pd.concat([users_yes_df,users_no_df])
model_df.shape

(50000, 2)

In [16]:
model_df["target"].value_counts()

0    37736
1    12264
Name: target, dtype: int64

<b> So we finally have our model dataset with the target variable.
We next need to map the features corresponing too each user id</b>

## Creating Features/Feature Engineering <a name="features"></a>

In [17]:
# Map features
model_df_with_features = pd.merge(model_df, user_profile, on = "user_id",how="left")
model_df_with_features.head()

Unnamed: 0,target,user_id,gender,age,country,registered_on
0,1,0000c176103e538d5c9828e695fed4f7ae42dd01,m,20.0,United Kingdom,"Jan 14, 2006"
1,1,000163263d2a41a3966a3746855b8b75b7d7aa83,m,27.0,Sweden,"Jan 5, 2007"
2,1,000532f6886f086f61037acd896828f0b5b36bf2,f,,Finland,"Feb 12, 2006"
3,1,000752c87a61bc4247f5219b4769c347c0062c8a,f,21.0,United States,"Jul 18, 2005"
4,1,00248667343aef7179c66db4d3d4de737403c572,m,20.0,Sweden,"Apr 15, 2004"


In [18]:
# WE next check for missing values
model_df_with_features.isnull().sum()

target              0
user_id             0
gender           3794
age              6857
country             0
registered_on       0
dtype: int64

In [19]:
# We do a median imputation for age and mode imputation for gender
age_impute = model_df_with_features["age"].mean()
gender_impute = model_df_with_features["gender"].value_counts().index[0]
age_impute

24.002735090281156

In [20]:
gender_impute

'm'

In [21]:
model_df_with_features.fillna({"age": age_impute, "gender": gender_impute}, inplace = True)

Unnamed: 0,target,user_id,gender,age,country,registered_on
0,1,0000c176103e538d5c9828e695fed4f7ae42dd01,m,20.000000,United Kingdom,"Jan 14, 2006"
1,1,000163263d2a41a3966a3746855b8b75b7d7aa83,m,27.000000,Sweden,"Jan 5, 2007"
2,1,000532f6886f086f61037acd896828f0b5b36bf2,f,24.002735,Finland,"Feb 12, 2006"
3,1,000752c87a61bc4247f5219b4769c347c0062c8a,f,21.000000,United States,"Jul 18, 2005"
4,1,00248667343aef7179c66db4d3d4de737403c572,m,20.000000,Sweden,"Apr 15, 2004"
5,1,00277ccecc376837e57b6d6b58330d1bafc90c73,m,31.000000,Brazil,"Nov 11, 2007"
6,1,0033ee7378661b88b245b1f67cc622ff63a51061,m,24.002735,United States,"Jun 5, 2006"
7,1,003c3c21a7ee4f8ce34e82f204d5aaf63432de87,m,22.000000,Turkey,"Mar 14, 2007"
8,1,00458c96257bab27657adca90732ecc4904300de,f,24.000000,Ghana,"Jan 28, 2006"
9,1,00489b25aafa16486bc0b5521fe001f46cc55b34,m,21.000000,Sweden,"Aug 30, 2007"


In [22]:
model_df_with_features.isnull().sum()

target           0
user_id          0
gender           0
age              0
country          0
registered_on    0
dtype: int64

In [23]:
# WE convert the registration date into no.of days since registered
model_df_with_features["registered_on"] = pd.to_datetime(model_df_with_features["registered_on"])
model_df_with_features["registered_on"][0:5]


0   2006-01-14
1   2007-01-05
2   2006-02-12
3   2005-07-18
4   2004-04-15
Name: registered_on, dtype: datetime64[ns]

In [24]:
model_df_with_features["curr_date"] = pd.to_datetime("2017-12-01")

In [25]:
duration = model_df_with_features["curr_date"] - model_df_with_features["registered_on"]

In [26]:
import numpy as np
duration=(duration / np.timedelta64(1, 'D')).astype(int)

In [27]:
model_df_with_features["duration"] = duration

In [28]:
# Finally we drop the registred and curr date columns
model_df_with_features.drop(labels = ["curr_date","registered_on"],axis=1,inplace=True)

### 1-hot-encoding of Country Variable

In [29]:
model_df_with_features["country"].nunique()

220

We have a total of 220 countries. Instead of considering all these countries, we consider the top 50 countries (as per count). All other countries are clubbed as "Others"

In [30]:
top_countries = model_df_with_features["country"].value_counts().index[0:50]
top_countries

Index([u'United States', u'Germany', u'United Kingdom', u'Poland', u'Sweden',
       u'Finland', u'Russian Federation', u'Brazil', u'Canada', u'Australia',
       u'Netherlands', u'Japan', u'Spain', u'Norway', u'France',
       u'Czech Republic', u'Mexico', u'Turkey', u'Ukraine', u'Belgium',
       u'Italy', u'Bulgaria', u'Portugal', u'Austria', u'Croatia', u'Denmark',
       u'Romania', u'Switzerland', u'Argentina', u'Chile', u'Lithuania',
       u'New Zealand', u'Latvia', u'Estonia', u'Slovakia', u'Serbia',
       u'Ireland', u'Hungary', u'Belarus', u'Israel', u'Slovenia', u'Colombia',
       u'Greece', u'South Africa', u'Thailand', u'Venezuela', u'Antarctica',
       u'Philippines', u'India', u'China'],
      dtype='object')

In [31]:
# We update the country column
import numpy as np
model_df_with_features["country"] = np.where(model_df_with_features["country"].isin(top_countries),
                                            model_df_with_features["country"],"Others")

In [32]:
# Finally we do a 1-hot-encoding of the country variable
country_dummies = pd.get_dummies(model_df_with_features["country"])
country_dummies.head()

Unnamed: 0,Antarctica,Argentina,Australia,Austria,Belarus,Belgium,Brazil,Bulgaria,Canada,Chile,...,South Africa,Spain,Sweden,Switzerland,Thailand,Turkey,Ukraine,United Kingdom,United States,Venezuela
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [33]:
model_df_with_features = pd.concat([model_df_with_features,country_dummies],axis=1)
model_df_with_features.shape

(50000, 57)

In [34]:
# We also convert gender to numeric values
model_df_with_features["gender"] = [1 if gender == "m" else 2 for gender in model_df_with_features["gender"]]

In [35]:
model_df_with_features["gender"].value_counts()

1    40261
2     9739
Name: gender, dtype: int64

In [36]:
model_df_with_features.columns

Index([u'target', u'user_id', u'gender', u'age', u'country', u'duration',
       u'Antarctica', u'Argentina', u'Australia', u'Austria', u'Belarus',
       u'Belgium', u'Brazil', u'Bulgaria', u'Canada', u'Chile', u'China',
       u'Colombia', u'Croatia', u'Czech Republic', u'Denmark', u'Estonia',
       u'Finland', u'France', u'Germany', u'Greece', u'Hungary', u'India',
       u'Ireland', u'Israel', u'Italy', u'Japan', u'Latvia', u'Lithuania',
       u'Mexico', u'Netherlands', u'New Zealand', u'Norway', u'Others',
       u'Philippines', u'Poland', u'Portugal', u'Romania',
       u'Russian Federation', u'Serbia', u'Slovakia', u'Slovenia',
       u'South Africa', u'Spain', u'Sweden', u'Switzerland', u'Thailand',
       u'Turkey', u'Ukraine', u'United Kingdom', u'United States',
       u'Venezuela'],
      dtype='object')

## Training a Binary Classification Model <a name="model_train"></a>

In [37]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(model_df_with_features.drop(labels=["user_id","target","country"],axis=1),
                                                   model_df_with_features["target"],
                                                    test_size=0.3, random_state = 123)
X_train.shape

(35000, 54)

In [38]:
y_train.value_counts()

0    26323
1     8677
Name: target, dtype: int64

In [39]:
# Next we train a Logistic Regresison model
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression()
train_model = lr_model.fit(X_train,y_train)

## Model Performance Evaluation <a name="model_perf"></a>

In [40]:
# We score the model on the test set
pred_test_prob = pd.Series(lr_model.predict_proba(X_test)[:,1])

In [41]:
pred_test_prob.describe()

count    15000.000000
mean         0.249910
std          0.076672
min          0.054701
25%          0.202855
50%          0.243487
75%          0.329724
max          0.408924
dtype: float64

The maximum probability is only 0.4. 

In [42]:
from sklearn.metrics import roc_auc_score,confusion_matrix

In [43]:
pred_test_prob[0:12]

0     0.227852
1     0.268917
2     0.271604
3     0.295505
4     0.132653
5     0.347541
6     0.211663
7     0.154567
8     0.251927
9     0.244254
10    0.367789
11    0.249566
dtype: float64

In [44]:
roc_auc_score(y_test,pred_test_prob)

0.62428558876621343

### Inference

Area under the curve is only 62% indicating that the model performance is not very great. This could be due to the limited number of features that we had at our disposal to create a model.


In [45]:
# We next try a confusion matrix
# BY default the confusion matrix uses a 0.5 probability cutoff
# Given that our max prob is only 0.4 we have to use a ddifferent threshold
# If the probability is > 75th percentile we tag the case as 1 else 0
pred_test_class = [1 if x > pred_test_prob.quantile(q=0.75) else 0 for x in pred_test_prob]
confusion_matrix(y_test, pred_test_class)

array([[8939, 2474],
       [2311, 1276]])

In [46]:
# We compute the accuracy from this confusion matrix
from sklearn.metrics import accuracy_score
accuracy_score(y_test,pred_test_class)

0.68100000000000005

In [47]:
# We have a 68% accuracy (the accuracy number will change depenidng on the threshold choosen)

In [48]:
# Finally we try runing this model on the whole of our training set
# to do that we would have to convert the country and gender variables
model_df_updated = model_df_with_features.copy()
model_df_updated.columns

Index([u'target', u'user_id', u'gender', u'age', u'country', u'duration',
       u'Antarctica', u'Argentina', u'Australia', u'Austria', u'Belarus',
       u'Belgium', u'Brazil', u'Bulgaria', u'Canada', u'Chile', u'China',
       u'Colombia', u'Croatia', u'Czech Republic', u'Denmark', u'Estonia',
       u'Finland', u'France', u'Germany', u'Greece', u'Hungary', u'India',
       u'Ireland', u'Israel', u'Italy', u'Japan', u'Latvia', u'Lithuania',
       u'Mexico', u'Netherlands', u'New Zealand', u'Norway', u'Others',
       u'Philippines', u'Poland', u'Portugal', u'Romania',
       u'Russian Federation', u'Serbia', u'Slovakia', u'Slovenia',
       u'South Africa', u'Spain', u'Sweden', u'Switzerland', u'Thailand',
       u'Turkey', u'Ukraine', u'United Kingdom', u'United States',
       u'Venezuela'],
      dtype='object')

# Generic Function for any artist <a name = "func"></a>

The model that we currently built is for a particular user id. We would next create a function that would run these steps for any given artist id

In [49]:
def generate_artist_prediction(artist_id):
    # WE next filter our total plays dataset on this artist id as well as the top 50k user ids
    artist_play_data = plays_data[(plays_data["artist_id"] == artist_id) & (plays_data["user_id"].isin(top_50k_users))]
    
    # Create target variable
    users_yes = list(artist_play_data["user_id"])
    users_no = [x for x in top_50k_users if x not in users_yes]
    users_yes_df = pd.DataFrame({"user_id": users_yes,"target": 1})
    users_no_df = pd.DataFrame({"user_id": users_no,"target": 0})
    model_df = pd.concat([users_yes_df,users_no_df])
    
    # Map features
    model_df_with_features = pd.merge(model_df, user_profile, on = "user_id",how="left")
    
    # Missing value imputation
    # We do a median imputation for age and mode imputation for gender
    age_impute = model_df_with_features["age"].mean()
    gender_impute = model_df_with_features["gender"].value_counts().index[0]
    model_df_with_features.fillna({"age": age_impute, "gender": gender_impute}, inplace = True)
    
    # WE convert the registration date into no.of days since registered
    model_df_with_features["registered_on"] = pd.to_datetime(model_df_with_features["registered_on"])
    model_df_with_features["curr_date"] = pd.to_datetime("2017-12-01")
    duration = model_df_with_features["curr_date"] - model_df_with_features["registered_on"]
    import numpy as np
    duration=(duration / np.timedelta64(1, 'D')).astype(int)
    model_df_with_features["duration"] = duration
    # Finally we drop the registred and curr date columns
    model_df_with_features.drop(labels = ["curr_date","registered_on"],axis=1,inplace=True)
    
    #1-hot encoding of countries
    top_countries = model_df_with_features["country"].value_counts().index[0:50]
    model_df_with_features["country"] = np.where(model_df_with_features["country"].isin(top_countries),
                                            model_df_with_features["country"],"Others")
    country_dummies = pd.get_dummies(model_df_with_features["country"])
    model_df_with_features = pd.concat([model_df_with_features,country_dummies],axis=1)
    # We also convert gender to numeric values
    model_df_with_features["gender"] = [1 if gender == "m" else 2 for gender in model_df_with_features["gender"]]
    
    # Train model
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(model_df_with_features.drop(labels=["user_id","target","country"],axis=1),
                                                   model_df_with_features["target"],
                                                    test_size=0.3, random_state = 123)
    # Next we train a Logistic Regresison model
    from sklearn.linear_model import LogisticRegression
    lr_model = LogisticRegression()
    train_model = lr_model.fit(X_train,y_train)

    # model performance
    # We score the model on the test set
    pred_test_prob = pd.Series(lr_model.predict_proba(X_test)[:,1])
    
    from sklearn.metrics import roc_auc_score,confusion_matrix
    auc = roc_auc_score(y_test,pred_test_prob)
    pred_test_class = [1 if x > pred_test_prob.quantile(q=0.75) else 0 for x in pred_test_prob]
    conf_matrix = confusion_matrix(y_test, pred_test_class)
    
    res = {"auc":auc, "confusion_matrix": conf_matrix}
    return res

In [50]:
# WE can now use this function to run a prediction model for any given artist
# for eg...
selected_artist = tot_plays_artist.index[10]

In [51]:
generate_artist_prediction(selected_artist)

{'auc': 0.6776041410603082, 'confusion_matrix': array([[10456,  3104],
        [  794,   646]])}