# Song Recommendation Challenge


## Goal


Company XYZ is a very early stage startup. They allow people to stream music from their mobile for free. Right now, they still only have songs from the Beatles in their music collection, but they are planning to expand soon.

They still have all their data in json files and they are interested in getting some basic info about their users as well as building a very preliminary song recommendation model in order to increase user engagement.


## Challenge Description

You are the fifth employee at company XYZ. The good news is that if the company becomes big, you will become very rich with the stocks. The bad news is that at such an early stage the data is usually very messy. All their data is stored in json format.

The company CEO asked you for very specific questions:

- What are the top 3 and the bottom 3 states in terms number of users?


- What are the top 3 and the bottom 3 states in terms of user engagement? You can choose how to mathematically define user engagement. What the CEO cares about here is in which states users are using the product a lot/very little.


- The CEO wants to send a gift to the first user who signed-up for each state. That is, the first user who signed-up from California, from Oregon, etc. Can you give him a list of those users?


- Build a function that takes as an input any of the songs in the data and returns the most likely song to be listened next. That is, if, for instance, a user is currently listening to “Eight Days A Week“, which song has the highest probability of being played right after it by the same user? This is going to be V1 of a song recommendation model.


- How would you set up a test to check whether your model works well?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
ls

08_Song_Recommendation_Challenge.ipynb  song.json


In [3]:
df = pd.read_json('song.json')
df.head()

Unnamed: 0,id,song_played,time_played,user_id,user_sign_up_date,user_state
0,GOQMMKSQQH,Hey Jude,2015-06-11 21:51:35,122,2015-05-16,Louisiana
1,HWKKBQKNWI,We Can Work It Out,2015-06-06 16:49:19,3,2015-05-01,Ohio
2,DKQSXVNJDH,Back In the U.S.S.R.,2015-06-14 02:11:29,35,2015-05-04,New Jersey
3,HLHRIDQTUW,P.s. I Love You,2015-06-08 12:26:10,126,2015-05-16,Illinois
4,SUKJCSBCYW,Sgt. Pepper's Lonely Hearts Club Band,2015-06-28 14:57:00,6,2015-05-01,New Jersey


In [4]:
df.shape

(4000, 6)

In [5]:
# no invalid data

df.isnull().sum()

id                   0
song_played          0
time_played          0
user_id              0
user_sign_up_date    0
user_state           0
dtype: int64

In [6]:
df.dtypes

id                   object
song_played          object
time_played          object
user_id               int64
user_sign_up_date    object
user_state           object
dtype: object

## Question 1

What are the top 3 and the bottom 3 states in terms number of users?

In [7]:
df_userid = df[['user_state','user_id']].drop_duplicates()
df_userid = df_userid.groupby('user_state').count().sort_values('user_id',ascending = False).reset_index()
df_userid

Unnamed: 0,user_state,user_id
0,New York,23
1,California,21
2,Texas,15
3,Pennsylvania,9
4,Ohio,9
5,Illinois,7
6,Florida,7
7,North Carolina,6
8,New Jersey,6
9,Massachusetts,6


In [8]:
df_userid.head(3)

Unnamed: 0,user_state,user_id
0,New York,23
1,California,21
2,Texas,15


In [9]:
df_userid.tail(10)

Unnamed: 0,user_state,user_id
31,Washington,2
32,New Mexico,1
33,Idaho,1
34,North Dakota,1
35,Connecticut,1
36,Iowa,1
37,Rhode Island,1
38,Nebraska,1
39,Arizona,1
40,Kansas,1


So the top 3 are New York, California and Texas, bottom 3 are many. 

## Question 2

What are the top 3 and the bottom 3 states in terms of user engagement? You can choose how to mathematically define user engagement. What the CEO cares about here is in which states users are using the product a lot/very little.

In [10]:
df_engage = df_userid.sort_values(by='user_state').reset_index(drop=True)
df_engage

Unnamed: 0,user_state,user_id
0,Alabama,4
1,Alaska,2
2,Arizona,1
3,Arkansas,2
4,California,21
5,Colorado,3
6,Connecticut,1
7,Florida,7
8,Georgia,6
9,Idaho,1


In [11]:
df_engage2 = df.groupby(['user_state'])['time_played'].count().reset_index()
df_engage = pd.merge(df_engage, df_engage2, how ='left', on = 'user_state')
df_engage

Unnamed: 0,user_state,user_id,time_played
0,Alabama,4,104
1,Alaska,2,58
2,Arizona,1,22
3,Arkansas,2,34
4,California,21,425
5,Colorado,3,54
6,Connecticut,1,16
7,Florida,7,180
8,Georgia,6,135
9,Idaho,1,26


In [12]:
df_engage['engagement'] = df_engage.time_played/df_engage.user_id
df_engage = df_engage.sort_values('engagement',ascending = False).reset_index(drop=True)

In [13]:
df_engage.head(3)

Unnamed: 0,user_state,user_id,time_played,engagement
0,Nebraska,1,36,36.0
1,Alaska,2,58,29.0
2,Mississippi,3,85,28.333333


In [14]:
df_engage.tail(3)

Unnamed: 0,user_state,user_id,time_played,engagement
38,Minnesota,4,42,10.5
39,Virginia,2,17,8.5
40,Kansas,1,8,8.0


In [15]:
states = list(df_engage.user_state.unique())
states

['Nebraska',
 'Alaska',
 'Mississippi',
 'South Carolina',
 'Rhode Island',
 'Idaho',
 'North Dakota',
 'Kentucky',
 'Alabama',
 'Florida',
 'North Carolina',
 'Missouri',
 'Oklahoma',
 'Ohio',
 'Iowa',
 'Georgia',
 'Maryland',
 'Arizona',
 'Illinois',
 'Louisiana',
 'Oregon',
 'Washington',
 'Tennessee',
 'New York',
 'California',
 'Pennsylvania',
 'New Jersey',
 'Wisconsin',
 'Utah',
 'Colorado',
 'New Mexico',
 'Arkansas',
 'Michigan',
 'Connecticut',
 'Texas',
 'Massachusetts',
 'Indiana',
 'West Virginia',
 'Minnesota',
 'Virginia',
 'Kansas']

## Question 3

The CEO wants to send a gift to the first user who signed-up for each state. That is, the first user who signed-up from California, from Oregon, etc. Can you give him a list of those users?

In [16]:
df_signup = df[['user_id','user_sign_up_date','user_state']].drop_duplicates()
df_signup

Unnamed: 0,user_id,user_sign_up_date,user_state
0,122,2015-05-16,Louisiana
1,3,2015-05-01,Ohio
2,35,2015-05-04,New Jersey
3,126,2015-05-16,Illinois
4,6,2015-05-01,New Jersey
5,147,2015-05-18,Texas
6,155,2015-05-19,Texas
7,171,2015-05-19,Illinois
8,174,2015-05-19,Rhode Island
9,170,2015-05-19,Oregon


In [17]:
df_signup.user_sign_up_date = pd.to_datetime(df_signup.user_sign_up_date)

In [18]:
df_signup.dtypes

user_id                       int64
user_sign_up_date    datetime64[ns]
user_state                   object
dtype: object

In [19]:
df_signup2 = df_signup.groupby('user_state').min().reset_index()
df_signup2

Unnamed: 0,user_state,user_id,user_sign_up_date
0,Alabama,5,2015-05-01
1,Alaska,106,2015-05-12
2,Arizona,105,2015-05-12
3,Arkansas,78,2015-05-08
4,California,39,2015-05-04
5,Colorado,166,2015-05-19
6,Connecticut,127,2015-05-16
7,Florida,41,2015-05-04
8,Georgia,16,2015-05-02
9,Idaho,165,2015-05-19


This is good, but not able to find all earliest users, but just find one of them. Now I need to find all of the earliest users.

In [20]:
#early_users = []
#states = []
#times = []

df_earlyusers = pd.DataFrame(columns = ['user_state','user_sign_up_date','user_id'])

for state in states:
    time = df_signup[(df_signup.user_state == state)].user_sign_up_date.min()
    #df2 = pd.concat([df2,df_signup[(df_signup.user_state == state) & (df_signup.user_sign_up_date == time)]])
    df_earlyusers = df_earlyusers.append(df_signup[(df_signup.user_state == state) 
                                                   & (df_signup.user_sign_up_date == time)])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


In [21]:
df_earlyusers.reset_index()

Unnamed: 0,index,user_id,user_sign_up_date,user_state
0,362,134,2015-05-16,Nebraska
1,112,106,2015-05-12,Alaska
2,56,23,2015-05-02,Mississippi
3,430,26,2015-05-02,Mississippi
4,132,64,2015-05-08,South Carolina
5,8,174,2015-05-19,Rhode Island
6,33,165,2015-05-19,Idaho
7,242,135,2015-05-17,North Dakota
8,37,34,2015-05-04,Kentucky
9,100,5,2015-05-01,Alabama


## Question 4

Build a function that takes as an input any of the songs in the data and returns the most likely song to be listened next. That is, if, for instance, a user is currently listening to “Eight Days A Week“, which song has the highest probability of being played right after it by the same user? This is going to be V1 of a song recommendation model.

In [22]:
# This is the table we need for song recommendation. 

df_songs = df.sort_values(['user_id','time_played']).reset_index(drop=True)
df_songs

Unnamed: 0,id,song_played,time_played,user_id,user_sign_up_date,user_state
0,IVJSWTXIDO,Yesterday,2015-06-05 14:30:22,1,2015-05-01,Oregon
1,FUIFDVMJMQ,While My Guitar Gently Weeps,2015-06-07 18:54:56,1,2015-05-01,Oregon
2,ICVFRNKBJT,The Long And Winding Road,2015-06-08 22:37:41,1,2015-05-01,Oregon
3,IHVXAYDOYU,Reprise / Day in the Life,2015-06-10 18:00:05,1,2015-05-01,Oregon
4,ECUPTRAWWI,I Feel Fine,2015-06-15 15:46:46,1,2015-05-01,Oregon
5,RFGQEHFCRI,Hello Goodbye,2015-06-19 14:54:57,1,2015-05-01,Oregon
6,MOZQZUAILW,Here Comes The Sun,2015-06-21 21:53:48,1,2015-05-01,Oregon
7,FBKNENUGZG,Can't Buy Me Love,2015-06-22 08:05:01,1,2015-05-01,Oregon
8,EUGLZCXXPH,Birthday,2015-06-25 12:32:22,1,2015-05-01,Oregon
9,VICPUPDPCV,Here Comes The Sun,2015-06-25 20:28:47,1,2015-05-01,Oregon


In [23]:
songs = list(df_songs.song_played.unique())
len(songs)

97

In [24]:
from collections import Counter


def recommandation(song):

    follow = []
    for i in range(df_songs.shape[0]-1):
        if df_songs.iloc[i,1] == song and df_songs.iloc[i,3] == df_songs.iloc[i+1,3]:
            follow.append(df_songs.iloc[i+1,1])

    counts = Counter(follow)
    recommand = []

    max = -1

    for i in counts:
        #print(i,counts[i])
        if counts[i]> max:
            max = counts[i]
            recommand = [i]
        elif counts[i]==max:
            recommand.append(i)

    return recommand

In [25]:
i = np.random.randint(1,len(songs)+1)
print("The song is: ", songs[i])
print("Recommendation(s):", recommandation(songs[i]))

The song is:  Eight Days A Week
Recommendation(s): ['Come Together']


double check:

In [26]:
song = 'ANYTIME AT ALL'

df_songs[df_songs.song_played == song]

Unnamed: 0,id,song_played,time_played,user_id,user_sign_up_date,user_state
305,JBNWMWPHMV,ANYTIME AT ALL,2015-06-26 17:59:55,17,2015-05-02,Texas
698,DFJWEKYDGI,ANYTIME AT ALL,2015-06-25 16:35:33,35,2015-05-04,New Jersey
1522,THEFEJNPWZ,ANYTIME AT ALL,2015-06-05 11:58:43,76,2015-05-08,New York
1829,HMAAAJEPGH,ANYTIME AT ALL,2015-06-15 12:44:37,95,2015-05-11,Missouri
2957,JDKEKVRZUM,ANYTIME AT ALL,2015-06-03 12:42:06,152,2015-05-18,Texas
3059,VMTLHGGYSD,ANYTIME AT ALL,2015-06-26 19:22:26,158,2015-05-19,North Carolina
3161,GNQWNSYDOX,ANYTIME AT ALL,2015-06-21 18:00:36,163,2015-05-19,New York


In [27]:
df_songs.iloc[[306,699,1523,1830,2958,3060,3162]]

Unnamed: 0,id,song_played,time_played,user_id,user_sign_up_date,user_state
306,KNXKOTOAOV,Come Together,2015-06-01 14:46:13,18,2015-05-02,Maryland
699,SBSGSORLYB,IN MY LIFE,2015-06-27 21:11:42,35,2015-05-04,New Jersey
1523,ADNRXVARTA,Helter Skelter,2015-06-06 16:52:27,76,2015-05-08,New York
1830,EQTQLRSOBK,Can't Buy Me Love,2015-06-17 12:28:51,95,2015-05-11,Missouri
2958,GZCVTPEYKN,NOWHERE MAN,2015-06-05 12:25:52,152,2015-05-18,Texas
3060,TTPUVUNTUA,Here Comes The Sun,2015-06-27 17:04:16,158,2015-05-19,North Carolina
3162,POQHFGKCOO,Let It Be,2015-06-21 23:30:44,163,2015-05-19,New York


## Question 5

How would you set up a test to check whether your model works well?

We need to perform a A/B test (reference), to randomly split users into two groups, one Control group and one Experiment group. Control group has no recommendation strategy, and Experiment group recommend the next song. After running some time, perform a one-tailed t-test on 'average #play per hour'

$H_0$: population 'average #play per hour' is same in two groups

$H_a$: experiment group's population 'average #play per hour' is higher than control group's