Company XYZ is a very early stage startup. They allow people to stream music from their mobile for free. Right now, they still only have songs from the Beatles in their music collection, but they are planning to expand soon. They still have all their data in json ﬁles and they are interested in getting some basic info about their users as well as building a very preliminary song recommendation model in order to increase user engagement. Working with json ﬁles is important. If you join a very early stage start-up, they might not have a nice database and all data will be in jsons. Third party data are often stored in json ﬁles as well.



goal: increase user engagement 


The company CEO asked you very speciﬁc questions: 

1) What are the top 3 and the bottom 3 states in terms of number of users? 
2) What are the top 3 and the bottom 3 states in terms of user engagement? You can choose how to mathematically deﬁne user engagement. 
3) What the CEO cares about here is in which states users are using the product a lot/very little. The CEO wants to send a gift to the ﬁrst user who signed-up for each state. That is, the ﬁrst user who signed-up from California, from Oregon, etc. Can you give him a list of those users? 
4) Build a function that takes as an input any of the songs in the data and returns the most likely song to be listened next. That is, if, for instance, a user is currently listening to "Eight Days A Week", which song has the highest probability of being played right after it by the same user? This is going to be v1 of a song recommendation model.
5) How would you set up a test to check whether your model works well and is improving engagement? 

In [1]:
import json
from collections import Counter

import numpy as np
import pandas as pd
from sklearn.preprocessing import normalize

In [80]:
#read data from json file 
json_data=open("data/song.json").read()

data = json.loads(json_data)
df=pd.DataFrame(data)
df.set_index("id", inplace=True)
df["time_played"]=pd.to_datetime(df.time_played)
df['user_sign_up_date'] = pd.to_datetime(df.user_sign_up_date)
df.head()

#df.to_csv("data/song_dataframe.csv")

Unnamed: 0_level_0,song_played,time_played,user_id,user_sign_up_date,user_state
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GOQMMKSQQH,Hey Jude,2015-06-11 21:51:35,122,2015-05-16,Louisiana
HWKKBQKNWI,We Can Work It Out,2015-06-06 16:49:19,3,2015-05-01,Ohio
DKQSXVNJDH,Back In the U.S.S.R.,2015-06-14 02:11:29,35,2015-05-04,New Jersey
HLHRIDQTUW,P.s. I Love You,2015-06-08 12:26:10,126,2015-05-16,Illinois
SUKJCSBCYW,Sgt. Pepper's Lonely Hearts Club Band,2015-06-28 14:57:00,6,2015-05-01,New Jersey


## Question1

1) What are the top 3 and the bottom 3 states in terms of number of users? 

In [42]:
#lambda function 
user_count=df.groupby("user_state").user_id.agg(lambda ids: len(np.unique(ids)))

user_count.sort_values(inplace=True, ascending=True) 
#Whenever the inplace is set to True, it modifies the existing data frame and you need not assign it to a new data frame.
print(user_count.head(9))
print(user_count.tail(3))
 

user_state
Arizona         1
New Mexico      1
Connecticut     1
Idaho           1
Nebraska        1
Rhode Island    1
Iowa            1
Kansas          1
North Dakota    1
Name: user_id, dtype: int64
user_state
Texas         15
California    21
New York      23
Name: user_id, dtype: int64


In [51]:
len(np.unique(df.user_id))#196 unique users
len(df)#4000 records
df.shape#4000 records with 5 dimensions 


(4000, 5)

## Question2
2) What are the top 3 and the bottom 3 states in terms of user engagement? You can choose how to mathematically deﬁne user engagement.

* user engagement definition
number of songs played for users, then based on avg of this value, pick top and bottom 3 states
because songs played per user is dependent on how long this user is with this platform, therefore this engagement metrics should be #music playde per hour 

In [61]:
def count_by_state(df):
    """ all data in df come from the same state """
    total_played = df.shape[0]
    first_play_dt = df.time_played.min()
    last_play_dt = df.time_played.max()
    duration = last_play_dt - first_play_dt
    duration_hours = duration.total_seconds()/60.0
    return pd.Series([first_play_dt,last_play_dt, duration,duration_hours, total_played],
                     index=["first_play_dt",'last_play_dt','duration','duration_hours','total_played'])

In [71]:
counts_by_states = df.groupby("user_state").apply(count_by_state)

In [72]:
counts_by_states["hr_average"] = counts_by_states.total_played/counts_by_states.duration_hours
counts_by_states.sort_values(by="hr_average",ascending=False,inplace=True)
counts_by_states

Unnamed: 0_level_0,first_play_dt,last_play_dt,duration,duration_hours,total_played,hr_average
user_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
New York,2015-06-01 06:14:45,2015-06-28 21:36:40,27 days 15:21:55,39801.916667,469,0.011783
California,2015-06-01 06:33:03,2015-06-28 20:35:50,27 days 14:02:47,39722.783333,425,0.010699
Texas,2015-06-01 06:09:04,2015-06-28 20:28:35,27 days 14:19:31,39739.516667,230,0.005788
Ohio,2015-06-01 05:02:54,2015-06-28 22:22:25,27 days 17:19:31,39919.516667,209,0.005236
Florida,2015-06-01 09:29:39,2015-06-28 22:59:27,27 days 13:29:48,39689.8,180,0.004535
Pennsylvania,2015-06-01 05:19:08,2015-06-28 21:44:20,27 days 16:25:12,39865.2,179,0.00449
North Carolina,2015-06-01 12:40:31,2015-06-28 23:26:38,27 days 10:46:07,39526.116667,154,0.003896
Illinois,2015-06-01 12:15:13,2015-06-28 18:07:10,27 days 05:51:57,39231.95,149,0.003798
Georgia,2015-06-01 06:41:36,2015-06-28 21:37:34,27 days 14:55:58,39775.966667,135,0.003394
Missouri,2015-06-01 05:36:55,2015-06-28 18:32:34,27 days 12:55:39,39655.65,127,0.003203


## Question3
3) What the CEO cares about here is in which states users are using the product a lot/very little. The CEO wants to send a gift to the ﬁrst user who signed-up for each state. That is, the ﬁrst user who signed-up from California, from Oregon, etc. Can you give him a list of those users?

In [84]:
#sql: select *, row_number(partition by user_state order by user_signup) from data
def min_signup(df):
    idx=df.user_sign_up_date.argmin()
    return df.loc[idx,]

first_sign_up=df.groupby("user_state").apply(min_signup)
first_sign_up.loc[:,["user_id","user_sign_up_date"]]

Unnamed: 0_level_0,user_id,user_sign_up_date
user_state,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama,5,2015-05-01
Alaska,106,2015-05-12
Arizona,105,2015-05-12
Arkansas,78,2015-05-08
California,39,2015-05-04
Colorado,173,2015-05-19
Connecticut,127,2015-05-16
Florida,41,2015-05-04
Georgia,20,2015-05-02
Idaho,165,2015-05-19


## Question 4:
4) Build a function that takes as an input any of the songs in the data and returns the most likely song to be listened next. That is, if, for instance, a user is currently listening to "Eight Days A Week", which song has the highest probability of being played right after it by the same user? This is going to be v1 of a song recommendation model. 

In [101]:
#use sql lagging function, partition by user id, and check which is most likely song following "Eight Days A Week"

#calculate next 
df_sorted=df.sort_values(['user_id','time_played'],ascending=True)
df_sorted.head()

nextSongPlayed=[]
l=len(df_sorted)
df_sorted=df_sorted.reset_index()

i=0

#df_sorted.head()
while i in range(l-1):
    if(df_sorted.loc[i,"user_id"]==df_sorted.loc[i+1,"user_id"]):
        nextSongPlayed.append(df_sorted.loc[i+1,"song_played"]);
    else:
        nextSongPlayed.append("NA");
    i=i+1;
nextSongPlayed.append("NA");    #last value must be NA 

df_sorted['nextsong_played']=nextSongPlayed


### Question 4 Recommendation Methd1: based on current records, which song is most likely to follow the other to make recommendation.

In [189]:
#song="Eight Days A Week"
#score=0.0
def SongRecommend(song):    
    sample=df_sorted.loc[(df_sorted["song_played"]==song) & (df_sorted["nextsong_played"]!="NA"),:]
    count=len(sample)
    if(count>0):
        pivot_by_nextsong=sample.groupby("nextsong_played").size() 
        SongMetrics=pd.DataFrame(pivot_by_nextsong).reset_index().rename(columns={0:"Frequency"})
        SongMetrics["Frequency"]=SongMetrics["Frequency"]/count
        top=SongMetrics.sort_values("Frequency",ascending=False).head(1)        
        song=top.iloc[0,0]
        score=top.iloc[0,1]
    else:
        song="No Recommendation"
        score=0.0
    Recom="Here recommend \"{0}\" as next song to be played with probability {1:2f}".format(song,score)    
    return Recom;

In [190]:
SongRecommend("Eight Days A Week")


'Here recommend "Come Together" as next song to be played with probability 0.137931'

with method 4.1, I would recommend "Come Togther" after "Eight Days A Week"

### Question 4 method 2
use collaborative filtering 

In [237]:
# Step 1: build the Song-User matrix
song_user = df.groupby(['song_played', 'user_id'])['song_played'].count().unstack(fill_value=0)
song_user = song_user.astype(int)

song_user.head(5)

user_id,1,2,3,4,5,6,7,8,9,10,...,191,192,193,194,195,196,197,198,199,200
song_played,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A Day In The Life,0,0,1,3,0,2,0,0,0,0,...,0,0,3,3,0,2,0,0,2,0
A Hard Day's Night,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
A Saturday Club Xmas/Crimble Medley,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ANYTIME AT ALL,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Across The Universe,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [245]:
# Step 2: build song-song similarity matrix
song_user_norm = normalize(song_user, axis=1)  # normalize the song-user matrix
#we have to normalize this matrix, otherwise, some heavy users songs will dominant similarity score  
similarity = np.dot(song_user_norm, song_user_norm.T)  # calculate the similarity matrix
similarity_df = pd.DataFrame(similarity, index=song_user.index, columns=song_user.index)

similarity_df.head()

song_played,A Day In The Life,A Hard Day's Night,A Saturday Club Xmas/Crimble Medley,ANYTIME AT ALL,Across The Universe,All My Loving,All You Need Is Love,And Your Bird Can Sing,BAD BOY,BALLAD OF JOHN AND YOKO,...,We Can Work It Out,When I'm 64,While My Guitar Gently Weeps,Wild Honey Pie,With a Little Help From My Friends,YOUR MOTHER SHOULD KNOW,Yellow Submarine,Yesterday,You Never Give Me Your Money,You're Going To Lose That Girl
song_played,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A Day In The Life,1.0,0.235702,0.074536,0.119523,0.212132,0.355023,0.329404,0.152145,0.210819,0.172133,...,0.464938,0.030429,0.508964,0.223607,0.359092,0.037268,0.318198,0.35322,0.087841,0.0
A Hard Day's Night,0.235702,1.0,0.0,0.0,0.1,0.136931,0.111803,0.0,0.0,0.091287,...,0.259548,0.129099,0.210099,0.0,0.0,0.0,0.05,0.195468,0.074536,0.0
A Saturday Club Xmas/Crimble Medley,0.074536,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.109435,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0
ANYTIME AT ALL,0.119523,0.0,0.0,1.0,0.0,0.154303,0.094491,0.109109,0.0,0.0,...,0.116991,0.0,0.138107,0.089087,0.183942,0.0,0.0,0.146845,0.0,0.0
Across The Universe,0.212132,0.1,0.0,0.0,1.0,0.091287,0.0,0.0,0.0,0.0,...,0.138426,0.0,0.116722,0.0,0.0,0.0,0.0,0.043437,0.0,0.0


In [232]:
#define a function to get top 10 values from 
def get_TopK(song,similarity, k=1):
    df=similarity.loc[song].sort_values(ascending=False)[1:k+1].reset_index()#start from index of 1, to exclude itself
    df=df.rename(columns={"song_played":"Recommend Song",song:"Similarity Score"})
    return df

In [244]:
#get_TopK("A Hard Day's Night",similarity_df,3)
get_TopK("Eight Days A Week",similarity_df,3)

Unnamed: 0,Recommend Song,Similarity Score
0,Revolution,58
1,Get Back,52
2,Let It Be,45


with method 4.2, I would recommend "Hey Jude" after "Eight Days A Week"

## Question 5:
5) How would you set up a test to check whether your model works well and is improving engagement? 

Experiment design:
Perform A/B test. 
Metrics: user engagement as proposed in question 2. 
Experiment set up: pick New York or California to set up expeirment. 50% of users with default setting, 50% of users with recommendation. 
Test: Run this experiment for a while, and conduct T test (one tail) to whether user engagement metrics is different from control group.  
Hypothesis:
    H0: user engagement in control group is better or same as experiemnt group.
    H1: user engagement in control group is worse than experiment group. 