# 2021: Week 8 - Karaoke Data

February 24, 2021

Challenge by: Jenny Martin

Recently I was helping a colleague prep some karaoke data and I thought it was too fun a subject to resist turning into a Preppin' Data challenge! I had a lot of fun creating the dataset and imagining the type of person who may sing one song and then not bother with the rest of the session. 

We will need to make some assumptions as part of our data prep:

- Customers often don't sing the entire song
- Sessions last 60 minutes
- Customers arrive a maximum of 10 minutes before their sessions begin

I will warn you that this challenge may be a little on the trickier end of the spectrum!

## Inputs

1. Karaoke song choices and what time they began 

<img src='https://1.bp.blogspot.com/-OKoZi-s2jrI/X-Iam5_rHYI/AAAAAAAAAqg/XUkttbXMfNEfez_Q2lPotOCVSiqGUPrPACLcBGAsYHQ/w400-h223/Karaoke%2BInput.png'>

2. Customer entry times 

<img src='https://1.bp.blogspot.com/-fFeRcrRKbvE/X-IasecCtrI/AAAAAAAAAqk/X0Y9RMIJBWAoHK74_QoC3YDaiVMdnodTACLcBGAsYHQ/s0/Customer%2BEntry.png'>

## Requirements

- [Input the data](https://drive.google.com/file/d/1ACDlDqxsWUxfojo_ZAcEe7GuYXw4rfYI/view?usp=drivesdk)
- Calculate the time between songs
- If the time between songs is greater than (or equal to) 59 minutes, flag this as being a new session
- Create a session number field
- Number the songs in order for each session
- Match the customers to the correct session, based on their entry time
    - The Customer ID field should be null if there were no customers who arrived 10 minutes (or less) before the start of the session
- [Output the data](https://drive.google.com/file/d/1RCADhTi7YruO3-wzit2PQoTj_GfqVcOl/view?usp=sharing)

Output

<img src='https://1.bp.blogspot.com/-KZrVSOQMowk/YDe9W8-t97I/AAAAAAAAAww/rziET4GoHtk3JehmSfcmu_my7KPSNM8pgCLcBGAsYHQ/w640-h184/Karaoke%2BOutput2.png'>

- 6 fields
    - Session #
    - Customer ID
    - Song Order
    - Date
    - Artist
    - Song
- 988 rows (989 including headers)

In [602]:
import pandas as pd
import numpy as np
import datetime as dt

# Load data
karaoke_choices = pd.read_excel('Copy of Karaoke Dataset.xlsx', engine='openpyxl', sheet_name = 'Karaoke Choices')
karaoke_choices

Unnamed: 0,Date,Artist,Song
0,2020-12-22 13:59:59.971,Wham!,Last Christmas
1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5
2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana
3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go
4,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way
...,...,...,...
983,2021-02-01 22:09:00.029,Kings Of Leon,Sex Is On Fire
984,2021-02-01 22:11:00.038,Hugh Jackman & The Greatest Showman Cast,The Greatest Show
985,2021-02-01 22:14:00.010,Lil Nas X,Old Town Road
986,2021-02-01 22:16:59.981,Ike And Tina Turner,Proud Mary


In [603]:
customers = pd.read_excel('Copy of Karaoke Dataset.xlsx', engine='openpyxl', sheet_name = 'Customers')
customers

Unnamed: 0,Customer ID,Entry Time
0,3fdc46,2020-12-27 06:55:00
1,3fdc46,2020-12-31 03:55:00
2,3fdc46,2021-01-02 08:55:00
3,3fdc46,2021-01-09 05:55:00
4,3fdc46,2021-02-01 06:55:00
...,...,...
296,bdc39c,2021-01-28 10:53:00
297,8d850e,2020-12-28 02:51:00
298,8d850e,2021-01-12 22:51:00
299,8d850e,2021-01-26 11:51:00


In [604]:
# Calculate the time between songs
karaoke_choices['Next song'] = karaoke_choices['Date'][1:].reset_index(drop=True)
karaoke_choices['Time between songs'] = karaoke_choices['Next song'] - karaoke_choices['Date']
karaoke_choices

Unnamed: 0,Date,Artist,Song,Next song,Time between songs
0,2020-12-22 13:59:59.971,Wham!,Last Christmas,2020-12-22 15:00:00.000,0 days 01:00:00.029000
1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5,2020-12-22 15:02:00.010,0 days 00:02:00.010000
2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana,2020-12-22 15:04:00.019,0 days 00:02:00.009000
3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go,2020-12-22 18:00:00.000,0 days 02:55:59.981000
4,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way,2020-12-22 19:00:00.029,0 days 01:00:00.029000
...,...,...,...,...,...
983,2021-02-01 22:09:00.029,Kings Of Leon,Sex Is On Fire,2021-02-01 22:11:00.038,0 days 00:02:00.009000
984,2021-02-01 22:11:00.038,Hugh Jackman & The Greatest Showman Cast,The Greatest Show,2021-02-01 22:14:00.010,0 days 00:02:59.972000
985,2021-02-01 22:14:00.010,Lil Nas X,Old Town Road,2021-02-01 22:16:59.981,0 days 00:02:59.971000
986,2021-02-01 22:16:59.981,Ike And Tina Turner,Proud Mary,2021-02-01 22:18:59.990,0 days 00:02:00.009000


In [605]:
# If the time between songs is greater than (or equal to) 59 minutes, flag this as being a new session
karaoke_choices['Time between songs in minutes'] = karaoke_choices['Time between songs'].dt.total_seconds()/60
# karaoke_choices['New session'] = np.where(karaoke_choices['Time between songs in minutes']>=59,1,0)
karaoke_choices['New session'] = karaoke_choices.apply(lambda row: 1 if row['Time between songs in minutes']>=59 else 0, axis=1)
karaoke_choices

Unnamed: 0,Date,Artist,Song,Next song,Time between songs,Time between songs in minutes,New session
0,2020-12-22 13:59:59.971,Wham!,Last Christmas,2020-12-22 15:00:00.000,0 days 01:00:00.029000,60.000483,1
1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5,2020-12-22 15:02:00.010,0 days 00:02:00.010000,2.000167,0
2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana,2020-12-22 15:04:00.019,0 days 00:02:00.009000,2.000150,0
3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go,2020-12-22 18:00:00.000,0 days 02:55:59.981000,175.999683,1
4,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way,2020-12-22 19:00:00.029,0 days 01:00:00.029000,60.000483,1
...,...,...,...,...,...,...,...
983,2021-02-01 22:09:00.029,Kings Of Leon,Sex Is On Fire,2021-02-01 22:11:00.038,0 days 00:02:00.009000,2.000150,0
984,2021-02-01 22:11:00.038,Hugh Jackman & The Greatest Showman Cast,The Greatest Show,2021-02-01 22:14:00.010,0 days 00:02:59.972000,2.999533,0
985,2021-02-01 22:14:00.010,Lil Nas X,Old Town Road,2021-02-01 22:16:59.981,0 days 00:02:59.971000,2.999517,0
986,2021-02-01 22:16:59.981,Ike And Tina Turner,Proud Mary,2021-02-01 22:18:59.990,0 days 00:02:00.009000,2.000150,0


In [606]:
# Create a session number field
# loop through dataframe and increase session id if new session flagged
session = 1
session_list = []
for i in range(len(karaoke_choices['New session'])):
    session_list.append(session)
    session += karaoke_choices['New session'][i]

karaoke_choices['Session #'] = session_list
karaoke_choices

Unnamed: 0,Date,Artist,Song,Next song,Time between songs,Time between songs in minutes,New session,Session #
0,2020-12-22 13:59:59.971,Wham!,Last Christmas,2020-12-22 15:00:00.000,0 days 01:00:00.029000,60.000483,1,1
1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5,2020-12-22 15:02:00.010,0 days 00:02:00.010000,2.000167,0,2
2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana,2020-12-22 15:04:00.019,0 days 00:02:00.009000,2.000150,0,2
3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go,2020-12-22 18:00:00.000,0 days 02:55:59.981000,175.999683,1,2
4,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way,2020-12-22 19:00:00.029,0 days 01:00:00.029000,60.000483,1,3
...,...,...,...,...,...,...,...,...
983,2021-02-01 22:09:00.029,Kings Of Leon,Sex Is On Fire,2021-02-01 22:11:00.038,0 days 00:02:00.009000,2.000150,0,298
984,2021-02-01 22:11:00.038,Hugh Jackman & The Greatest Showman Cast,The Greatest Show,2021-02-01 22:14:00.010,0 days 00:02:59.972000,2.999533,0,298
985,2021-02-01 22:14:00.010,Lil Nas X,Old Town Road,2021-02-01 22:16:59.981,0 days 00:02:59.971000,2.999517,0,298
986,2021-02-01 22:16:59.981,Ike And Tina Turner,Proud Mary,2021-02-01 22:18:59.990,0 days 00:02:00.009000,2.000150,0,298


In [607]:
# Number the songs in order for each session
karaoke_choices["Song Order"] = karaoke_choices.groupby("Session #")["Date"].rank("dense", ascending=True)
karaoke_choices["Song Order"] = karaoke_choices["Song Order"].astype(int)
karaoke_choices

Unnamed: 0,Date,Artist,Song,Next song,Time between songs,Time between songs in minutes,New session,Session #,Song Order
0,2020-12-22 13:59:59.971,Wham!,Last Christmas,2020-12-22 15:00:00.000,0 days 01:00:00.029000,60.000483,1,1,1
1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5,2020-12-22 15:02:00.010,0 days 00:02:00.010000,2.000167,0,2,1
2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana,2020-12-22 15:04:00.019,0 days 00:02:00.009000,2.000150,0,2,2
3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go,2020-12-22 18:00:00.000,0 days 02:55:59.981000,175.999683,1,2,3
4,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way,2020-12-22 19:00:00.029,0 days 01:00:00.029000,60.000483,1,3,1
...,...,...,...,...,...,...,...,...,...
983,2021-02-01 22:09:00.029,Kings Of Leon,Sex Is On Fire,2021-02-01 22:11:00.038,0 days 00:02:00.009000,2.000150,0,298,4
984,2021-02-01 22:11:00.038,Hugh Jackman & The Greatest Showman Cast,The Greatest Show,2021-02-01 22:14:00.010,0 days 00:02:59.972000,2.999533,0,298,5
985,2021-02-01 22:14:00.010,Lil Nas X,Old Town Road,2021-02-01 22:16:59.981,0 days 00:02:59.971000,2.999517,0,298,6
986,2021-02-01 22:16:59.981,Ike And Tina Turner,Proud Mary,2021-02-01 22:18:59.990,0 days 00:02:00.009000,2.000150,0,298,7


In [608]:
# Match the customers to the correct session, based on their entry time

# create session start time column
session_start_df = karaoke_choices[["Session #","Date"]]
session_start_df = session_start_df.groupby("Session #").agg(Date=('Date','min')).reset_index()
session_start_df

Unnamed: 0,Session #,Date
0,1,2020-12-22 13:59:59.971
1,2,2020-12-22 15:00:00.000
2,3,2020-12-22 18:00:00.000
3,4,2020-12-22 19:00:00.029
4,5,2020-12-22 22:59:59.971
...,...,...
293,294,2021-02-01 13:00:00.029
294,295,2021-02-01 15:00:00.000
295,296,2021-02-01 16:59:59.971
296,297,2021-02-01 21:00:00.000


In [609]:
# rename column and merge to dataset
session_start_df.rename(columns={'Date':'Session Start'}, inplace=True)
karaoke_choices = karaoke_choices.merge(session_start_df, on='Session #', how='inner')
karaoke_choices

Unnamed: 0,Date,Artist,Song,Next song,Time between songs,Time between songs in minutes,New session,Session #,Song Order,Session Start
0,2020-12-22 13:59:59.971,Wham!,Last Christmas,2020-12-22 15:00:00.000,0 days 01:00:00.029000,60.000483,1,1,1,2020-12-22 13:59:59.971
1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5,2020-12-22 15:02:00.010,0 days 00:02:00.010000,2.000167,0,2,1,2020-12-22 15:00:00.000
2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana,2020-12-22 15:04:00.019,0 days 00:02:00.009000,2.000150,0,2,2,2020-12-22 15:00:00.000
3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go,2020-12-22 18:00:00.000,0 days 02:55:59.981000,175.999683,1,2,3,2020-12-22 15:00:00.000
4,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way,2020-12-22 19:00:00.029,0 days 01:00:00.029000,60.000483,1,3,1,2020-12-22 18:00:00.000
...,...,...,...,...,...,...,...,...,...,...
983,2021-02-01 22:09:00.029,Kings Of Leon,Sex Is On Fire,2021-02-01 22:11:00.038,0 days 00:02:00.009000,2.000150,0,298,4,2021-02-01 22:00:00.029
984,2021-02-01 22:11:00.038,Hugh Jackman & The Greatest Showman Cast,The Greatest Show,2021-02-01 22:14:00.010,0 days 00:02:59.972000,2.999533,0,298,5,2021-02-01 22:00:00.029
985,2021-02-01 22:14:00.010,Lil Nas X,Old Town Road,2021-02-01 22:16:59.981,0 days 00:02:59.971000,2.999517,0,298,6,2021-02-01 22:00:00.029
986,2021-02-01 22:16:59.981,Ike And Tina Turner,Proud Mary,2021-02-01 22:18:59.990,0 days 00:02:00.009000,2.000150,0,298,7,2021-02-01 22:00:00.029


In [610]:
# create time interval to match datasets on , Customers arrive a maximum of 10 minutes before their sessions begin
threshold = 20
threshold_ns = threshold * 60 * 1e9
threshold_ns

1200000000000.0

In [611]:
# compute "interval" to which each session belongs
karaoke_choices['interval'] = pd.to_datetime(np.round(karaoke_choices['Session Start'].astype(np.int64) / threshold_ns) * threshold_ns)
karaoke_choices

Unnamed: 0,Date,Artist,Song,Next song,Time between songs,Time between songs in minutes,New session,Session #,Song Order,Session Start,interval
0,2020-12-22 13:59:59.971,Wham!,Last Christmas,2020-12-22 15:00:00.000,0 days 01:00:00.029000,60.000483,1,1,1,2020-12-22 13:59:59.971,2020-12-22 14:00:00
1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5,2020-12-22 15:02:00.010,0 days 00:02:00.010000,2.000167,0,2,1,2020-12-22 15:00:00.000,2020-12-22 15:00:00
2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana,2020-12-22 15:04:00.019,0 days 00:02:00.009000,2.000150,0,2,2,2020-12-22 15:00:00.000,2020-12-22 15:00:00
3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go,2020-12-22 18:00:00.000,0 days 02:55:59.981000,175.999683,1,2,3,2020-12-22 15:00:00.000,2020-12-22 15:00:00
4,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way,2020-12-22 19:00:00.029,0 days 01:00:00.029000,60.000483,1,3,1,2020-12-22 18:00:00.000,2020-12-22 18:00:00
...,...,...,...,...,...,...,...,...,...,...,...
983,2021-02-01 22:09:00.029,Kings Of Leon,Sex Is On Fire,2021-02-01 22:11:00.038,0 days 00:02:00.009000,2.000150,0,298,4,2021-02-01 22:00:00.029,2021-02-01 22:00:00
984,2021-02-01 22:11:00.038,Hugh Jackman & The Greatest Showman Cast,The Greatest Show,2021-02-01 22:14:00.010,0 days 00:02:59.972000,2.999533,0,298,5,2021-02-01 22:00:00.029,2021-02-01 22:00:00
985,2021-02-01 22:14:00.010,Lil Nas X,Old Town Road,2021-02-01 22:16:59.981,0 days 00:02:59.971000,2.999517,0,298,6,2021-02-01 22:00:00.029,2021-02-01 22:00:00
986,2021-02-01 22:16:59.981,Ike And Tina Turner,Proud Mary,2021-02-01 22:18:59.990,0 days 00:02:00.009000,2.000150,0,298,7,2021-02-01 22:00:00.029,2021-02-01 22:00:00


In [612]:
customers['interval'] = pd.to_datetime(np.round(customers['Entry Time'].astype(np.int64) / threshold_ns) * threshold_ns)
customers

Unnamed: 0,Customer ID,Entry Time,interval
0,3fdc46,2020-12-27 06:55:00,2020-12-27 07:00:00
1,3fdc46,2020-12-31 03:55:00,2020-12-31 04:00:00
2,3fdc46,2021-01-02 08:55:00,2021-01-02 09:00:00
3,3fdc46,2021-01-09 05:55:00,2021-01-09 06:00:00
4,3fdc46,2021-02-01 06:55:00,2021-02-01 07:00:00
...,...,...,...
296,bdc39c,2021-01-28 10:53:00,2021-01-28 11:00:00
297,8d850e,2020-12-28 02:51:00,2020-12-28 03:00:00
298,8d850e,2021-01-12 22:51:00,2021-01-12 23:00:00
299,8d850e,2021-01-26 11:51:00,2021-01-26 12:00:00


In [613]:
# merge datasets on interval
output = karaoke_choices.merge(customers, on='interval', how='inner')
output

Unnamed: 0,Date,Artist,Song,Next song,Time between songs,Time between songs in minutes,New session,Session #,Song Order,Session Start,interval,Customer ID,Entry Time
0,2020-12-22 13:59:59.971,Wham!,Last Christmas,2020-12-22 15:00:00.000,0 days 01:00:00.029000,60.000483,1,1,1,2020-12-22 13:59:59.971,2020-12-22 14:00:00,cd2834,2020-12-22 13:54:00
1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5,2020-12-22 15:02:00.010,0 days 00:02:00.010000,2.000167,0,2,1,2020-12-22 15:00:00.000,2020-12-22 15:00:00,2de3d7,2020-12-22 14:55:00
2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana,2020-12-22 15:04:00.019,0 days 00:02:00.009000,2.000150,0,2,2,2020-12-22 15:00:00.000,2020-12-22 15:00:00,2de3d7,2020-12-22 14:55:00
3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go,2020-12-22 18:00:00.000,0 days 02:55:59.981000,175.999683,1,2,3,2020-12-22 15:00:00.000,2020-12-22 15:00:00,2de3d7,2020-12-22 14:55:00
4,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way,2020-12-22 19:00:00.029,0 days 01:00:00.029000,60.000483,1,3,1,2020-12-22 18:00:00.000,2020-12-22 18:00:00,6990000000000000162183243612064360401218956014...,2020-12-22 17:51:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
983,2021-02-01 22:09:00.029,Kings Of Leon,Sex Is On Fire,2021-02-01 22:11:00.038,0 days 00:02:00.009000,2.000150,0,298,4,2021-02-01 22:00:00.029,2021-02-01 22:00:00,cdda70,2021-02-01 21:53:00
984,2021-02-01 22:11:00.038,Hugh Jackman & The Greatest Showman Cast,The Greatest Show,2021-02-01 22:14:00.010,0 days 00:02:59.972000,2.999533,0,298,5,2021-02-01 22:00:00.029,2021-02-01 22:00:00,cdda70,2021-02-01 21:53:00
985,2021-02-01 22:14:00.010,Lil Nas X,Old Town Road,2021-02-01 22:16:59.981,0 days 00:02:59.971000,2.999517,0,298,6,2021-02-01 22:00:00.029,2021-02-01 22:00:00,cdda70,2021-02-01 21:53:00
986,2021-02-01 22:16:59.981,Ike And Tina Turner,Proud Mary,2021-02-01 22:18:59.990,0 days 00:02:00.009000,2.000150,0,298,7,2021-02-01 22:00:00.029,2021-02-01 22:00:00,cdda70,2021-02-01 21:53:00


In [614]:
# Output the data
# 6 fields: Session #, Customer ID, Song Order, Date, Artist, Song
# 988 rows (989 including headers)

# reduce columns for output
output = output[['Session #','Customer ID','Song Order','Date','Artist','Song']]
output

Unnamed: 0,Session #,Customer ID,Song Order,Date,Artist,Song
0,1,cd2834,1,2020-12-22 13:59:59.971,Wham!,Last Christmas
1,2,2de3d7,1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5
2,2,2de3d7,2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana
3,2,2de3d7,3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go
4,3,6990000000000000162183243612064360401218956014...,1,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way
...,...,...,...,...,...,...
983,298,cdda70,4,2021-02-01 22:09:00.029,Kings Of Leon,Sex Is On Fire
984,298,cdda70,5,2021-02-01 22:11:00.038,Hugh Jackman & The Greatest Showman Cast,The Greatest Show
985,298,cdda70,6,2021-02-01 22:14:00.010,Lil Nas X,Old Town Road
986,298,cdda70,7,2021-02-01 22:16:59.981,Ike And Tina Turner,Proud Mary
