# 2021: Week 8 - Karaoke Data

February 24, 2021

Challenge by: Jenny Martin

Recently I was helping a colleague prep some karaoke data and I thought it was too fun a subject to resist turning into a Preppin' Data challenge! I had a lot of fun creating the dataset and imagining the type of person who may sing one song and then not bother with the rest of the session. 

We will need to make some assumptions as part of our data prep:

- Customers often don't sing the entire song
- Sessions last 60 minutes
- Customers arrive a maximum of 10 minutes before their sessions begin

I will warn you that this challenge may be a little on the trickier end of the spectrum!

## Inputs

1. Karaoke song choices and what time they began 

<img src='https://1.bp.blogspot.com/-OKoZi-s2jrI/X-Iam5_rHYI/AAAAAAAAAqg/XUkttbXMfNEfez_Q2lPotOCVSiqGUPrPACLcBGAsYHQ/w400-h223/Karaoke%2BInput.png'>

2. Customer entry times 

<img src='https://1.bp.blogspot.com/-fFeRcrRKbvE/X-IasecCtrI/AAAAAAAAAqk/X0Y9RMIJBWAoHK74_QoC3YDaiVMdnodTACLcBGAsYHQ/s0/Customer%2BEntry.png'>

## Requirements

- [Input the data](https://drive.google.com/file/d/1ACDlDqxsWUxfojo_ZAcEe7GuYXw4rfYI/view?usp=drivesdk)
- Calculate the time between songs
- If the time between songs is greater than (or equal to) 59 minutes, flag this as being a new session
- Create a session number field
- Number the songs in order for each session
- Match the customers to the correct session, based on their entry time
    - The Customer ID field should be null if there were no customers who arrived 10 minutes (or less) before the start of the session
- [Output the data](https://drive.google.com/file/d/1RCADhTi7YruO3-wzit2PQoTj_GfqVcOl/view?usp=sharing)

Output

<img src='https://1.bp.blogspot.com/-KZrVSOQMowk/YDe9W8-t97I/AAAAAAAAAww/rziET4GoHtk3JehmSfcmu_my7KPSNM8pgCLcBGAsYHQ/w640-h184/Karaoke%2BOutput2.png'>

- 6 fields
    - Session #
    - Customer ID
    - Song Order
    - Date
    - Artist
    - Song
- 988 rows (989 including headers)

In [27]:
import pandas as pd
import numpy as np
import datetime as dt

# Load data
karaoke_choices = pd.read_excel('Copy of Karaoke Dataset.xlsx', engine='openpyxl', sheet_name = 'Karaoke Choices')
karaoke_choices



Unnamed: 0,Date,Artist,Song
0,2020-12-22 13:59:59.971,Wham!,Last Christmas
1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5
2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana
3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go
4,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way
...,...,...,...
983,2021-02-01 22:09:00.029,Kings Of Leon,Sex Is On Fire
984,2021-02-01 22:11:00.038,Hugh Jackman & The Greatest Showman Cast,The Greatest Show
985,2021-02-01 22:14:00.010,Lil Nas X,Old Town Road
986,2021-02-01 22:16:59.981,Ike And Tina Turner,Proud Mary


In [28]:
customers = pd.read_excel('Copy of Karaoke Dataset.xlsx', engine='openpyxl', sheet_name = 'Customers')
customers

Unnamed: 0,Customer ID,Entry Time
0,3fdc46,2020-12-27 06:55:00
1,3fdc46,2020-12-31 03:55:00
2,3fdc46,2021-01-02 08:55:00
3,3fdc46,2021-01-09 05:55:00
4,3fdc46,2021-02-01 06:55:00
...,...,...
296,bdc39c,2021-01-28 10:53:00
297,8d850e,2020-12-28 02:51:00
298,8d850e,2021-01-12 22:51:00
299,8d850e,2021-01-26 11:51:00


In [29]:
# Calculate the time between songs
karaoke_choices['Next song'] = karaoke_choices['Date'][1:].reset_index(drop=True)
karaoke_choices
# karaoke_choices['Time between songs'] = karaoke_choices['Next song'] - karaoke_choices['Date']

Unnamed: 0,Date,Artist,Song,Next song
0,2020-12-22 13:59:59.971,Wham!,Last Christmas,2020-12-22 15:00:00.000
1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5,2020-12-22 15:02:00.010
2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana,2020-12-22 15:04:00.019
3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go,2020-12-22 18:00:00.000
4,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way,2020-12-22 19:00:00.029
...,...,...,...,...
983,2021-02-01 22:09:00.029,Kings Of Leon,Sex Is On Fire,2021-02-01 22:11:00.038
984,2021-02-01 22:11:00.038,Hugh Jackman & The Greatest Showman Cast,The Greatest Show,2021-02-01 22:14:00.010
985,2021-02-01 22:14:00.010,Lil Nas X,Old Town Road,2021-02-01 22:16:59.981
986,2021-02-01 22:16:59.981,Ike And Tina Turner,Proud Mary,2021-02-01 22:18:59.990


In [30]:
# If the time between songs is greater than (or equal to) 59 minutes, flag this as being a new session
karaoke_choices['Time between songs in minutes'] = karaoke_choices['Time between songs'].dt.total_seconds()/60
karaoke_choices['New session'] = np.where(karaoke_choices['Time between songs in minutes']>=59,1,0)



KeyError: 'Time between songs'

In [None]:
# Create a session number field
# loop through dataframe and increase session id if new session flagged
session = 1
session_list = []
for i in range(len(karaoke_choices['New session'])):
    session_list.append(session)
    session += karaoke_choices['New session'][i]

karaoke_choices['Session #'] = session_list



In [None]:
# Number the songs in order for each session
karaoke_choices["Song Order"] = karaoke_choices.groupby("Session #")["Date"].rank("dense", ascending=True)
karaoke_choices["Song Order"] = karaoke_choices["Song Order"].astype(int)



In [None]:
# Match the customers to the correct session, based on their entry time

# create session start time column
session_start_df = karaoke_choices[["Session #","Date"]]
session_start_df = session_start_df.loc[session_start_df.groupby("Session #").Date.idxmin()].reset_index(drop=True)



In [None]:
# rename column and merge to dataset
new_session_columns = ['Session #','Session Start']
session_start_df.columns  = new_session_columns
karaoke_choices = karaoke_choices.merge(session_start_df, on='Session #', how='inner')



In [None]:
# create time interval to match datasets on 
threshold = 20
threshold_ns = threshold * 60 * 1e9



In [None]:
# compute "interval" to which each session belongs
karaoke_choices['interval'] = pd.to_datetime(np.round(karaoke_choices['Session Start'].astype(np.int64) / threshold_ns) * threshold_ns)
customers['interval'] = pd.to_datetime(np.round(customers['Entry Time'].astype(np.int64) / threshold_ns) * threshold_ns)



In [None]:
# merge datasets on interval
output = karaoke_choices.merge(customers, on='interval', how='inner')



In [None]:
# Output the data
# 6 fields: Session #, Customer ID, Song Order, Date, Artist, Song
# 988 rows (989 including headers)

# reduce columns for output
output = output[['Session #','Customer ID','Song Order','Date','Artist','Song']]