

# **Will you skip this music track or not?**


The public part of the dataset consists of roughly 130 million listening sessions with associated user interactions on the Spotify service. 

The task is to predict whether individual tracks encountered in a listening session will be skipped by a particular user. In order to do this, complete information about the first half of a user’s listening session is provided, while the prediction is to be carried out on the second half. Participants have access to metadata, as well as acoustic descriptors, for all the tracks encountered in listening sessions.

https://www.aicrowd.com/challenges/spotify-sequential-skip-prediction-challenge

Brost, B., Mehrotra, R., & Jehan, T. (2019, May). The music streaming sessions dataset. In The World Wide Web Conference (pp. 2594-2600).



As the entire dataset is too big to experiment data manipulation, Spotify provided a mini dataset for this purpose.

In this script, we will do data wrangling to inspect the quality of the data, and do data engineering to generate features for machine learning modeling.


# Mount Google drive to Colab

In [1]:
# # For Colab only
#from google.colab import drive
#drive.mount('/content/drive')
#%cd /content/drive/MyDrive/Capstone_SpotifyStreaming/notebooks

#!pip install featuretools==0.4.0
#!pip install -U featuretools
#!pip install featuretools

# # check the installed packages
# pip list -v 

In [None]:
import numpy as np
import pandas as pd
import featuretools as ft
import time
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# Load the data and perform some data cleaning/re-coding as described in 2_mini_EDA

In [None]:
# load the track information (mini version)

tf_df = pd.read_csv('../data/raw/data/track_features/tf_mini.csv')
log_df = pd.read_csv('../data/raw/data/training_set/log_mini.csv')

In [None]:
# perform some data cleaning/re-coding as described in 2_mini_EDA

tf_df_dummy = pd.get_dummies(tf_df, columns=['key','time_signature','mode'])
log_df_dummy = pd.get_dummies(log_df.drop(columns = ['session_length',  'hist_user_behavior_reason_end', 'hist_user_behavior_n_seekfwd','hist_user_behavior_n_seekback']), columns=['hist_user_behavior_reason_start', 'context_type'])


In [None]:
tf_df_dummy.head().T

In [None]:
log_df_dummy.head().T

# add skipping information as new columns

In [None]:
session_id = log_df_dummy['session_id'].unique()
print('number of sessions in this mini dataset:',len(session_id))


In [None]:
# the function of integrating the skipping labels into one column
def skip_label(df):
    skip = (df['not_skipped']==False).astype(int)*4 # no skip: 0, ultra-late skip: 4
    # It has to go under this order. If skip_1 = True, then skip_2 and _3 will be True too.
    skip[df['skip_3']==True] = 3 # late skip
    skip[df['skip_2']==True] = 2 # mid skip
    skip[df['skip_1']==True] = 1 # early skip
    return skip

log_df_dummy['skip_label'] = skip_label(log_df_dummy)


In [None]:
# make a column which has session ID and skip info
log_df_dummy['session_id_skip_label'] = log_df_dummy['session_id'] + '_skip_' + log_df_dummy['skip_label'].astype(str)
log_df_dummy['session_id_skip_1_False'] = log_df_dummy['session_id'] * (log_df_dummy['skip_1'] == False)
log_df_dummy['session_id_skip_1_True'] = log_df_dummy['session_id'] * (log_df_dummy['skip_1'] == True)
log_df_dummy['session_id_not_skipped_True'] = log_df_dummy['session_id'] * (log_df_dummy['not_skipped'] == True)
log_df_dummy['session_id_not_skipped_False'] = log_df_dummy['session_id'] * (log_df_dummy['not_skipped'] == False)

log_df_dummy.head(10).T

# Calculate the feature distance/similarity between adjacent tracks

The track information dataframe contains acoustic analysis and scores of each track on 8 acoustic features (see: https://benanne.github.io/2014/08/05/spotify-cnns.html). Therefore, within each session, I would like to calculate the ***distance*** or ***similarity*** of each track to the other tracks.

In [None]:
# extract the acoustic features of each track
df = log_df_dummy.merge(tf_df_dummy[['track_id','acousticness','beat_strength','danceability',
                                     'dyn_range_mean', 'energy', 'flatness','instrumentalness', 'liveness', 
                                     'loudness', 'mechanism', 'organism','speechiness','valence',
                                     'acoustic_vector_0','acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3',
                                     'acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6','acoustic_vector_7']], 
                        left_on = 'track_id_clean', 
                        right_on = 'track_id')
df.sort_values(by = ['session_id', 'session_position'],inplace = True)
df.head().T

In [None]:

temp_data = df.loc[df['session_id'] == session_id[0]]
temp_data

In [None]:
def cal_dist(x):
    from scipy.spatial.distance import cdist
    Y_euc = cdist(x, x, 'euclidean')
    Y_cos = cdist(x, x, 'cosine')
    Y_man = cdist(x, x, 'cityblock')
    # The 1st track of each session should have unreasonably far distance
    euc_dist = [0]
    cos_dist = [0]
    man_dist = [0]
    for n in range(1,len(x)):
        euc_dist.append(Y_euc[n,n-1])
        cos_dist.append(Y_cos[n,n-1])
        man_dist.append(Y_man[n,n-1])
    return euc_dist, cos_dist, man_dist


In [None]:
# calculate the distance/similarity within each session
from sklearn.preprocessing import StandardScaler

# as the last 20% of the rows of each session will be used as the testing dataset, they should not be fitted by the scaler
train_perc = 0.8

sel_col_names = ['skip_label','acousticness','beat_strength','danceability',
                        'dyn_range_mean', 'energy', 'flatness','instrumentalness', 'liveness', 
                        'loudness', 'mechanism', 'organism','speechiness','valence',
                        'acoustic_vector_0','acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3',
                        'acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6','acoustic_vector_7']

start_time = time.time()

for s_id in session_id:
    temp_data = []
    temp_mat = []
    temp_data = df.loc[df['session_id'] == s_id, sel_col_names]
    temp_mat = temp_data.drop(columns = ['skip_label']).copy()
    scaler = StandardScaler()
    scaler.fit(temp_mat[0:round(len(temp_mat)*train_perc)])
    temp_mat_scaled = scaler.transform(temp_mat)
#     temp_mat_scaled_skip0 = temp_mat_scaled[temp_data['skip_label']==0]
#     temp_mat_scaled_skip1 = temp_mat_scaled[temp_data['skip_label']==1]
#     temp_mat_scaled_skipR = temp_mat_scaled[temp_data['skip_label']>1]
    
    
    euc_dist_all, cos_dist_all, man_dist_all = cal_dist(temp_mat_scaled)
    df.loc[temp_data.index, 'euc_dist_all'] = euc_dist_all
    df.loc[temp_data.index, 'cos_dist_all'] = cos_dist_all
    df.loc[temp_data.index, 'man_dist_all'] = man_dist_all
    
#     euc_dist_skip0, cos_dist_skip0, man_dist_skip0 = cal_dist(temp_mat_scaled_skip0)
#     loc_index0 = temp_data.index[temp_data['skip_label']==0].tolist()
#     df.loc[loc_index0, 'euc_dist_skip0'] = euc_dist_skip0
#     df.loc[loc_index0, 'cos_dist_skip0'] = cos_dist_skip0
#     df.loc[loc_index0, 'man_dist_skip0'] = man_dist_skip0
    
#     euc_dist_skip1, cos_dist_skip1, man_dist_skip1 = cal_dist(temp_mat_scaled_skip1)
#     loc_index1 = temp_data.index[temp_data['skip_label']==1].tolist()
#     df.loc[loc_index1, 'euc_dist_skip1'] = euc_dist_skip1
#     df.loc[loc_index1, 'cos_dist_skip1'] = cos_dist_skip1
#     df.loc[loc_index1, 'man_dist_skip1'] = man_dist_skip1
    
#     euc_dist_skipR, cos_dist_skipR, man_dist_skipR = cal_dist(temp_mat_scaled_skipR)
#     loc_indexR = temp_data.index[temp_data['skip_label']>1].tolist()
#     df.loc[loc_indexR, 'euc_dist_skipR'] = euc_dist_skipR
#     df.loc[loc_indexR, 'cos_dist_skipR'] = cos_dist_skipR
#     df.loc[loc_indexR, 'man_dist_skipR'] = man_dist_skipR

print('***It takes ',(time.time() - start_time)/60, ' minutes.***')


In [None]:
log_df_dummy2 = df.drop(columns = ['skip_label','track_id','acousticness','beat_strength','danceability',
                                     'dyn_range_mean', 'energy', 'flatness','instrumentalness', 'liveness', 
                                     'loudness', 'mechanism', 'organism','speechiness','valence',
                                     'acoustic_vector_0','acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3',
                                     'acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6','acoustic_vector_7'])
log_df_dummy2['session_id_skip_label'] = log_df_dummy2['session_id_skip_label'].astype('category')
log_df_dummy2['session_id_skip_1_False'] = log_df_dummy2['session_id_skip_1_False'].astype('category')
log_df_dummy2['session_id_skip_1_True'] = log_df_dummy2['session_id_skip_1_True'].astype('category')
log_df_dummy2['session_id_not_skipped_True'] = log_df_dummy2['session_id_not_skipped_True'].astype('category')
log_df_dummy2['session_id_not_skipped_False'] = log_df_dummy2['session_id_not_skipped_False'].astype('category')

# **Use featuretool to do automatic feature engineering**

In [None]:
#First, initializing an EntitySet with a name
es = ft.EntitySet(id="spotify_data")

In [None]:
from woodwork.logical_types import Categorical, PostalCode

es = es.add_dataframe(
    dataframe_name="tf",
    dataframe=tf_df_dummy,
    index="track_id",
)

es

In [None]:
es['tf'].ww.schema

In [None]:
# add dataframe
# the 'session_position' contains order information within each session

es = es.add_dataframe(
    dataframe_name="log", dataframe=log_df_dummy2, make_index = True, index="event_id", time_index="session_position",
)

es

In [None]:
es['log'].ww.schema

In [None]:
# When two DataFrames have a one-to-many relationship, we call the “one” DataFrame, the “parent DataFrame”. A relationship between a parent and child is defined like this:
# (parent_dataframe, parent_column, child_dataframe, child_column)
es = es.add_relationship("tf", "track_id", "log", "track_id_clean")
es

In [None]:
# turn on "features_only = True" to experiment with the function without computing the feature_matrix

primi_parameters = {
    "include_groupby_columns":{"log": ["session_id","session_id_skip_1_False","session_id_skip_1_True","session_id_not_skipped_True","session_id_not_skipped_False"]},
    "ignore_groupby_dataframes": ["tf"],
    "ignore_columns": {"log":["context_type_catalog","context_type_charts","context_type_editorial_playlist","context_type_personalized_playlist","context_type_radio","context_type_user_collection","hour_of_day","date"]}
                   }
primi_parameters

In [None]:
primi_parameters_ignoreCategoricalAcoustic = primi_parameters.copy()


key_cols = [col for col in tf_df_dummy.columns if 'key_' in col]
time_cols = [col for col in tf_df_dummy.columns if 'time_signature_' in col]

primi_parameters_ignoreCategoricalAcoustic['ignore_columns']['tf'] = key_cols+time_cols
primi_parameters_ignoreCategoricalAcoustic

In [None]:
feature_defs_test = ft.dfs(entityset=es,
                        target_dataframe_name="log",
                        groupby_trans_primitives=["Diff","CumSum", "CumMean", "CumMin", "CumMax"],
                        agg_primitives=[],
                        trans_primitives=[],
                        primitive_options={"diff": primi_parameters,
                                           "cum_sum": primi_parameters,
                                           "cum_mean": primi_parameters,
                                           "cum_min": primi_parameters_ignoreCategoricalAcoustic,
                                           "cum_max": primi_parameters_ignoreCategoricalAcoustic
                                          },
                        features_only = True,
                        n_jobs=-1)


# feature_defs_test = ft.dfs(entityset=es,
#                         target_dataframe_name="log",
#                         groupby_trans_primitives=["Diff","CumSum", "CumMean", "CumMin", "CumMax"],
#                         agg_primitives=[],
#                         trans_primitives=[],
#                         primitive_options={"diff": {"include_groupby_columns": {"log": ["session_id","session_id_skip_1_False","session_id_skip_1_True","session_id_not_skipped_True","session_id_not_skipped_False"]},"ignore_groupby_dataframes": ["tf"]},
#                                         "cum_sum": {"include_groupby_columns": {"log": ["session_id","session_id_skip_1_False","session_id_skip_1_True","session_id_not_skipped_True","session_id_not_skipped_False"]},"ignore_groupby_dataframes": ["tf"]},
#                                         "cum_mean": {"include_groupby_columns": {"log": ["session_id","session_id_skip_1_False","session_id_skip_1_True","session_id_not_skipped_True","session_id_not_skipped_False"]},"ignore_groupby_dataframes": ["tf"]},
#                                         "cum_min": {"include_groupby_columns": {"log": ["session_id","session_id_skip_1_False","session_id_skip_1_True","session_id_not_skipped_True","session_id_not_skipped_False"]},"ignore_groupby_dataframes": ["tf"]},
#                                         "cum_max": {"include_groupby_columns": {"log": ["session_id","session_id_skip_1_False","session_id_skip_1_True","session_id_not_skipped_True","session_id_not_skipped_False"]},"ignore_groupby_dataframes": ["tf"]}},
#                         features_only = True,
#                         n_jobs=-1)

feature_defs_test

In [None]:
len(feature_defs_test)

In [None]:
import warnings
def fxn():
    warnings.warn("deprecated", DeprecationWarning)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fxn()
#warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

In [None]:
# check this for specifying groupby: https://featuretools.alteryx.com/en/stable/guides/specifying_primitive_options.html
# The 'session_id' was specified as GroupBy option as we care what happend in each session.
# The tf (track inforamtion) dataframe does not contain any order or session information, so it does not be be specified as GroupBy option.

#warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
# warnings.simplefilter("ignore")


start_time = time.time()

feature_matrix, feature_defs = ft.dfs(entityset=es,
                        target_dataframe_name="log",
                        groupby_trans_primitives=["Diff","CumSum", "CumCount", "CumMean", "CumMin", "CumMax"],
                        agg_primitives=[],
                        trans_primitives=[],
                        primitive_options={"diff": {"include_groupby_columns": {"log": ["session_id","session_id_skip_label"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_sum": {"include_groupby_columns": {"log": ["session_id","session_id_skip_label"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_count": {"include_groupby_columns": {"log": ["session_id","session_id_skip_label"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_mean": {"include_groupby_columns": {"log": ["session_id","session_id_skip_label"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_min": {"include_groupby_columns": {"log": ["session_id","session_id_skip_label"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_max": {"include_groupby_columns": {"log": ["session_id","session_id_skip_label"]},"ignore_groupby_dataframes": ["tf"]}},
                        features_only = False,
                        n_jobs=-1)

print('***It takes ',(time.time() - start_time)/60, ' minutes.***')

# save feature matrix
feature_matrix.to_csv('../data/processed/feature_matrix_noSkip_skip1_TF.csv')


In [None]:
feature_matrix

In [None]:
feature_defs

In [None]:
print(feature_defs[350])
ft.describe_feature(feature_defs[350])

In [28]:
feature_matrix.ww.schema

Unnamed: 0_level_0,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1
session_id,Categorical,['category']
session_position,Integer,['numeric']
skip_1,Boolean,[]
skip_2,Boolean,[]
skip_3,Boolean,[]
not_skipped,Boolean,[]
context_switch,Integer,['numeric']
no_pause_before_play,Integer,['numeric']
short_pause_before_play,Integer,['numeric']
long_pause_before_play,Integer,['numeric']


# Check the results of Featuretools

In [46]:
# load the feature matrix to check whether it was properly saved.

feature_matrix2 = pd.read_csv('../data/processed/feature_matrix_skipLabel_distance.csv')
feature_matrix2.sort_values(by = ['session_id','session_position'],inplace=True)
feature_matrix2.head().T

Unnamed: 0,0,10000,20000,30000,40000
event_id,0,1,2,3,4
session_id,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e
session_position,1,2,3,4,5
skip_1,False,False,False,False,False
skip_2,False,False,False,False,False
...,...,...,...,...,...
DIFF(tf.time_signature_5) by session_id_skip_label,,0.0,0.0,0.0,0.0
DIFF(tf.us_popularity_estimate) by session_id,,-0.071405,0.103248,-0.004938,0.00346
DIFF(tf.us_popularity_estimate) by session_id_skip_label,,-0.071405,0.103248,-0.004938,0.00346
DIFF(tf.valence) by session_id,,0.184898,0.03671,0.275558,0.003501


In [48]:
# inspect the data of a session
a = feature_matrix2[feature_matrix2['session_id'] == '0_00006f66-33e5-4de7-a324-2d18e439fc1e']
# inspect whether the featuretools generated features considered the session and the order information. The answer appears yes.
a[['tf.speechiness','CUM_SUM(tf.speechiness) by session_id','session_id','CUM_SUM(tf.speechiness) by session_id_skip_label','session_id_skip_label','session_position']]

Unnamed: 0,tf.speechiness,CUM_SUM(tf.speechiness) by session_id,session_id,CUM_SUM(tf.speechiness) by session_id_skip_label,session_id_skip_label,session_position
0,0.069717,0.069717,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.069717,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_0,1
10000,0.061158,0.130874,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.130874,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_0,2
20000,0.045354,0.176229,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.176229,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_0,3
30000,0.229936,0.406164,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.406164,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_0,4
40000,0.24098,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_0,5
50000,0.133586,0.78073,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.133586,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_3,6
60000,0.409848,1.190578,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.409848,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_1,7
70000,0.103687,1.294266,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.513535,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_1,8
80000,0.049853,1.344119,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.049853,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_2,9
90000,0.154609,1.498729,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.668145,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_1,10


In [49]:
sel_cols = [col for col in feature_matrix2.columns if 'by session_id_skip_label' in col]
sel_cols

['CUM_COUNT(session_id) by session_id_skip_label',
 'CUM_COUNT(session_id_skip_label) by session_id_skip_label',
 'CUM_COUNT(track_id_clean) by session_id_skip_label',
 'CUM_MAX(context_switch) by session_id_skip_label',
 'CUM_MAX(context_type_catalog) by session_id_skip_label',
 'CUM_MAX(context_type_charts) by session_id_skip_label',
 'CUM_MAX(context_type_editorial_playlist) by session_id_skip_label',
 'CUM_MAX(context_type_personalized_playlist) by session_id_skip_label',
 'CUM_MAX(context_type_radio) by session_id_skip_label',
 'CUM_MAX(context_type_user_collection) by session_id_skip_label',
 'CUM_MAX(cos_dist_all) by session_id_skip_label',
 'CUM_MAX(euc_dist_all) by session_id_skip_label',
 'CUM_MAX(hist_user_behavior_reason_start_appload) by session_id_skip_label',
 'CUM_MAX(hist_user_behavior_reason_start_backbtn) by session_id_skip_label',
 'CUM_MAX(hist_user_behavior_reason_start_clickrow) by session_id_skip_label',
 'CUM_MAX(hist_user_behavior_reason_start_endplay) by sess

In [50]:
feature_matrix2['session_id_skip_label'].isna().sum()

0

In [52]:
start_time = time.time()
for s_id in pd.unique(feature_matrix2['session_id']):
    temp = []
    temp = feature_matrix2[feature_matrix2['session_id']==s_id]
    # shift the rows down by 1 within each session because it reflects the history (no_skip) up to the previous row
    # without shifting, the information of no_skip will leak into the features!!
    temp[sel_cols] = temp[sel_cols].shift(periods=1, axis=0, fill_value=0) 
    #temp.fillna(method = 'ffill', axis = 0, inplace=True)
    feature_matrix2[feature_matrix2['session_id']==s_id] = temp

print('***It takes ',(time.time() - start_time)/60, ' mins.***')

***It takes  26.485645016034443  mins.***


In [53]:
# check the data again after shifting the rows
b = feature_matrix2[feature_matrix2['session_id'] == '0_00006f66-33e5-4de7-a324-2d18e439fc1e']
# inspect whether the featuretools generated features considered the session and the order information. The answer appears yes.
b[['tf.speechiness','CUM_SUM(tf.speechiness) by session_id','session_id','CUM_SUM(tf.speechiness) by session_id_skip_label','session_id_skip_label','session_position']]

Unnamed: 0,tf.speechiness,CUM_SUM(tf.speechiness) by session_id,session_id,CUM_SUM(tf.speechiness) by session_id_skip_label,session_id_skip_label,session_position
0,0.069717,0.069717,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.0,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_0,1
10000,0.061158,0.130874,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.069717,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_0,2
20000,0.045354,0.176229,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.130874,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_0,3
30000,0.229936,0.406164,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.176229,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_0,4
40000,0.24098,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.406164,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_0,5
50000,0.133586,0.78073,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_3,6
60000,0.409848,1.190578,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.133586,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_1,7
70000,0.103687,1.294266,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.409848,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_1,8
80000,0.049853,1.344119,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.513535,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_2,9
90000,0.154609,1.498729,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.049853,0_00006f66-33e5-4de7-a324-2d18e439fc1e_skip_1,10


In [7]:
# feature_matrix2.loc[feature_matrix2['session_id_not_skipped'].isna(),sel_cols] = np.nan
# b = feature_matrix2[feature_matrix2['session_id'] == '0_00006f66-33e5-4de7-a324-2d18e439fc1e']
# b[['tf.speechiness','CUM_SUM(tf.speechiness) by session_id','session_id','CUM_SUM(tf.speechiness) by session_id_not_skipped','session_id_not_skipped','session_position']]

Unnamed: 0,tf.speechiness,CUM_SUM(tf.speechiness) by session_id,session_id,CUM_SUM(tf.speechiness) by session_id_not_skipped,session_id_not_skipped,session_position
0,0.069717,0.069717,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.069717,0_00006f66-33e5-4de7-a324-2d18e439fc1e,1
10000,0.061158,0.130874,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.130874,0_00006f66-33e5-4de7-a324-2d18e439fc1e,2
20000,0.045354,0.176229,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.176229,0_00006f66-33e5-4de7-a324-2d18e439fc1e,3
30000,0.229936,0.406164,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.406164,0_00006f66-33e5-4de7-a324-2d18e439fc1e,4
40000,0.24098,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,5
50000,0.133586,0.78073,0_00006f66-33e5-4de7-a324-2d18e439fc1e,,,6
60000,0.409848,1.190578,0_00006f66-33e5-4de7-a324-2d18e439fc1e,,,7
70000,0.103687,1.294266,0_00006f66-33e5-4de7-a324-2d18e439fc1e,,,8
80000,0.049853,1.344119,0_00006f66-33e5-4de7-a324-2d18e439fc1e,,,9
90000,0.154609,1.498729,0_00006f66-33e5-4de7-a324-2d18e439fc1e,,,10


In [9]:
# feature_matrix3 = feature_matrix2.sort_values(by = ['session_id','session_position']).copy()
# start_time = time.time()
# for s_id in pd.unique(feature_matrix2['session_id']):
#     temp = []
#     temp = feature_matrix3[feature_matrix3['session_id']==s_id]
#     # shift the rows down by 1 within each session because it reflects the history (no_skip) up to the previous row
#     # without shifting, the information of no_skip will leak into the features!!
#     temp[sel_cols] = temp[sel_cols].shift(periods=1, axis=0) 
#     temp.fillna(method = 'ffill', axis = 0, inplace=True)
#     feature_matrix3[feature_matrix3['session_id']==s_id] = temp

# print('***It takes ',(time.time() - start_time)/60, ' mins.***')


# c = feature_matrix3[feature_matrix3['session_id'] == '0_00006f66-33e5-4de7-a324-2d18e439fc1e']
# c[['tf.speechiness','CUM_SUM(tf.speechiness) by session_id','session_id','CUM_SUM(tf.speechiness) by session_id_not_skipped','session_id_not_skipped','not_skipped','session_position']]

***It takes  1416.1911041736603  secs.***


Unnamed: 0,tf.speechiness,CUM_SUM(tf.speechiness) by session_id,session_id,CUM_SUM(tf.speechiness) by session_id_not_skipped,session_id_not_skipped,not_skipped,session_position
0,0.069717,0.069717,0_00006f66-33e5-4de7-a324-2d18e439fc1e,,0_00006f66-33e5-4de7-a324-2d18e439fc1e,True,1
10000,0.061158,0.130874,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.069717,0_00006f66-33e5-4de7-a324-2d18e439fc1e,True,2
20000,0.045354,0.176229,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.130874,0_00006f66-33e5-4de7-a324-2d18e439fc1e,True,3
30000,0.229936,0.406164,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.176229,0_00006f66-33e5-4de7-a324-2d18e439fc1e,True,4
40000,0.24098,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.406164,0_00006f66-33e5-4de7-a324-2d18e439fc1e,True,5
50000,0.133586,0.78073,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,False,6
60000,0.409848,1.190578,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,False,7
70000,0.103687,1.294266,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,False,8
80000,0.049853,1.344119,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,False,9
90000,0.154609,1.498729,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,False,10


In [10]:
d = feature_matrix3[feature_matrix3['session_id'] == '0_0000a72b-09ac-412f-b452-9b9e79bded8f']
d[['tf.speechiness','CUM_SUM(tf.speechiness) by session_id','session_id','CUM_SUM(tf.speechiness) by session_id_not_skipped','session_id_not_skipped','not_skipped','session_position']]


Unnamed: 0,tf.speechiness,CUM_SUM(tf.speechiness) by session_id,session_id,CUM_SUM(tf.speechiness) by session_id_not_skipped,session_id_not_skipped,not_skipped,session_position
1,0.192012,0.192012,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,1
10001,0.068105,0.260117,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,2
20001,0.075064,0.335181,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,3
30001,0.045026,0.380207,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,4
40001,0.12405,0.504257,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,5
50001,0.239021,0.743278,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,6
60001,0.039155,0.782433,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,7
70001,0.041675,0.824108,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,8
80001,0.134822,0.958929,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,9
90001,0.140254,1.099183,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,10


In [12]:
feature_matrix3.to_csv('../data/processed/feature_matrix_skip_processed.csv')

# Scale standardization & training/testing data split will be performed at the model fitting stage.
