

# **Will you skip this music track or not?**


The public part of the dataset consists of roughly 130 million listening sessions with associated user interactions on the Spotify service. 

The task is to predict whether individual tracks encountered in a listening session will be skipped by a particular user. In order to do this, complete information about the first half of a user’s listening session is provided, while the prediction is to be carried out on the second half. Participants have access to metadata, as well as acoustic descriptors, for all the tracks encountered in listening sessions.

https://www.aicrowd.com/challenges/spotify-sequential-skip-prediction-challenge

Brost, B., Mehrotra, R., & Jehan, T. (2019, May). The music streaming sessions dataset. In The World Wide Web Conference (pp. 2594-2600).



As the entire dataset is too big to experiment data manipulation, Spotify provided a mini dataset for this purpose.

In this script, we will do data wrangling to inspect the quality of the data, and do data engineering to generate features for machine learning modeling.


# Mount Google drive to Colab

In [1]:
# # For Colab only
#from google.colab import drive
#drive.mount('/content/drive')
#%cd /content/drive/MyDrive/Capstone_SpotifyStreaming/notebooks

#!pip install featuretools==0.4.0
#!pip install -U featuretools
#!pip install featuretools

# # check the installed packages
# pip list -v 

In [2]:
import numpy as np
import pandas as pd
import featuretools as ft
import time
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# Load the data and perform some data cleaning/re-coding as described in 2_mini_EDA

In [3]:
# load the track information (mini version)

tf_df = pd.read_csv('../data/raw/data/track_features/tf_mini.csv')
log_df = pd.read_csv('../data/raw/data/training_set/log_mini.csv')

In [4]:
# perform some data cleaning/re-coding as described in 2_mini_EDA

tf_df_dummy = pd.get_dummies(tf_df, columns=['key','time_signature','mode'])
log_df_dummy = pd.get_dummies(log_df.drop(columns = ['session_length',  'hist_user_behavior_reason_end', 'hist_user_behavior_n_seekfwd','hist_user_behavior_n_seekback']), columns=['hist_user_behavior_reason_start', 'context_type'])


In [5]:
tf_df_dummy.head().T

Unnamed: 0,0,1,2,3,4
track_id,t_a540e552-16d4-42f8-a185-232bd650ea7d,t_67965da0-132b-4b1e-8a69-0ef99b32287c,t_0614ecd3-a7d5-40a1-816e-156d5872a467,t_070a63a0-744a-434e-9913-a97b02926a29,t_d6990e17-9c31-4b01-8559-47d9ce476df1
duration,109.706673,187.693329,160.839996,175.399994,369.600006
release_year,1950,1950,1951,1951,1951
us_popularity_estimate,99.975414,99.96943,99.602549,99.665018,99.991764
acousticness,0.45804,0.916272,0.812884,0.396854,0.728831
beat_strength,0.519497,0.419223,0.42589,0.400934,0.371328
bounciness,0.504949,0.54553,0.50828,0.35999,0.335115
danceability,0.399767,0.491235,0.491625,0.552227,0.483044
dyn_range_mean,7.51188,9.098376,8.36867,5.967346,5.802681
energy,0.817709,0.154258,0.358813,0.514585,0.721442


In [6]:
log_df_dummy.head().T

Unnamed: 0,0,1,2,3,4
session_id,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e
session_position,1,2,3,4,5
track_id_clean,t_0479f24c-27d2-46d6-a00c-7ec928f2b539,t_9099cd7b-c238-47b7-9381-f23f2c1d1043,t_fc5df5ba-5396-49a7-8b29-35d0d28249e0,t_23cff8d6-d874-4b20-83dc-94e450e8aa20,t_64f3743c-f624-46bb-a579-0f3f9a07a123
skip_1,False,False,False,False,False
skip_2,False,False,False,False,False
skip_3,False,False,False,False,False
not_skipped,True,True,True,True,True
context_switch,0,0,0,0,0
no_pause_before_play,0,1,1,1,1
short_pause_before_play,0,0,0,0,0


# add skipping information as new columns

In [7]:
session_id = log_df_dummy['session_id'].unique()
print('number of sessions in this mini dataset:',len(session_id))


number of sessions in this mini dataset: 10000


In [8]:
# the function of integrating the skipping labels into one column
def skip_label(df):
    skip = (df['not_skipped']==False).astype(int)*4 # no skip: 0, ultra-late skip: 4
    # It has to go under this order. If skip_1 = True, then skip_2 and _3 will be True too.
    skip[df['skip_3']==True] = 3 # late skip
    skip[df['skip_2']==True] = 2 # mid skip
    skip[df['skip_1']==True] = 1 # early skip
    return skip

log_df_dummy['skip_label'] = skip_label(log_df_dummy)


In [9]:
# make a column which has session ID and skip info
log_df_dummy['session_id_skip_label'] = log_df_dummy['session_id'] + '_skip_' + log_df_dummy['skip_label'].astype(str)

log_df_dummy.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
session_id,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e
session_position,1,2,3,4,5,6,7,8,9,10
track_id_clean,t_0479f24c-27d2-46d6-a00c-7ec928f2b539,t_9099cd7b-c238-47b7-9381-f23f2c1d1043,t_fc5df5ba-5396-49a7-8b29-35d0d28249e0,t_23cff8d6-d874-4b20-83dc-94e450e8aa20,t_64f3743c-f624-46bb-a579-0f3f9a07a123,t_c815228b-3212-4f9e-9d4f-9cb19b248184,t_e23c19f5-4c32-4557-aa44-81372c2e3705,t_0be6eced-f56f-48bd-8086-f2e0b760fdee,t_f3ecbd3b-9e8e-4557-b8e0-39cfcd7e65dd,t_2af4dfa0-7df3-4b7e-b7ab-353ba48237f9
skip_1,False,False,False,False,False,False,True,True,False,True
skip_2,False,False,False,False,False,False,True,True,True,True
skip_3,False,False,False,False,False,True,True,True,True,True
not_skipped,True,True,True,True,True,False,False,False,False,False
context_switch,0,0,0,0,0,0,0,0,0,0
no_pause_before_play,0,1,1,1,1,1,1,1,1,1
short_pause_before_play,0,0,0,0,0,0,0,0,0,0


# Calculate the feature distance/similarity between adjacent tracks

The track information dataframe contains acoustic analysis and scores of each track on 8 acoustic features (see: https://benanne.github.io/2014/08/05/spotify-cnns.html). Therefore, within each session, I would like to calculate the ***distance*** or ***similarity*** of each track to the other tracks.

In [10]:
# extract the acoustic features of each track
df = log_df_dummy.merge(tf_df_dummy[['track_id','acousticness','beat_strength','danceability',
                                     'dyn_range_mean', 'energy', 'flatness','instrumentalness', 'liveness', 
                                     'loudness', 'mechanism', 'organism','speechiness','valence',
                                     'acoustic_vector_0','acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3',
                                     'acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6','acoustic_vector_7']], 
                        left_on = 'track_id_clean', 
                        right_on = 'track_id')
df.sort_values(by = ['session_id', 'session_position'],inplace = True)
df.head().T

Unnamed: 0,0,45,50,327,353
session_id,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e
session_position,1,2,3,4,5
track_id_clean,t_0479f24c-27d2-46d6-a00c-7ec928f2b539,t_9099cd7b-c238-47b7-9381-f23f2c1d1043,t_fc5df5ba-5396-49a7-8b29-35d0d28249e0,t_23cff8d6-d874-4b20-83dc-94e450e8aa20,t_64f3743c-f624-46bb-a579-0f3f9a07a123
skip_1,False,False,False,False,False
skip_2,False,False,False,False,False
skip_3,False,False,False,False,False
not_skipped,True,True,True,True,True
context_switch,0,0,0,0,0
no_pause_before_play,0,1,1,1,1
short_pause_before_play,0,0,0,0,0


In [11]:

temp_data = df.loc[df['session_id'] == session_id[0]]
temp_data

Unnamed: 0,session_id,session_position,track_id_clean,skip_1,skip_2,skip_3,not_skipped,context_switch,no_pause_before_play,short_pause_before_play,...,speechiness,valence,acoustic_vector_0,acoustic_vector_1,acoustic_vector_2,acoustic_vector_3,acoustic_vector_4,acoustic_vector_5,acoustic_vector_6,acoustic_vector_7
0,0_00006f66-33e5-4de7-a324-2d18e439fc1e,1,t_0479f24c-27d2-46d6-a00c-7ec928f2b539,False,False,False,True,0,0,0,...,0.069717,0.152255,-0.815775,0.386409,0.23016,0.028028,-0.333373,0.015452,-0.35359,0.205826
45,0_00006f66-33e5-4de7-a324-2d18e439fc1e,2,t_9099cd7b-c238-47b7-9381-f23f2c1d1043,False,False,False,True,0,1,0,...,0.061158,0.337152,-0.713646,0.363718,0.310315,-0.042222,-0.383164,0.066357,-0.365308,0.15792
50,0_00006f66-33e5-4de7-a324-2d18e439fc1e,3,t_fc5df5ba-5396-49a7-8b29-35d0d28249e0,False,False,False,True,0,1,0,...,0.045354,0.373862,-0.742541,0.375599,0.25266,-0.049007,-0.299745,0.063341,-0.486689,0.181604
327,0_00006f66-33e5-4de7-a324-2d18e439fc1e,4,t_23cff8d6-d874-4b20-83dc-94e450e8aa20,False,False,False,True,0,1,0,...,0.229936,0.64942,-0.705116,0.317562,0.289141,-0.03892,-0.393358,0.092719,-0.364418,0.285603
353,0_00006f66-33e5-4de7-a324-2d18e439fc1e,5,t_64f3743c-f624-46bb-a579-0f3f9a07a123,False,False,False,True,0,1,0,...,0.24098,0.652921,-0.868489,0.33128,0.210478,0.08474,-0.333287,-0.025706,-0.51035,0.182315
475,0_00006f66-33e5-4de7-a324-2d18e439fc1e,6,t_c815228b-3212-4f9e-9d4f-9cb19b248184,False,False,True,False,0,1,0,...,0.133586,0.661081,-0.817504,0.283297,0.387589,0.279636,-0.280334,0.117993,0.106159,0.311233
537,0_00006f66-33e5-4de7-a324-2d18e439fc1e,7,t_e23c19f5-4c32-4557-aa44-81372c2e3705,True,True,True,False,0,1,0,...,0.409848,0.10942,-0.748412,0.321976,0.237488,0.00348,-0.315287,0.032431,-0.464694,0.200836
540,0_00006f66-33e5-4de7-a324-2d18e439fc1e,8,t_0be6eced-f56f-48bd-8086-f2e0b760fdee,True,True,True,False,0,1,0,...,0.103687,0.389913,-0.921928,0.35974,0.293674,0.115302,-0.274987,0.043193,-0.444351,0.211909
541,0_00006f66-33e5-4de7-a324-2d18e439fc1e,9,t_f3ecbd3b-9e8e-4557-b8e0-39cfcd7e65dd,False,True,True,False,0,1,0,...,0.049853,0.338321,-0.744412,0.3087,0.230126,0.066493,-0.242549,0.02537,-0.40321,0.15935
601,0_00006f66-33e5-4de7-a324-2d18e439fc1e,10,t_2af4dfa0-7df3-4b7e-b7ab-353ba48237f9,True,True,True,False,0,1,0,...,0.154609,0.257672,-0.647221,0.316101,0.251329,-0.041532,-0.252359,0.059971,-0.313696,0.126421


In [12]:
def cal_dist(x):
    from scipy.spatial.distance import cdist
    Y_euc = cdist(x, x, 'euclidean')
    Y_cos = cdist(x, x, 'cosine')
    Y_man = cdist(x, x, 'cityblock')
    # The 1st track of each session should have unreasonably far distance
    euc_dist = [0]
    cos_dist = [0]
    man_dist = [0]
    for n in range(1,len(x)):
        euc_dist.append(Y_euc[n,n-1])
        cos_dist.append(Y_cos[n,n-1])
        man_dist.append(Y_man[n,n-1])
    return euc_dist, cos_dist, man_dist


In [13]:
# calculate the distance/similarity within each session
from sklearn.preprocessing import StandardScaler

# as the last 20% of the rows of each session will be used as the testing dataset, they should not be fitted by the scaler
train_perc = 0.8

sel_col_names = ['skip_label','acousticness','beat_strength','danceability',
                        'dyn_range_mean', 'energy', 'flatness','instrumentalness', 'liveness', 
                        'loudness', 'mechanism', 'organism','speechiness','valence',
                        'acoustic_vector_0','acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3',
                        'acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6','acoustic_vector_7']

start_time = time.time()

for s_id in session_id:
    temp_data = []
    temp_mat = []
    temp_data = df.loc[df['session_id'] == s_id, sel_col_names]
    temp_mat = temp_data.drop(columns = ['skip_label']).copy()
    scaler = StandardScaler()
    scaler.fit(temp_mat[0:round(len(temp_mat)*train_perc)])
    temp_mat_scaled = scaler.transform(temp_mat)
    temp_mat_scaled_skip0 = temp_mat_scaled[temp_data['skip_label']==0]
    temp_mat_scaled_skip1 = temp_mat_scaled[temp_data['skip_label']==1]
    temp_mat_scaled_skipR = temp_mat_scaled[temp_data['skip_label']>1]
    
    
    euc_dist_all, cos_dist_all, man_dist_all = cal_dist(temp_mat_scaled)
    df.loc[temp_data.index, 'euc_dist_all'] = euc_dist_all
    df.loc[temp_data.index, 'cos_dist_all'] = cos_dist_all
    df.loc[temp_data.index, 'man_dist_all'] = man_dist_all
    
#     euc_dist_skip0, cos_dist_skip0, man_dist_skip0 = cal_dist(temp_mat_scaled_skip0)
#     loc_index0 = temp_data.index[temp_data['skip_label']==0].tolist()
#     df.loc[loc_index0, 'euc_dist_skip0'] = euc_dist_skip0
#     df.loc[loc_index0, 'cos_dist_skip0'] = cos_dist_skip0
#     df.loc[loc_index0, 'man_dist_skip0'] = man_dist_skip0
    
#     euc_dist_skip1, cos_dist_skip1, man_dist_skip1 = cal_dist(temp_mat_scaled_skip1)
#     loc_index1 = temp_data.index[temp_data['skip_label']==1].tolist()
#     df.loc[loc_index1, 'euc_dist_skip1'] = euc_dist_skip1
#     df.loc[loc_index1, 'cos_dist_skip1'] = cos_dist_skip1
#     df.loc[loc_index1, 'man_dist_skip1'] = man_dist_skip1
    
#     euc_dist_skipR, cos_dist_skipR, man_dist_skipR = cal_dist(temp_mat_scaled_skipR)
#     loc_indexR = temp_data.index[temp_data['skip_label']>1].tolist()
#     df.loc[loc_indexR, 'euc_dist_skipR'] = euc_dist_skipR
#     df.loc[loc_indexR, 'cos_dist_skipR'] = cos_dist_skipR
#     df.loc[loc_indexR, 'man_dist_skipR'] = man_dist_skipR

print('***It takes ',(time.time() - start_time)/60, ' minutes.***')


***It takes  1.9491318027178446  minutes.***


In [14]:
log_df_dummy2 = df.drop(columns = ['skip_label','track_id','acousticness','beat_strength','danceability',
                                     'dyn_range_mean', 'energy', 'flatness','instrumentalness', 'liveness', 
                                     'loudness', 'mechanism', 'organism','speechiness','valence',
                                     'acoustic_vector_0','acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3',
                                     'acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6','acoustic_vector_7'])
log_df_dummy2['session_id_skip_label'] = log_df_dummy2['session_id_skip_label'].astype('category')

# **Use featuretool to do automatic feature engineering**

In [15]:
#First, initializing an EntitySet with a name
es = ft.EntitySet(id="spotify_data")

In [16]:
from woodwork.logical_types import Categorical, PostalCode

es = es.add_dataframe(
    dataframe_name="tf",
    dataframe=tf_df_dummy,
    index="track_id",
)

es

Entityset: spotify_data
  DataFrames:
    tf [Rows: 50704, Columns: 46]
  Relationships:
    No relationships

In [17]:
es['tf'].ww.schema

Unnamed: 0_level_0,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1
track_id,Unknown,['index']
duration,Double,['numeric']
release_year,Integer,['numeric']
us_popularity_estimate,Double,['numeric']
acousticness,Double,['numeric']
beat_strength,Double,['numeric']
bounciness,Double,['numeric']
danceability,Double,['numeric']
dyn_range_mean,Double,['numeric']
energy,Double,['numeric']


In [18]:
# add dataframe
# the 'session_position' contains order information within each session

es = es.add_dataframe(
    dataframe_name="log", dataframe=log_df_dummy2, make_index = True, index="event_id", time_index="session_position",
)

es

Entityset: spotify_data
  DataFrames:
    tf [Rows: 50704, Columns: 46]
    log [Rows: 167880, Columns: 35]
  Relationships:
    No relationships

In [19]:
es['log'].ww.schema

Unnamed: 0_level_0,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1
event_id,Integer,['index']
session_id,Categorical,['category']
session_position,Integer,"['numeric', 'time_index']"
track_id_clean,Unknown,[]
skip_1,Boolean,[]
skip_2,Boolean,[]
skip_3,Boolean,[]
not_skipped,Boolean,[]
context_switch,Integer,['numeric']
no_pause_before_play,Integer,['numeric']


In [20]:
# When two DataFrames have a one-to-many relationship, we call the “one” DataFrame, the “parent DataFrame”. A relationship between a parent and child is defined like this:
# (parent_dataframe, parent_column, child_dataframe, child_column)
es = es.add_relationship("tf", "track_id", "log", "track_id_clean")
es

Entityset: spotify_data
  DataFrames:
    tf [Rows: 50704, Columns: 46]
    log [Rows: 167880, Columns: 35]
  Relationships:
    log.track_id_clean -> tf.track_id

In [21]:
# turn on "features_only = True" to experiment with the function without computing the feature_matrix
feature_defs_test = ft.dfs(entityset=es,
                        target_dataframe_name="log",
                        groupby_trans_primitives=["Diff","CumSum", "CumCount", "CumMean", "CumMin", "CumMax"],
                        agg_primitives=[],
                        trans_primitives=[],
                        primitive_options={"diff": {"include_groupby_columns": {"log": ["session_id","session_id_skip_label"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_sum": {"include_groupby_columns": {"log": ["session_id","session_id_skip_label"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_count": {"include_groupby_columns": {"log": ["session_id","session_id_skip_label"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_mean": {"include_groupby_columns": {"log": ["session_id","session_id_skip_label"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_min": {"include_groupby_columns": {"log": ["session_id","session_id_skip_label"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_max": {"include_groupby_columns": {"log": ["session_id","session_id_skip_label"]},"ignore_groupby_dataframes": ["tf"]}},
                        features_only = True,
                        n_jobs=-1)

feature_defs_test

[<Feature: session_id>,
 <Feature: session_position>,
 <Feature: skip_1>,
 <Feature: skip_2>,
 <Feature: skip_3>,
 <Feature: not_skipped>,
 <Feature: context_switch>,
 <Feature: no_pause_before_play>,
 <Feature: short_pause_before_play>,
 <Feature: long_pause_before_play>,
 <Feature: hist_user_behavior_is_shuffle>,
 <Feature: hour_of_day>,
 <Feature: premium>,
 <Feature: hist_user_behavior_reason_start_appload>,
 <Feature: hist_user_behavior_reason_start_backbtn>,
 <Feature: hist_user_behavior_reason_start_clickrow>,
 <Feature: hist_user_behavior_reason_start_endplay>,
 <Feature: hist_user_behavior_reason_start_fwdbtn>,
 <Feature: hist_user_behavior_reason_start_playbtn>,
 <Feature: hist_user_behavior_reason_start_remote>,
 <Feature: hist_user_behavior_reason_start_trackdone>,
 <Feature: hist_user_behavior_reason_start_trackerror>,
 <Feature: context_type_catalog>,
 <Feature: context_type_charts>,
 <Feature: context_type_editorial_playlist>,
 <Feature: context_type_personalized_playlis

In [22]:
len(feature_defs_test)

773

In [23]:
import warnings
def fxn():
    warnings.warn("deprecated", DeprecationWarning)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fxn()
#warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

In [24]:
# check this for specifying groupby: https://featuretools.alteryx.com/en/stable/guides/specifying_primitive_options.html
# The 'session_id' was specified as GroupBy option as we care what happend in each session.
# The tf (track inforamtion) dataframe does not contain any order or session information, so it does not be be specified as GroupBy option.

#warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
# warnings.simplefilter("ignore")


start_time = time.time()

feature_matrix, feature_defs = ft.dfs(entityset=es,
                        target_dataframe_name="log",
                        groupby_trans_primitives=["Diff","CumSum", "CumCount", "CumMean", "CumMin", "CumMax"],
                        agg_primitives=[],
                        trans_primitives=[],
                        primitive_options={"diff": {"include_groupby_columns": {"log": ["session_id","session_id_not_skipped","session_id_not_skip_1"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_sum": {"include_groupby_columns": {"log": ["session_id","session_id_not_skipped","session_id_not_skip_1"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_count": {"include_groupby_columns": {"log": ["session_id","session_id_not_skipped","session_id_not_skip_1"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_mean": {"include_groupby_columns": {"log": ["session_id","session_id_not_skipped","session_id_not_skip_1"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_min": {"include_groupby_columns": {"log": ["session_id","session_id_not_skipped","session_id_not_skip_1"]},"ignore_groupby_dataframes": ["tf"]},
                                        "cum_max": {"include_groupby_columns": {"log": ["session_id","session_id_not_skipped","session_id_not_skip_1"]},"ignore_groupby_dataframes": ["tf"]}},
                        features_only = False,
                        n_jobs=-1)

print('***It takes ',(time.time() - start_time)/60, ' minutes.***')

# save feature matrix
feature_matrix.to_csv('../data/processed/feature_matrix_skipLabel_distance.csv')


EntitySet scattered to 8 workers in 7 seconds


  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[

  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[

  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[

  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[

  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[

  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[

  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[

  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[

  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[

  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[

  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
ually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  frame[name] = f.default_value
  frame[name] = f.default_value
rformance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  frame[name] = f.default_value
  fr

***It takes  34.22096426884333  minutes.***


In [None]:
feature_matrix

In [None]:
feature_defs

In [30]:
print(feature_defs[350])
ft.describe_feature(feature_defs[350])

<Feature: CUM_MAX(tf.loudness) by session_id>


'The cumulative maximum of the "loudness" for the instance of "tf" associated with this instance of "log" for each "session_id".'

In [32]:
feature_matrix.ww.schema

Unnamed: 0_level_0,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1
session_id,Categorical,['category']
session_position,Integer,['numeric']
skip_1,Boolean,[]
skip_2,Boolean,[]
skip_3,Boolean,[]
not_skipped,Boolean,[]
context_switch,Integer,['numeric']
no_pause_before_play,Integer,['numeric']
short_pause_before_play,Integer,['numeric']
long_pause_before_play,Integer,['numeric']


# Check the results of Featuretools

In [2]:
# load the feature matrix to check whether it was properly saved.

feature_matrix2 = pd.read_csv('../data/processed/feature_matrix_skip.csv')
feature_matrix2.head().T

Unnamed: 0,0,1,2,3,4
event_id,0,20,40,60,80
session_id,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_0000a72b-09ac-412f-b452-9b9e79bded8f,0_00010fc5-b79e-4cdf-bc4c-f140d0f99a3a,0_00016a3d-9076-4f67-918f-f29e3ce160dc,0_00018b58-deb8-4f98-ac5e-d7e01b346130
session_position,1,1,1,1,1
skip_1,False,False,False,True,False
skip_2,False,True,False,True,False
...,...,...,...,...,...
DIFF(tf.time_signature_5) by session_id_not_skipped,,,0.0,0.0,0.0
DIFF(tf.us_popularity_estimate) by session_id,,,,,
DIFF(tf.us_popularity_estimate) by session_id_not_skipped,,,1.191625,-0.068586,-0.358257
DIFF(tf.valence) by session_id,,,,,


In [3]:
# inspect the data of a session

a = feature_matrix2[feature_matrix2['session_id'] == '0_00006f66-33e5-4de7-a324-2d18e439fc1e']

a

Unnamed: 0,event_id,session_id,session_position,skip_1,skip_2,skip_3,not_skipped,context_switch,no_pause_before_play,short_pause_before_play,...,DIFF(tf.time_signature_3) by session_id,DIFF(tf.time_signature_3) by session_id_not_skipped,DIFF(tf.time_signature_4) by session_id,DIFF(tf.time_signature_4) by session_id_not_skipped,DIFF(tf.time_signature_5) by session_id,DIFF(tf.time_signature_5) by session_id_not_skipped,DIFF(tf.us_popularity_estimate) by session_id,DIFF(tf.us_popularity_estimate) by session_id_not_skipped,DIFF(tf.valence) by session_id,DIFF(tf.valence) by session_id_not_skipped
0,0,0_00006f66-33e5-4de7-a324-2d18e439fc1e,1,False,False,False,True,0,0,0,...,,,,,,,,,,
10000,1,0_00006f66-33e5-4de7-a324-2d18e439fc1e,2,False,False,False,True,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.071405,-0.071405,0.184898,0.184898
20000,2,0_00006f66-33e5-4de7-a324-2d18e439fc1e,3,False,False,False,True,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.103248,0.103248,0.03671,0.03671
30000,3,0_00006f66-33e5-4de7-a324-2d18e439fc1e,4,False,False,False,True,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.004938,-0.004938,0.275558,0.275558
40000,4,0_00006f66-33e5-4de7-a324-2d18e439fc1e,5,False,False,False,True,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00346,0.00346,0.003501,0.003501
50000,5,0_00006f66-33e5-4de7-a324-2d18e439fc1e,6,False,False,True,False,0,1,0,...,0.0,-1.0,-1.0,0.0,1.0,1.0,-0.000896,0.022073,0.008161,0.249168
60000,6,0_00006f66-33e5-4de7-a324-2d18e439fc1e,7,True,True,True,False,0,1,0,...,0.0,0.0,1.0,0.0,-1.0,0.0,-0.136035,-0.133144,-0.551661,-0.648615
70000,7,0_00006f66-33e5-4de7-a324-2d18e439fc1e,8,True,True,True,False,0,1,0,...,0.0,-1.0,0.0,1.0,0.0,0.0,-2.617748,-2.752594,0.280493,0.212967
80000,8,0_00006f66-33e5-4de7-a324-2d18e439fc1e,9,False,True,True,False,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,2.754632,-0.000881,-0.051592,-0.173766
90000,9,0_00006f66-33e5-4de7-a324-2d18e439fc1e,10,True,True,True,False,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.041017,-0.040328,-0.080649,-0.347373


In [4]:
# inspect whether the featuretools generated features considered the session and the order information. The answer appears yes.

a[['tf.speechiness','CUM_SUM(tf.speechiness) by session_id','session_id','CUM_SUM(tf.speechiness) by session_id_not_skipped','session_id_not_skipped','session_position']]

Unnamed: 0,tf.speechiness,CUM_SUM(tf.speechiness) by session_id,session_id,CUM_SUM(tf.speechiness) by session_id_not_skipped,session_id_not_skipped,session_position
0,0.069717,0.069717,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.069717,0_00006f66-33e5-4de7-a324-2d18e439fc1e,1
10000,0.061158,0.130874,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.130874,0_00006f66-33e5-4de7-a324-2d18e439fc1e,2
20000,0.045354,0.176229,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.176229,0_00006f66-33e5-4de7-a324-2d18e439fc1e,3
30000,0.229936,0.406164,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.406164,0_00006f66-33e5-4de7-a324-2d18e439fc1e,4
40000,0.24098,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,5
50000,0.133586,0.78073,0_00006f66-33e5-4de7-a324-2d18e439fc1e,4759.094724,,6
60000,0.409848,1.190578,0_00006f66-33e5-4de7-a324-2d18e439fc1e,5784.135332,,7
70000,0.103687,1.294266,0_00006f66-33e5-4de7-a324-2d18e439fc1e,6768.998141,,8
80000,0.049853,1.344119,0_00006f66-33e5-4de7-a324-2d18e439fc1e,7727.900896,,9
90000,0.154609,1.498729,0_00006f66-33e5-4de7-a324-2d18e439fc1e,8695.171707,,10


In [5]:
sel_cols = [col for col in feature_matrix2.columns if 'by session_id_not_skipped' in col]
sel_cols

['CUM_COUNT(session_id) by session_id_not_skipped',
 'CUM_COUNT(session_id_not_skipped) by session_id_not_skipped',
 'CUM_COUNT(track_id_clean) by session_id_not_skipped',
 'CUM_MAX(context_switch) by session_id_not_skipped',
 'CUM_MAX(context_type_catalog) by session_id_not_skipped',
 'CUM_MAX(context_type_charts) by session_id_not_skipped',
 'CUM_MAX(context_type_editorial_playlist) by session_id_not_skipped',
 'CUM_MAX(context_type_personalized_playlist) by session_id_not_skipped',
 'CUM_MAX(context_type_radio) by session_id_not_skipped',
 'CUM_MAX(context_type_user_collection) by session_id_not_skipped',
 'CUM_MAX(hist_user_behavior_reason_start_appload) by session_id_not_skipped',
 'CUM_MAX(hist_user_behavior_reason_start_backbtn) by session_id_not_skipped',
 'CUM_MAX(hist_user_behavior_reason_start_clickrow) by session_id_not_skipped',
 'CUM_MAX(hist_user_behavior_reason_start_endplay) by session_id_not_skipped',
 'CUM_MAX(hist_user_behavior_reason_start_fwdbtn) by session_id_not

In [6]:
feature_matrix2['session_id_not_skipped'].isna()

0         False
1          True
2          True
3          True
4          True
          ...  
167875    False
167876     True
167877     True
167878     True
167879    False
Name: session_id_not_skipped, Length: 167880, dtype: bool

In [7]:
feature_matrix2.loc[feature_matrix2['session_id_not_skipped'].isna(),sel_cols] = np.nan
b = feature_matrix2[feature_matrix2['session_id'] == '0_00006f66-33e5-4de7-a324-2d18e439fc1e']
b[['tf.speechiness','CUM_SUM(tf.speechiness) by session_id','session_id','CUM_SUM(tf.speechiness) by session_id_not_skipped','session_id_not_skipped','session_position']]

Unnamed: 0,tf.speechiness,CUM_SUM(tf.speechiness) by session_id,session_id,CUM_SUM(tf.speechiness) by session_id_not_skipped,session_id_not_skipped,session_position
0,0.069717,0.069717,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.069717,0_00006f66-33e5-4de7-a324-2d18e439fc1e,1
10000,0.061158,0.130874,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.130874,0_00006f66-33e5-4de7-a324-2d18e439fc1e,2
20000,0.045354,0.176229,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.176229,0_00006f66-33e5-4de7-a324-2d18e439fc1e,3
30000,0.229936,0.406164,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.406164,0_00006f66-33e5-4de7-a324-2d18e439fc1e,4
40000,0.24098,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,5
50000,0.133586,0.78073,0_00006f66-33e5-4de7-a324-2d18e439fc1e,,,6
60000,0.409848,1.190578,0_00006f66-33e5-4de7-a324-2d18e439fc1e,,,7
70000,0.103687,1.294266,0_00006f66-33e5-4de7-a324-2d18e439fc1e,,,8
80000,0.049853,1.344119,0_00006f66-33e5-4de7-a324-2d18e439fc1e,,,9
90000,0.154609,1.498729,0_00006f66-33e5-4de7-a324-2d18e439fc1e,,,10


In [9]:
feature_matrix3 = feature_matrix2.sort_values(by = ['session_id','session_position']).copy()
start_time = time.time()
for s_id in pd.unique(feature_matrix2['session_id']):
    temp = []
    temp = feature_matrix3[feature_matrix3['session_id']==s_id]
    # shift the rows down by 1 within each session because it reflects the history (no_skip) up to the previous row
    # without shifting, the information of no_skip will leak into the features!!
    temp[sel_cols] = temp[sel_cols].shift(periods=1, axis=0) 
    temp.fillna(method = 'ffill', axis = 0, inplace=True)
    feature_matrix3[feature_matrix3['session_id']==s_id] = temp

print('***It takes ',(time.time() - start_time)/60, ' mins.***')


c = feature_matrix3[feature_matrix3['session_id'] == '0_00006f66-33e5-4de7-a324-2d18e439fc1e']
c[['tf.speechiness','CUM_SUM(tf.speechiness) by session_id','session_id','CUM_SUM(tf.speechiness) by session_id_not_skipped','session_id_not_skipped','not_skipped','session_position']]

***It takes  1416.1911041736603  secs.***


Unnamed: 0,tf.speechiness,CUM_SUM(tf.speechiness) by session_id,session_id,CUM_SUM(tf.speechiness) by session_id_not_skipped,session_id_not_skipped,not_skipped,session_position
0,0.069717,0.069717,0_00006f66-33e5-4de7-a324-2d18e439fc1e,,0_00006f66-33e5-4de7-a324-2d18e439fc1e,True,1
10000,0.061158,0.130874,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.069717,0_00006f66-33e5-4de7-a324-2d18e439fc1e,True,2
20000,0.045354,0.176229,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.130874,0_00006f66-33e5-4de7-a324-2d18e439fc1e,True,3
30000,0.229936,0.406164,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.176229,0_00006f66-33e5-4de7-a324-2d18e439fc1e,True,4
40000,0.24098,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.406164,0_00006f66-33e5-4de7-a324-2d18e439fc1e,True,5
50000,0.133586,0.78073,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,False,6
60000,0.409848,1.190578,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,False,7
70000,0.103687,1.294266,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,False,8
80000,0.049853,1.344119,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,False,9
90000,0.154609,1.498729,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0.647144,0_00006f66-33e5-4de7-a324-2d18e439fc1e,False,10


In [10]:
d = feature_matrix3[feature_matrix3['session_id'] == '0_0000a72b-09ac-412f-b452-9b9e79bded8f']
d[['tf.speechiness','CUM_SUM(tf.speechiness) by session_id','session_id','CUM_SUM(tf.speechiness) by session_id_not_skipped','session_id_not_skipped','not_skipped','session_position']]


Unnamed: 0,tf.speechiness,CUM_SUM(tf.speechiness) by session_id,session_id,CUM_SUM(tf.speechiness) by session_id_not_skipped,session_id_not_skipped,not_skipped,session_position
1,0.192012,0.192012,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,1
10001,0.068105,0.260117,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,2
20001,0.075064,0.335181,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,3
30001,0.045026,0.380207,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,4
40001,0.12405,0.504257,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,5
50001,0.239021,0.743278,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,6
60001,0.039155,0.782433,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,7
70001,0.041675,0.824108,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,8
80001,0.134822,0.958929,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,9
90001,0.140254,1.099183,0_0000a72b-09ac-412f-b452-9b9e79bded8f,,,False,10


In [12]:
feature_matrix3.to_csv('../data/processed/feature_matrix_skip_processed.csv')

# Scale standardization & training/testing data split will be performed at the model fitting stage.
