![spotify_logo](../assets/Spotify_Logo_CMYK_Green.png)

# Spotify Skip Prediction: Introduction
Notebook: 1 of 7
 
Author: Alex Thach - alcthach@gmail.com  
BrainStation Data Science Capstone Project - Winter 2022  
April 4, 2022
---

### About the Project
The goal of this project is to predict whether a user will skip a given track on Spotify, a music streaming platform. Tracking skip is often overlooked as a research topic on music streaming platforms, often overshadowed by research related to recommendation systems. However, uncovering trends about track skipping may allow for new approaches to music curation for the listener. Such as ad-hoc or on-the-fly playlist curation. In essence, understanding the probability of a user skipping a certain track may also clue us in as to whether or not we want to include the song in the playlist for the user. Although Spotify markets unlimited skips as a feature for premium users, it appears to be a feature that is least important in the order of premium features. Which might suggest there is room to develop a feature that allows for the platform to consider transient listening moods in users.


---

### Data Acquisition/Summary

![image.png](../assets/aicrowd_logo_sm.png)
- Data was sourced from AICrowd, a data science competition platform
- Specifically from the [Spotify Sequential Skip Prediction Challenge](https://www.aicrowd.com/challenges/spotify-sequential-skip-prediction-challenge)

---

Below you will find Data Dictionaries for the Session Logs, Track Features, and a mini-dictionary for `'context_type'`, a field found in the Session Logs dataset.

#### Listening Session Logs Dataset - Schema

|Field|Values|
|-----|------|
|session_id| E.g. ​65_283174c5-551c-4c1b-954b-cb60ffcc2aec- unique identifier for the session that this row is a part of|
|session_position| {1-20} - position of row within session|
|session_length| {10-20} - number of rows in session
|track_id_clean|E.g. ​t_13d34e4b-dc9b-4535-963d-419afa8332ec - unique identifier for the track played. This is linked with track_id in the track features and metadata table.|
|skip_1| Boolean indicating if the track was only played very briefly|
|skip_2|Boolean indicating if the track was only played briefly|
|skip_3| Boolean indicating if most of the track was played|
|not_skipped| Boolean indicating that the track was played in its entirety|
|context_switch| Boolean indicating if the user changed context between the previous row and the current row. This could for example occur if the user switched from one playlist to another.|
|no_pause_before_play|Boolean indicating if there was no pause between playback of the previous track and this track|
|short_pause_before_play| Boolean indicating if there was a short pause between playback of the previous track and this track|
|long_pause_before_play| Boolean indicating if there was a long pause between playback of the previous track and this track|
|hist_user_behavior_n_seekfwd| Number of times the user did a seek forward within track|
|hist_user_behavior_n_seekback| Number of times the user did a seek back within track|
|hist_user_behavior_is_shuffle| Boolean indicating if the user encountered this track while shuffle mode was activated
|hour_of_day| {0-23} - The hour of day
|date| E.g. 2018-09-18 - The date|
|premium| Boolean indicating if the user was on premium or not. This has potential implications for skipping behavior.|
|context_type| E.g. ​editorial playlist - ​what type of context the playback occurred within|
|hist_user_behavior_reason_start| E.g. ​fwdbtn​ - the user action which led to the current track being played|
|hist_user_behavior_reason_end| E.g. ​trackdone​ - the user action which led to the current track playback ending|

---

#### Track Features Dataset - Schema

|Field|Values|
|-----|------|
|track_id|E.g. ​t_13d34e4b-dc9b-4535-963d-419afa8332ec - unique identifier for the track played. This is linked with track_id_clean in the session logs|
|duration|Length of track in seconds|
|release_year|Estimate of the year the track was released|
|us_popularity_estimate|Estimate of the US popularity percentile of the track as of October 12, 2018|
|acousticness|A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. >= 0<= 1|
|beat_strength|\*This measure is undocumented. It ranges between 0 and 1. The audio feature likely describes the rhythmic and perceptual characteristics contributing to the perceived clarity of the beat in a track.|
|bounciness|\*This measure is undocumented. It ranges between 0 and 1. This audio feature is likely a compound measure based on perceptual and acoustical factors contributing to a sense of ‘bounce’ in the track, likely involving calculating attack slopes and general length of musical onsets.|
|danceability|Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.|
|dyn_range_mean|This measure is not specified. It is a positive value and, in all likelihood, reflects the dynamic range in a track, given in decibels (dB). We assume this value measures the mean dynamic range in the track, as a measure of the range in amplitude within a given window of samples.|
|energy|Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.|
|flatness|\*This measure is undocumented. It ranges between 0 and 1. This audio feature is likely a compound measure based on perceptual and acoustical factors contributing to a sense of ‘flatness' in the track, likely involving a low density of events and little auditory transients.|
|instrumentalness|Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.|
|key|The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. >= -1<= 11|
|loudness|The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.|
|mechanism|\*This measure is undocumented. It ranges between 0 and 1. This audio feature is likely a compound measure of multiple perceptual features and other features, indicating if a track is perceived as ‘mechanic’, as opposed to the previously mentioned organism.|
|mode|Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.|
|organism|\*This measure is undocumented. It ranges between 0 and 1. This audio feature is likely a compound measure of multiple perceptual features and other audio features, indicating if a track is perceived as ‘organic’, in the sense that tempo and acousticness may exhibit a higher variability, dynamic range may be high.
|speechiness|Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.|
|tempo|The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.|
|time_signature|An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of "3/4", to "7/4". >= 3<= 7|
|valence|A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). >= 0<= 1|
|acoustic_vector_0|Spotify provides a 7-dimensional acoustic vector, representing each track. See ​http://benanne.github.io/2014/08/05/spotify-cnns.html​ and http://papers.nips.cc/paper/5004-deep-content-based|
|acoustic_vector_1|
|acoustic_vector_2|
|acoustic_vector_3|
|acoustic_vector_4|
|acoustic_vector_5|
|acoustic_vector_6|
|acoustic_vector_7|

Source: https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features  
\* Undocumented features not found on Spotify Developer website above can be found at: [Diurnal fluctuations in musical preference (Heggli, 2021)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8580447)

---

#### Mini-Dictionary for 'context_type'

|`'context_type'`|Description|
|------------|-----------|
|charts|songs from top charts|
|personalized_playlist|playlists created by other users|
|radio|songs accessed through Spotify’s radio feature|
|editorial_playlist|playlists that are created by Spotify, to promote certain music|
|catalog|all songs on Spotify|
|user_collection|songs that are from the user’s personal song collection|

Source: https://spotify.humspace.ucla.edu/insights/](https://spotify.humspace.ucla.edu/insights/

---

In [1]:
# Runs setup script, imports, plotting settings, reads in raw data
%run -i "../scripts/at-setup.py" 

Dataframes in the global name space now include:
session_logs_df
track_features_df


#### Characteristics of the Datasets
In the section below I will highlight the general characteristics of the datasets.In the section below I will highlight the general characteristics of the datasets.

In [2]:
# Importing a helper function to check for missing values, duplicates, and dimensions of each dataset
from helper_functions import data_summary

#### Session Logs

##### A First Glance

In [3]:
# Checks for nan values, duplicates and dimensions for 'session_logs_df'
data_summary(session_logs_df)

Missing values: 0
Duplicate values: 0
Rows: 167880
Columns: 21


Based on the cell output above,
- There are no missing values
- There are no duplicate values
- There are 167,880 rows and 21 columns
---

In [4]:
# Gets summary of descriptive statistics of 'session_logs' dataset
session_logs_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
session_position,167880.0,9.325911,5.457638,1.0,5.0,9.0,14.0,20.0
session_length,167880.0,17.651823,3.422025,10.0,15.0,20.0,20.0,20.0
context_switch,167880.0,0.040904,0.198069,0.0,0.0,0.0,0.0,1.0
no_pause_before_play,167880.0,0.767602,0.422363,0.0,1.0,1.0,1.0,1.0
short_pause_before_play,167880.0,0.146635,0.353742,0.0,0.0,0.0,0.0,1.0
long_pause_before_play,167880.0,0.172832,0.378103,0.0,0.0,0.0,0.0,1.0
hist_user_behavior_n_seekfwd,167880.0,0.038909,0.367295,0.0,0.0,0.0,0.0,60.0
hist_user_behavior_n_seekback,167880.0,0.046259,0.606558,0.0,0.0,0.0,0.0,151.0
hour_of_day,167880.0,14.193084,5.996243,0.0,11.0,15.0,19.0,23.0


Something to consider is the varying magnitudes of scale across features. For example, the minimum and maximum of `'context_switch'` is 0.0 and 1.0 respectively, while `'session_position'` is 1.0 and 20.0. This is something to keep in mind when employing machine learning methodologies that are sensitive to scale. For example, Logistic Regression models. 

---

In [5]:
# Gets the datatypes within 'session_logs' dataset
session_logs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167880 entries, 0 to 167879
Data columns (total 21 columns):
 #   Column                           Non-Null Count   Dtype 
---  ------                           --------------   ----- 
 0   session_id                       167880 non-null  object
 1   session_position                 167880 non-null  int64 
 2   session_length                   167880 non-null  int64 
 3   track_id_clean                   167880 non-null  object
 4   skip_1                           167880 non-null  bool  
 5   skip_2                           167880 non-null  bool  
 6   skip_3                           167880 non-null  bool  
 7   not_skipped                      167880 non-null  bool  
 8   context_switch                   167880 non-null  int64 
 9   no_pause_before_play             167880 non-null  int64 
 10  short_pause_before_play          167880 non-null  int64 
 11  long_pause_before_play           167880 non-null  int64 
 12  hist_user_behavi

Briefly scanning through the cell output above, the session logs dataset contains a combination of boolean, integer and object-type columns. Of the object-type columns, it would appear that `'session_id'` and `'track_id_clean'` serve as some sort of identifier. With `'session_id'` function as an identifier for indvidual listening sessions and `'track_id_clean'` as an identifier for the song that's being presented to the listener. Lastly, the final object-type column you see in the dataset is `'date'` which likely contains information related to the date of the session.

---

##### Taking a Look at some Observations within the `'session_logs_df'`
*Specifically, the first 5 rows, last 5 rows, and 5 rows selected at random.*

In [6]:
# Displays first 5 rows of 'session_logs_df'
session_logs_df.head()

Unnamed: 0,session_id,session_position,session_length,track_id_clean,skip_1,skip_2,skip_3,not_skipped,context_switch,no_pause_before_play,...,long_pause_before_play,hist_user_behavior_n_seekfwd,hist_user_behavior_n_seekback,hist_user_behavior_is_shuffle,hour_of_day,date,premium,context_type,hist_user_behavior_reason_start,hist_user_behavior_reason_end
0,0_00006f66-33e5-4de7-a324-2d18e439fc1e,1,20,t_0479f24c-27d2-46d6-a00c-7ec928f2b539,False,False,False,True,0,0,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone
1,0_00006f66-33e5-4de7-a324-2d18e439fc1e,2,20,t_9099cd7b-c238-47b7-9381-f23f2c1d1043,False,False,False,True,0,1,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone
2,0_00006f66-33e5-4de7-a324-2d18e439fc1e,3,20,t_fc5df5ba-5396-49a7-8b29-35d0d28249e0,False,False,False,True,0,1,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone
3,0_00006f66-33e5-4de7-a324-2d18e439fc1e,4,20,t_23cff8d6-d874-4b20-83dc-94e450e8aa20,False,False,False,True,0,1,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone
4,0_00006f66-33e5-4de7-a324-2d18e439fc1e,5,20,t_64f3743c-f624-46bb-a579-0f3f9a07a123,False,False,False,True,0,1,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone


In [7]:
# Displays last 5 rows of 'session_logs_df'
session_logs_df.tail()

Unnamed: 0,session_id,session_position,session_length,track_id_clean,skip_1,skip_2,skip_3,not_skipped,context_switch,no_pause_before_play,...,long_pause_before_play,hist_user_behavior_n_seekfwd,hist_user_behavior_n_seekback,hist_user_behavior_is_shuffle,hour_of_day,date,premium,context_type,hist_user_behavior_reason_start,hist_user_behavior_reason_end
167875,0_0eaeef5d-25e9-4429-bd55-af15d3604c9f,16,20,t_360910e8-2a84-42b0-baf1-59abcf96a1f2,False,False,False,True,0,1,...,0,0,0,False,13,2018-07-15,True,user_collection,trackdone,trackdone
167876,0_0eaeef5d-25e9-4429-bd55-af15d3604c9f,17,20,t_aa2fff77-9b0a-4fa3-a685-ecef50310e8a,False,False,False,True,0,1,...,0,0,0,False,13,2018-07-15,True,user_collection,trackdone,trackdone
167877,0_0eaeef5d-25e9-4429-bd55-af15d3604c9f,18,20,t_f673e1b7-4ebe-4fc1-ac24-a9f25de70381,False,False,False,True,0,1,...,0,0,0,False,13,2018-07-15,True,user_collection,trackdone,trackdone
167878,0_0eaeef5d-25e9-4429-bd55-af15d3604c9f,19,20,t_e172e8e7-7161-42a9-acb0-d606346c8f87,False,False,False,True,0,1,...,0,0,0,False,13,2018-07-15,True,user_collection,trackdone,trackdone
167879,0_0eaeef5d-25e9-4429-bd55-af15d3604c9f,20,20,t_77977dd6-597e-4425-8f8f-4efb32ecfba6,False,False,False,True,0,1,...,0,0,0,False,13,2018-07-15,True,user_collection,trackdone,trackdone


In [8]:
# Gets 5 random observations from 'session_logs_df'
session_logs_df.sample(5)

Unnamed: 0,session_id,session_position,session_length,track_id_clean,skip_1,skip_2,skip_3,not_skipped,context_switch,no_pause_before_play,...,long_pause_before_play,hist_user_behavior_n_seekfwd,hist_user_behavior_n_seekback,hist_user_behavior_is_shuffle,hour_of_day,date,premium,context_type,hist_user_behavior_reason_start,hist_user_behavior_reason_end
54565,0_04aa9994-b4af-45bc-b905-047228f5b89b,7,20,t_2137818b-341f-4f46-aee2-d21db39efacb,True,True,True,False,0,1,...,0,0,0,True,22,2018-07-14,True,catalog,clickrow,endplay
141890,0_0c740d96-a951-4636-b587-ffa738db4bba,7,20,t_1dba18ad-1ddb-4e97-983b-4af57e18d84a,True,True,True,False,0,1,...,0,0,0,False,22,2018-07-14,True,editorial_playlist,fwdbtn,fwdbtn
24207,0_0217e6f9-2135-4243-9df8-90ed06fd502c,2,20,t_0788d50a-2341-4bfa-be2c-381e0e3cf172,False,False,True,False,0,0,...,1,0,0,False,17,2018-07-14,True,radio,trackdone,fwdbtn
43078,0_03b6f4c1-5d26-4f98-ae0d-ba7e77ee89f4,3,15,t_1767cb03-7e2c-4262-aa40-924dbe14eb96,False,False,False,True,0,1,...,0,0,0,False,18,2018-07-15,True,radio,fwdbtn,trackdone
146768,0_0cd78e4d-4c86-4b24-8780-e2ae221ac2a9,11,20,t_e9901b0e-dc9d-429c-a8a9-dde6ab877d54,True,True,True,False,0,1,...,0,0,0,True,15,2018-07-15,False,catalog,fwdbtn,fwdbtn


Based on the 3 cell outputs above, I don't suspect anything strange with the dataset.

---

##### Interpreting a Single Observation in `'sessions_log_df'`

In [9]:
# Gets a single observation from 'session_logs_df'
session_logs_df.sample()

Unnamed: 0,session_id,session_position,session_length,track_id_clean,skip_1,skip_2,skip_3,not_skipped,context_switch,no_pause_before_play,...,long_pause_before_play,hist_user_behavior_n_seekfwd,hist_user_behavior_n_seekback,hist_user_behavior_is_shuffle,hour_of_day,date,premium,context_type,hist_user_behavior_reason_start,hist_user_behavior_reason_end
151695,0_0d45b4a2-01d3-41b7-8f69-74659a1cf887,4,20,t_1dba18ad-1ddb-4e97-983b-4af57e18d84a,False,False,False,True,0,1,...,0,0,0,False,17,2018-07-15,True,charts,trackdone,trackdone


Comments on the Observation Above:
- A single row in `'session_logs_df'` appears to represent an instance of song being played in a particular listening session
- We know which session this observation belongs to, I.E. `'0_094a5ac3-a16f-4620-924a-f6456ffa7d55`
- This was the 15th session played in a session of total length 20 songs
- We also know which song was played, however, it should be noted that that the '`track_id`' is internal to the dataset, I.E. We have no way of figuring out what the name and artist of the song was
- In addition, we can see that the song was skipped, both `'skip_2 and skip_3'` have values of `'True'`, with the latter feature indicating most of the track was played before skipping, which also seems to include the condition of `'skip_2'`, which indicates that the track was played briefly
- We go on to see that the song was played on 2018-07-15 at 18:00
- It also shows that they were a premium Spotify user
- Lastly, it also shows how the user arrived at the current track and how the track ended, in this case they arrived at the song using the forward button, and the track ended using the forward button

Based on the observations above, it would appear that the session logs dataset contains comprehensive information about how the user interacted with each track in the listening session, and the context in which the song was played. For example, were they a premium user? What kind of playlist did the song belong to? Etc.

---

Before I move on to take a look at the track features dataset. I will pull the entire listening session from the sample observation above. Just to get an understanding for how an example listening session might look.

In [10]:
# Gets entire listening session from the example above
session_logs_df[session_logs_df['session_id'] == '0_094a5ac3-a16f-4620-924a-f6456ffa7d55']

Unnamed: 0,session_id,session_position,session_length,track_id_clean,skip_1,skip_2,skip_3,not_skipped,context_switch,no_pause_before_play,...,long_pause_before_play,hist_user_behavior_n_seekfwd,hist_user_behavior_n_seekback,hist_user_behavior_is_shuffle,hour_of_day,date,premium,context_type,hist_user_behavior_reason_start,hist_user_behavior_reason_end
106531,0_094a5ac3-a16f-4620-924a-f6456ffa7d55,1,20,t_a3d09175-9e2c-4295-802c-dcc5d153e482,True,True,True,False,0,0,...,0,0,0,True,18,2018-07-15,True,user_collection,fwdbtn,fwdbtn
106532,0_094a5ac3-a16f-4620-924a-f6456ffa7d55,2,20,t_98db3bc2-e559-4f99-9e2b-28804832947d,True,True,True,False,0,1,...,0,0,0,True,18,2018-07-15,True,user_collection,fwdbtn,fwdbtn
106533,0_094a5ac3-a16f-4620-924a-f6456ffa7d55,3,20,t_8246d51e-3575-4bff-ba6c-d1fbd3cc0f12,True,True,True,False,0,1,...,0,0,0,True,18,2018-07-15,True,user_collection,fwdbtn,fwdbtn
106534,0_094a5ac3-a16f-4620-924a-f6456ffa7d55,4,20,t_59fe8388-f99a-4bcb-9295-3c977c2ecc37,False,True,True,False,0,1,...,0,0,0,True,18,2018-07-15,True,user_collection,fwdbtn,fwdbtn
106535,0_094a5ac3-a16f-4620-924a-f6456ffa7d55,5,20,t_d84861e6-4418-4222-8eb5-9aa66785a099,False,False,False,True,0,1,...,0,0,0,True,18,2018-07-15,True,user_collection,fwdbtn,trackdone
106536,0_094a5ac3-a16f-4620-924a-f6456ffa7d55,6,20,t_4ca96b01-dc8d-4ffa-9669-aca1110bdca0,True,True,True,False,0,0,...,1,0,0,True,18,2018-07-15,True,user_collection,trackdone,fwdbtn
106537,0_094a5ac3-a16f-4620-924a-f6456ffa7d55,7,20,t_4e244ce1-1893-433d-9d29-968db45ada4f,True,True,True,False,0,1,...,0,0,0,True,18,2018-07-15,True,user_collection,fwdbtn,fwdbtn
106538,0_094a5ac3-a16f-4620-924a-f6456ffa7d55,8,20,t_47afc5f9-448c-48fd-9656-e8d7332b04e2,True,True,True,False,0,1,...,0,0,0,True,18,2018-07-15,True,user_collection,fwdbtn,fwdbtn
106539,0_094a5ac3-a16f-4620-924a-f6456ffa7d55,9,20,t_c9a1736b-befb-4f1c-8842-75c4a7969000,True,True,True,False,0,1,...,0,0,0,True,18,2018-07-15,True,user_collection,fwdbtn,fwdbtn
106540,0_094a5ac3-a16f-4620-924a-f6456ffa7d55,10,20,t_9495e5a3-780d-49f7-9c32-3e4eda30aec5,False,False,True,False,0,1,...,0,0,0,True,18,2018-07-15,True,user_collection,fwdbtn,fwdbtn


Now you can take a look at what an example listening session might look like for a user! It's a lot to digest. And I'll admit the tabular data above is not fun to look at. However, the same observations can be extended to the whole listening session for the user. But now I can analyze the patterns of user listening session. Not now but I will revisit this during the Exploratory Data Analysis portion of this project. Remember that I'm just gathering an initial impression of the data I have. Meaning I will move on to take a look at the track features dataset.

#### Track Features Dataset

##### A First Glance

In [11]:
# Checks for nan values, duplicates and dimensions for 'track_features_df'
data_summary(track_features_df)

Missing values: 0
Duplicate values: 0
Rows: 50704
Columns: 30


Based on the cell output above,
- There are no missing values
- There are no duplicate values
- There are 50,704 rows and 30 columns
---

In [12]:
# Gets summary of descriptive statistics of 'track_features_df' dataset
track_features_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
duration,50704.0,222.26798,72.224839,30.01333,183.9333,214.866669,250.426666,1787.760986
release_year,50704.0,2010.324748,11.471866,1950.0,2009.0,2015.0,2017.0,2018.0
us_popularity_estimate,50704.0,99.455131,1.139222,90.0189,99.50967,99.865444,99.961131,99.999997
acousticness,50704.0,0.250336,0.276047,0.0,0.02529069,0.135821,0.406142,0.995796
beat_strength,50704.0,0.492075,0.158102,0.0,0.3743785,0.493764,0.604994,0.990419
bounciness,50704.0,0.514526,0.182595,0.0,0.3724626,0.522266,0.655912,0.97259
danceability,50704.0,0.611742,0.166146,0.0,0.5024992,0.625145,0.736241,0.984952
dyn_range_mean,50704.0,8.21595,2.410626,0.0,6.356671,8.015075,9.792092,32.342781
energy,50704.0,0.64144,0.207957,0.0,0.5078394,0.661368,0.803185,0.999877
flatness,50704.0,0.996548,0.045611,0.0,0.9759097,1.00379,1.026137,1.103213


Again, something to consider is the difference in scale across the different features. The largest values are found in the `'release_year'` column. With some features being order of magnitudes smaller in scale as compared to `'release_year'` like `'liveness'`. I will likely be address these concerns related to scale later in this project. But this is something to keep in mind.

---

In [13]:
# Gets the datatypes within 'session_logs' dataset
track_features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50704 entries, 0 to 50703
Data columns (total 30 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   track_id                50704 non-null  object 
 1   duration                50704 non-null  float64
 2   release_year            50704 non-null  int64  
 3   us_popularity_estimate  50704 non-null  float64
 4   acousticness            50704 non-null  float64
 5   beat_strength           50704 non-null  float64
 6   bounciness              50704 non-null  float64
 7   danceability            50704 non-null  float64
 8   dyn_range_mean          50704 non-null  float64
 9   energy                  50704 non-null  float64
 10  flatness                50704 non-null  float64
 11  instrumentalness        50704 non-null  float64
 12  key                     50704 non-null  int64  
 13  liveness                50704 non-null  float64
 14  loudness                50704 non-null

As compared to `'session_logs_df'`, there are only two object columns. One is `'track_id'`
which like the other dataset appears to be a unique identifier in this dataset, and might also serve as a foreign key to help join the two tables together in the future. In addition to that we have `'mode'`. Which indicates with the song is in a 'major' key or 'minor' key. In terms of preparing the dataset for model, most of the features in this dataset are already in numeric form, which means these features are likely a step closer to being used as inputs in a machine learning model. In constrast, `'mode'` might be a feature that we consider for some sort of transformation to make it numeric before using as an input for a machine learning model.

---

##### Taking a Look at some Observations within the `'track_features_df'`
*Specifically, the first 5 rows, last 5 rows, and 5 rows selected at random.*

In [14]:
# Displays first 5 rows of 'session_logs_df'
track_features_df.head()

Unnamed: 0,track_id,duration,release_year,us_popularity_estimate,acousticness,beat_strength,bounciness,danceability,dyn_range_mean,energy,...,time_signature,valence,acoustic_vector_0,acoustic_vector_1,acoustic_vector_2,acoustic_vector_3,acoustic_vector_4,acoustic_vector_5,acoustic_vector_6,acoustic_vector_7
0,t_a540e552-16d4-42f8-a185-232bd650ea7d,109.706673,1950,99.975414,0.45804,0.519497,0.504949,0.399767,7.51188,0.817709,...,4,0.935512,-0.033284,-0.411896,-0.02858,0.349438,0.832467,-0.213871,-0.299464,-0.675907
1,t_67965da0-132b-4b1e-8a69-0ef99b32287c,187.693329,1950,99.96943,0.916272,0.419223,0.54553,0.491235,9.098376,0.154258,...,3,0.359675,0.145703,-0.850372,0.12386,0.746904,0.371803,-0.420558,-0.21312,-0.525795
2,t_0614ecd3-a7d5-40a1-816e-156d5872a467,160.839996,1951,99.602549,0.812884,0.42589,0.50828,0.491625,8.36867,0.358813,...,4,0.726769,0.02172,-0.743634,0.333247,0.568447,0.411094,-0.187749,-0.387599,-0.433496
3,t_070a63a0-744a-434e-9913-a97b02926a29,175.399994,1951,99.665018,0.396854,0.400934,0.35999,0.552227,5.967346,0.514585,...,4,0.859075,0.039143,-0.267555,-0.051825,0.106173,0.614825,-0.111419,-0.265953,-0.542753
4,t_d6990e17-9c31-4b01-8559-47d9ce476df1,369.600006,1951,99.991764,0.728831,0.371328,0.335115,0.483044,5.802681,0.721442,...,4,0.562343,0.131931,-0.292523,-0.174819,-0.034422,0.717229,-0.016239,-0.392694,-0.455496


In [15]:
# Displays last 5 rows of 'session_logs_df'
track_features_df.tail()

Unnamed: 0,track_id,duration,release_year,us_popularity_estimate,acousticness,beat_strength,bounciness,danceability,dyn_range_mean,energy,...,time_signature,valence,acoustic_vector_0,acoustic_vector_1,acoustic_vector_2,acoustic_vector_3,acoustic_vector_4,acoustic_vector_5,acoustic_vector_6,acoustic_vector_7
50699,t_402930af-4174-47ec-b1fd-593d93597624,184.686798,2018,99.315966,0.584765,0.521544,0.515087,0.65314,7.68422,0.336433,...,4,0.542063,-0.196001,0.301727,0.23888,-0.391421,0.01669,0.247235,-0.399387,-0.192473
50700,t_e5f9a069-a893-452e-ab21-49b4eaebfbd0,251.813324,2018,99.918573,0.40668,0.5652,0.693126,0.844861,11.176841,0.709085,...,4,0.472353,-0.54516,0.271596,0.274377,0.043951,-0.322946,0.150802,0.159378,0.384336
50701,t_3983306d-13b4-4027-9391-7236ca93d2bf,157.520004,2018,98.517692,0.001279,0.414721,0.341769,0.463543,5.405471,0.975503,...,4,0.766519,0.112592,0.368523,-0.46695,-0.468494,0.640088,0.050771,-0.258999,0.258766
50702,t_74eb6e99-210b-440c-8d7b-4db6617d1c80,129.105392,2018,99.902866,0.139452,0.688375,0.73372,0.850959,10.778521,0.666146,...,4,0.058505,-0.855291,0.365487,0.273034,0.108294,-0.206204,0.007847,-0.408226,0.143629
50703,t_aec0dda3-19b4-4222-be07-327c12f21456,213.689392,2018,99.961653,0.154936,0.327645,0.36552,0.520514,6.707502,0.880616,...,4,0.414714,-0.095364,0.382398,0.231202,-0.41975,-0.010569,0.17347,-0.403821,-0.377302


In [16]:
# Gets 5 random observations from 'session_logs_df'
track_features_df.sample(5)

Unnamed: 0,track_id,duration,release_year,us_popularity_estimate,acousticness,beat_strength,bounciness,danceability,dyn_range_mean,energy,...,time_signature,valence,acoustic_vector_0,acoustic_vector_1,acoustic_vector_2,acoustic_vector_3,acoustic_vector_4,acoustic_vector_5,acoustic_vector_6,acoustic_vector_7
11010,t_daf96940-b963-4b4d-9bcb-db0da7e4a672,236.149658,2018,99.241347,1.6e-05,0.266018,0.23079,0.51923,4.886916,0.989784,...,4,0.244273,0.15073,0.535133,-0.386667,-0.4427,0.245195,0.117197,-0.241758,0.840968
4157,t_9c4db64c-997a-4f7e-aafa-fadec8aaf59a,223.433289,2017,99.570246,0.00921,0.336797,0.312577,0.565488,5.7328,0.963295,...,4,0.632195,-0.112127,0.613662,-0.038689,-0.671888,0.079821,0.216979,-0.406181,0.112312
12249,t_473393b3-a507-409f-8413-6a56a5c10505,300.359985,1998,99.877318,0.007655,0.757457,0.788417,0.844983,11.484257,0.921363,...,4,0.367058,-0.733565,0.255447,0.146758,0.146551,-0.263699,-0.034799,-0.352633,0.216652
42829,t_bc8f4b8b-2e71-4ee0-aa9a-b187ef5292c8,169.973328,2018,99.997734,0.836207,0.583486,0.630654,0.77279,9.408275,0.437035,...,4,0.70698,-0.343416,0.426422,0.245223,-0.372917,-0.177993,0.189173,-0.394945,-0.01979
113,t_1a189907-6a94-446b-bb99-5357174db234,318.813324,1969,99.992084,0.074764,0.287649,0.262264,0.29938,5.283388,0.442851,...,4,0.434246,0.11478,-0.227969,-0.026377,-0.006003,0.548675,-0.023654,-0.283713,-0.423066


Based on the 3 cell outputs above, I don't suspect anything strange with the dataset.

---

##### Interpreting a Single Observation in `'sessions_log_df'`

In [17]:
# Gets a single observation from 'session_logs_df'
track_features_df.sample()

Unnamed: 0,track_id,duration,release_year,us_popularity_estimate,acousticness,beat_strength,bounciness,danceability,dyn_range_mean,energy,...,time_signature,valence,acoustic_vector_0,acoustic_vector_1,acoustic_vector_2,acoustic_vector_3,acoustic_vector_4,acoustic_vector_5,acoustic_vector_6,acoustic_vector_7
45227,t_5cad67a8-d55b-4a4a-8607-2077383c7c9a,275.346497,2016,99.828642,0.203832,0.434354,0.38549,0.545908,6.10162,0.786755,...,4,0.540536,0.04476,0.36306,0.243301,-0.51087,-0.060339,0.184722,-0.34508,-0.467421


Comments on the Observation Above:
- A single row in `'track_features_df'` likely represents a song that was featured in the session logs dataset
- In addition, I'm able to see what the characteristics of the song are and some metadata related to the song
- This dataset is a little less intense than the session logs dataset, however provides very important information about the characteristics of the songs in a given listening session
- **One of the major steps that might help drive this project forward is to merge the session logs dataset with the track features dataset, to provide a more comprehensive understanding of what kind of songs a user was listening to and how they interacted with these songs**

---

#  Conclusion

So far, you have learned about the problem statement, value-add, and goal of this project. In addition, you've learned a bit about the data. 

Including,
- Where it was sourced
- The general characteristics of the data
- And how the data has informed some of the decisions we might make it the future

In the next notebook I will dive a bit deeper into the data. I will be performing some Exploratory Data Analysis and try to build a stronger understanding of the problem space, using the next phase of this project to help inform my decisions as it related to developing a machine learning model to predict song skips