![AltText](http://www.scdn.co/i/_global/open-graph-default.png "Spotify Logo")


# Introduction and Motivations


## Proposal

To introduce the subject material and understand the motivations behind this study, we first recall our original project proposal:
>As people who are like music and have received some form of musical training, we decided to do our data science topic on songs.  A look at the Spotify data obtained in Kaggle has shown lists of top ranking songs, some song classifications, and some attributes that are known as "audio analyses."  Metrics such as "danceability" and "energy" are given quantitative values and are available through Spotify's API.  We intend to use this data as a data lake for our experiments in order to find out what Spotify looks for in a hit.  

>We'd like to answer questions such as "Given today's trends, what does it take to make it to Spotify's top 50 songs?"  and "What's more important for streaming numbers today, instrumentalness or danceability?"  We add the qualifier for the present because at the moment we do not have access to such ranking data from 2016 or earlier.

## Defining the Question


For the unfamiliar, Spotify is a digital music, podcast, and video streaming service that provides access to more than 30 million songs.  Spotify's Charts rank songs by the number of streams - we obtained the top 100 songs of 2017 to conduct our analysis.  The dataset is available here: https://www.kaggle.com/nadintamer/top-tracks-of-2017/downloads/featuresdf.csv/1 

As outlined in the proposal, we'd like to find out what makes up a Spotify hit.  Which of the audio features are the most important?  What gets more listens - vocals-based music like rap or instrumentals like EDM?  The potential significance of such questions are at least two-fold: 

1. Develop some sort of model that could determine if a song's characteristics are enough to be a Spotify chart-topper
2. Perhaps even reverse-engineer how Spotify determines these song attributes

## Getting the Data

First, we set up the environment:


In [23]:
%matplotlib inline
import pandas
import seaborn

The dataset contains other information beyond the name of the songs and its ranking, as shown below.  We also check that the dataset is complete.  In this case, all the fields are valid, and nothing seems to be duplicated.

In [24]:
sdata = pandas.read_csv('./featuresdf.csv')
sdata.head(4)


Unnamed: 0,id,name,artists,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,7qiZfU4dY1lWllzX7mPBI,Shape of You,Ed Sheeran,0.825,0.652,1.0,-3.183,0.0,0.0802,0.581,0.0,0.0931,0.931,95.977,233713.0,4.0
1,5CtI0qwDJkDQGwXD1H1cL,Despacito - Remix,Luis Fonsi,0.694,0.815,2.0,-4.328,1.0,0.12,0.229,0.0,0.0924,0.813,88.931,228827.0,4.0
2,4aWmUDTfIPGksMNLV2rQP,Despacito (Featuring Daddy Yankee),Luis Fonsi,0.66,0.786,2.0,-4.757,1.0,0.17,0.209,0.0,0.112,0.846,177.833,228200.0,4.0
3,6RUKPb4LETWmmr3iAEQkt,Something Just Like This,The Chainsmokers,0.617,0.635,11.0,-6.769,0.0,0.0317,0.0498,1.4e-05,0.164,0.446,103.019,247160.0,4.0


In [25]:
sdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 16 columns):
id                  100 non-null object
name                100 non-null object
artists             100 non-null object
danceability        100 non-null float64
energy              100 non-null float64
key                 100 non-null float64
loudness            100 non-null float64
mode                100 non-null float64
speechiness         100 non-null float64
acousticness        100 non-null float64
instrumentalness    100 non-null float64
liveness            100 non-null float64
valence             100 non-null float64
tempo               100 non-null float64
duration_ms         100 non-null float64
time_signature      100 non-null float64
dtypes: float64(13), object(3)
memory usage: 12.6+ KB


The first 3 fields are arranged as follows:
+ **id**: 
    + Spotify URI
+ **name**: 
    + title
+ **artists**: 
    + contributing artists


The following fields are song attributes or audio features that are provided by Spotify via the Spotify API (descriptions taken from https://beta.developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/):
+ **danceability**: 
    + describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. 
    + value of 0.0 is least danceable and 1.0 is most danceable.
+ **energy**: 
    + measure from 0.0 to 1.0 which represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. 
    + perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
+ **key**: 
   + the key the track is in. 
   + integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
+ **loudness**:  
   + overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. 
   + values typical range between -60 and 0 db.
+ **mode**: 
    + indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. 
    + major is represented by 1 and minor is 0.
+ **speechiness**:  
    + detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. 
    + values above 0.66 describe tracks that are probably made entirely of spoken words. 
    + values between 0.33 and 0.66 describe tracks that may contain both music and speech. 
    + values below 0.33 most likely represent music and other non-speech-like tracks.
+ **acousticness**: 
    + confidence measure from 0.0 to 1.0 of whether the track is acoustic. 
    + 1.0 represents high confidence the track is acoustic.
+ **instrumentalness**: 
    + predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. 
    + values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
+ **liveness**: 
    + detects the presence of an audience in the recording. 
    + value above 0.8 provides strong likelihood that the track is live.
+ **valence**: 
    + measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
+ **tempo**:  
    + overall estimated tempo of a track in beats per minute (BPM). 
    + derives directly from the average beat duration.
+ **duration_ms**: 
    + duration of the track in milliseconds.
+ **time_signature**: 
    + estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

# Preliminary Analysis