# Taylor Swift | The Eras Tour Official Setlist Data

https://www.kaggle.com/datasets/yukawithdata/taylor-swift-the-eras-tour-official-setlist-data/data


**Attribute Descriptions**:
* **artist_name**: the name of the artist (Taylor Swift)
* **track_name**: the title of the track
* **is_explicit**: Indicates whether the track contains explicit content
* **album_release_date**: The date when the track was released
* **genres**: A list of genres associated with Beyoncé
<br></br>

* **danceability**: A measure from 0.0 to 1.0 indicating how suitable a track is for - dancing based on a combination of musical elements
* **valence**: A measure from 0.0 to 1.0 indicating the musical positiveness conveyed by a track
* **energy**: A measure from 0.0 to 1.0 representing a perceptual measure of intensity and activity
* **loudness**: The overall loudness of a track in decibels (dB)
* **acousticness**: A measure from 0.0 to 1.0 whether the track is acoustic
<br></br>

* **instrumentalness**: Predicts whether a track contains no vocals
* **liveness**: Detects the presence of an audience in the recordings 
* **speechiness**: Detects the presence of spoken words in a track
<br></br>

* **key**: The key the track is in. Integers map to pitches using standard Pitch Class notation
* **tempo**: The overall estimated tempo of a track in beats per minute (BPM)
* **mode**: Modality of the track
* **duration_ms**: The length of the track in milliseconds
* **time_signature**: An estimated overall time signature of a track

**Predict Feature**: 
* **popularity**: A score between 0 and 100, with 100 being the most popular

In [2]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Prepare dataset

In [3]:
df = pd.read_csv(r'/Users/xiao/Projects/git/AnalysisProjects/data/era_tour_setlist.csv', index_col='track_name').drop_duplicates()

# clean dataset
df['album_release_date'] = pd.to_datetime(df['album_release_date'], format='%Y-%m-%d')

df = df.loc[:, df.nunique() > 1]

df.head()

Unnamed: 0_level_0,is_explicit,album_release_date,danceability,valence,energy,loudness,acousticness,instrumentalness,liveness,speechiness,key,tempo,mode,duration_ms,time_signature,popularity
track_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Miss Americana & The Heartbreak Prince,False,2019-08-23,0.662,0.487,0.747,-6.926,0.028,0.00615,0.138,0.0736,11,150.088,0,234147,4,79
Cruel Summer,False,2019-08-23,0.552,0.564,0.702,-5.707,0.117,2.1e-05,0.105,0.157,9,169.994,1,178427,4,94
The Man,False,2019-08-23,0.777,0.633,0.658,-5.191,0.0767,0.0,0.0901,0.054,0,110.048,1,190360,4,82
You Need To Calm Down,False,2019-08-23,0.771,0.714,0.671,-5.617,0.00929,0.0,0.0637,0.0553,2,85.026,1,171360,4,81
Lover,False,2019-08-23,0.359,0.453,0.543,-7.582,0.492,1.6e-05,0.118,0.0919,7,68.534,1,221307,4,88


### Population Statistics

In [30]:
df_num = df.select_dtypes(include=['float', 'int'])
df_num.describe().T.join(pd.DataFrame(df_num.skew(), columns=['skew'])).join(pd.DataFrame(df_num.kurtosis(), columns=['kurtosis'])).join(pd.DataFrame(df_num.nunique(), columns=['unique number']))

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,skew,kurtosis,unique number
danceability,44.0,0.599477,0.11055,0.359,0.53425,0.6095,0.6465,0.872,-0.051732,0.270381,42
valence,44.0,0.395736,0.202988,0.0499,0.22525,0.416,0.51175,0.903,0.372844,-0.428446,43
energy,44.0,0.576909,0.172917,0.24,0.4085,0.6,0.713,0.866,-0.38662,-0.993673,44
loudness,44.0,-7.895841,2.509138,-14.132,-10.2365,-7.5105,-5.707,-3.546,-0.358812,-0.680257,43
acousticness,44.0,0.280691,0.293417,0.000443,0.036375,0.1265,0.52425,0.92,0.894185,-0.554949,44
instrumentalness,44.0,0.000798,0.003592,0.0,0.0,3e-06,6.2e-05,0.0232,5.979696,37.347872,27
liveness,44.0,0.137098,0.085399,0.0576,0.09085,0.106,0.14625,0.475,2.342496,5.800986,41
speechiness,44.0,0.075068,0.063762,0.0253,0.0345,0.0567,0.0811,0.387,3.069187,12.692495,44
key,44.0,5.272727,3.636786,0.0,2.0,6.5,8.0,11.0,-0.245256,-1.182387,10
tempo,44.0,123.261841,32.815035,68.534,95.7155,122.92,151.068,192.004,0.269944,-1.091466,43


In [29]:
# Tendency of numerical features: boxplot, density 


danceability        42
valence             43
energy              44
loudness            43
acousticness        44
instrumentalness    27
liveness            41
speechiness         44
key                 10
tempo               43
mode                 2
duration_ms         44
time_signature       2
popularity          18
dtype: int64

In [None]:
# Correlation
corr_matrix = df[df['']]