# Project Initialization

In [1]:
import pandas as pd
import numpy as np

In [46]:
df = pd.read_csv('./data/dataset.csv', index_col=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 114000 entries, 0 to 113999
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   track_id          114000 non-null  object 
 1   artists           113999 non-null  object 
 2   album_name        113999 non-null  object 
 3   track_name        113999 non-null  object 
 4   popularity        114000 non-null  int64  
 5   duration_ms       114000 non-null  int64  
 6   explicit          114000 non-null  bool   
 7   danceability      114000 non-null  float64
 8   energy            114000 non-null  float64
 9   key               114000 non-null  int64  
 10  loudness          114000 non-null  float64
 11  mode              114000 non-null  int64  
 12  speechiness       114000 non-null  float64
 13  acousticness      114000 non-null  float64
 14  instrumentalness  114000 non-null  float64
 15  liveness          114000 non-null  float64
 16  valence           11

After a quick df.info() there seems to be one row that has a null value for 3 important columns so let's get rid of that real quick.

In [47]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 113999 entries, 0 to 113999
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   track_id          113999 non-null  object 
 1   artists           113999 non-null  object 
 2   album_name        113999 non-null  object 
 3   track_name        113999 non-null  object 
 4   popularity        113999 non-null  int64  
 5   duration_ms       113999 non-null  int64  
 6   explicit          113999 non-null  bool   
 7   danceability      113999 non-null  float64
 8   energy            113999 non-null  float64
 9   key               113999 non-null  int64  
 10  loudness          113999 non-null  float64
 11  mode              113999 non-null  int64  
 12  speechiness       113999 non-null  float64
 13  acousticness      113999 non-null  float64
 14  instrumentalness  113999 non-null  float64
 15  liveness          113999 non-null  float64
 16  valence           11

Now let's remove completely duplicate rows.

In [80]:
df.drop_duplicates(inplace=True)

All fixed! Now let's reset indexes and we can continue with the rest of the data to answer some burning questions we have.

In [88]:
df.reset_index(inplace=True)

# Questions

For this section we have three questions we want to look at using relatively the same columns so I'm going to go ahead and create a variable to hold the relevant column names.

In [89]:
relevant_columns = ["track_id", "track_name", "artists", "popularity", "danceability", "energy", "tempo", "track_genre"]

## What's the most popular genre?

Lets start setting up the data

In [90]:
data1 = df[relevant_columns]
data1

Unnamed: 0,track_id,track_name,artists,popularity,danceability,energy,tempo,track_genre
0,5SuOikwiRyPMVoIQDJUgSV,Comedy,Gen Hoshino,73,0.676,0.4610,87.917,acoustic
1,4qPNDBW1i3p13qLCt0Ki3A,Ghost - Acoustic,Ben Woodward,55,0.420,0.1660,77.489,acoustic
2,1iJBSr7s7jYXzM8EGcbK5b,To Begin Again,Ingrid Michaelson;ZAYN,57,0.438,0.3590,76.332,acoustic
3,6lfxq3CG4xtTiEg7opyCyx,Can't Help Falling In Love,Kina Grannis,71,0.266,0.0596,181.740,acoustic
4,5vjLSffimiIP26QG5WcN2K,Hold On,Chord Overstreet,82,0.618,0.4430,119.949,acoustic
...,...,...,...,...,...,...,...,...
113544,2C3TZjDRiAzdyViavDJ217,Sleep My Little Boy,Rainy Lullaby,21,0.172,0.2350,125.995,world-music
113545,1hIz5L4IB9hN3WRYPOCGPw,Water Into Light,Rainy Lullaby,22,0.174,0.1170,85.239,world-music
113546,6x8ZfSoqDjuNa5SVP5QjvX,Miss Perfumado,Cesária Evora,22,0.629,0.3290,132.378,world-music
113547,2e6sXL2bYv4bSz6VTdnfLs,Friends,Michael W. Smith,41,0.587,0.5060,135.960,world-music


In [91]:
data1.value_counts(["track_genre"]).sort_values()

track_genre
romance         904
classical       933
german          963
dance           965
honky-tonk      981
               ... 
indie-pop      1000
rock-n-roll    1000
industrial     1000
swedish        1000
acoustic       1000
Length: 114, dtype: int64

Since the with the counts of songs per genre sorted are very similar, we don't need to bother with weight averaging it.

In [92]:
grouped_data = data1.groupby("track_genre").agg({
    'popularity': 'mean',
}).reset_index()
grouped_data

Unnamed: 0,track_genre,popularity
0,acoustic,42.483000
1,afrobeat,24.407407
2,alt-rock,33.896897
3,alternative,24.361361
4,ambient,44.208208
...,...,...
109,techno,39.042000
110,trance,37.636637
111,trip-hop,34.499498
112,turkish,40.700701


Now that we have our dataset we want to use, let's see which genre is the most popular

In [93]:
question_1_data = grouped_data.sort_values(["popularity"], ascending= False)
question_1_data

Unnamed: 0,track_genre,popularity
81,pop-film,59.280280
65,k-pop,56.963928
15,chill,53.704705
94,sad,52.379000
44,grunge,49.582583
...,...,...
13,chicago-house,12.333667
24,detroit-techno,11.183367
67,latin,8.363636
93,romance,3.549779


Looks like the most popular genre in this data set is pop-film.

## How does a song's danceability relate to the song's popularity?

Let's set up the data

In [94]:
data2 = df[relevant_columns]
data2

Unnamed: 0,track_id,track_name,artists,popularity,danceability,energy,tempo,track_genre
0,5SuOikwiRyPMVoIQDJUgSV,Comedy,Gen Hoshino,73,0.676,0.4610,87.917,acoustic
1,4qPNDBW1i3p13qLCt0Ki3A,Ghost - Acoustic,Ben Woodward,55,0.420,0.1660,77.489,acoustic
2,1iJBSr7s7jYXzM8EGcbK5b,To Begin Again,Ingrid Michaelson;ZAYN,57,0.438,0.3590,76.332,acoustic
3,6lfxq3CG4xtTiEg7opyCyx,Can't Help Falling In Love,Kina Grannis,71,0.266,0.0596,181.740,acoustic
4,5vjLSffimiIP26QG5WcN2K,Hold On,Chord Overstreet,82,0.618,0.4430,119.949,acoustic
...,...,...,...,...,...,...,...,...
113544,2C3TZjDRiAzdyViavDJ217,Sleep My Little Boy,Rainy Lullaby,21,0.172,0.2350,125.995,world-music
113545,1hIz5L4IB9hN3WRYPOCGPw,Water Into Light,Rainy Lullaby,22,0.174,0.1170,85.239,world-music
113546,6x8ZfSoqDjuNa5SVP5QjvX,Miss Perfumado,Cesária Evora,22,0.629,0.3290,132.378,world-music
113547,2e6sXL2bYv4bSz6VTdnfLs,Friends,Michael W. Smith,41,0.587,0.5060,135.960,world-music


In [95]:
grouped_data2 = data2.groupby("danceability").agg({
    'popularity': 'mean',
}).reset_index()
grouped_data2

Unnamed: 0,danceability,popularity
0,0.0000,37.197452
1,0.0513,23.000000
2,0.0532,62.000000
3,0.0545,33.000000
4,0.0548,0.000000
...,...,...
1169,0.9810,8.500000
1170,0.9820,49.000000
1171,0.9830,5.000000
1172,0.9840,10.000000


We can sort and look at the which danceability leads to the most popular songs but it would be data better looked at through a chart in the dashboard. 

In [97]:
question_2_data = grouped_data2.sort_values(["popularity"], ascending= False)
question_2_data

Unnamed: 0,danceability,popularity
6,0.0555,80.0
79,0.0685,72.0
42,0.0630,66.0
277,0.0983,65.0
156,0.0808,65.0
...,...,...
263,0.0965,0.0
12,0.0574,0.0
47,0.0640,0.0
11,0.0569,0.0


## How does the song's tempo correlate to the song's danceability?

Let's set up the data

In [99]:
data3 = df[relevant_columns]
data3

Unnamed: 0,track_id,track_name,artists,popularity,danceability,energy,tempo,track_genre
0,5SuOikwiRyPMVoIQDJUgSV,Comedy,Gen Hoshino,73,0.676,0.4610,87.917,acoustic
1,4qPNDBW1i3p13qLCt0Ki3A,Ghost - Acoustic,Ben Woodward,55,0.420,0.1660,77.489,acoustic
2,1iJBSr7s7jYXzM8EGcbK5b,To Begin Again,Ingrid Michaelson;ZAYN,57,0.438,0.3590,76.332,acoustic
3,6lfxq3CG4xtTiEg7opyCyx,Can't Help Falling In Love,Kina Grannis,71,0.266,0.0596,181.740,acoustic
4,5vjLSffimiIP26QG5WcN2K,Hold On,Chord Overstreet,82,0.618,0.4430,119.949,acoustic
...,...,...,...,...,...,...,...,...
113544,2C3TZjDRiAzdyViavDJ217,Sleep My Little Boy,Rainy Lullaby,21,0.172,0.2350,125.995,world-music
113545,1hIz5L4IB9hN3WRYPOCGPw,Water Into Light,Rainy Lullaby,22,0.174,0.1170,85.239,world-music
113546,6x8ZfSoqDjuNa5SVP5QjvX,Miss Perfumado,Cesária Evora,22,0.629,0.3290,132.378,world-music
113547,2e6sXL2bYv4bSz6VTdnfLs,Friends,Michael W. Smith,41,0.587,0.5060,135.960,world-music


In [102]:
grouped_data3 = data3.groupby("tempo").agg({
    'danceability': 'mean',
}).reset_index()
grouped_data3

Unnamed: 0,tempo,danceability
0,0.000,0.000
1,30.200,0.334
2,30.322,0.391
3,31.834,0.558
4,34.262,0.780
...,...,...
45647,220.081,0.469
45648,220.084,0.225
45649,220.525,0.244
45650,222.605,0.241


We can sort and look at how danceable each tempo is but it will be better to see using the dashboard.

In [103]:
question_3_data = grouped_data3.sort_values(["danceability"], ascending= False)
question_3_data

Unnamed: 0,tempo,danceability
20268,115.347,0.9850
23643,122.514,0.9760
17343,109.009,0.9750
24722,124.663,0.9740
23342,121.950,0.9710
...,...,...
409,61.880,0.0548
553,63.428,0.0545
242,58.621,0.0532
42471,175.152,0.0513
