In [29]:
import pandas as pd

#import dataset
spotify_songs = pd.read_csv('data/final_spotify_data.csv')

#subset data based on playlist genre
w = spotify_songs.loc[spotify_songs['playlist'] == 'workout']
c = spotify_songs.loc[spotify_songs['playlist'] == 'chill']
p = spotify_songs.loc[spotify_songs['playlist'] == 'party']
f = spotify_songs.loc[spotify_songs['playlist'] == 'focus']

# Inferential Statistics

## Are the variable significant in terms of predicting Playlist genre?

We will use inferential statistics to determine which variables will be good candidates for our models. To do this, I will use to hypothesis testing to prove that the variables are significantly different across the playlist categories. 

### Are workout and party playlists significantly different from each other?

In the data storytelling stage, we found that there were many duplicate songs shared between Workout and Party playlists. This might indicate that the two playlists are very similar. I want to perform hypothesis testing to confirm if that two playlists are significantly different from each other. 

##### Two sample T test:

H0: the means of the samples are the same.

H1: the means of the samples are not the same.


In [3]:
from scipy import stats
from scipy.stats import ttest_ind


features = ['acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 
            'liveness', 'loudness', 'speechiness', 'tempo', 'valence']

for feature in features:
    data1, data2 = w[feature], p[feature]
    stat, pval = ttest_ind(data1, data2)

    print(feature, "\n--------------------")
    print("T Statistic: %.3f" % stat)
    print("P-Value: %f" % pval)
    if pval > 0.05:
        print("The means are the same (do not reject null hypothesis)\n")
    else:
        print("The means are not the same (reject null hypothesis)\n")

acousticness 
--------------------
T Statistic: -7.784
P-Value: 0.000000
The means are not the same (reject null hypothesis)

danceability 
--------------------
T Statistic: -6.753
P-Value: 0.000000
The means are not the same (reject null hypothesis)

duration_ms 
--------------------
T Statistic: 0.903
P-Value: 0.366590
The means are the same (do not reject null hypothesis)

energy 
--------------------
T Statistic: 7.677
P-Value: 0.000000
The means are not the same (reject null hypothesis)

instrumentalness 
--------------------
T Statistic: 3.432
P-Value: 0.000623
The means are not the same (reject null hypothesis)

liveness 
--------------------
T Statistic: 0.322
P-Value: 0.747544
The means are the same (do not reject null hypothesis)

loudness 
--------------------
T Statistic: 6.251
P-Value: 0.000000
The means are not the same (reject null hypothesis)

speechiness 
--------------------
T Statistic: 1.177
P-Value: 0.239262
The means are the same (do not reject null hypothesis)

t

The results from the t-test show that duration, liveness, and speechiness are not significantly different between workout and party playlist.

T-test only tests for continuous variables. For our categorical variables, I will use the chi-squared test.

##### Chi-Squared Test

H0: the two samples are independent.

H1: there is a dependency between the samples.

In [27]:
workout_party = spotify_songs[(spotify_songs['playlist'] == 'workout') | (spotify_songs['playlist'] == 'party')]

mode = pd.crosstab(workout_party['mode'], workout_party['playlist'], margins = True)


observed = mode.iloc[0:2,0:2]   # Get table without totals for later use
print(mode, "\n")


chi2, p_val, dof, exp = stats.chi2_contingency(observed= observed)

print('mode', "\n--------------------")
print("Chi-Squared: %.3f" % chi2)
print("P-Value: %f" % p_val)
print("Expected:", exp)
if p > 0.05:
    print("The two samples are independent (do not reject null hypothesis)\n")
else:
    print("The two samples are dependent (reject null hypothesis)\n")

playlist  party  workout   All
mode                          
0           224      233   457
1           340      290   630
All         564      523  1087 

mode 
--------------------
Chi-Squared: 2.408
P-Value: 0.120707
Expected: [[237.11867525 219.88132475]
 [326.88132475 303.11867525]]
The two samples are independent (do not reject null hypothesis)



In [26]:
key = pd.crosstab(workout_party['key'], workout_party['playlist'], margins = True)


observed = key.iloc[0:12,0:2]   # Get table without totals for later use
print(key, "\n")

chi2, p_val, dof, exp = stats.chi2_contingency(observed= observed)

print('key', "\n--------------------")
print("Chi-Squared: %.3f" % chi2)
print("P-Value: %f" % p_val)
print("Expected:", exp)
if p > 0.05:
    print("The two samples are independent (do not reject null hypothesis)\n")
else:
    print("The two samples are dependent (reject null hypothesis)\n")

playlist  party  workout   All
key                           
0            75       46   121
1            82       68   150
2            35       49    84
3            16       13    29
4            35       30    65
5            45       46    91
6            53       50   103
7            47       51    98
8            32       31    63
9            51       41    92
10           44       39    83
11           49       59   108
All         564      523  1087 

key 
--------------------
Chi-Squared: 12.348
P-Value: 0.338056
Expected: [[62.78196872 58.21803128]
 [77.82888684 72.17111316]
 [43.58417663 40.41582337]
 [15.04691812 13.95308188]
 [33.72585097 31.27414903]
 [47.21619135 43.78380865]
 [53.4425023  49.5574977 ]
 [50.84820607 47.15179393]
 [32.68813247 30.31186753]
 [47.7350506  44.2649494 ]
 [43.06531739 39.93468261]
 [56.03679853 51.96320147]]
The two samples are independent (do not reject null hypothesis)



In [23]:
time_sig = pd.crosstab(workout_party['time_signature'], workout_party['playlist'], margins = True)


observed = time_sig.iloc[0:4,0:2]   # Get table without totals for later use
print(time_sig, "\n")

chi2, p_val, dof, exp = stats.chi2_contingency(observed= observed)

print('time signature', "\n--------------------")
print("Chi-Squared: %.3f" % chi2)
print("P-Value: %f" % p_val)
print("Expected:", exp)
if p > 0.05:
    print("The two samples are independent (do not reject null hypothesis)\n")
else:
    print("The two samples are dependent (reject null hypothesis)\n")

playlist        party  workout   All
time_signature                      
1                   1        1     2
3                   7       10    17
4                 548      508  1056
5                   8        4    12
All               564      523  1087 

time signature 
--------------------
Chi-Squared: 1.834
P-Value: 0.607554
Expected: [[  1.03771849   0.96228151]
 [  8.82060718   8.17939282]
 [547.91536339 508.08463661]
 [  6.22631095   5.77368905]]
The two samples are independent (do not reject null hypothesis)



The tests show that all three categorical variables of party and workout playlists are independent of each other. However, for the chi-squared results to be trusted, all expected frequencies have to be greater than 5. Our results from time signature have two expected frequency values less than 5. We cannot trust the chi-squared results for this variable.

The reason for the small frequency values is because of the small sample size for time signature category of 1. To fix this, I will test only one category of time signature against all the others combined. Since time signature '4' is the most common value, I will test it against the others to see if there is a dependence on playlist genre. A pandas method is used to create a dummy variable for only the time signature ‘4’ which equals to ‘1’ if it belongs  to that variable and ‘0’ if it does not.

In [61]:
dummies = pd.get_dummies(workout_party['time_signature'])
dummies.head()

Unnamed: 0,1,3,4,5
0,0,0,1,0
1,0,0,1,0
2,0,0,1,0
3,0,0,1,0
4,0,0,1,0


In [64]:
timesig_tab = pd.crosstab(dummies[4], workout_party['playlist'], margins = True)


observed = timesig_tab.iloc[0:2,0:2]  
print(timesig_tab)

chi2, p, dof, exp = stats.chi2_contingency(observed= observed)

print('time signature', "\n--------------------")
print("Chi-Squared: %.3f" % chi2)
print("P-Value: %f" % p)
print("Expected:", exp)
if p > 0.05:
    print("The samples are independent (do not reject null hypothesis)\n")
else:
    print("The samples are dependent (reject null hypothesis)\n")

playlist  party  workout   All
4                             
0            16       15    31
1           548      508  1056
All         564      523  1087
time signature 
--------------------
Chi-Squared: 0.023
P-Value: 0.879593
Expected: [[ 16.08463661  14.91536339]
 [547.91536339 508.08463661]]
The samples are independent (do not reject null hypothesis)




### Are variables significantly different across playlist genres?

Now lets test if all playlists are significantly different from each other. 


##### Kruskal-Wallis H Test

The Kruskal-Wallis test assesses for significant differences on a continuous dependent variable by a categorical independent variable (with two or more groups). It is the non-paramteric counterpart to ANOVA test.

H0: the distributions of all categories are equal.

H1: the distributions of one or more categories are not equal.


In [30]:
from scipy.stats import kruskal


features = ['acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 
            'liveness', 'loudness', 'speechiness', 'tempo', 'valence']

for feature in features:
    data1, data2, data3, data4 = w[feature], p[feature], c[feature], f[feature]
    stat, pval = kruskal(data1, data2, data3, data4)

    print(feature, "\n--------------------")
    print("Kruskal-Wallis H: %.3f" % stat)
    print("P-Value: %f" % pval)
    if pval > 0.05:
        print("Distributions are the same (do not reject null hypothesis)\n")
    else:
        print("Distributions are not the same (reject null hypothesis)\n")

acousticness 
--------------------
Kruskal-Wallis H: 973.123
P-Value: 0.000000
Distributions are not the same (reject null hypothesis)

danceability 
--------------------
Kruskal-Wallis H: 519.786
P-Value: 0.000000
Distributions are not the same (reject null hypothesis)

duration_ms 
--------------------
Kruskal-Wallis H: 98.430
P-Value: 0.000000
Distributions are not the same (reject null hypothesis)

energy 
--------------------
Kruskal-Wallis H: 1121.449
P-Value: 0.000000
Distributions are not the same (reject null hypothesis)

instrumentalness 
--------------------
Kruskal-Wallis H: 1137.366
P-Value: 0.000000
Distributions are not the same (reject null hypothesis)

liveness 
--------------------
Kruskal-Wallis H: 71.343
P-Value: 0.000000
Distributions are not the same (reject null hypothesis)

loudness 
--------------------
Kruskal-Wallis H: 1197.174
P-Value: 0.000000
Distributions are not the same (reject null hypothesis)

speechiness 
--------------------
Kruskal-Wallis H: 507.99

We reject the null hypothesis for all the variables. There is a significant difference between all genre categories. Therefore, we can conclude that all these variables will be good features to include in our model. 

Kruskal-Wallis is only valid for continuous variables. We will use a different test for our three categorical variables.



##### Chi-Squared Test

H0: the samples are independent.

H1: there is a dependency between the samples.



In [8]:
mode_tab = pd.crosstab(spotify_songs['mode'], spotify_songs['playlist'], margins = True)


observed = mode_tab.iloc[0:2,0:4]   # Get table without totals for later use
print(mode_tab)

playlist  chill  focus  party  workout   All
mode                                        
0           145    183    224      233   785
1           392    342    340      290  1364
All         537    525    564      523  2149


In [12]:
chi2, p, dof, exp = stats.chi2_contingency(observed= observed)

print('mode', "\n--------------------")
print("Chi-Squared: %.3f" % chi2)
print("P-Value: %f" % p)
print("Expected:", exp)
if p > 0.05:
    print("The samples are independent (do not reject null hypothesis)\n")
else:
    print("The samples are dependent (reject null hypothesis)\n")

mode 
--------------------
Chi-Squared: 38.642
P-Value: 0.000000
Expected: [[196.15867846 191.7752443  206.0214053  191.04467194]
 [340.84132154 333.2247557  357.9785947  331.95532806]]
The samples are dependent (reject null hypothesis)



In [14]:
key_tab = pd.crosstab(spotify_songs['key'], spotify_songs['playlist'], margins = True)


observed = key_tab.iloc[0:12,0:4]  
print(key_tab)

playlist  chill  focus  party  workout   All
key                                         
0            63     68     75       46   252
1            39     56     82       68   245
2            57     39     35       49   180
3            24     25     16       13    78
4            45     38     35       30   148
5            50     50     45       46   191
6            38     39     53       50   180
7            65     71     47       51   234
8            37     42     32       31   142
9            41     38     51       41   171
10           33     26     44       39   142
11           45     33     49       59   186
All         537    525    564      523  2149


In [15]:
chi2, p, dof, exp = stats.chi2_contingency(observed= observed)

print('key', "\n--------------------")
print("Chi-Squared: %.3f" % chi2)
print("P-Value: %f" % p)
print("Expected:", exp)
if p > 0.05:
    print("The samples are independent (do not reject null hypothesis)\n")
else:
    print("The samples are dependent (reject null hypothesis)\n")

key 
--------------------
Chi-Squared: 64.504
P-Value: 0.000842
Expected: [[62.97068404 61.56351792 66.13680782 61.32899023]
 [61.22149837 59.8534202  64.29967427 59.62540717]
 [44.97906003 43.97394137 47.24057701 43.80642159]
 [19.49092601 19.05537459 20.47091671 18.98278269]
 [36.98278269 36.15635179 38.84225221 36.01861331]
 [47.72778036 46.66123779 50.12750116 46.48348069]
 [44.97906003 43.97394137 47.24057701 43.80642159]
 [58.47277804 57.16612378 61.41275012 56.94834807]
 [35.48348069 34.69055375 37.26756631 34.55839926]
 [42.73010703 41.7752443  44.87854816 41.61610051]
 [35.48348069 34.69055375 37.26756631 34.55839926]
 [46.47836203 45.43973941 48.81526291 45.26663564]]
The samples are dependent (reject null hypothesis)



In [32]:
timesig_tab = pd.crosstab(spotify_songs['time_signature'], spotify_songs['playlist'], margins = True)


observed = timesig_tab.iloc[0:4,0:4]  
print(timesig_tab)

playlist        chill  focus  party  workout   All
time_signature                                    
1                   5     12      1        1    19
3                  36     97      7       10   150
4                 485    393    548      508  1934
5                  11     23      8        4    46
All               537    525    564      523  2149


In [33]:
chi2, p, dof, exp = stats.chi2_contingency(observed= observed)

print('time signature', "\n--------------------")
print("Chi-Squared: %.3f" % chi2)
print("P-Value: %f" % p)
print("Expected:", exp)
if p > 0.05:
    print("The samples are independent (do not reject null hypothesis)\n")
else:
    print("The samples are dependent (reject null hypothesis)\n")

time signature 
--------------------
Chi-Squared: 200.912
P-Value: 0.000000
Expected: [[  4.74778967   4.64169381   4.98650535   4.62401117]
 [ 37.48255002  36.64495114  39.36714751  36.50535133]
 [483.27501163 472.47557003 507.57375523 470.6756631 ]
 [ 11.49464867  11.23778502  12.0725919   11.19497441]]
The samples are dependent (reject null hypothesis)



The chi-squared test indicate a significant relationship between mode, key, time signature and playlist genre. However, our results from time signature have four expected frequency values less than 5. Therefore, we cannot trust the chi-squared results for this variable.

Since we cannot trust the chi-squared results for time signature since expected values are less than 5, we can perform more granular tests. Since time signature '4' is the most common value, I will test it against all other values to see if there is a dependence on playlist genre. I will use the pandas method to create a dummy variable for the only time signature ‘4’ which equals to ‘1’ if it belongs  to that variable and ‘0’ if it does not. 

In [56]:
dummies = pd.get_dummies(spotify_songs['time_signature'])
dummies.head()

Unnamed: 0,1,3,4,5
0,0,0,1,0
1,0,0,1,0
2,0,0,1,0
3,0,0,1,0
4,0,0,1,0


In [58]:
timesig_tab = pd.crosstab(dummies[4], spotify_songs['playlist'], margins = True)


observed = timesig_tab.iloc[0:2,0:4]  
print(timesig_tab)

chi2, p, dof, exp = stats.chi2_contingency(observed= observed)

print('time signature', "\n--------------------")
print("Chi-Squared: %.3f" % chi2)
print("P-Value: %f" % p)
print("Expected:", exp)
if p > 0.05:
    print("The samples are independent (do not reject null hypothesis)\n")
else:
    print("The samples are dependent (reject null hypothesis)\n")

playlist  chill  focus  party  workout   All
4                                           
0            52    132     16       15   215
1           485    393    548      508  1934
All         537    525    564      523  2149
time signature 
--------------------
Chi-Squared: 195.453
P-Value: 0.000000
Expected: [[ 53.72498837  52.52442997  56.42624477  52.3243369 ]
 [483.27501163 472.47557003 507.57375523 470.6756631 ]]
The samples are dependent (reject null hypothesis)



Results show that playlist genre has a significant relationship to tracks having a 4-beat time signature vs other time signatures. 



### Conclusion:

Our tests show that all our variables significantly vary based on playlist genre. All of the features are good candidates for our model.

However, we did see that duration, liveness, and speechiness are not significantly different between workout and party playlists. This will be good to keep in mind when evaluating our model. 