In [20]:
import pandas as pd

#import dataset
spotify_songs = pd.read_csv('data/final_spotify_data.csv')

#subset data based on playlist genre
w = spotify_songs.loc[spotify_songs['playlist'] == 'workout']
c = spotify_songs.loc[spotify_songs['playlist'] == 'chill']
p = spotify_songs.loc[spotify_songs['playlist'] == 'party']
f = spotify_songs.loc[spotify_songs['playlist'] == 'focus']

# Inferential Statistics

## Are the variable significant in terms of predicting Playlist genre?

We will use inferential statistics to determine which variables will be good candidates for our models. To do this, I will use to hypothesis testing to prove that the variables are significantly different across the playlist categories. 

### Are workout and party playlists significantly different from each other?

In the data storytelling stage, we found that there were many duplicate songs shared between Workout and Party playlists. This might indicate that the two playlists are very similar. I want to perform hypothesis testing to confirm if that two playlists are significantly different from each other. 

##### Two sample T test:

H0: the means of the samples are the same.

H1: the means of the samples are not the same.


In [47]:
from scipy import stats
from scipy.stats import ttest_ind


features = ['acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 
            'liveness', 'loudness', 'speechiness', 'tempo', 'valence']

for feature in features:
    data1, data2 = w[feature], p[feature]
    stat, pval = ttest_ind(data1, data2)

    print(feature, "\n--------------------")
    print("T Statistic: %.3f" % stat)
    print("P-Value: %f" % pval)
    if pval > 0.05:
        print("The means are the same (do not reject null hypothesis)\n")
    else:
        print("The means are not the same (reject null hypothesis)\n")

acousticness 
--------------------
T Statistic: -7.784
P-Value: 0.000000
The means are not the same (reject null hypothesis)

danceability 
--------------------
T Statistic: -6.753
P-Value: 0.000000
The means are not the same (reject null hypothesis)

duration_ms 
--------------------
T Statistic: 0.903
P-Value: 0.366590
The means are the same (do not reject null hypothesis)

energy 
--------------------
T Statistic: 7.677
P-Value: 0.000000
The means are not the same (reject null hypothesis)

instrumentalness 
--------------------
T Statistic: 3.432
P-Value: 0.000623
The means are not the same (reject null hypothesis)

liveness 
--------------------
T Statistic: 0.322
P-Value: 0.747544
The means are the same (do not reject null hypothesis)

loudness 
--------------------
T Statistic: 6.251
P-Value: 0.000000
The means are not the same (reject null hypothesis)

speechiness 
--------------------
T Statistic: 1.177
P-Value: 0.239262
The means are the same (do not reject null hypothesis)

t

The results from the t-test show that duration, liveness, and speechiness are not significantly different between workout and party playlist.

### Are variables significantly different across playlist genres?

Now lets test if all playlists are significantly different from each other. 



##### Kruskal-Wallis H Test

The Kruskal-Wallis test assesses for significant differences on a continuous dependent variable by a categorical independent variable (with two or more groups). It is the non-paramteric counterpart to ANOVA test.

H0: the distributions of all categories are equal.

H1: the distributions of one or more categories are not equal.


In [48]:
from scipy.stats import kruskal


features = ['acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 
            'liveness', 'loudness', 'speechiness', 'tempo', 'valence']

for feature in features:
    data1, data2, data3, data4 = w[feature], p[feature], c[feature], f[feature]
    stat, pval = kruskal(data1, data2, data3, data4)

    print(feature, "\n--------------------")
    print("Kruskal-Wallis H: %.3f" % stat)
    print("P-Value: %f" % pval)
    if pval > 0.05:
        print("Distributions are the same (do not reject null hypothesis)\n")
    else:
        print("Distributions are not the same (reject null hypothesis)\n")

acousticness 
--------------------
Kruskal-Wallis H: 973.123
P-Value: 0.000000
Distributions are not the same (reject null hypothesis)

danceability 
--------------------
Kruskal-Wallis H: 519.786
P-Value: 0.000000
Distributions are not the same (reject null hypothesis)

duration_ms 
--------------------
Kruskal-Wallis H: 98.430
P-Value: 0.000000
Distributions are not the same (reject null hypothesis)

energy 
--------------------
Kruskal-Wallis H: 1121.449
P-Value: 0.000000
Distributions are not the same (reject null hypothesis)

instrumentalness 
--------------------
Kruskal-Wallis H: 1137.366
P-Value: 0.000000
Distributions are not the same (reject null hypothesis)

liveness 
--------------------
Kruskal-Wallis H: 71.343
P-Value: 0.000000
Distributions are not the same (reject null hypothesis)

loudness 
--------------------
Kruskal-Wallis H: 1197.174
P-Value: 0.000000
Distributions are not the same (reject null hypothesis)

speechiness 
--------------------
Kruskal-Wallis H: 507.99

We reject the null hypothesis for all the variables. There is a significant difference between all genre categories. Therefore, we can conclude that all these variables will be good features to include in our model. 

Kruskal-Wallis is only valid for continuous variables. We will use a different test for our three categorical variables.



##### Chi-Squared Test

H0: the two samples are independent.

H1: there is a dependency between the samples.



In [62]:
mode_tab = pd.crosstab(spotify_songs['mode'], spotify_songs['playlist'], margins = True)


observed = mode_tab.iloc[0:5,0:3]   # Get table without totals for later use
print(mode_tab)

playlist  chill  focus  party  workout   All
mode                                        
0           145    183    224      233   785
1           392    342    340      290  1364
All         537    525    564      523  2149


In [64]:
chi2, p, dof, exp = stats.chi2_contingency(observed= observed)

print('mode', "\n--------------------")
print("Chi-Sqaured: %.3f" % chi2)
print("P-Value: %f" % p)
if p > 0.05:
    print("The two samples are independent (do not reject null hypothesis)\n")
else:
    print("The two samples are dependent (reject null hypothesis)\n")

mode 
--------------------
Chi-Sqaured: 20.117
P-Value: 0.000473
The two samples are dependent (reject null hypothesis)



In [65]:
key_tab = pd.crosstab(spotify_songs['key'], spotify_songs['playlist'], margins = True)


observed = key_tab.iloc[0:5,0:3]  

In [66]:
chi2, p, dof, exp = stats.chi2_contingency(observed= observed)

print('key', "\n--------------------")
print("Chi-Sqaured: %.3f" % chi2)
print("P-Value: %f" % p)
if p > 0.05:
    print("The two samples are independent (do not reject null hypothesis)\n")
else:
    print("The two samples are dependent (reject null hypothesis)\n")

key 
--------------------
Chi-Sqaured: 25.792
P-Value: 0.001140
The two samples are dependent (reject null hypothesis)



In [67]:
timesig_tab = pd.crosstab(spotify_songs['time_signature'], spotify_songs['playlist'], margins = True)


observed = timesig_tab.iloc[0:5,0:3]  

In [68]:
chi2, p, dof, exp = stats.chi2_contingency(observed= observed)

print('time signature', "\n--------------------")
print("Chi-Sqaured: %.3f" % chi2)
print("P-Value: %f" % p)
if p > 0.05:
    print("The two samples are independent (do not reject null hypothesis)\n")
else:
    print("The two samples are dependent (reject null hypothesis)\n")

time signature 
--------------------
Chi-Sqaured: 134.510
P-Value: 0.000000
The two samples are dependent (reject null hypothesis)



The chi-squared test shows a significant relationship between mode, key, time signature and playlist genre. Therefore, these categorical variables are good candidates for our model. 

### Conclusion:

Our tests show that all our variables significantly vary based on playlist genre. All of the features are good candidates for our model.

However, we did see that duration, liveness, and speechiness are not significantly different between workout and party playlists. This will be good to keep in mind when evaluating our model. 