# <center>Saving Frat Parties, One Bop at a Time</center>
<center>Aleem Virani</center>

## <center>Introduction</center>
Fraternity parties: a staple of the modern college experience. Every college student seems to have this shared experience, no matter what college they go to. On television, these parties were protrayed as an unforgettable experience with DJs playing hit after hit, students dancing on tables, and everybody similing and having a good time. However, after conducting my own research as well as talking with some of my more "outgoing" friends, this wasn't the case. Sure people do have a good time, but the good times are short lived. Columbia student, Huber Gonzalez, documented <a href = https://www.columbiaspectator.com/spectrum/2016/09/22/your-first-frat-party-play-play-timeline-if-youve-never-been-frat-party-its-time/>his experience</a> at party in Columbia. In his words, the "turn-up" started at 12:04 am and was quickly followed by the "realization" at 12:43 am where, according to Gonzalez, "...the frat's completely out of booze, the music starts to suck, and everyone's ready to bounce the f*** outta there." In 40 minutes, the "party" is seemingly over. Assuming an average song length of around 3 minutes, the party ended after 13 songs. If things don't change fast, the phrase "Party All Night" will be as revelant to the next generation as the fax machine is to this generation, and who better to lead this change than a third year Computer Science nerd with asthma. In this project, I take a look at what audio features make up party songs using the Spotify API and Spotipy, as well as create a classification model to help people pick out good songs to play at a party.

## <center>Setup</center>

For this project, we need to import and install multiple libraries: <a href = https://spotipy.readthedocs.io/en/2.21.0/>Spotipy</a>, <a href = https://pandas.pydata.org/pandas-docs/stable/>Pandas</a>, <a href = https://seaborn.pydata.org/>Seaborn</a>, 
<a href = https://docs.scipy.org/doc/numpy/user/>Numpy</a>, <a href = http://scikit-learn.org/stable/documentation.html>Scikit-learn</a>, and more

In [362]:
%%capture
pip install spotipy --upgrade

In [363]:
%%capture
pip install lazypredict

In [364]:
#Used to interact with Spotify API
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
#Used for classification
from xgboost import XGBClassifier
from lazypredict.Supervised import LazyClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import NuSVC
from sklearn.metrics import precision_score
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.svm import LinearSVC

Now that we have our tools, the next thing we have to do is be able to make calls to the Spotify API through Spotipy. To do this, we need a client id and a client secret key. Both of these can be obtained by registering an app through the <a href = https://developer.spotify.com/dashboard/login>Spotify for Developers</a> portal.

In [365]:
#Create a Spotipy object to make Spotify API calls
auth_manager = SpotifyClientCredentials(client_id='client id',client_secret='client secret')
sp = spotipy.Spotify(auth_manager=auth_manager)

Great! Now we that we have a Spotipy object, we can start to collect our data.

## <center>Data Collection</center>

The first step of this process is to obtain songs that would play at a fraternity party. To get the songs, I simply searched "Fraternity Playlists" on the Spotify Web Player and stored the playlists IDs in a List. From this, I used the Spotipy object to get the name of each song and stored this, along with the song ID into a Data Frame. I also added a column "Frat" and filled it all with 1s to indicate that these are good songs to play at a party.

In [366]:
frat_playlists = ['12mcHHoTRZF6cAxfjAvMPP', '3Gd67DHBoA9QRSYk2hhHdq', '2KfDfNcRVrNVYgU46eJYcJ', '2mmdzFwPVURPPeWSG2Gadh', '08SEeX1N03RuaRAmKhin8F', '1ea6YoJmz0eaKzIxjW5PPQ', '1RQ3vAw4gEmLeJE96faCVc']

In [367]:
songs = []
#Collect the IDs of all the songs present in these playlists and store them in an array
for p in frat_playlists:
    x = sp.playlist_tracks(p)
    for i in range(0,len(x['items'])):
        songs.append(x['items'][i]['track']['id'])

In [368]:
#Got rid of all duplicate IDs by transforming the list to a set and back
songs = list(set(songs))

In [369]:
#Create empty DataFrame
df = pd.DataFrame()

In [None]:
#Get the title of each of these songs and add them to the Data Frame
title = []
for i in songs:
    t = sp.track(i)
    title.append(t['name'])
#Add the ID and title of the tracks to the Data Frame
df['ID'] = songs
df['Title'] = title
#Mark each of these entries as 1 to indicate that these are good party songs and add to the Data Frame
df['Frat'] = [1] * len(df)

The next step of this process was to get the audio features of each of the songs present within the Data Frame and add it to said Data Frame. This is an important step as these features will end up helping when determing whether a song is good for a party or not. 

In [None]:
#Go through each of the rows in the DF and get the danceability, energy, loudness, speechiness, 
#acousticness, instrumentalness, liveness, valence, and tempo of each song
danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo = [],[],[],[],[],[],[],[],[]
for index, row in df.iterrows():
    #Make a central call to the Spotipy API to get the audio features of each track
    info = sp.audio_features(tracks = [df.loc[index, 'ID']])[0]
    danceability.append(info['danceability'])
    energy.append(info['energy'])
    loudness.append(info['loudness'])
    speechiness.append(info['speechiness'])
    acousticness.append(info['acousticness'])
    instrumentalness.append(info['instrumentalness'])
    liveness.append(info['liveness'])
    valence.append(info['valence'])
    tempo.append(info['tempo'])

#Add these metrics to the Data Frame
df['danceability'] = danceability
df['energy'] = energy
df['loudness'] = loudness
df['speechiness'] = speechiness
df['acousticness'] = acousticness
df['instrumentalness'] = instrumentalness
df['liveness'] = liveness
df['valence'] = valence
df['tempo'] = tempo

In [None]:
#Output party song Data Frame
df

Now that we have created a Data Frame for songs that would be good to play at a frat party, it's to do the same thing for songs that would be __bad__ to play at a frat party. Besides figuring out how how Spotipy worked, this was the hardest part of the project. What would constitute as a bad song? I ended up crowdsourcing some answers found that "bad" songs include songs written by Taylor Swift, sad songs, slow songs, songs written by Olivia Rodrigo, country songs, and Kids Bop. As I did before, I created a new Data Frame that incorporates these selections and filled it with the song ID, the song name, an indicator that showed these songs were not meant for a party, and the audio features of each track. The one step in the process that differed with the previous Data Frame creation was the fact that I got rid of tracks that were considered to be "bad" if they were also in the "good" list of songs. This was to make sure that each song only belonged to the "good" section or the "bad" section and not both.

In [None]:
bad_music = ['5ksaUaYEgnywCzO6nmAIwN', '6duuzFPn741MPpmaurNbH1', '37i9dQZF1EQmPV0vrce2QZ', '37i9dQZF1DWWEJlAGA9gs0', '4kStQdar45aPq6v97qT2Dc', '3a6Rd7GxLcl6ZCSfy2B0oL', '1Tsa6hKcC2TIJ6ZcbsEhNx', '37i9dQZF1DZ06evO0WqnZe']

In [None]:
bad_songs = []
#Collect the IDs of all the songs present in these playlists and store them in an array
for p in bad_music:
    x = sp.playlist_tracks(p)
    for i in range(0,len(x['items'])):
        bad_songs.append(x['items'][i]['track']['id'])

In [None]:
#If a bad song also appears as a good song, delete it from the bad_songs array 
for i in bad_songs:
    if i in songs:
        index = bad_songs.index(i)
        bad_songs.pop(index)
#Got rid of all duplicate IDs by transforming the list to a set and back
bad_songs = list(set(bad_songs))

In [None]:
#Create a Bad Data Frame for bad songs to play at a party 
bdf = pd.DataFrame()

In [None]:
#Get the title of each of these songs and add them to the Bad Data Frame
bad_title = []
for i in bad_songs:
    t = sp.track(i)
    bad_title.append(t['name'])
#Add the ID and title of the tracks to the Bad Data Frame
bdf['ID'] = bad_songs 
bdf['Title'] = bad_title
#Mark each of these entries as 0 to indicate that these are bad party songs and add to the Bad Data Frame
bdf['Frat'] = [0]*len(bdf)

In [None]:
#Bad Data Frame
bdf

In [None]:
#Go through each of the rows in the Bad DF and get the danceability, energy, loudness, speechiness, 
#acousticness, instrumentalness, liveness, valence, and tempo of each song
danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo = [],[],[],[],[],[],[],[],[]
for index, row in bdf.iterrows():
    #Make a central call to the Spotipy API to get the audio features of each track
    info = sp.audio_features(tracks = [bdf.loc[index, 'ID']])[0]
    danceability.append(info['danceability'])
    energy.append(info['energy'])
    loudness.append(info['loudness'])
    speechiness.append(info['speechiness'])
    acousticness.append(info['acousticness'])
    instrumentalness.append(info['instrumentalness'])
    liveness.append(info['liveness'])
    valence.append(info['valence'])
    tempo.append(info['tempo'])
#Add these metrics to the Bad Data Frame
bdf['danceability'] = danceability
bdf['energy'] = energy
bdf['loudness'] = loudness
bdf['speechiness'] = speechiness
bdf['acousticness'] = acousticness
bdf['instrumentalness'] = instrumentalness
bdf['liveness'] = liveness
bdf['valence'] = valence
bdf['tempo'] = tempo

Now that we have our two Data Frames, it's time to combine them into one main Data Frame so we can start with our data exploration stage.

In [None]:
#Good df
df

In [None]:
#Bad df
bdf

In [None]:
#Combine dfs
df = df.append(bdf).reset_index(drop=True)

In [None]:
#Combined df
df

## <center>Data Exploration</center>

For this section, I wanted to go over the audio features of the songs found within the Data Frame. The first thing I wanted to know was if certain features were more influential than others when determining if a song is good for a party. To do this, I decided to create a correlation matrix and heatmap of the correlation matrix. 

In [None]:
sns.set_theme()

In [None]:
#Heatmap of correlation matrix
sns.heatmap(df.corr())

In [None]:
#Correlation matrix of df
df.corr()

To me, a strong correlation is anything above 40%. Based on this, the attributes that have the most impact on the classification of a song are __"danceability"__, __"energy"__, __"loudness"__, __"speechiness"__, and __"acousticness"__. With this in mind, lets look at the distribution of these attributes across the fraternity songs and non-fraternity songs within our Data Frame. 

In [None]:
display(sns.histplot(data = df, x = 'danceability', hue = 'Frat').set(title = 'Histogram of Danceability of Songs'))

The main thing I saw in the histogram above was the the danceability of both frat songs and non-frat songs are left-skewed. However, it seems that, on average, danceability of frat songs are higher than that of non-frat songs. Overall it seems that while both classes of songs are danceable, you would find more poeple dancing to a frat song.  

In [None]:
display(sns.histplot(data = df, x = 'energy', hue = 'Frat').set(title = 'Histogram of Energy of Songs'))

In this histogram, it seems that the energy of frat songs are left skewed, while the energy of non-frat songs are slightly right skewed. This means that the energy of non-frat songs would be on the lower end, while frat songs would have higher energy. This doesn't surprise me as the whole reason why people play frat songs in a party is to get people active and moving.

In [None]:
display(sns.histplot(data = df, x = 'loudness', hue = 'Frat').set(title = 'Histogram of Loudness of Songs'))

Here, non-frat songs are left skewed while frat songs and normally distributed. Based on this, it seems like there is high variance in the loudness of a non-frat songs as the values range from -40 to -4 while frat songs range from -12 to 0. Lastly, it seems that frat songs tend to be more loud than non-frat songs.

In [None]:
display(sns.histplot(data = df, x = 'speechiness', hue = 'Frat').set(title = 'Histogram of Speechiness of Songs'))

In this histogram, it's clear that non-frats songs are right skewed and not speechy at all. This is in stark contrast to frat songs which are also right skewed, but have high volatility as the speechiness varies between 0.04-0.5.

In [None]:
display(sns.histplot(data = df, x = 'acousticness', hue = 'Frat').set(title = 'Histogram of Acousticness of Songs'))

Lastly in this histogram, non-frat songs seem to be bimodal while frat songs are unimodal and skewed to the right. This tells me that the acoustics of frat songs tend to be on the lower end, while the acoustics of non-frat songs tend to either be very low or very high. From this data exploration, we now know that fraternity songs tend to be loud, high in energy, somewhat speechy, danceable, and have very little acoustic presence. I was surprised that tempo wasn't a factor in this classification, as I thought that faster songs would be more favorable at parties, but I guess that this is not the case.


## <center>Song Classification</center>

Now that we've taken a look at what factors influence the classification of a song the most, the question still begs: "Can we come up with a way to determine what songs are good to play at parties?" The answer, of course, is yes, with the help of a classification model. The first step in this process is to create a train-test split for the data. Here, I decided to not use every attribute that we collected earlier, but rather the 5 attributes that we found to be most impactful when deciding if a song is good to play at a frat party.  

In [None]:
#Target is the frat column
y = df['Frat']
#Classify on 'danceability', 'energy', 'loudness', 'speechiness', and'acousticness'
X = df[['danceability', 'energy', 'loudness', 'speechiness','acousticness']]
#Use 20% of the data to test and the rest to train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

The next step in the process is to train a model with this data. However, I had no idea which model would be best in this scenario. To solve this, I decided to use LazyClassifier to run a bunch of different classificatiion models and pick the best one to use on the data.

In [None]:
#Create and deploy LazyClassifier model
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
models

The top three models that the LazyClassifier displayed was GaussianNB, NearestCentroid, SGDClassifier, and SVC. The accuracies of these models were 0.87, 0.86, 0.87, and 0.87 respectively. This is good, but I wanted to see if by using all the data to classify, I would get better accuracy results. 

In [None]:
#Target is the frat column
y = df['Frat']
#Use all the Spotify attributes to classify a song
X = df.drop(['Title', 'ID', 'Frat'], axis = 1)
#Use 20% of the data to test and the rest to train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

In [None]:
#Create and deploy LazyClassifier model
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
models

It turns out that by using all the data, the results do improve. In this case, the top models to use were SVC, NuSVC, LinearSVC, and RandomForestClassifier with the accuracies being 0.9, 0.89, 0.89, and 0.89 respectively. Another way to potentially increase the results of the classification was to standardize the data. The general formula I used was to subract the mean from the column from each value in the column and divide it by the standard deviation of the column. 

In [None]:
#For each column, subtract the mean from each value in the column and divide it by the standard 
#deviation of the column.
df['danceability'] = (df['danceability']-df['danceability'].mean())/df['danceability'].std()
df['energy'] = (df['energy']-df['energy'].mean())/df['energy'].std()
df['loudness'] = (df['loudness']-df['loudness'].mean())/df['loudness'].std()
df['speechiness'] = (df['speechiness']-df['speechiness'].mean())/df['speechiness'].std()
df['acousticness'] = (df['acousticness']-df['acousticness'].mean())/df['acousticness'].std()
df['instrumentalness'] = (df['instrumentalness']-df['instrumentalness'].mean())/df['instrumentalness'].std()
df['liveness'] = (df['liveness']-df['liveness'].mean())/df['liveness'].std()
df['valence'] = (df['valence']-df['valence'].mean())/df['valence'].std()
df['tempo'] = (df['tempo']-df['tempo'].mean())/df['tempo'].std()

In [None]:
#Standardized DF
df

Now that we have standardized the data, let's repeat the steps above see whether or not the models have improved.

In [None]:
#Target is the frat column
y = df['Frat']
#Classify on 'danceability', 'energy', 'loudness', 'speechiness', and'acousticness'
X = df[['danceability', 'energy', 'loudness', 'speechiness','acousticness']]
#Use 20% of the data to test and the rest to train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

In [None]:
#Create and deploy LazyClassifier model
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
models

In [None]:
#Target is the frat column
y = df['Frat']
#Use all the Spotify attributes to classify a song
X = df.drop(['Title', 'ID', 'Frat'], axis = 1)
#Use 20% of the data to test and the rest to train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

In [None]:
#Create and deploy LazyClassifier model
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
models

Overall, the classifiers themselves and the scores of the classifiers haven't changed at all, thus making me come to the concluion that normalizing the data didn't really matter. Again, the models in which every attribute was used tended to yield better results. 

Although these models yielded great results, the key metric that I really care about here is precision, as I want to find what proportion of predicted frat songs was actually correct. I decided to take the top 4 models that yielded the best results in the classification above and measure the precision of each of them.

The first model I tested was NuSVC:

In [None]:
#Create and test NuSVC model
clf = make_pipeline(StandardScaler(), NuSVC())
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
precision_score(y_test, y_predict, average='micro')

Here, the precision score of this model was 89.03%.

The next model I tested was Random Forest Classifier:

In [None]:
#Create and test RandomForestClassifier model
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
precision_score(y_test, y_predict, average='micro')

From this, the precision score of this model was 89.03%.

The next model I tested was SVC:

In [None]:
#Create and test SVC model
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
precision_score(y_test, y_predict, average='micro')

Here, the precision score of this model was 89.87%: a significant jump from the previous model.

The last model I checked was LinearSVC:

In [None]:
#Create and test LinearSVC model
clf = make_pipeline(StandardScaler(),LinearSVC())
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
precision_score(y_test, y_predict, average='micro')

Here, the precision score of this model was 89.03%

Overall, it seems that the best model to use here is <a href = https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC>SVC</a>, as it has the highest percision out of all the models tested.

## <center>Conclusion</center>
It turns out that making, let alone making a hit song, is a lot deeper than people expect it to be. Before this project, I had no idea about the metrics that Spotify stores for each song. However, with the use of these metrics, we were able to not only identify defining characteristics of popular party songs, but we were able to use this information and more to create a model that can help DJs and hosts everywhere curate playlists to ensure the satisfaction of partygoers all over the world. Although the approach of this project was to determine good songs to play at a party, the project is robust enough for you to curate it to your personal needs such as trying to make a playlist for a friend or finding songs that match your current mood. The next step, if any for this project, is to find ways to increase the precision and accuracy of these models and eventually refine it enough so that they can replace a human altogether. I would also love to test the validity of my model by curating a playlist for a party and playing it for an actual crowd. Overall, I hope that this was worth the read and you were able to learn something that you didn't know earlier.