# Table Of Contents
* ## Importing Contents
* ## Basic Cleaning
* ## Basic Exploration
* ## Advanced Cleaning
* ## Exploratory Data Analysis
* ## Machine Laarning

#**Importing Modules**

In [1]:
!pip install tqdm



In [2]:
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sb
from time import sleep
ROOT = "/content"
from tqdm import tqdm
import json

# **Spotify API**

We first obtain our raw data from the spotify million playlist dataset:
https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge

The raw dataset provides the basic details of tracks on spotify, such as name, artist and identification (URI). We extract a few slices from the entire dataset (~60,000 songs) and upload them to google colab

In [3]:
!pip install spotipy==2.16



In [4]:
from google.colab import files
files.upload()

ModuleNotFoundError: No module named 'google.colab'

In [None]:
!unzip mpd.slice2.zip

Next, we import the spotipy library to interect with the spotify API directly from google colab. This allows us to send the spotify API requests, and extract the song-specific details for our machine learning algorithms

In [None]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
client_credentials_manager = SpotifyClientCredentials(client_id="5fa1b1f372a54022be85dbdaaa792649", client_secret="84f297f748304b5c9a616c85b0139fb9")
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)
def get_URI(link):
  return link.split("/")[-1].split("?")[0]

In [None]:
df = pd.DataFrame()

In [None]:
file = "mpd.slice.1.json"
print("Processing "+ file)
with open(file) as train_file:
    dictionary = json.load(train_file)
    
playlists = [x["tracks"] for x in dictionary["playlists"]]
tracks = [x["track_uri"] for playlist in playlists for x in playlist]

In [None]:
for i in range(1,7):
  file = "mpd.slice.{}.json".format(i)
  print("Processing "+ file)
  with open(file) as train_file:
      dictionary = json.load(train_file)
      
  playlists = [x["tracks"] for x in dictionary["playlists"]]
  tracks = [x["track_uri"] for playlist in playlists for x in playlist]


  for i in tqdm(range(0,len(tracks)-50, 50)):
    part = tracks[i:i+50]
    try:
      part_data = sp.audio_features(part)
      part_info = sp.tracks(part)['tracks']

      part_artists_info = sp.artists([artist['artists'][0]['id'] for artist in part_info])["artists"]
      part_artist = [artist["name"] for artist in part_artists_info]
      part_genre = ["/".join(artist["genres"]) for artist in part_artists_info]

      part_df = pd.concat([pd.DataFrame(part_data).drop(["id","uri","track_href","analysis_url"],axis=1), pd.DataFrame(part_info)[["name","popularity"]]],axis=1)
      part_df["artist"] = part_artist
      part_df["genre"] = part_genre

      df = pd.concat([df, part_df])

    except:
      pass



In [None]:
df

In [None]:
df.to_csv("spotify_dataset.csv")

# **Importing the dataset**

Here, we upload our extracted data and display it to get a general understanding of it

In [None]:
# Upload "spotify_dataset.csv"
from google.colab import files
files.upload()

In [None]:
df = pd.read_csv("spotify_dataset.csv")
df.head()

In [None]:
# Drop irrelevant columns
df = df.drop(["Unnamed: 0",'type','time_signature'],axis=1)

In [None]:
df.info()
print("There are {} unique titles, {} unique artists, and {} unique genres".format(len(df.name.unique()),
                                                                                   len(df.artist.unique()), 
                                                                                   len(df.genre.unique())))

##**Label References**


title - name of the song

artist - artist of the song

genre - the genre of the track

year - the release year of the recording. (may be unreliable due to re-releases and reuploads)

bpm (beats per minute) - The tempo of the song.

energy - The energy of a song - the higher the value, the more energtic

dance - The higher the value, the easier it is to dance to

volume (dB) - The higher the value, the louder the song

live - The higher the value, the more likely the song is a live recording.

valence - The higher the value, the more positive mood for the song.

length - The duration of the song.

acous (acoustics) - The higher the value the more acoustic the song is.

speech (speechiness) - The higher the value the more spoken word the song 

pop (popularity) - The higher the value the more popular the song is.

# **Basic Cleaning**

We proceed to do basic cleaning of the data such as handling NaN values and duplicates. This ensures that our models can be trained without skewed data

## Handling NaN values

In [None]:
# Check for NaN values in the dataset
df[df.isna().any(axis=1)].head()

Observation: Looks like most of the NaN comes from the genre column

In [None]:
# Check for NaN that isn't under genre
df[df.drop("genre",axis=1).isna().any(axis=1)]

In [None]:
# Remove NaN rows (excluding those in genre)

df = df.dropna(how='any', subset=['danceability',	'energy',	'key', 'loudness',	'mode',	'speechiness',	'acousticness',	'instrumentalness',	'liveness',	'valence',	'tempo',	'duration_ms',	'name',	'popularity',	'artist'])
df


## Handling duplicates

In [None]:
# sort df by popularity
df = df.sort_values("popularity", ascending=False)

# Remove duplicate track titles, keeps row with highest pop value
df = df[df.duplicated('name', keep="first") == False]
df

#**Basic Exploration**

After cleaning, we start do perform basic exploratory data analysis on the numerical and categorical data. From there, we get a rough understanding of the overall distribution and allows us to continue with more advanced data cleaning and feature engineering

In [None]:
# split columns into categorical and numerical data types
categorical = ["artist",'genre']
numerical = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms','popularity']

## Numerical Data

In [None]:
# Plot box-plot, histogram and violin plot for all the numerical data
def plot_numericals(dataframe):
  f, axes = plt.subplots(len(numerical), 3, figsize=(20, 13), constrained_layout = True)
  for i,num in enumerate(tqdm(numerical)):
    data = pd.DataFrame(dataframe[num])
    sb.boxplot(data = data, orient = "h", ax = axes[i,0],color="green")
    sb.histplot(data = data, ax = axes[i,1],color="blue")
    sb.violinplot(data = data, orient = "h", ax = axes[i,2],color="yellow")

In [None]:
# Check the distribution for each of the numerical values
plot_numericals(df)

Observations:
1. Most values are in range 0-1, with exception of loudness in the negative range
2. There is a disproportionate number of tracks with 0 popularity
3. Something amiss with instrumentalness column

## Categorical Data

In [None]:
# View the different values in artist column
df["artist"].value_counts()

Observation: There are 16964 unique artists, the variety in artists may prove to be ineffective in training our predictive model

In [None]:
# View the different values in genre column
df["genre"].value_counts()[:20]

Observation: the genres are a mix of multiple subgenres, data preprocessing required to make use of this categorical value

In [None]:
# Explore and find the most frequent genres
df["genre"].nunique()

Each song can be tagged with multiple genres. Our team decided to reduce the dimensionality of genres by collating the most frequently occuring genres

In [None]:
# Split the combined genres into their individual categories
dfcpy=df["genre"].str.split(r"/", expand=True)
# Concat all genre instances into 1 dataframe to track their frequency
test = pd.DataFrame(pd.concat([pd.DataFrame(dfcpy[i]).value_counts() for i in range(18)],axis=0),columns=["freq"]).reset_index()

In [None]:
# Sort by frequency and get the most frequent genres
test=test.sort_values("freq",ascending=False)
test.columns=["genre",'freq']

In [None]:
test=test.groupby(by=["genre"]).sum().sort_values("freq",ascending=False)
genres = test.index.tolist()[:25]
genres.reverse()

In [None]:
genres

We now have a list of the top 25 most frequent genres (least to most frequent)

#**Advanced Cleaning**

## Numerical Data

Looking at the primary EDA, we decide to normalize the data so everything is represented on a scale of 0-1. We also decided to remove songs with 0 popularity because there is a disproportionate amount of them, and they are songs which have minimal plays on spotify

In [None]:
# Create a copy
df_clean = df.copy()
# Remove tracks with 0 popularity
df_clean = df_clean[df_clean["popularity"]>0]
#df_clean['popularity']

In [None]:
# Normalize the data using min-max so each numerical value is between 0-1
def normalize(dataframe):
  dataframe[numerical] = (dataframe[numerical] - dataframe[numerical].min()) / (dataframe[numerical].max() - dataframe[numerical].min())
normalize(df_clean)

In [None]:
df_clean["instrumentalness"].describe()

In [None]:
# Instrumentalness column is positively skewed, so perform squareroot transformation
df_clean["instrumentalness"] = np.sqrt(df_clean["instrumentalness"])
df_clean['popularity']

In [None]:
plot_numericals(df_clean)

## Categorical Data

Based on our previous genre EDA, we clean out NA values in genre, and filter all tracks that are part of the top 25 most frequent genres.

In [None]:
# Fill NA values as "no genre"
df_clean['genre'] = df_clean['genre'].fillna('no genre')

In [None]:
# Simplify genres and add them as an individual column

for genre in genres:
  df_clean["genre"][df_clean['genre'].str.contains(genre, case=True, na=False)] = genre.upper()
genres = [genre.upper() for genre in genres]
df_clean = df_clean.loc[df_clean['genre'].isin(genres)]


In [None]:
df_clean["genre"].value_counts()

# **Exploratory data analysis**

After cleaning our data, we can perform exploratory data analysis to find any correlations in our data. For this project, we wish to train a model to predict whether a song will be popular or not, hence we decided to convert the numerical popularity label into binary "pop" and "not pop".

In [None]:
df_pop = df_clean.copy()

names = ["Not pop", "pop"]
df_pop['pop'] = pd.qcut(df_pop['popularity'],
                              q=[0, .9, 1],
                              labels=names)
df_pop.info()

## Graph plots

Plot a heatmap to check linear correlations between labels

In [None]:
# Find correlation coefficient between numerical values and popularity
def plot_heatmap(dataframe):
  f = plt.figure(figsize=(12, 8))
  sb.heatmap(dataframe[numerical].corr(), vmin = -1, vmax = 1, annot = True, fmt = ".2f")
plot_heatmap(df_pop)

At a glance, there are strong correlations between loudness, energy and acousticness. However, popularity does not seem to have high linear correlations with any of the labels. The highest correlation observed is -0.14 between popularity and instrumentalness and is considered a weak correlation

Next, we plot boxplots of the labels to explore the categorical relationship with popualarity

In [None]:
def plot_boxplot(dataframe, metric="pop"):

  for i, num in enumerate(numerical):
    f = plt.figure(figsize=(12, 3))
    sb.boxplot(x = num, y = metric, data = dataframe, orient = "h")

plot_boxplot(df_pop)

We observed that most of the labels do not provide clear distinction between them and the song's popularity. This leads us to the hypothesis that multiple variables are needed to determine a song's popularity

To get a better visualization, we used a pairplot to check the distribution of popular and unpopular songs with reference to their individual labels

In [None]:
sb.pairplot(data = df_pop,hue="pop")

Overall, there does not seem to be a clear distinction between popularity and any one variable. But labels such as duration and loudness suggest that popular songs vary less compared to their unpopular counterparts

Another visualization we did was to plot the top generes with reference to the labels.

In [None]:
sb.pairplot(data = df_pop,hue="genre")

The lack of distinct pattern in the above pairplot suggests that the genre of a song is not dependant on any one variable

## Feature Selection

Using our domain specific knowledge of songs, we dropped uneccesary labels mode and key as they do not affect how people hear the song

In [None]:
numericals=['danceability', 'energy', 'loudness',  'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']
metrics = numericals + genres

## One Hot Encoding

We use one-hot encoding to convert the values in our genere column into individual binary columns. 

In [None]:
df_onehot = df_pop.copy()
one_hot = pd.get_dummies(df_onehot["genre"])
df_onehot = df_onehot.join(one_hot)
print(df_onehot.columns.tolist())

In [None]:
df_onehot

# **Classification**





## Preparing train test splits

We import the necessary libraries and split our data randomly into train and test sets. The train sets are used to train our machine learning models, and tests sets will be used to evaluate their performance.

In [None]:
# Import essential models and functions from sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, balanced_accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
# Extract Response and Predictors

y = pd.DataFrame(df_onehot["pop"])
X = pd.DataFrame(df_onehot[metrics])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

## Decision Tree

We first explored using a simple decision tree model to classify our dataset.

In [None]:
# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_leaf_nodes=30,max_depth=15, class_weight="balanced",min_samples_split=900)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)
balanced_score  = balanced_accuracy_score(y_test, y_test_pred)
print(balanced_score)
# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
results = confusion_matrix(y_test, y_test_pred)
sb.heatmap(results, 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

tpr = results[1][1]/results[1].sum()
tnr = results[0][0]/results[0].sum()
print("True positive rate: \t{}".format(tpr))
print("True negative rate: \t{}".format(tnr))

Although this model provided a decent true positive rate, the low true negative rate makes it impractical and suggests overfitting on popular songs

We make use of grid search to tune our hyperparameters for the decision tree.Limiting these parameters prevent our tree from overfitting, and allows our model to have higher accuracy on the test data. The parameters we have chosen are:
1. max_depth - limits the maximum depth that the tree can go
2. max_leaf_nodes - limits the number of leaf nodes the tree can have
3. min_sample_split - requires a branch to have a certain number of samples before splitting is allowed
4. max_features - restricts the number of features to be considered for splitting




As our popular to unpopular song ratio is imbalanced, we switched to a different scoring metric to evaluate the model's performance. The usage of "balanced accuracy" adjusts the class weights in proportion to their frequency in the dataset, and is a alternative solution to over/under sampling.

In [None]:
# Create a function to evaluate the performance of a grid search model
def print_grid_report(model):
  # print the best parameters found by gridsearch, and its score on the test set
  print(model.best_params_)
  y_test_pred = model.predict(X_test)
  y_train_pred = model.predict(X_train)
  print("Balanced accuracy score (test):" , model.score(X_test, y_test))

  #Plot confusion matrix for train and test data
  f, axes = plt.subplots(1, 2, figsize=(12, 4))
  sb.heatmap(confusion_matrix(y_train, y_train_pred),
            annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
  results = confusion_matrix(y_test, y_test_pred)
  sb.heatmap(results, 
            annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
  
  #Calculate true and false positive rates
  results = confusion_matrix(y_test, y_test_pred)
  tpr = results[1][1]/results[1].sum()
  tnr = results[0][0]/results[0].sum()
  print("True positive rate: \t{}".format(tpr))
  print("True negative rate: \t{}".format(tnr))

In [None]:
# Find the best parameter range to use
parameters = {"max_depth": [None,10,20,30,40,50], "max_leaf_nodes": [None,10,20,30,40,50], "class_weight":["balanced"], "min_samples_split":[2,50,500,1000,1500],
              "criterion":["gini", "entropy"],"max_features":[None,"sqrt", "log2"]}
grid = GridSearchCV(DecisionTreeClassifier(), parameters, refit = True, verbose = 3,n_jobs=-1,scoring="balanced_accuracy")
grid.fit(X_train, y_train)
 
print_grid_report(grid)

Using grid search, we have managed to reduce the number of false positives predicted


However, the overall balanced score for the decision tree model is relatively low(0.577), and cannot be considered reliable in predicting popular songs

##Random Forest

The next alternative we try is the random forest classifier. Random forest  fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

In [None]:
# Decision Tree using Train Data
dectree = RandomForestClassifier(max_depth=30, max_leaf_nodes=30, class_weight="balanced",n_estimators=400)  # create the decision tree object
dectree.fit(X_train, y_train.values.ravel())                    # train the decision tree model

# Predict Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

results = confusion_matrix(y_test, y_test_pred)
tpr = results[1][1]/results[1].sum()
tnr = results[0][0]/results[0].sum()
print(tnr, tpr)

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])

sb.heatmap(results, 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])


# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))

print("True positive rate: \t{}".format(tpr))
print("True negative rate: \t{}".format(tnr))

Similar to our decision tree, we run grid search to find the best parameters for the random forest model

In [None]:
# Find the general best parameters 
parameters = {"max_depth": [10,20,30,40], "max_leaf_nodes": [30,40,50,60], "class_weight":["balanced"], "min_samples_split":[2],
              "criterion":["gini"],"max_features":["sqrt"], "n_estimators":[400]}
grid = GridSearchCV(RandomForestClassifier(), parameters, refit = True, verbose = 3,n_jobs=-1, scoring='balanced_accuracy')
grid.fit(X_train, y_train.values.ravel())

print_grid_report(grid)

## K nearest neighbours

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Decision Tree using Train Data
classifier = KNeighborsClassifier()
classifier.fit(X_train, y_train)

# Predict Response corresponding to Predictors
y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)

# Grid search
parameters = {"n_neighbors": [10,20,30,40,50], "weights":["uniform", "distance"], "leaf_size":[30,40,50,60],
              "p":[1, 2]}
grid = GridSearchCV(KNeighborsClassifier(), parameters, refit = True, verbose = 3, n_jobs=-1, scoring='balanced_accuracy')
grid.fit(X_train, y_train.values.ravel())
 
# print best parameter after tuning
print(grid.best_params_)
grid_predictions = grid.predict(X_test)

   
# print classification report 
print(classification_report(y_test, grid_predictions))

accuracy = balanced_accuracy_score(y_test, grid_predictions)
print(accuracy)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", grid.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", grid.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])

results = confusion_matrix(y_test, grid_predictions)
sb.heatmap(results, 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

tpr = results[1][1]/results[1].sum()
tnr = results[0][0]/results[0].sum()
print("True positive rate: \t{}".format(tpr))
print("True negative rate: \t{}".format(tnr))

In [None]:
# Fine-tune the parameters for the best performance
parameters = {"n_neighbors": [i for i in range(5,15)], "weights":["distance"], "leaf_size":[30],
              "p":[2]}
grid = GridSearchCV(KNeighborsClassifier(), parameters, refit = True, verbose = 3,n_jobs=-1,scoring="balanced_accuracy")
grid.fit(X_train, y_train.values.ravel())
 
print_grid_report(grid)

## Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Decision Tree using Train Data
logreg = LogisticRegression()
logreg.fit(X_train,y_train)

# Predict Response corresponding to Predictors
y_train_pred = logreg.predict(X_train)
y_test_pred = logreg.predict(X_test)

# Grid search
parameters = {"solver":["sag", "saga"], "class_weight":["balanced"], "max_iter":[20,40,60,80,100],
              "n_jobs":[-1]}
grid = GridSearchCV(LogisticRegression(), parameters, refit = True, verbose = 3, n_jobs=-1, scoring='balanced_accuracy')
grid.fit(X_train, y_train.values.ravel())

 
# print best parameter after tuning 
print(grid.best_params_) 
grid_predictions = grid.predict(X_test) 

   
# print classification report 
print(classification_report(y_test, grid_predictions))

accuracy = balanced_accuracy_score(y_test, grid_predictions)
print(accuracy)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", grid.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", grid.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])

results = confusion_matrix(y_test, grid_predictions)
sb.heatmap(results, 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

tpr = results[1][1]/results[1].sum()
tnr = results[0][0]/results[0].sum()
print("True positive rate: \t{}".format(tpr))
print("True negative rate: \t{}".format(tnr))

In [None]:
# Fine-tune the parameters for the best performance
parameters = {"solver": ["sag"], "class_weight": ["balanced"], "max_iter":[i for i in range(50,70)], "n_jobs":[-1]}
grid = GridSearchCV(LogisticRegression(), parameters, refit = True, verbose = 3,n_jobs=-1,scoring="balanced_accuracy")
grid.fit(X_train, y_train.values.ravel())
 
print_grid_report(grid)

## Gradient boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")
from sklearn.metrics import classification_report, confusion_matrix

# Scale the data
scaler = MinMaxScaler()
X_train_transformed = scaler.fit_transform(X_train)
X_test_transformed = scaler.transform(X_test)

# Split the Dataset into Train and Test
# X_train, X_test, y_train, y_test = train_test_split(X_train_transformed, y_train, test_size = 0.2)

# Decision Tree using Train Data
gbc = GradientBoostingClassifier()
gbc.fit(X_train_transformed,y_train)

# Predict Response corresponding to Predictors
y_train_pred = gbc.predict(X_train_transformed)
y_test_pred = gbc.predict(X_test_transformed)

# Grid search
parameters = {"n_estimators":[20,40,60,80,100], "learning_rate":[0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1], "max_depth":[2,3,5],
              "min_samples_split":[100], "min_samples_leaf":[30], "max_features":["sqrt"]}
grid = GridSearchCV(GradientBoostingClassifier(), parameters, refit = True, verbose = 3, n_jobs=-1, scoring='balanced_accuracy')
grid.fit(X_train_transformed, y_train.values.ravel())

 
# print best parameter after tuning 
print(grid.best_params_)
grid_predictions = grid.predict(X_test_transformed)

   
# print classification report 
print(classification_report(y_test, grid_predictions))

accuracy = balanced_accuracy_score(y_test, grid_predictions)
print(accuracy)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", grid.score(X_train_transformed, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", grid.score(X_test_transformed, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])

results = confusion_matrix(y_test, grid_predictions)
sb.heatmap(results, 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

tpr = results[1][1]/results[1].sum()
tnr = results[0][0]/results[0].sum()
print("True positive rate: \t{}".format(tpr))
print("True negative rate: \t{}".format(tnr))

In [None]:
# Fine-tune the parameters(max_depth + min_samples_split) for the best performance
parameters = {"n_estimators":[80], "learning_rate":[0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1], "max_depth":[i for i in range(5,15,2)],
              "min_samples_split":[j for j in range(100,500,100)], "min_samples_leaf":[30], "max_features":["sqrt"]}
grid = GridSearchCV(GradientBoostingClassifier(), parameters, refit = True, verbose = 3,n_jobs=-1,scoring="balanced_accuracy")
grid.fit(X_train_transformed, y_train.values.ravel())
 
print_grid_report(grid)

##Misc

In [None]:
r = pd.DataFrame()
for i in tqdm(range(2,15)):
  dectree = RandomForestClassifier(max_depth=2,max_leaf_nodes=i,class_weight="balanced")  # create the decision tree object
  dectree.fit(X_train, y_train.values.ravel())                    # train the decision tree model

  # Predict Response corresponding to Predictors
  y_train_pred = dectree.predict(X_train)
  y_test_pred = dectree.predict(X_test)
  results = confusion_matrix(y_test, y_test_pred)
  tpr = results[1][1]/results[1].sum()
  tnr = results[0][0]/results[0].sum()
  p = {"depth":i, "TPR":tpr, "TNR":tnr,"Score":abs(tnr+tpr),"0.7":0.7}
  r=r.append(p,ignore_index=True)


sb.lineplot(data=r,y="TPR",x="depth",color="red")
sb.lineplot(data=r,y="TNR",x="depth",color="green")
sb.lineplot(data=r,y="Score",x="depth",color="grey")
sb.lineplot(data=r,y="0.7",x="depth",color="black")

In [None]:
metric=['danceability', 'energy',  'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']


f, axes = plt.subplots(len(metric)+1,1, figsize=(8, 20),constrained_layout = True)
axes.flatten()

for index, m in enumerate(metric):
  new_metric = metric.copy()
  new_metric.remove(m)
  y = pd.DataFrame(df_pop["pop"])
  X = pd.DataFrame(df_pop[new_metric]) 

  # Split the Dataset into Train and Test
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,random_state=1)

  r = pd.DataFrame()
  for i in tqdm(range(5,15)):
    dectree = DecisionTreeClassifier(max_depth = i, class_weight="balanced",min_samples_leaf=60,min_samples_split=500)  # create the decision tree object
    dectree.fit(X_train, y_train)                    # train the decision tree model

    # Predict Response corresponding to Predictors
    y_train_pred = dectree.predict(X_train)
    y_test_pred = dectree.predict(X_test)
    results = confusion_matrix(y_test, y_test_pred)
    tpr = results[1][1]/results[1].sum()
    tnr = results[0][0]/results[0].sum()
    p = {"depth":i, "TPR":tpr, "TNR":tnr,"Score":abs(tnr+tpr)/2,"0.7":0.7}
    r=r.append(p,ignore_index=True)


  sb.lineplot(data=r,y="TPR",x="depth",color="red",ax=axes[index])
  sb.lineplot(data=r,y="TNR",x="depth",color="green",ax=axes[index])
  sb.lineplot(data=r,y="Score",x="depth",color="grey",ax=axes[index])
  sb.lineplot(data=r,y="0.7",x="depth",color="black",ax=axes[index])
  axes[index].set(xlabel=m,ylim=(0,1))

y = pd.DataFrame(df_pop["pop"])
X = pd.DataFrame(df_pop[metric]) 

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,random_state=1)

r = pd.DataFrame()
for i in tqdm(range(5,15)):
  dectree = DecisionTreeClassifier(max_depth = i, class_weight="balanced",min_samples_leaf=10,min_samples_split=30)  # create the decision tree object
  dectree.fit(X_train, y_train)                    # train the decision tree model

  # Predict Response corresponding to Predictors
  y_train_pred = dectree.predict(X_train)
  y_test_pred = dectree.predict(X_test)
  results = confusion_matrix(y_test, y_test_pred)
  tpr = results[1][1]/results[1].sum()
  tnr = results[0][0]/results[0].sum()
  p = {"depth":i, "TPR":tpr, "TNR":tnr,"Score":abs(tnr+tpr)/2,"0.7":0.7}
  r=r.append(p,ignore_index=True)

index+=1
sb.lineplot(data=r,y="TPR",x="depth",color="red",ax=axes[index])
sb.lineplot(data=r,y="TNR",x="depth",color="green",ax=axes[index])
sb.lineplot(data=r,y="Score",x="depth",color="grey",ax=axes[index])
sb.lineplot(data=r,y="0.7",x="depth",color="black",ax=axes[index])
axes[index].set(xlabel="Original",ylim=(0,1))


In [None]:
# Plot the trained Decision Tree
from sklearn.tree import plot_tree

f = plt.figure(figsize=(200,50))
plot_tree(dectree, filled=True, rounded=True, 
          feature_names=X_train.columns, 
          class_names=["UnPopular","Popular"])

#**Insights**

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('dQw4w9WgXcQ')