## 💪 Challenge
Your task is to devise an analytically-backed, dance-themed playlist for the company's summer party. Your choices must be justified with a comprehensive report explaining your methodology and reasoning. Below are some suggestions on how you might want to start curating the playlist:
* Use descriptive statistics and data visualization techniques to explore the audio features and understand their relationships.
* Develop and apply a machine learning model that predicts a song's `danceability`. 
* Interpret the model outcomes and utilize your data-driven insights to curate your ultimate dance party playlist of the top 50 songs according to your model.

## 💾 The Data
You have assembled information on more than `125` genres of Spotify music tracks in a file called `spotify.csv`, with each genre containing approximately `1000` tracks. All tracks, from all time, have been taken into account without any time period limitations. However, the data collection was concluded in `October 2022`.
Each row represents a track that has some audio features associated with it.

| Column     | Description              |
|------------|--------------------------|
| `track_id` | The Spotify ID number of the track. |
| `artists` | Names of the artists who performed the track, separated by a `;` if there's more than one.|
| `album_name` | The name of the album that includes the track.|
| `track_name` | The name of the track.|
| `popularity` | Numerical value ranges from `0` to `100`, with `100` being the highest popularity. This is calculated based on the number of times the track has been played recently, with more recent plays contributing more to the score. Duplicate tracks are scored independently.|
| `duration_ms` | The length of the track, measured in milliseconds.|
| `explicit` | Indicates whether the track contains explicit lyrics. `true` means it does, `false` means it does not or it's unknown.|
| `danceability` | A score ranges between `0.0` and `1.0` that represents the track's suitability for dancing. This is calculated by algorithm and is determined by factors like tempo, rhythm stability, beat strength, and regularity.|
| `energy` | A score ranges between `0.0` and `1.0` indicating the track's intensity and activity level. Energetic tracks tend to be fast, loud, and noisy.|
| `key` | The key the track is in. Integers map to pitches using standard Pitch class notation. E.g.`0 = C`, `1 = C♯/D♭`, `2 = D`, and so on. If no key was detected, the value is `-1`.| 
| `loudness` | The overall loudness, measured in decibels (dB).|
| `mode` |  The modality of a track, represented as `1` for major and `0` for minor.| 
| `speechiness` | Measures the amount of spoken words in a track. A value close to `1.0` denotes speech-based content, while `0.33` to `0.66` indicates a mix of speech and music like rap. Values below `0.33` are usually music and non-speech tracks.| 
| `acousticness` | A confidence measure ranges from `0.0` to `1.0`, with `1.0` representing the highest confidence that the track is acoustic.|
| `instrumentalness` | Instrumentalness estimates the likelihood of a track being instrumental. Non-lyrical sounds such as "ooh" and "aah" are considered instrumental, whereas rap or spoken word tracks are classified as "vocal". A value closer to `1.0` indicates a higher probability that the track lacks vocal content.|
| `liveness` | A measure of the probability that the track was performed live. Scores above `0.8` indicate a high likelihood of the track being live.|
| `valence` | A score from `0.0` to `1.0` representing the track's positiveness. High scores suggest a more positive or happier track.|
| `tempo` | The track's estimated tempo, measured in beats per minute (BPM).|
| `time_signature` | An estimate of the track's time signature (meter), which is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from `3` to `7` indicating time signatures of `3/4`, to `7/4`.|
| `track_genre` |  The genre of the track.|

[Source](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset) (data has been modified)

# 📜 Executive Summary:

This report provides a comprehensive analysis of a music dataset, focusing on danceability as a key attribute, and explores various aspects of data preprocessing, genre analysis, audio features, and machine learning modeling. The objective of this analysis is to gain insights into the factors that influence danceability and create a playlist of songs optimized for danceability.

**Data Exploration:** Initial data inspection using .info() and .describe() functions to understand dataset structure and statistics. Plotting visualizations to gain insights.

**Data Pre-processing:** Duplicate entries are removed for data accuracy. Missing values are imputed.

**Genre Analysis:** An exploration of music genres' distribution in the dataset, providing insights for subsequent genre-based investigations.Analysis of how different genres correlate with danceability scores, helping identify high-danceability genres.

**Feature Correlation:** Visualization of feature correlations with danceability through a heatmap and histograms, offering insights into feature importance.

**Clustering**:K-Means clustering is used to cluster the genres, the elbow method is used to find the optimal number of clusters based on song danceability and valence.

**ML Modeling**: After the validation of the chosen ML models,The KNN Regressor Machine learning model is used to predict danceability scores based on selected features. Feature selection, data splitting, and performance evaluation are done to ensure the most optimal results.

**Playlist Creation:** A curated dance playlist is generated based on the analysis and modeling results, catering to dance enthusiasts.

**Playlist Refining** : Refining the playlist with a weight based algorithm

## 👩‍💻 Reading Data

In this step, we read the necessary data for our analysis. The dataset was provided by the competition and originally sourced from Kaggle.
For this task, let's import the pandas library and use it to unveil our dataset.

In [88]:
import pandas as pd
spotify = pd.read_csv('data/spotify.csv')
spotify

## 🔎 Exploring the Data

In this step, we will explore the dataset to gain insights and understand the structure of the data.

Let's start by examining the first few rows of the dataset.


In [89]:
spotify.head()

Looks like we have got a lot of da'ta'cing to do!

## Using the `.info()` function

The `.info()` function is a useful method in pandas that provides a concise summary of a DataFrame. It gives information about the column names, the number of non-null values, and the data types of each column.
Using the `.info()` function is a quick way to get an overview of the structure and content of a DataFrame.

In [90]:
spotify.info()

### **Recommendation**

## What do we know so far? 🤔
**Dataset Size:** The dataset contains a total of 113,027 entries or rows.

**Columns:** There are 20 columns in this dataset, each representing different attributes of music tracks.

**Data Types:**

- Most columns contain numeric data types, including integers (int64) and floating-point numbers (float64).
- The explicit column is represented as a boolean (bool) data type, indicating whether a track contains explicit content.
- Several columns, such as track_id, artists, album_name, track_name, and track_genre, are of object (object) data type, which typically represents strings or categorical data.

## Using the `.describe()` function

The `.describe()` function is a useful method in pandas that provides descriptive statistics of a DataFrame. It gives information about the count, mean, standard deviation, minimum, maximum, and quartiles of the numerical columns in the DataFrame.

Using the `.describe()` function is a quick way to get an overview of the distribution and summary statistics of the numerical data in a DataFrame.


In [91]:
spotify.describe()

### Recommendation

By carefully observing the above result, we get a clearer picture of how data is distributed over each numerical column. Few observations that can be inferred are-
- Popularity, duration, key, loudness, tempo and time_signature have occurences that go beyond the range of 0-1 and would need to be dealt with differently.
- 25% of values in instrumentalness are less than or equal to 0 which can be imputed.
- 75% of values in mode are equal to 1, so mode can be dropped from the feature selection process as it conveys little information about each individual track.

## Time to put our 'data dancing' shoes ON!
Now that we have a statistical picture of what is going on in our dataset. Let us start exploring and pre-processing our data for the best results.

### Dropping duplicates

The dataset has many song tracks which are repeated because of the different albums or genres they belong to. (One-to-many relationship). So let us first find out the actual number of unique tracks in our data.
This code below calculates and stores the count of unique track names in a Spotify dataset and then assigns this count to the variable track_count.

In [92]:
track_count=len(spotify.track_name.unique())
track_count

Might as well find the number of duplicated tracks-

In [93]:
duplicate_rows = spotify[spotify.duplicated(['track_name'])]
len(duplicate_rows)

That is quite a number of duplicates. Now let us drop them, ensuring that we retain the first occurence of every duplicate track name and make our dataset a little easier to deal with. However, doing this might result in the information loss related to genres as the same song can belong to different genres. Although, doing this is the viable option to keep redundancy in check.

In [94]:
spotify = spotify.drop_duplicates(subset=['track_name'])
spotify

### 📊 Genre Analysis

The genre of a song plays a significant role in its overall appeal and popularity. It enables music platforms and recommendation systems to suggest similar songs or artists based on a user's preferences. This facilitates music discovery and allows listeners to explore new genres and expand their musical horizons.
In our case, genre analysis would lead us to curate a relevant dance music playlist for our company!

Initally, the dataset had 125 unique genres. After dropping duplicates we have 113 unique genres as illustrated by the code below:

In [95]:
genres = spotify.track_genre.unique()
len(genres)

The distribution of song tracks across genres can be better perceived using visulization like the one demonstrated below. 

In [96]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
sns.countplot(data=spotify, x='track_genre', order=spotify['track_genre'].value_counts().index, palette='viridis')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.title('Distribution of Data by Genre')
plt.xticks(rotation=90)  # Rotate x-axis labels for readability
plt.tight_layout()
# Show the plot
plt.show()

### Recommendation

Based on the genre count plot, we can make the following recommendations:

1. Explore more songs in the genres with the highest count to discover popular and trending tracks. Metal based songs have highest frequency, followed by afrobeat and cantopop.
2. We can consider diversifying the playlist by including songs from genres with lower counts such as indie house and reggaeton.
3. Analyze the characteristics of songs in different genres to understand the preferences of the target audience.

The above can also be interpretted as below using **.value_counts()** function

In [97]:
genre_counts =spotify['track_genre'].value_counts()
genre_counts

Here is a view of all types of unique genres in our data for better readibility-

In [98]:
genres

### Genre Vs Danceability
A song's genre can tell a lot about whether a song is danceable or not. Songs belonging to genres like sad,metal,sleep etc. can definitely not qualify as danceable and hence the distrubution of danceability across various genres needs to be kept in mind. 

This code below creates three **boxplots** using Seaborn to visualize the distribution of danceability scores for music genres in the dataset, dividing the genres into thirds since one boxplot would lead to overcrowding.

In [99]:
def create_genre_boxplot(data, title):
    sns.boxplot(data=data, x='track_genre', y='danceability', width=0.6)
    plt.xlabel('Genre')
    plt.ylabel('Danceability')
    plt.title(title)
    plt.xticks(rotation=90)
    plt.show()

third = len(genres) // 3
genres1 = genres[:third]
genres2 = genres[third:2*third]
genres3 = genres[2*third:]

subset1 = spotify[spotify['track_genre'].isin(genres1)]
create_genre_boxplot(subset1, 'Boxplot of Danceability for First Third of Genres')

subset2 = spotify[spotify['track_genre'].isin(genres2)]
create_genre_boxplot(subset2, 'Boxplot of Danceability for Second Third of Genres')

subset3 = spotify[spotify['track_genre'].isin(genres3)]
create_genre_boxplot(subset3, 'Boxplot of Danceability for Final Third of Genres')

### Recommendation

Boxplots are valuable for understanding data distribution patterns. By closely examining each sub-boxplot, we can identify genres that are unsuitable for further analysis and modeling based on their danceability distribution and even checking outliers.
Genres with lowest danceability scores on average:
- iranian
- sleep
- grindcore
- opera

However, it's also essential to consider our context, as we are curating a playlist for a company's dance party. Genres like "children" and "kids" are likely irrelevant for our specific goal. Even songs which are ambient, or belong to classical genre are not suitable for dancing.

Following is the list of genres that are not being considered:

### Dropping Irrelevant Genres

In [100]:
genres_to_drop = ['children','study','sad','emo','kids','black-metal','opera','sleep','classical','ambient','death metal','grindcore','iranian','new-age','disney','idm','grunge']

# Create a new DataFrame without the specified genres
spotify =spotify[~spotify['track_genre'].isin(genres_to_drop)]

In [101]:
genres2=spotify["track_genre"].unique()
genres2

This further reduces the number of rows in our dataset to even more relevant data for our context.

In [102]:
# Print the resulting DataFrame
spotify

## 🎼 Audio Features
Let's now explore the actual audio features and see where it takes us!

### Data pre-processing

As mentioned earlier in the notebook, 25% of the values in instrumentalness column were 0. Let us deal with them. First we find the count of such values and then impute them with mean value of the column

In [103]:
count = spotify["instrumentalness"].value_counts()[0]
count

In [104]:
#Replacing '0' values with mean
spotify['instrumentalness'] = spotify['instrumentalness'].replace(0, spotify['instrumentalness'].mean())
print(spotify['instrumentalness'])

Let us now convert boolean values in explicit columns to 0 or 1. Where 0 indicates False value and 1 indicated True value.

In [105]:
spotify['explicit'] = spotify['explicit'].astype(int)
spotify['explicit']

## 💡 Correlation of features with danceability

### Heatmap

This heatmap visualization, created using Seaborn, offers a vivid representation of how different attributes interact. Bright spots indicate strong positive correlations, while darker areas suggest weaker or negative associations. Analyzing these correlations is essential for understanding the interplay of musical features and can help us uncover fascinating insights into the world of music

In [106]:
correlation = spotify.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation, annot=True)
plt.xticks(rotation=45)
plt.yticks(fontsize=10)
plt.show()

It is clear from the above heatmap that valence is highly correlated with danceability.It would be too early to disregard the rest as correlation is just one face of analysis.

### Histograms

Let us plot histograms to analyse each audio feature at once for a better understanding.

In [107]:
spotify.hist(figsize=(15,15), color='purple')
plt.tight_layout()
plt.show()

### Recommendation

The histograms provide a visual representation of the distribution of each audio feature in the Spotify dataset. 
- The 'duration' distribution is also not uniform across the data and can be dropped.

- The 'explicit' distribution is again discrete, and since most of the songs are non-explicit, we can drop this feature.

- Most of the distributions are skewed and hence a suitable scaling technique is required for processing the dataset.

Dropping the features that might not be necessary for further analysis:

In [108]:
spotify=spotify.drop(['track_id','album_name','duration_ms',],axis=1)

### Individual Correlation with Danceability
In this segment of our exploration, we turn to the Yellowbrick library to shed light on the relationships between various musical attributes and the 'danceability' of tracks in our Spotify dataset.
By examining how different features interact with the 'danceability' of music tracks, we gain a deeper understanding of which attributes have a significant impact and in what direction. This knowledge becomes invaluable for making informed decisions in our analysis and modeling processes, particularly when curating a playlist for a company's dance party.

In [109]:
import numpy as np
from sklearn import datasets
from yellowbrick.target.feature_correlation import feature_correlation

X, y = spotify.drop(['track_name','danceability','track_genre','artists'], axis=1), spotify['danceability']
feature_names = X.columns.tolist()
features = np.array(feature_names)
visualizer = feature_correlation(X, y, labels=features)
plt.tight_layout()

This diverging plot gives us a better picture of the type of correlation of features with the target feature i.e Danceability

### Scatter Plots
Each plot explores the correlation between danceability and specific audio features of music tracks. The code below employs Linear Regression to analyze the relationships between danceability and audio features such as Valence, Speechiness, Acousticness, Liveness, Energy, and Instrumentalness.

In [110]:
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

def plot_feature(ax, feature_name, X, y, x_label):
    regr = LinearRegression()
    regr.fit(X, y)

    ax.scatter(X, y, alpha=0.5, label=f'{feature_name} vs. Danceability')
    ax.plot(X, regr.predict(X), color="red", linewidth=3)
    ax.set_xlabel(x_label)
    ax.set_ylabel("Danceability")
    ax.set_title("Correlation")

# Create a single figure for all the subplots
fig, axes = plt.subplots(3, 3, figsize=(20, 10))
fig.subplots_adjust(wspace=0.3, hspace=0.5)

# List of features and their corresponding column names
features = ["Valence", "Speechiness", "Acousticness", "Liveness", "Energy", "Instrumentalness", "Loudness", "Tempo","Popularity"]
columns = ["valence", "speechiness", "acousticness", "liveness", "energy", "instrumentalness", "loudness", "tempo","popularity"]

for i, feature_name in enumerate(features):
    row, col = divmod(i, 3)
    plot_feature(
        axes[row, col],
        feature_name,
        spotify[[columns[i]]].values,
        spotify["danceability"].values,
        feature_name
    )
plt.show()


### Recommendation

The scatter plots above are a better explanation of the histograms we saw earlier.
- Valence, energy and loudness have the most prominent positive linear relationship with danceability.
- Acousticness, liveness, intrumentalness and tempo have negative relationship with danceability.
- Popularity and key are int type so they can be analysed in a different manner.

It is very apparent that valence is the strongest indicator of a song's danceability. However, the same cannot be said for speechiness due to a large concentration of outliers in the extreme ends of the graph even though there is a positive correlation.
Energy and loudness can also be considered as  decent positive indicators of a song's danceability

## 🛠 Processing Data

As we observed from the plotted histograms above, most of the distribution of features is skewed such as that of energy, intrumentalness, acousticness,liveness and speechiness. It is important that we preprocess our data carefully.

Robust Scaling is a suitable choice for preprocessing music-related data, especially when dealing with potentially skewed distributions and outliers. It helps ensure that the scaled data maintains its integrity, making it more suitable for various machine learning and data analysis tasks.

In [111]:
from sklearn.preprocessing import RobustScaler
# Select the columns to be scaled (exclude 'track_genre')
numeric_columns = ['popularity', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature', 'explicit', 'mode']

# Create a RobustScaler instance
scaler = RobustScaler()

# Apply Robust scaling to the selected columns
spotify[numeric_columns] = scaler.fit_transform(spotify[numeric_columns])

In [112]:
spotify

Our features are now scaled according to the requirement.

## Clustering for Genre
Since there are too many genres, encoding them individually would increase the dimensionality of the dataset. Hence another way of including the genres for our task is to create meaningful clusters of genres. This can be done by creating clusters of genres not only based on daceability but also on valence which is highly correlated with danceability.

Valence reflects the emotional mood of a song, while danceability measures its rhythmic and tempo characteristics. Clustering based on these features makes genres more interpretable, relevant to how listeners perceive music, and useful for recommendations and creative exploration

### Calculating Genre Statistics

Our first task is to calculate the mean danceability and valence score for each genre. This step condenses the numerous tracks within a genre into a single attribute representing its average danceability. 

In [113]:
# Calculate the mean danceability and mean valence for each genre
genre_stats = spotify.groupby('track_genre').agg({'danceability': 'mean', 'valence': 'mean'}).reset_index()

### 🕵️‍♀️ Finding the Optimal Number of Clusters


Now, let's use clustering to group genres based on their danceability scores. We can employ the "Elbow Method" to help us decide. This technique involves running the clustering algorithm for different values of K and plotting a graph of the distortion (sum of squared distances from each point to its assigned center) against K.

In [114]:
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt

range_n_clusters = [3, 4, 5, 6, 7]
distortions = []

for num_clusters in range_n_clusters:
    # initialise kmeans
    kmeans = KMeans(n_clusters=num_clusters)
    kmeans.fit(genre_stats[['danceability', 'valence']])
    cluster_labels = kmeans.labels_
    
    # calculate distortion
    distortions.append(sum(np.min(cdist(genre_stats[['danceability', 'valence']], kmeans.cluster_centers_, 'euclidean'), axis=1)) / genre_stats.shape[0])

plt.plot(range_n_clusters, distortions, 'bx-')
plt.xlabel('Values of K') 
plt.ylabel('Distortion') 
plt.title('Elbow Method For Optimal k')
plt.show()

In the resulting graph, you'll notice the first distinctive "elbow point," where the distortion starts to decrease at a slower rate. This point suggests the optimal number of clusters which is k=4. In our case, it helps us determine how many distinct clusters of music genres exist based on their danceability scores.

In [115]:
# Number of clusters (you can adjust this based on your preference)
num_clusters = 4

# Perform K-Means clustering based on mean danceability and mean valence
kmeans = KMeans(n_clusters=num_clusters, random_state=10)
genre_stats['cluster'] = kmeans.fit_predict(genre_stats[['danceability', 'valence']])
custom_cluster_labels = list(range(num_clusters))

# Plot the clusters
plt.scatter(genre_stats['danceability'], genre_stats['valence'], c=genre_stats['cluster'], cmap='rainbow')
plt.xlabel('Mean Danceability')
plt.ylabel('Mean Valence')
plt.title('Clusters of Music Genres Based on Mean Danceability and Mean Valence')
plt.show()

# Display the cluster assignments
print(genre_stats[['track_genre', 'danceability', 'valence', 'cluster']])

Since we have have our genre clusters now, we no longer need the original track genre information so we can use a mapping operation to replace the original 'track_genre' values in the Spotify dataset with corresponding cluster labels.

In [116]:
genre_to_cluster = dict(zip(genre_stats['track_genre'], genre_stats['cluster']))
# Use the mapping to replace track_genre in the original Spotify dataset
spotify['track_genre'] = spotify['track_genre'].map(genre_to_cluster)

In [117]:
spotify

We have our clusters but that is not the end of our problem here. Now the track genre column is out of scale as compared to other features. So we can use One-hot encoding to transform our clusters to make them interpretable to the model.

In [118]:
#One-hot encoding
spotify = pd.get_dummies(spotify, columns=['track_genre'], prefix='genre')

In [119]:
spotify

The dataset seems good to go now.

## 🤖 ML Modeling
Now that we have the complete picture of our data, it is time to call our DJ and get the party started. Our DJ here is none other than our Machine Learning model that would predict the Top 50 danceable songs for us.

### Feature Selection

In the data preparation process for machine learning modeling, relevant features have been thoughtfully selected to predict track 'danceability'. Instrumentalness,explicit,speechiness have not been considered in the process due to having little or no effect in determining danceability of a song. The categorical features such as track name and track id have also been omitted for smooth training, leaving our dataset ready for machine learning algorithms to unveil the factors that truly determine a song's dance-worthiness.

In [120]:
# Select relevant features and the target variable
features =["popularity","danceability","energy","key","loudness","mode","acousticness","liveness",
         "valence","tempo","time_signature","genre_0","genre_1","genre_2","genre_3"]
#categorical_feature = ['track_genre']  # Add 'track_genre' as a categorical feature
target = ['danceability']

### Traint-Test Split
Now, it's time to split our dataset into two parts: training and testing. Allocating 80% of the data for training machine learning models and reserving 20% for performance testing is essential for model generalization. A fixed random seed (random_state=42) guarantees consistent and comparable results across runs.

In [121]:
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(spotify[features], spotify[target], test_size=0.2, random_state=10)

In [122]:
from sklearn.metrics import mean_squared_error, make_scorer, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split, cross_val_score
import xgboost as xgb
models = [
    RandomForestRegressor(n_estimators=50,max_depth=5,min_samples_split=5,random_state=10), 
    xgb.XGBRegressor(n_estimators=50, random_state=42,alpha=1.0),
    SVR(),
    KNeighborsRegressor()
]
model_names = ['Random Forest', 'XGBoost','SVM', 'KNN']
mse_scores = []
r2_scores = []
for model in models:
    mse_scores.append(-cross_val_score(model, X_train, y_train, cv=5,
                                       scoring='neg_mean_squared_error').mean())
    r2_scores.append(cross_val_score(model, X_train, y_train, cv=5, scoring='r2').mean())
results = pd.DataFrame({'Model': model_names, 'MSE': mse_scores, 'R-squared': r2_scores})
print(results)

Although, RandomForest and XGBoost clearly seem to outperform other algorithms, their R2 score might indicate a tendecy to overfit. This could be due to many reasons, but for the sake of simplicity we can safely decide to go ahead with KNN Regressor due to it's fair R2 score and acceptable MSE. Also KNN is non-parametric, which means it can capture complex relationships in the data without assuming a specific functional form.

In [123]:
from sklearn.neighbors import KNeighborsRegressor

# Create a KNN regression model
knn_model = KNeighborsRegressor(n_neighbors=5)  # You can adjust the number of neighbors (n_neighbors) as needed

# Fit the KNN model to the training data
knn_model.fit(X_train, y_train.values.ravel())

# Predict danceability scores for all songs in the dataset using features prepared above
spotify['predicted_danceability'] = knn_model.predict(spotify[features])

# Sort the data by predicted danceability score in descending order
spotify.sort_values(by='predicted_danceability', ascending=False, inplace=True)

## 📈 Performace Evaluation
MAE, MSE and R2 score collectively provide insights into how well the RandomForestRegressor model is performing in predicting danceability scores. A lower MAE and MSE, along with a higher R-squared, indicate better model performance. It's essential to evaluate these metrics to ensure that the model meets the desired level of accuracy and can effectively predict danceability in your dataset.

In [124]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Calculate predictions on the test set
y_pred = knn_model.predict(X_test)

# Calculate evaluation metrics
mae= mean_absolute_error(y_test, y_pred)
mse= mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error (MAE): {mae:.6f}")
print(f"Mean Squared Error (MSE): {mse:.8f}")
print(f"R-squared (R^2): {r2:.4f}")

- The MAE measures the average absolute difference between the actual and predicted values. In this case, a MAE of 0.08 indicates that, on average, the model's predictions have an absolute difference of 0.139 from the actual values.
- A lower MAE is generally better, so a MAE of 0.139 suggests that the model's predictions are relatively close to the actual values.
- The MSE is similar to MAE but squares the differences before averaging them. A lower MSE indicates a better fit of the model to the data.
- With an MSE of 0.03, the model's predictions have, on average, a squared difference of 0.03 from the actual values.
- An R^2 value of 0.935 indicates that approximately 94% of the variability in the target variable can be explained by the model's independent variables.

### Regression Plot
This scatter plot provides a clear comparison between the predicted (in blue) and actual (in red) danceability scores for a collection of songs. Each point on the plot represents a song, and their positions relative to the diagonal line (where actual equals predicted) reveal how closely the predictions align with reality. This visual assessment helps us gauge the accuracy of our predictive model, with points clustering around the diagonal line indicating more accurate predictions.

In [125]:
danceability = spotify['danceability']
predicted_danceability = spotify['predicted_danceability']

# Create a scatter plot for actual danceability (in blue)
plt.figure(figsize=(10, 6))
plt.scatter(danceability, predicted_danceability, alpha=0.05, label='Predicted', color='blue',s=20)

# Create a scatter plot for predicted danceability (in red)
plt.scatter(danceability, danceability, alpha=0.05, label='Actual', color='red',s=10)

plt.title('Scatter Plot: Actual vs. Predicted Danceability')
plt.xlabel('Actual Danceability')
plt.ylabel('Predicted Danceability')
plt.grid(True)

# Add a legend to distinguish between actual and predicted
plt.legend(loc='best')
plt.show()

### 🎶 The Playlist

In [126]:
# Extract the top 50 songs
top_50_danceable_songs= spotify.head(50)

# Display the top 50 danceable songs
top_50_danceable_songs[['track_name', 'artists', 'predicted_danceability']]

Voila! We have our top 50 dance songs for the ultimate summer party, but our work here is not done yet, to add a little spice to this playlist, let us develop an algorithm that orders these 50 songs like how a DJ would. Typically in a party setting, DJs start with a selection of songs that have a moderate tempo and gradually increase the energy level as the party progresses. They may begin with some well-known and easy-to-dance-to tracks and then transition to more energetic beats.

In [128]:
from sklearn.preprocessing import MinMaxScaler

# Define the weights for each feature based on importance
weights = {
    'valence': 0.25,   # Adjust the weight for valence based on importance
    'popularity': 0.35,  # Adjust the weight for popularity based on importance
    'energy': 0.25,     # Adjust the weight for energy based on importance
    'tempo': 0.15      # Adjust the weight for tempo based on importance
}

# Create a Min-Max scaler
scaler = MinMaxScaler()

# Normalize the feature columns except 'track_name' and 'artists'
normalized_features = top_50_danceable_songs[['valence', 'popularity', 'energy', 'tempo']].copy()
normalized_features = scaler.fit_transform(normalized_features)
top_50_danceable_songs[['valence', 'popularity', 'energy', 'tempo']] = normalized_features

# Calculate a weighted score for each song using the defined weights
top_50_danceable_songs['weighted_score'] = (
    top_50_danceable_songs['valence'] * weights['valence'] +
    top_50_danceable_songs['popularity'] * weights['popularity'] +
    top_50_danceable_songs['energy'] * weights['energy'] +
    top_50_danceable_songs['tempo'] * weights['tempo']
)

# Filter songs with moderate to high tempo
moderate_to_high_tempo_songs = top_50_danceable_songs[top_50_danceable_songs['tempo'] >= 0.5]

# Calculate how many additional songs are needed to reach a total of 50
additional_songs_needed = 50 - len(moderate_to_high_tempo_songs)

# If additional songs are needed, select them from lower tempo range
if additional_songs_needed > 0:
    low_tempo_songs = top_50_danceable_songs[top_50_danceable_songs['tempo'] < 0.5]
    # Sort the low tempo songs by weighted score in descending order and select the top ones
    additional_songs = low_tempo_songs.sort_values(by='weighted_score', ascending=False).head(additional_songs_needed)
    # Concatenate the additional songs with the moderate to high tempo songs
    final_song_list = pd.concat([moderate_to_high_tempo_songs, additional_songs])
else:
    # If no additional songs are needed, use only the moderate to high tempo songs
    final_song_list = moderate_to_high_tempo_songs

# Sort the final list of songs by weighted score in descending order
playlist = final_song_list.sort_values(by='weighted_score', ascending=False)
playlist = playlist[playlist['artists'] != playlist['artists'].shift()]


# Display the final list of songs
print(playlist[['track_name', 'artists', 'weighted_score']])


In [129]:
playlist[['track_name', 'artists', 'weighted_score']]

### The End

Our final playlist is ready to be played! I hope you enjoyed this groovy journey as much as I did. Any feedback is welcome! Let the dancing begin!