#**[Using Machine Learning to Recommend Songs](https://www.sciencebuddies.org/science-fair-projects/project_ideas/ArtificialIntelligence_p012/artificial-intelligence/K-Means_Spotify)**

This notebook was developed by Science Buddies [www.sciencebuddies.org](https://www.sciencebuddies.org/) as part of a science project to allow students to explore and learn about artificial intelligence. For personal use, this notebook can be downloaded and modified with attribution. For all other uses, please see our [Terms and Conditions of Fair Use](https://www.sciencebuddies.org/about/terms-and-conditions-of-fair-use).  

**Troubleshooting tips**
*   Read the written instructions at Science Buddies and the text and comments on this page carefully.
*   If you make changes that break the code, you can download a fresh copy of this notebook and start over.

*   If you are using this notebook for a science project and need help, visit our [Ask an Expert](https://www.sciencebuddies.org/science-fair-projects/ask-an-expert-intro) forum for assistance.

## **How To Use This Notebook**

This notebook contains text fields, like this one, that give you information about the project and instructions.

In [None]:
# There are also code blocks, like this one.

# The green text in a code block are comments. Comments are descriptions of what the code does.

# The non-green text in a code block is the Python code. Click on the triangle in the top left corner to run this code block.

print("Congratulations, you ran a code block! Try changing the text in the code and running it again.")

##**Importing Libraries**
We will start this science project by importing some necessary libraries. These libraries contain functions that we will be using to create and display our maze. The comments tell you what each libary is for.

In [None]:
# This library provides support for working with arrays and matrices, along with various mathematical functions
# to operate on these arrays
import numpy as np

# Pandas is a powerful data manipulation and analysis library. It provides data structures like DataFrames and
# Series that allow for easy data handling and analysis
import pandas as pd

# This is part of the scikit-learn library and is used to perform K-means clustering, an unsupervised machine
# learning algorithm that groups similar data points into clusters.
from sklearn.cluster import KMeans

# Matplotlib is a widely used plotting library in Python. The "pyplot" submodule provides functions to create various
# types of plots and visualizations
import matplotlib.pyplot as plt

# Another part of scikit-learn, PCA (Principal Component Analysis) is used for dimensionality reduction. It helps
# transform high-dimensional data into a lower-dimensional representation while preserving as much of the variance as possible.
from sklearn.decomposition import PCA

pd.set_option("display.max_columns", None)    # Sets the maximum number of columns to be displayed in a Pandas DataFrame to be unlimited
pd.set_option("display.max_rows", None)       # Sets the maximum number of rows to be displayed in a Pandas DataFrame to be unlimited
pd.set_option("display.width", None)          # Sets the maximum width of the display for a Pandas DataFrame to be unlimited
pd.set_option("display.max_colwidth", None)   # Sets the maximum width of column contents to be unlimited, allowing for complete display of text data

print("You have imported all the libraries.")

## Loading the Data into a Pandas DataFrame

In [None]:
# Load the CSV file into a pandas DataFrame
df = pd.read_csv("https://www.sciencebuddies.org/ai/colab/spotify.csv?t=AQX3FN5n47RYT9PqwsfVEk9Iuje0qM8M3qwo6EMrO_NZpw")

# We can see what the dataframe looks like by using the head function
df.head()

## Preprocessing the Dataset

Dropping NaN Values

In [None]:
# Display the shape before dropping NaN values
print("Shape before dropping NaN values:", df.shape)

# Drop NaN values from the DataFrame
df.dropna(inplace=True)

# Display the shape after dropping NaN values
print("Shape after dropping NaN values:", df.shape)
df.head()

Dropping Features

In [None]:
# TODO: List the columns that you want to drop
columns_to_drop = ['Artist']

# Create a new DataFrame that excludes the specified columns
dropped_df = df.drop(columns=columns_to_drop)

# Let's check if our specified columns are no longer there!
dropped_df.head()

Since KMeans is a distance-based algorithm, it is crucial to normalize or scale the features to ensure that all features contribute equally to the distance calculations.

In [None]:
# We can use the describe() function to provide a summary of statistical information about the numerical columns in the DataFrame
dropped_df.describe()

In [None]:
# TODO: Identify the numerical feature columns you want to normalize
numerical_columns = ['Loudness']

# Create a copy of the dropped_df
final_df = dropped_df

# Apply min-max scaling to the selected numerical feature columns
final_df[numerical_columns] = (dropped_df[numerical_columns] - dropped_df[numerical_columns].min()) / (dropped_df[numerical_columns].max() - dropped_df[numerical_columns].min())

# Let's see what our normalization did!
final_df.describe()

## Clustering the Data

In [None]:
# Function that works out optimum number of clusters
def optimise_k_means(data, max_k):
    means = []
    inertias = []

    for k in range(1, max_k+1):
        kmeans = KMeans(n_clusters=k, n_init=max_k)
        kmeans.fit(data)

        means.append(k)
        inertias.append(kmeans.inertia_)

    # Generate the elbow plot
    plt.figure(figsize=(10, 5))  # Create a new figure
    plt.plot(means, inertias, 'o-')
    plt.xlabel('Number of Clusters')
    plt.ylabel('Inertia')
    plt.grid(True)
    plt.show()

print("This code block has been run and the optimise_k_means() function is now available for use.")

In [None]:
optimise_k_means(final_df, 10)

## Applying K-Means Clustering

In [None]:
# Initialize a KMeans model with 4 clusters
kmeans = KMeans(n_clusters= , n_init='auto') # TODO: Insert number of clusters

# Fit the KMeans model to the data in 'final_df'
kmeans.fit(final_df)

# Assign cluster labels to each data point and add the 'Cluster' column to the original DataFrame
df['Cluster'] = kmeans.labels_
final_df['Cluster'] = kmeans.labels_

# Let's check out what our DataFrame looks like now!
df.head()

## Visualize the Model

In [None]:
# Perform PCA for dimensionality reduction
pca = PCA(n_components=2)
reduced_features = pca.fit_transform(final_df.drop('Cluster', axis=1)) # Exclude the cluster labels

# Add the reduced components to the DataFrame
final_df['pca_1'] = reduced_features[:, 0]
final_df['pca_2'] = reduced_features[:, 1]

# Create a scatter plot
plt.scatter(final_df['pca_1'], final_df['pca_2'], c=final_df['Cluster'], cmap='viridis')
plt.title('KMeans Clustering Results using PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

## Creating Our Song Recommendation Function

In [None]:
# This function attemps to find the index of a given track name in the 'Track' column of the dataframe
def find_track_index(track_name, df):
    try:
        # Attempt to find the index of the first occurence of 'track_name' in the 'Track' column of 'df'
        track_index = df[df['Track'] == track_name].index[0]
        # Return the index if found
        return track_index
    except IndexError:
        # If the track name is not found, return None
        return None

In [None]:
# This function finds song recommendations based on a given track name and the DataFrame 'df'
def find_song_recommendation(track_name, df):
    # Call the 'find_track_index' function to get the index of the provided 'track_name'
    track_index = find_track_index(track_name, df)

    # Retrieve the cluster label of the provided track using its index
    cluster = df.loc[track_index]['Cluster']

    # Create a filter to select rows in 'df' that belong to the same cluster as the provided track
    filter = (df['Cluster'] == cluster)

    # Apply the filter to 'df' to get a DataFrame containing songs from the same cluster
    filtered_df = df[filter]

    # Generate song recommendations by randomly selecting tracks from the same cluster
    for i in range(5):
        # Randomly sample a track from the shuffled DataFrame
        recommendation = filtered_df.sample()
        # Print the recommended track's title and artist
        print(recommendation.iloc[0]['Track'] + ' by ' + recommendation.iloc[0]['Artist'])

In [None]:
# TODO: Experiment with inputting different song names!
find_song_recommendation('Clint Eastwood', df)

## Creating Our Song Randomizer Function

In [None]:
def find_random_song(track_name, df):
    # Call the 'find_track_index' function to get the index of the provided 'track_name'
    track_index = find_track_index(track_name, df)

    # Retrieve the cluster label of the provided track using its index
    cluster = df.loc[track_index]['Cluster']

    # Create a filter to select rows in 'df' that don't belong to the same cluster as the provided track
    filter = (df['Cluster'] != cluster)

    # Apply the filter to 'df' to get a DataFrame containing songs from different clusters
    filtered_df = df[filter]

    # Generate song recommendations by randomly selecting tracks from the filtered dataframe
    for i in range(5):
        # Randomly sample a track from the shuffled DataFrame
        random_song = filtered_df.sample()
        # Print the random song track's title and artist
        print(random_song.iloc[0]['Track'] + ' by ' + random_song.iloc[0]['Artist'])

In [None]:
# TODO: Experiment with inputting different song names!
find_random_song('Blue Flame', df)

## Evaluating the Model

In [None]:
# TODO: Insert the accuracies for each of the functions
recommendations_accuracy = []
random_songs_accuracy = []

recommendations_average = sum(recommendations_accuracy) / len(recommendations_accuracy)
random_songs_average = sum(random_songs_accuracy) / len(random_songs_accuracy)

print("Recommendations average accuracy:", recommendations_average)
print("Random songs average accuracy:", random_songs_average)