Skip to content

benitomartin/mlops-music-clustering

Repository files navigation

MLOps Music Clustering 🎸

This is a personal MLOps project based on this Kaggle dataset with music features from Spotify. Below you can find some instructions to understand the project content. Feel free to ⭐ and clone this repo 😉

Tech Stack

Visual Studio Code Jupyter Notebook Python Pandas NumPy Matplotlib Plotly scikit-learn FastAPI Docker Anaconda Linux AWS Git Streamlit

Project Structure

The project has been structured with the following folders and files:

  • .github/workflows: CI/CD Pipeline
  • data: dataset
  • images: images from results
  • streamlit: streamlit app
  • notebooks: EDA and Modelling performed at the beginning of the project to establish a baseline
  • model: saved best model
  • requirements.txt: project requirements
  • Dockerfile: docker image for deployment
  • app.py: FastAPI app

Project Description

Exploratory Data Analysis

The dataset was obtained from Kaggle and contains 232'725 rows and various columns with song features:

  • genre
  • danceability loudness
  • artist_name
  • duration_ms
  • mode
  • track_name
  • energy
  • speechiness
  • track_id
  • instrumentalness
  • tempo
  • popularity
  • keys
  • time_signature
  • acousticness
  • liveness
  • valence

To prepare the data for modelling, an Exploratory Data Analysis was conducted to preprocess the numerical features, and suitable scalers were chosen for the preprocessing pipeline. Prior to scaling by plotting 3 features, it can be seen that they are not very correlated except from acousticness, energy and loudness.

For choosing the scalers, the distribution and boxplot of each features was analyzed. The features with significant outliers were scaled with RobustScaler, features normally distributed with StandardScaler and the rest with MinMmaxScaler.

Afterwards, the scaled features were fitted in a PCA model with the following objectives:

  • reduce dimensionality to get a better visual feedback on our clustering
  • use the orthogonality of the principal components so that the KMeans algorithm increases its clustering power

A threshold of 95% explained variance was set up in order to get the number of principal components, which ended up being 3.

Then we determine the optimal number of clusters for K-Means using the within-cluster sum of squares (WCSS) method and the "elbow" or "knee" point in the WCSS curve:

  • Calculate WCSS for different numbers of clusters: The code iterates through a range of cluster numbers from 1 to max_clusters - 1 and fits a K-Means model to the data. The kmeans.inertia_ attribute returns the WCSS for the current number of clusters, which is then appended to the wcss list.

  • Determine the optimal number of clusters: The KneeLocator is used to find the optimal number of clusters based on the WCSS values. The 'elbow' or 'knee' point represents the optimal number of clusters where adding more clusters doesn't significantly reduce the WCSS.

The results after scaling, getting the number of PCs (3) and clusters (5), show a clear grouping of the numerical features, as well as the distribution of the features along each cluster.

Modelling

After the EDA, a classifier modelling was performed using the following models:

  • KNeighborsClassifier
  • MLPClassifier
  • SVC
  • AdaBoostClassifier
  • DecisionTreeClassifier
  • GaussianNB
  • RandomForestClassifier
  • QuadraticDiscriminantAnalysis

All models performed an accuracy above 0.9, being the SVC the one with the better results with only 58 FN and FP.

Deployment

For the deployment a CICD Pipeline was set up pushing a Dockerimage into an ECR and launching it in an EC2 in AWS. Then the app can be used in Streamlit wither loading the model or using the Service URL of the EC2 instance.

AWS CICD

Create an IAM user with the following policies:

  • AmazonEC2ContainerRegistryFullAccess
  • AmazonEC2FullAccess

Create EC2 instance:

  • t2.micro
  • Allow: HTTPS, SSH and HTTP
  • 30 instead of 8 gp2 storage
  • Get keypair
  • Add security group port range 8000

Create an ECR Repository for the Dockerimage

  • 406345071577.dkr.ecr.eu-central-1.amazonaws.com/music

Open the EC2 instance and install docker with the following commands:

sudo apt-get update -y

sudo apt-get upgrade

curl -fsSL https://get.docker.com -o get-docker.sh

sudo sh get-docker.sh

sudo usermod -aG docker ubuntu

newgrp docker

Check that docker is installed

docker --version

Configure EC2 as self-hosted runner:

  • In GitHub repository: setting -> actions -> runner -> new self hosted runner -> choose OS Linux -> then run each command one by one in the EC2 terminal

Setup GitHub secrets:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_REGION
  • AWS_ECR_LOGIN_URI
  • ECR_REPOSITORY_NAME

Make sure all data are the same in the yaml file and push the code to GitHub.

App

The Streamlit App can be run with the saved model or the Service URL. By selecting the desired features, a music playlist will be generated which correspond to a cluster. By clicking search on YouTube a random song will be selected and will redirect you to YouTube to listen that song