MLOps Music Clustering 🎸

This is a personal MLOps project based on this Kaggle dataset with music features from Spotify. Below you can find some instructions to understand the project content. Feel free to ⭐ and clone this repo 😉

Tech Stack

Project Structure

The project has been structured with the following folders and files:

.github/workflows: CI/CD Pipeline
data: dataset
images: images from results
streamlit: streamlit app
notebooks: EDA and Modelling performed at the beginning of the project to establish a baseline
model: saved best model
requirements.txt: project requirements
Dockerfile: docker image for deployment
app.py: FastAPI app

Project Description

Exploratory Data Analysis

The dataset was obtained from Kaggle and contains 232'725 rows and various columns with song features:

genre
danceability loudness
artist_name
duration_ms
mode
track_name
energy
speechiness
track_id
instrumentalness
tempo
popularity
keys
time_signature
acousticness
liveness
valence

To prepare the data for modelling, an Exploratory Data Analysis was conducted to preprocess the numerical features, and suitable scalers were chosen for the preprocessing pipeline. Prior to scaling by plotting 3 features, it can be seen that they are not very correlated except from acousticness, energy and loudness.

For choosing the scalers, the distribution and boxplot of each features was analyzed. The features with significant outliers were scaled with RobustScaler, features normally distributed with StandardScaler and the rest with MinMmaxScaler.

Afterwards, the scaled features were fitted in a PCA model with the following objectives:

reduce dimensionality to get a better visual feedback on our clustering
use the orthogonality of the principal components so that the KMeans algorithm increases its clustering power

A threshold of 95% explained variance was set up in order to get the number of principal components, which ended up being 3.

Then we determine the optimal number of clusters for K-Means using the within-cluster sum of squares (WCSS) method and the "elbow" or "knee" point in the WCSS curve:

Calculate WCSS for different numbers of clusters: The code iterates through a range of cluster numbers from 1 to max_clusters - 1 and fits a K-Means model to the data. The kmeans.inertia_ attribute returns the WCSS for the current number of clusters, which is then appended to the wcss list.
Determine the optimal number of clusters: The KneeLocator is used to find the optimal number of clusters based on the WCSS values. The 'elbow' or 'knee' point represents the optimal number of clusters where adding more clusters doesn't significantly reduce the WCSS.

The results after scaling, getting the number of PCs (3) and clusters (5), show a clear grouping of the numerical features, as well as the distribution of the features along each cluster.

Modelling

After the EDA, a classifier modelling was performed using the following models:

KNeighborsClassifier
MLPClassifier
SVC
AdaBoostClassifier
DecisionTreeClassifier
GaussianNB
RandomForestClassifier
QuadraticDiscriminantAnalysis

All models performed an accuracy above 0.9, being the SVC the one with the better results with only 58 FN and FP.

Deployment

For the deployment a CICD Pipeline was set up pushing a Dockerimage into an ECR and launching it in an EC2 in AWS. Then the app can be used in Streamlit wither loading the model or using the Service URL of the EC2 instance.

AWS CICD

Create an IAM user with the following policies:

AmazonEC2ContainerRegistryFullAccess
AmazonEC2FullAccess

Create EC2 instance:

t2.micro
Allow: HTTPS, SSH and HTTP
30 instead of 8 gp2 storage
Get keypair
Add security group port range 8000

Create an ECR Repository for the Dockerimage

406345071577.dkr.ecr.eu-central-1.amazonaws.com/music

Open the EC2 instance and install docker with the following commands:

sudo apt-get update -y

sudo apt-get upgrade

curl -fsSL https://get.docker.com -o get-docker.sh

sudo sh get-docker.sh

sudo usermod -aG docker ubuntu

newgrp docker

Check that docker is installed

docker --version

Configure EC2 as self-hosted runner:

In GitHub repository: setting -> actions -> runner -> new self hosted runner -> choose OS Linux -> then run each command one by one in the EC2 terminal

Setup GitHub secrets:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_REGION
AWS_ECR_LOGIN_URI
ECR_REPOSITORY_NAME

Make sure all data are the same in the yaml file and push the code to GitHub.

App

The Streamlit App can be run with the saved model or the Service URL. By selecting the desired features, a music playlist will be generated which correspond to a cluster. By clicking search on YouTube a random song will be selected and will redirect you to YouTube to listen that song

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLOps Music Clustering 🎸

Tech Stack

Project Structure

Project Description

Exploratory Data Analysis

Modelling

Deployment

AWS CICD

App

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
data		data
images		images
model		model
notebooks		notebooks
streamlit		streamlit
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

benitomartin/mlops-music-clustering

Folders and files

Latest commit

History

Repository files navigation

MLOps Music Clustering 🎸

Tech Stack

Project Structure

Project Description

Exploratory Data Analysis

Modelling

Deployment

AWS CICD

App

About

Topics

Resources

Stars

Watchers

Forks

Languages