# Machine Learning with Spotify

<img src = 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Silhouette_of_bare_tree_branches_under_twilight_sky.jpg/640px-Silhouette_of_bare_tree_branches_under_twilight_sky.jpg' width = 600>


Spotify has data on million of songs. It assigns certain atributes to each in order to describe the music. In manhine learning, we can use these attributes (known as **features**) to train the model. Here are a few:

- **Mode**: 1=major, 0=minor
- **Tempo**: beats per minute
- **Duration**: length of somg (in milliseconds)
- **Time signature**: 4=4/4, 3=3/4

- **Acousticness**: confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
- **Danceability**: describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
- **Energy**: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
- **Instrumentalness**: predicts whether a track contains no vocals.  Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
- **Liveness**: detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
- **Speechiness** detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
- **Valence**: measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

    

**Import Pandas**

**Read in the Spotify dataset: `../data/spotify_all_genres_tracks.csv`**

**Get a summary of the numerical data.**

**List the column names.**

We are interested in predicting the genre of each song using the other features.

**Identify the genres that are used in this dataset.**

In machine learning, you are interested in predicting something. In this case, it is the genre of the song. By convention, we set this "target" equal to `y`.

We can choose which features we want to use to train the model.

**Choose several features and place them in a list, stored in a variable called `features`.**

**By convention, the list of features is set equal to the variable `X`.**

**View and summarize the feature set.**

We will use the machine learning library **Scikit-learn** (`sklearn`). Specifically, its decision tree classifier to train the model.

**Import the sklearn library.**

**Assign the model to an object called `spotify_model`. Specify a number for random_state to ensure same results each run.**

**Train the model using the features and target data.**

**Make predictions for the first 5 songs of the dataset.**

**Check the accuracy of the model's predictions.**

Since we are training and testing the model on thousands of songs, we need an efficient way of measuring the models accuracy.

**Import the `accuracy_score` method from Scikit-learn.**

**Run the model on all of the songs in the dataset.**

**Measure the accuracy of the model.**

## Training and Validating Your Model

Properly training your model requires both training and validating it with two different subsets of the data.

**Import the `train_test_split` method from Scikit-learn.**

**Split data into training and validation data, for both the features (`X`) and the target (`y`).**

**View the subsets.**

**Assign the model to an object called `spotify_model`. Choose a maximum number of leaf nodes and specify a number for random_state to ensure same results each run.**

**Train the model on the training subset.**

**Use the new model to make prediction of the genres from the validation subset.**

**Score the accuracy of the model's predictions.**

## Improving the model

### Tree depth

We can decide how complex we want the decision tree to be by specifying the number of leaf nodes. Limiting the number of leaf nodes can help control the complexity of the tree, prevent overfitting, and improve generalization on unseen data.

<center><img src = '../imgs/charlie.jpeg' width = 600><center> 


Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in the figure below.

<img src = 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Overfitting_svg.svg/640px-Overfitting_svg.svg.png' width = 400>

**Assign a value to the variable `max_leaf_nodes`.**

**Retrain the model with the maximum number of leaf nodes and measure its accuracy on the validation subset.**

**Define a function called `get_acc` that accepts `max_leaf_nodes`, `train_X`, `val_X`, `train_y`, and `val_y` as arguments and returns the accuracy of the model.**

**Test the function `get_acc`.**

**Create a `for loop` that runs the `get_acc` function for 5, 50, 500 and 5000 max_leaf_nodes and returns the accuracy for each model run.**

**Modify the `for loop` to find the optimal number for `max_leaf_nodes`.**

fine tune

### Improve Your Model

Continue to adjust the parameters of your model to see if you can improve its accuracy. Consider changing the features that are used to train the model and finding a balance between underfitting and overfitting the model.

## Random Forest

<img src = 'https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/D%C3%BClmen%2C_Kirchspiel%2C_Bauerschaft_B%C3%B6rnste_--_2017_--_6919.jpg/640px-D%C3%BClmen%2C_Kirchspiel%2C_Bauerschaft_B%C3%B6rnste_--_2017_--_6919.jpg' width = 600>

A Random Forest Classifier is an ensemble learning method that combines multiple decision trees to create a more robust and accurate model for classification tasks. It is one of the most popular machine learning algorithms due to its simplicity, versatility, and effectiveness.