## Dataset Info

The dataset represents audio features for a number of audio files.

  + The audio files were obtained from artists' music videos on YouTube using the `pytube` / `pytubefix` package.
  + The audio features were obtained using the `librosa` package.
  + Each YouTube video / song was split into 30-second chunks called tracks, and each row represents one of these 30-second tracks.
  + Columns ending with "mean" represent mean values, and columns ending with "var" represent variance values
  + Based on audio sampling methods used, the `track_length` is represented in units where 22050 represents one second.



## Data Loading

In [2]:
import pandas as pd
from pandas import read_csv

TRACK_LENGTH = 30
N_MFCC = 13

csv_filepath = ("/Users/chibuzor/Downloads/Python Repo/AI-Powered-Music-Recommendation-System/Dataset/"
    # "https://github.com/s2t2/ml-music-2023/raw/main/data/youtube/features_v1/"
    f"length_{TRACK_LENGTH}_mfcc_{N_MFCC}_features.csv"
)
df = read_csv(csv_filepath)
df.head()

Unnamed: 0,artist_name,video_id,audio_filename,track_number,track_length,tempo,chroma_stft_mean,chroma_stft_var,rms_mean,rms_var,...,mfcc_9_mean,mfcc_9_var,mfcc_10_mean,mfcc_10_var,mfcc_11_mean,mfcc_11_var,mfcc_12_mean,mfcc_12_var,mfcc_13_mean,mfcc_13_var
0,frank_sinatra,rSrc7aulay8,Fly Me To The Moon (In Other Words).mp4,1,661500,123.046875,0.381353,0.108483,0.035173,0.000843,...,6.451812,96.068996,1.410364,101.259422,1.96838,89.808526,9.885193,85.470773,0.186532,85.876381
1,frank_sinatra,rSrc7aulay8,Fly Me To The Moon (In Other Words).mp4,2,661500,117.453835,0.325079,0.095037,0.059724,0.000803,...,2.529402,163.63882,1.359641,111.545326,-0.145113,95.362841,9.985497,107.73133,0.40804,86.850819
2,frank_sinatra,rSrc7aulay8,Fly Me To The Moon (In Other Words).mp4,3,661500,117.453835,0.39123,0.091207,0.062308,0.001372,...,-4.243943,77.650651,5.9319,59.56585,0.710004,56.129559,5.494471,71.811565,-2.069526,62.430867
3,frank_sinatra,rSrc7aulay8,Fly Me To The Moon (In Other Words).mp4,4,661500,117.453835,0.354422,0.093518,0.070922,0.001636,...,-0.558803,111.401847,4.309566,79.8292,1.935659,77.462955,9.47608,86.830481,-2.179213,78.055285
4,frank_sinatra,LWXUdqvVO8Y,Somethin Stupid.mp4,1,661500,103.359375,0.372852,0.09057,0.078486,0.001692,...,-7.024814,67.282919,-6.018267,87.265848,-7.079118,86.895205,-1.009947,69.512053,-9.353167,55.724105


In [3]:
# Printing shape of the dataset
print(f"Shape of the dataset: {df.shape}")

# Printing the number of columns and rows
rows, columns = df.shape
print("Number of rows:",rows)
print("Number of columns:",columns)

Shape of the dataset: (1818, 46)
Number of rows: 1818
Number of columns: 46


In [4]:
# Print the column names.
print ('COLUNM NAME')
print ('-------------')
for column in df.columns:
    print(column)

COLUNM NAME
-------------
artist_name
video_id
audio_filename
track_number
track_length
tempo
chroma_stft_mean
chroma_stft_var
rms_mean
rms_var
spectral_centroid_mean
spectral_centroid_var
spectral_bandwidth_mean
spectral_bandwidth_var
spectral_rolloff_mean
spectral_rolloff_var
zero_crossing_rate_mean
zero_crossing_rate_var
tonnetz_mean
tonnetz_var
mfcc_1_mean
mfcc_1_var
mfcc_2_mean
mfcc_2_var
mfcc_3_mean
mfcc_3_var
mfcc_4_mean
mfcc_4_var
mfcc_5_mean
mfcc_5_var
mfcc_6_mean
mfcc_6_var
mfcc_7_mean
mfcc_7_var
mfcc_8_mean
mfcc_8_var
mfcc_9_mean
mfcc_9_var
mfcc_10_mean
mfcc_10_var
mfcc_11_mean
mfcc_11_var
mfcc_12_mean
mfcc_12_var
mfcc_13_mean
mfcc_13_var


In [5]:
# Print the number of unique songs, as indicated by the video_id column 

print(f"Count of Unique Songs: {df['video_id'].nunique()}")

Count of Unique Songs: 206


In [6]:
# Print the number of unique artists, as indicated by the artist_name column.

print(f"Count of Unique Artists: {df['artist_name'].nunique()}")

Count of Unique Artists: 24


## Artist Analysis


In [7]:
# Rows per Artist:

print(df['artist_name'].value_counts())

artist_name
miles_davis          118
john_coltrane        103
taylor_swift         100
coldplay              99
chris_stapleton       92
ariana_grande         92
jay_z                 92
bruce_springsteen     87
john_mayer            83
pink_floyd            83
beethoven             75
adele                 74
tupac                 70
bach                  70
dr_dre                69
maggie_rogers         68
led_zeppelin          67
alicia_keys           64
rihanna               63
john_legend           54
jason_aldean          54
andrea_bocelli        52
ac_dc                 49
frank_sinatra         40
Name: count, dtype: int64


In [8]:
# Songs per Artist:

# Group by artist, count unique songs, and sort by count in descending order
group_artist = df.groupby('artist_name')
unique_songs_per_artist = group_artist['audio_filename'].nunique().sort_values(ascending=False)

# Print the result
print(unique_songs_per_artist)

artist_name
taylor_swift         12
ariana_grande        12
chris_stapleton      12
coldplay             12
jay_z                11
john_mayer           10
bruce_springsteen    10
maggie_rogers         9
jason_aldean          8
adele                 8
miles_davis           8
tupac                 8
dr_dre                8
bach                  8
alicia_keys           8
frank_sinatra         7
john_coltrane         7
john_legend           7
beethoven             7
led_zeppelin          7
andrea_bocelli        7
pink_floyd            7
rihanna               7
ac_dc                 6
Name: audio_filename, dtype: int64


In [9]:
import nbformat


In [10]:
# Horizontal bar chart of the number of songs per artist, where the largest bars are on top.

import plotly.express as px

# Group by artist, count unique songs, and sort by count in descending order
grouping_artist = df.groupby('artist_name')
songs_per_artist = grouping_artist['audio_filename'].nunique().reset_index().sort_values(by='audio_filename', ascending=True)

# Creating a horizontal bar chart using Plotly
fig = px.bar(songs_per_artist,
             x='audio_filename', y='artist_name', orientation='h',
             height=650, width=1000,
             title='Number of Songs per Artist',
             labels={'audio_filename': 'Number of Songs', 'artist_name': 'Artist'})

fig.show()

In [11]:
# Total Duration in Minutes per Artist
# New column called track_duration_sec representing the track duration in seconds, and a new column called track_duration_min representing the track duration in minutes.

# Column per second
df['track_duration_sec'] = df['track_length']/22050

# Column per minute
df['track_duration_min'] = df['track_duration_sec']/60

In [12]:
# Using groupby, print the total duration in minutes per artist.

# Group by artist and sum the duration in minutes
df_grouped = df.groupby('artist_name')['track_duration_min'].sum().sort_values(ascending=False)

print(df_grouped)

artist_name
miles_davis          59.0
john_coltrane        51.5
taylor_swift         50.0
coldplay             49.5
ariana_grande        46.0
chris_stapleton      46.0
jay_z                46.0
bruce_springsteen    43.5
john_mayer           41.5
pink_floyd           41.5
beethoven            37.5
adele                37.0
tupac                35.0
bach                 35.0
dr_dre               34.5
maggie_rogers        34.0
led_zeppelin         33.5
alicia_keys          32.0
rihanna              31.5
jason_aldean         27.0
john_legend          27.0
andrea_bocelli       26.0
ac_dc                24.5
frank_sinatra        20.0
Name: track_duration_min, dtype: float64


+ The artist with the longest total duration: `Miles Davis`
+ The artist with the shortest total duration: `Frank Sinatra`

In [13]:
# Create a horizontal bar chart of the total duration in minutes per artist, where the largest bars are on top.

# Sort df_grouped in descending order
df_grouped_sorted = df_grouped.sort_values(ascending=True)

# Creating a horizontal bar chart using Plotly
fig = px.bar(df_grouped_sorted,
             x = df_grouped_sorted.values, y = df_grouped_sorted.index, orientation = 'h',
             height = 650, width = 1000,
             title='Total Duration per Artist',
             labels={'x': 'Duration (Mins)', 'artist_name': 'Artist'})

fig.show()

In [14]:
# Average Tempo per Artist
# Using groupby, print the average tempo per artist.

# Group by artist, count unique songs, and sort by count in descending order
group_artist = df.groupby('artist_name')
average_tempo_per_artist = group_artist['tempo'].mean().sort_values(ascending=False)

# Print the result
print(average_tempo_per_artist)

artist_name
jay_z                131.795568
beethoven            129.735525
john_coltrane        128.196795
adele                127.675907
ariana_grande        127.312845
miles_davis          126.816801
rihanna              126.761981
jason_aldean         126.464488
chris_stapleton      126.166627
coldplay             124.537433
pink_floyd           124.496634
tupac                122.128452
frank_sinatra        121.527763
john_mayer           121.052416
bach                 119.993363
ac_dc                119.782243
taylor_swift         118.584689
led_zeppelin         117.711761
maggie_rogers        117.125269
john_legend          116.857779
bruce_springsteen    116.255265
andrea_bocelli       114.514543
alicia_keys          111.908552
dr_dre               110.993482
Name: tempo, dtype: float64


+ The artist with the highest average tempo: `Jay Z`
+ The artist with the lowest average tempo: `Dr Dre`

In [15]:
# Create a horizontal bar chart of the average tempo per artist, where the largest bars are on top.

# Sort df_grouped in descending order
average_tempo_per_artist_sorted = average_tempo_per_artist.sort_values(ascending=True)

# Creating a horizontal bar chart using Plotly
fig = px.bar(average_tempo_per_artist_sorted,
             x = average_tempo_per_artist_sorted.values, y = average_tempo_per_artist_sorted.index, orientation = 'h',
             height = 650, width = 1000,
             title='Average Tempo per Artist',
             labels={'x': 'Tempo (BPM)', 'artist_name': 'Artist'})

fig.show()

In [16]:
# Print the variance in tempo for each artist
tempo_variance_per_artist = df.groupby('artist_name')['tempo'].var().sort_values(ascending=False)
tempo_variance_per_artist = tempo_variance_per_artist.reset_index()

print(tempo_variance_per_artist)

          artist_name        tempo
0               tupac  1154.445256
1              dr_dre   924.390317
2               jay_z   779.302222
3     chris_stapleton   724.111820
4        led_zeppelin   668.316129
5        taylor_swift   624.064072
6        jason_aldean   615.191101
7   bruce_springsteen   597.348392
8         alicia_keys   555.928764
9             rihanna   512.382646
10      ariana_grande   495.034984
11        john_legend   469.615001
12      john_coltrane   466.104578
13      maggie_rogers   438.469787
14      frank_sinatra   362.290527
15         john_mayer   343.170670
16           coldplay   312.686523
17              adele   299.420981
18              ac_dc   234.394016
19          beethoven   232.235508
20     andrea_bocelli   194.820338
21        miles_davis   167.675505
22         pink_floyd   125.919142
23               bach    65.055989


## Artist Classification


### Train and Test Split

In [17]:
# Selecting target and features

# Target
y = df['artist_name']

# Features
x = df.drop(columns=["artist_name", 'video_id', 'audio_filename', 'track_length', 'track_duration_sec', 'track_duration_min'])

In [18]:
print("X:", x.shape)
print("Y:", y.shape)

X: (1818, 42)
Y: (1818,)


### Data Scaling

In [19]:
x = x.select_dtypes(include='number')
x_scaled = (x - x.mean()) / x.std()
print("SCALED VALUES: MEAN & STDEV:")
print("--------")
print(x_scaled.describe().T[["mean", "std"]])

SCALED VALUES: MEAN & STDEV:
--------
                                 mean  std
track_number             3.517538e-17  1.0
tempo                   -8.403008e-16  1.0
chroma_stft_mean         1.563350e-17  1.0
chroma_stft_var         -2.626429e-15  1.0
rms_mean                -1.719685e-16  1.0
rms_var                  7.816752e-17  1.0
spectral_centroid_mean   2.266858e-16  1.0
spectral_centroid_var    8.598427e-17  1.0
spectral_bandwidth_mean  7.504082e-16  1.0
spectral_bandwidth_var   0.000000e+00  1.0
spectral_rolloff_mean   -1.954188e-16  1.0
spectral_rolloff_var    -2.032355e-16  1.0
zero_crossing_rate_mean -4.690051e-17  1.0
zero_crossing_rate_var  -1.367932e-17  1.0
tonnetz_mean            -2.345026e-17  1.0
tonnetz_var              0.000000e+00  1.0
mfcc_1_mean             -7.035077e-17  1.0
mfcc_1_var              -7.816752e-18  1.0
mfcc_2_mean              2.892198e-16  1.0
mfcc_2_var               1.563350e-17  1.0
mfcc_3_mean             -7.816752e-18  1.0
mfcc_3_var      

### Train Test Split

In [20]:
%pip install scikit-learn



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [21]:
# C) Train/Test Split:
# Perform a train/test split on the target and scaled features. Use 20% of the rows in the test set. Use a random state for reproducibility.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, test_size= 0.2, random_state=99)

print(f"x_train shape: {x_train.shape}")
print(f"x_test shape: {x_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

x_train shape: (1454, 42)
x_test shape: (364, 42)
y_train shape: (1454,)
y_test shape: (364,)


### Model Training

In [22]:
# Using a LogisticRegression model from sklearn, train the model on the training data.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(x_train, y_train)

In [23]:
from pandas import Series

coefs = model.coef_

# Shape of Coefficients
print(f"Coefficient  shape: {coefs.shape}")

Coefficient  shape: (24, 42)


In [24]:
coef_series = pd.Series(model.coef_[0], index=x_train.columns)

print("Model Coefficients")
print(coef_series.sort_values(ascending=False))

Model Coefficients
mfcc_5_mean                1.572781
mfcc_11_mean               1.200513
mfcc_3_var                 1.132944
mfcc_4_var                 1.071336
mfcc_1_mean                1.016710
zero_crossing_rate_mean    0.648296
spectral_rolloff_mean      0.630266
spectral_centroid_mean     0.520662
tonnetz_mean               0.475521
chroma_stft_mean           0.443427
spectral_bandwidth_var     0.349344
mfcc_6_var                 0.274007
spectral_rolloff_var       0.239327
mfcc_12_mean               0.192956
spectral_bandwidth_mean    0.064816
spectral_centroid_var      0.035511
mfcc_1_var                -0.026653
mfcc_2_var                -0.098279
tempo                     -0.113061
mfcc_6_mean               -0.119610
mfcc_8_mean               -0.119980
mfcc_8_var                -0.176268
rms_mean                  -0.185410
mfcc_10_mean              -0.222570
mfcc_13_var               -0.229519
mfcc_11_var               -0.249361
tonnetz_var               -0.298946
mfcc_4_me

In [25]:
# Choose one of the artists, and using the coeficients, determine the top five most predictive features for that artist.

# Extract coefficients for the chosen artist
artist_coefs = model.coef_[4]

artist_coef_series = pd.Series(artist_coefs, index=x_train.columns)

# Sorting to get the top 5 predictive features
top_5_features_artist = artist_coef_series.sort_values(ascending=False).head(5)

# Get artist's name
chosen_artist = model.classes_[4]

# Display features
print(f"Top Five Most Predictive Features for {chosen_artist}:")
print(top_5_features_artist)

Top Five Most Predictive Features for ariana_grande:
rms_mean                 1.661495
spectral_rolloff_mean    1.332904
mfcc_9_mean              1.246814
mfcc_2_var               1.152729
chroma_stft_mean         1.150108
dtype: float64


The Top Five Most Predictive Features for Ariana Grande are:

*   rms_mean: `1.661495`
*   spectral_rolloff_mean: `1.332904`
*   mfcc_9_mean: `1.246814`
*   mfcc_2_var: `1.152729`
*   chroma_stft_mean: `1.150108`

### Model Evaluation

In [26]:
# E) Model Evaluation: Evaluate the model using a classification report.

from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import plotly.figure_factory as ff

# Predictions
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1]

In [27]:
# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
                   precision    recall  f1-score   support

            ac_dc       1.00      1.00      1.00        11
            adele       0.58      0.64      0.61        11
      alicia_keys       0.55      0.55      0.55        11
   andrea_bocelli       1.00      0.69      0.82        13
    ariana_grande       0.75      0.67      0.71         9
             bach       0.86      0.92      0.89        13
        beethoven       0.90      1.00      0.95         9
bruce_springsteen       0.52      0.80      0.63        15
  chris_stapleton       0.72      1.00      0.84        18
         coldplay       0.50      0.75      0.60        20
           dr_dre       0.92      0.92      0.92        12
    frank_sinatra       0.80      0.33      0.47        12
     jason_aldean       0.75      0.75      0.75         8
            jay_z       0.86      0.69      0.77        26
    john_coltrane       1.00      0.86      0.92        35
      john_legend       0.50   

Overall Model Performance:

*   Accuracy: The model achieved an overall accuracy of `74%`, meaning it correctly predicted the artist for about 3 out of 4 tracks in the test set.
*  Precision: On average, the model's precision was `76%` (weighted avg),indicating how well the predictions matched the true labels without including false positives.
*   Recall: The model's average recall was `74%`, reflecting its ability to identify tracks for each artist correctly without missing any.
*   F1-score: The weighted average F1-score was `74%`, balancing precision and recall for the entire dataset.

Random Chance Guess: There are 24 artists in the dataset, so a random guess would have a 1/24 chance of being correct, which is about `4.17%` accuracy. The model's `74%` accuracy is far better than random chance.

Highest F1 Scores:

*   Ac_dc: `1.00` (perfect precision and recall for 11 samples).
*   Beethoven: `0.95` (high precision and perfect recall for 9 samples).
*   Dr Dre: `0.92` (excellent balance of precision and recall for 12 samples).
*   John Coltrane: `0.92` (perfect recall and strong precision for 35 samples).

Lowest F1 Scores:
*   John Legend: `0.35` (low recall of 27% and poor precision of 50%).   
*   Frank Sinatra: `0.47` (low recall of 33% despite good precision of 80%).
*   Alicia Keys: `0.55` (moderate precision and recall of 55%).
*   Led Zeppelin: `0.55` (precision and recall of 60% and 50%, respectively).

Summary:
The model performed significantly better than random guessing, with notable accuracy for artists with distinct musical characteristics (e.g., `Ac_dc, Beethoven`). However, performance was weaker for some artists with overlapping features or fewer distinct tracks, such as `Frank Sinatra and John Legend`.

In [28]:
from sklearn.metrics import confusion_matrix
import plotly.express as px

def plot_confusion_matrix(y_true, y_pred, height=450, showscale=False, title=None, subtitle=None):
    # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
    # Confusion matrix whose i-th row and j-th column
    # ... indicates the number of samples with
    # ... true label being i-th class (ROW)
    # ... and predicted label being j-th class (COLUMN)
    cm = confusion_matrix(y_true, y_pred)

    class_names = sorted(y_test.unique().tolist())

    cm = confusion_matrix(y_test, y_pred, labels=class_names)

    title = title or "Confusion Matrix"
    if subtitle:
        title += f"<br><sup>{subtitle}</sup>"

    fig = px.imshow(cm, x=class_names, y=class_names, height=height,
                    labels={"x": "Predicted", "y": "Actual"},
                    color_continuous_scale="Blues", text_auto=True,
    )
    fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})
    fig.update_coloraxes(showscale=showscale)

    fig.show()



plot_confusion_matrix(y_test, y_pred, height=900)

#### Top three pairs of artists who were most confused with each other.

*   `4` - Predicted: `Cold Play` Actual: `Frank Sinatra`
*   `3` - Predicted: `Chris Stapleton` Actual: `John Mayer`
*   `3` - Predicted: `Miles Davis` Actual: `John Coltrane`



## Dimensionality Reduction

Performing dimensionality reduction on the audio features.

In [29]:
# A) Use a PCA model from sklearn with two components. Use a random state for reproducibility. 
# Training the model on the scaled data to obtain the embeddings.

from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=99)

pca_embeddings = pca.fit_transform(x_scaled)

print(f"PCA embeddings (2 components):\n{pca_embeddings[:5]}")  # Show first 5 rows for reference

PCA embeddings (2 components):
[[ 2.28569942  0.76033006]
 [ 2.55098238  0.02637963]
 [-1.07916235  0.46552184]
 [ 0.25819892  0.4874113 ]
 [-1.17209878 -0.10913769]]


In [30]:
# Explained variance ratio for each component
print(f"Explained Variance Ratio: {pca.explained_variance_ratio_[0]:.4f}, {pca.explained_variance_ratio_[1]:.4f}")

# Print the sum of explained variance
print(f"Total Explained Variance: {pca.explained_variance_.sum():.4f}")

Explained Variance Ratio: 0.2146, 0.1670
Total Explained Variance: 16.0262


The explained variance ratios indicate how much of the total variance in the dataset is captured by each principal component. In this case:

+ First component explains `21.46%` of the variance
Second component explains `16.70%` of the variance
Together, these two components account for `38.16%` of the total variance
+ For the sum of explained variance for all components: The sum of explained variance `16.03` indicates the total amount of variance retained after performing PCA.

This suggests that while these two components capture a significant portion of the data's variability, there's still a considerable amount of information `(61.84%)` not explained by these components. This implies the dataset has complexity that isn't fully represented by just these two principal components.

### Track Embeddings Plot

In [31]:
# Wrap the embeddings in a pandas.DataFrame, using appropriate column names ("component_1" and "component_2"), and appropriate index values.

pca_df = pd.DataFrame(pca_embeddings, columns=["component_1", "component_2"], index=x_scaled.index)

# Print the first few rows of the resulting DataFrame
print(pca_df.head(10))

   component_1  component_2
0     2.285699     0.760330
1     2.550982     0.026380
2    -1.079162     0.465522
3     0.258199     0.487411
4    -1.172099    -0.109138
5    -0.994269    -0.913397
6    -1.205898    -0.118464
7    -2.317212     0.158391
8    -1.869898    -1.077974
9    -0.177722    -1.181422


In [32]:
# Plot the embeddings using a scatterplot, with "component_1" on the x axis and "component_2" on the y axis. Color on artist label.

# Create the scatter plot using Plotly
fig = px.scatter(pca_df, x='component_1', y='component_2', color=df['artist_name'],
                 title="PCA Embeddings of Audio Features",
                 labels={'component_1': 'Component 1', 'component_2': 'Component 2', 'color': 'Artist Name'},
                 color_continuous_scale='Set2')

# Show the plot
fig.show()

> The plot for `Beethoven` appears clustered on the left side of component_1, contrasting with other artists such as `Andrea Bocelli, Maggie Rogers`, and `Adele`, whose points overlap significantly. This suggests a potential similarity in how their music is recommended or consumed. Notably, there are a few outliers, including `Ariana Grande`.

### Artist Centroids Plot

In [33]:
# E) Calculate the centroids for each artist. The centroid for a given artist is comprised of the "component_1" mean and the "component_2" mean for that artist.

# Adding the 'artist' label to the pca_df
pca_df['artist'] = y_train

# Calculating the centroids for each artist
centroids = pca_df.groupby('artist')[['component_1', 'component_2']].mean()

# Print the centroids
print(centroids)

                   component_1  component_2
artist                                     
ac_dc                -2.650474     2.703622
adele                 0.101009    -0.642272
alicia_keys           0.617628    -0.153177
andrea_bocelli        1.372134    -2.115613
ariana_grande         4.242824     1.510066
bach                 -0.175216    -3.555729
beethoven            -1.796875    -5.714091
bruce_springsteen    -2.742395     0.874943
chris_stapleton      -0.228719    -0.130422
coldplay             -1.555954    -0.157235
dr_dre                3.001062     3.247707
frank_sinatra        -0.766613    -0.158465
jason_aldean         -1.088423     0.222854
jay_z                 0.511442     2.449474
john_coltrane        -2.056012    -0.644847
john_legend           1.112730    -0.185813
john_mayer           -2.056997     0.109605
led_zeppelin         -1.620247     0.212123
maggie_rogers         1.752033    -0.453864
miles_davis           0.635915    -0.380263
pink_floyd           -1.135664  

In [34]:
# F) Plot the centroids using a scatterplot, with "component_1" on the x axis and "component_2" on the y axis. Color on artist label.

# Plot the centroids with artist color
fig = px.scatter(centroids, x='component_1', y='component_2', color=centroids.index,
                 title="Artist Centroids in PCA Space",
                 labels={'component_1': 'PCA Component 1', 'component_2': 'PCA Component 2', 'artist': 'Artist Name'},
                 color_continuous_scale='Viridis')

fig.update_traces(marker=dict(size=24))

fig.show()

+ Artists in Similar Areas: Several artists cluster closely together, suggesting similarities in their data features. For instance, `John Mayer, Led Zeppelin, Jason Aldean`, and F`rank Sinatra` are grouped to the left of component_1, while `Taylor Swift, John Legend, Alicia Keys, and Maggie Rogers` form a cluster near the center-right of component_1, indicating shared characteristics.

+ More Distinct Artists: dr_dre stands out at the far right of component_1, reflecting a unique feature set. Similarly, `Ariana Grande` is separated from other artists, positioned far right but in a different region than `Dr dre`. On the bottom left of component_2, beethoven and bach emphasize their distinctiveness from pop artists. Additionally, `Ac_dc` and the `Jay Z` and `Tupac` pair exhibit their uniqueness in separate regions.

## PCA Tuning

In [35]:
# List to store explained variance and number of components
explained_variance_list = []

for n in range(1, x_scaled.shape[1] + 1):
    # Initialize PCA model with n components
    pca = PCA(n_components=n, random_state=99)

    # Train PCA model on scaled data (all rows of the features)
    pca.fit(x_scaled)

    # Calculate the sum of explained variance for n components
    explained_variance_sum = pca.explained_variance_ratio_.sum()

    # Print the sum of explained variance for n components
    print(f"Number of components: {n}, Sum of explained variance: {explained_variance_sum:.4f}")

    # Store the sum of explained variance and number of components in a list
    explained_variance_list.append({'n_components': n, 'explained_variance_sum': explained_variance_sum})

Number of components: 1, Sum of explained variance: 0.2146
Number of components: 2, Sum of explained variance: 0.3816
Number of components: 3, Sum of explained variance: 0.4773
Number of components: 4, Sum of explained variance: 0.5454
Number of components: 5, Sum of explained variance: 0.5905
Number of components: 6, Sum of explained variance: 0.6281
Number of components: 7, Sum of explained variance: 0.6596
Number of components: 8, Sum of explained variance: 0.6868
Number of components: 9, Sum of explained variance: 0.7127
Number of components: 10, Sum of explained variance: 0.7367
Number of components: 11, Sum of explained variance: 0.7599
Number of components: 12, Sum of explained variance: 0.7812
Number of components: 13, Sum of explained variance: 0.8005
Number of components: 14, Sum of explained variance: 0.8174
Number of components: 15, Sum of explained variance: 0.8326
Number of components: 16, Sum of explained variance: 0.8467
Number of components: 17, Sum of explained varian

In [36]:
explained_variance_df = pd.DataFrame(explained_variance_list)

# Print the DataFrame
print(explained_variance_df)

    n_components  explained_variance_sum
0              1                0.214583
1              2                0.381577
2              3                0.477308
3              4                0.545424
4              5                0.590539
5              6                0.628133
6              7                0.659592
7              8                0.686835
8              9                0.712749
9             10                0.736742
10            11                0.759914
11            12                0.781168
12            13                0.800504
13            14                0.817374
14            15                0.832558
15            16                0.846739
16            17                0.860355
17            18                0.873572
18            19                0.885747
19            20                0.896589
20            21                0.906169
21            22                0.915500
22            23                0.924526
23            24

### Explained Variance Plot

In [37]:
# Plot a line chart of the explained variance on the y axis and the number of components on the x axis
fig = px.line(explained_variance_df, x='n_components', y='explained_variance_sum', markers=True,
              title='Explained Variance vs. Number of Components',
              labels={'n_components': 'Number of Components', 'explained_variance_sum': 'Explained Variance'})

fig.add_hline(y=0.8, line_width=1, line_dash="dot", line_color="red")
fig.add_hline(y=0.9, line_width=1, line_dash="dot", line_color="green")

# Show the plot
fig.show()

### The line chart shows the explained variance on the y-axis and the number of components on the x-axis. As the number of components increases, the explained variance also increases, approaching 100% when all components are included.


*   ***80% Variance:*** To achieve 80% of the variance, `13` components are required.
*   ***90% Variance:*** To achieve 90% of the variance, `21` components are required.

### Scree Plot

In [38]:
import plotly.graph_objects as go

# Get the eigenvalues (explained variance)
eigenvalues = pca.explained_variance_

# Plot the scree plot using Plotly
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(1, len(eigenvalues) + 1)), y=eigenvalues, mode='lines+markers'))

fig.update_layout(title='Scree Plot', xaxis_title='Number of Components',
    yaxis_title='Eigenvalue', template='plotly_white'
)
fig.add_hline(y=1, line_width=1, line_dash="dot", line_color="blue", )
fig.show()

The scree plot shows that the eigenvalues decrease as the number of components increases. The "elbow" point, where the curve starts to flatten, is at the first component. This indicates that the first component captures the most significant variance, and additional components contribute less significantly to the total variance.

*   The greatest number of components where the eigenvalue is greater than one: `10`
*   The curve shows an elbow bend starting at `component 4` and continuing through component 9`, after which the increase in explained variance becomes significantly smaller.


### Optimal Number of Components

Answer: `21 components`

Based on the explained variance and scree plot analyses, I would select `21` components for this music recommendation dataset. This choice preserves `90%` of the variance, striking a good balance between reducing dimensionality and retaining most of the information. I also considered using  `10` as the scree plot shows eigenvalues greater than `1`.

## Artist Classification (Reduced Feature Embeddings)

In [39]:
# Number of components identified in PCA Tuning 
n_components = 21

# Train PCA model on the scaled data with the identified number of components
pca = PCA(n_components=n_components, random_state=99)
reduced_embeddings = pca.fit_transform(x_scaled)

# Print the sum of explained variance when using this number of components
explained_variance_sum = pca.explained_variance_ratio_.sum()
print(f"Sum of explained variance with {n_components} components: {explained_variance_sum:.4f}")

Sum of explained variance with 21 components: 0.9062


In [40]:
# Store the embeddings in a pandas DataFrame with appropriate column names and index values
column_names = [f'PC{i}' for i in range(1, n_components + 1)]
reduced_embeddings_df = pd.DataFrame(reduced_embeddings, columns=column_names, index=df["artist_name"])

# Add / merge the artist name labels back in for charting later
reduced_embeddings_df['artist_name'] = df['artist_name'].values

In [41]:
# Perform train/test split using the embeddings as the features
X_train, X_test, y_train, y_test = train_test_split(reduced_embeddings_df[column_names], reduced_embeddings_df['artist_name'], test_size=0.2, random_state=99)

In [42]:
# Train a LogisticRegression model using the embeddings training data
model = LogisticRegression(random_state=99)
model.fit(X_train, y_train)

In [43]:
# Model Evaluation: Evaluate the model using a classification report.

# Predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

In [44]:
# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
                   precision    recall  f1-score   support

            ac_dc       1.00      1.00      1.00        11
            adele       0.67      0.73      0.70        11
      alicia_keys       0.50      0.45      0.48        11
   andrea_bocelli       0.78      0.54      0.64        13
    ariana_grande       0.56      0.56      0.56         9
             bach       0.71      0.92      0.80        13
        beethoven       0.89      0.89      0.89         9
bruce_springsteen       0.48      0.87      0.62        15
  chris_stapleton       0.68      0.94      0.79        18
         coldplay       0.47      0.45      0.46        20
           dr_dre       0.69      0.92      0.79        12
    frank_sinatra       0.80      0.33      0.47        12
     jason_aldean       0.55      0.75      0.63         8
            jay_z       0.90      0.69      0.78        26
    john_coltrane       0.97      0.80      0.88        35
      john_legend       0.38   

In [45]:
def plot_confusion_matrix(y_true, y_pred, height=450, showscale=False, title=None, subtitle=None):
    # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
    # Confusion matrix whose i-th row and j-th column
    # ... indicates the number of samples with
    # ... true label being i-th class (ROW)
    # ... and predicted label being j-th class (COLUMN)
    cm = confusion_matrix(y_true, y_pred)

    class_names = sorted(y_test.unique().tolist())

    cm = confusion_matrix(y_test, y_pred, labels=class_names)

    title = title or "Confusion Matrix"
    if subtitle:
        title += f"<br><sup>{subtitle}</sup>"

    fig = px.imshow(cm, x=class_names, y=class_names, height=height,
                    labels={"x": "Predicted", "y": "Actual"},
                    color_continuous_scale="Blues", text_auto=True,
    )
    fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})
    fig.update_coloraxes(showscale=showscale)

    fig.show()



plot_confusion_matrix(y_test, y_pred, height=900)

+ The reduced model with 21 components achieved an overall accuracy of `70%`, which is significantly lower than initial artist classification.
+ The artists with the highest F1-scores are: `Ac_dc(F1=1.00)`, `beethoven(F1=0.89)`,`John Coltrane(F1=0.88)`.
+ The artists with the lowest F1-scores are: `John Legend(F1=0.32)`, `Cold Play(F1=0.46)`, `Led Zeppelin(F1=O.43)`.

+ The pairs of artists who were most confused with each other are:
    + `5 times` - `Predicted: Bruce Springsteen`, `Actual: Coldplay`
    + `6 times` - `Predicted: Cold Play`, `Actual: Frank Sinatra`

+ The performance metrics, including precision, recall, and F1-score, have generally declined. The classification results from Part 5D indicate that while the dimensionality reduction process effectively reduced the feature space, it likely led to a loss of relevant information compared to the original artist classification in Part 3. This is reflected in the accuracy drop from `74%` to `70%`.