# Moosic Modelling :: Final Iteration

* MOOSIC - mood based music recommendation system








---
## MODEL SKETCH : baseline model 

---

### Music track baseline content based recommender system by mood categories

* suggests data (music) based on user's interests? or users mood?
* insights and filter based on feature variables from our data
* ??metric: cosine similarity to measure the similarity of tracks/genres etc

<br>

### Algortihms and Options

* Kmeans clustering algorithm (unsupervised) (~ mini batch)

* t-SNE for dimensionality reduction and visualisation based on our mood labels

* cluster similarity modelling based on 1D mood indicator, valence (V) and to the 2D mood indicators, valence and energy.

* baseline focus: content-based recommender system based on user input query 
    - if-else construct based on mood clusters
    - output playlist with N = 5 randomized music track recommendations based on query
    - mood_choices are :  get users current mood cluster and also their preffered choice for mood choice for a playlist.


<br>


---






---
## MODEL SKETCH : main model v1

---

### Music track baseline content based recommender system by mood categories

* suggests data (music) based on user's interests? or users mood?
* insights and filter based on feature variables from our data
* track_to_track similiarity computation  based on m: cosine similarity to measure the similarity of tracks

<br>

### Algortihms and Options

* Kmeans clustering algorithm (unsupervised) ~ mini batch kmeans

* t-SNE for dimensionality reduction and visualisation based on our mood labels

* similarity modelling based on 2-D mood (affect) indicators, valence (V) and Energy (E)
    - music track name clustering and similarity measure, 
    - then get mood of clusters based on the average valence of the clusters gotten from the similarity

* categorical mood labels are labeled encoded (but may introduce ordering bias)
  - will use get_dummies method to encode the mood categories next

* modelling : clustering + classifying + predicting/ recommending
  - clustering : mini-batch k-means
  - text vectorization with TfidfVectorizer on our textual data (user preferences and mood targets )
  - track to track similarity computation
  - recommend random N = 5 (or top N = 5) music track based on user preferences

* content (track-track) recommendation system based on users mood goal and genre choice
  - track to track similarity matrix computation : linear kernel
  - query data based on user's preferences and clustered data
    - new feature that combines user preferences (mood goal and genre) : text (tdidf) vectorizer else mood vector only
    - vectorize the strings then  compute and store similarity scores of tracks 
  - track_similarity_matrix : track to track similarity based on associated mood and genre 
  - assign track similarity matrix scores computed back to our data
  - used linear kernel to compute similarity
  - reset and drop index, then concatenated withh clustered data to be used for the next step
  - classifying tracks based on similarity score and predicting random N = 5 samples





* main focus: content-based recommender system based on user input query 
    - if-else construct based on mood clusters
    - output playlist with N = 5 randomized music track recommendations based on query
    - mood_choices are :  get users current mood cluster and also their preffered choice for mood choice for a playlist.


* options (for main): 
    - svm 
    - classification/prediction of track to track to user preference scores
    - genetic algorithm for feature selection for an ML model : search and optimization in large solution space

<br>


---




# Flow chart of thought process (v3)


![Flow chart for model idea iteration v3](/images/moosic_process_current_workflow.jpeg)






## Importing required libraries




In [None]:


# IMPORT LIBRARIES


try:

    import numpy as np
    import pandas as pd
    import random as rnd
    #from tqdm.notebook import tqdm as tqdm
    from tqdm import tqdm 
    #from .autonotebook import tqdm as notebook_tqdm
    import time

    # databases - sql
    #from dotenv import dotenv_values
    #import sqlalchemy

    # visualisation
    import seaborn as sns
    import matplotlib.pyplot as plt
    from matplotlib.colors import ListedColormap

    # split data - avoid data leakage
    from sklearn.model_selection import train_test_split


    # cross validation, hyperparameter tuning
    from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, KFold
    from sklearn.model_selection import cross_val_score

    # preprocessing, scaling
    from sklearn.preprocessing import StandardScaler
    from sklearn.preprocessing import MinMaxScaler

    # modelling - clustering
    from sklearn.cluster import KMeans, MiniBatchKMeans, MeanShift, DBSCAN
    from kmodes.kmodes import KModes
    from kmodes.kprototypes import KPrototypes

    # text converter/ vectorizer
    from sklearn.feature_extraction.text  import TfidfVectorizer

    # modelling - classification
    from xgboost import XGBClassifier
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    from sklearn.svm import SVC
    from sklearn.svm import LinearSVC
    from sklearn.naive_bayes import MultinomialNB, GaussianNB
    from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
    from sklearn.ensemble import VotingClassifier, StackingClassifier
    from sklearn.linear_model import LogisticRegression

    # high dimensional usage - dimensionality reduction
    from sklearn.manifold import TSNE
    from sklearn.decomposition import PCA
    from umap import UMAP

    # metrics
    from sklearn import metrics
    from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve
    from sklearn.metrics import classification_report, confusion_matrix, roc_curve
    from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
    from sklearn.metrics import adjusted_rand_score, silhouette_score, v_measure_score, ndcg_score, precision_score, \
        recall_score, f1_score, average_precision_score

    # pipeline
    from sklearn.pipeline import Pipeline
    from sklearn.pipeline import make_pipeline


except ImportError as error:
    print(f"Installation of the required dependencies necessary! {error}")

    %pip install numpy
    %pip install pandas
    #%pip install dotenv
    #%pip install sqlalchemy
    %pip install seaborn
    %pip install matplotlib
    %pip install scikit-learn
    %pip install xgboost
    %pip install tqdm
    %pip install ipywidgets
    print(f"Successful installation of the required dependencies necessary")


import warnings
warnings.filterwarnings('ignore')





# color scheme

- custom_palette = { violet: #2B2960, blue: #00A1D8, orange: #F08144, yellow: #FDC20C, green: #29A744, eggshell: #FFF4D5}


- custom_palette =[#2B2960, #00A1D8, #F08144, #FDC20C, #29A744, #FFF4D5]



In [None]:
# setting color scheme for plots


hex_colors = ['#2B2960', '#00A1D8', '#F08144', '#FDC20C', '#29A744', '#22E87E', '#FFF4D5', '#E3CFBF']

custom_palette = sns.set_palette(sns.color_palette(hex_colors))

# pandas plot: colormap = custom_cmap_hex, plt/sns plot : cmap = custom_cmap_hex
custom_cmap_hex1 = ListedColormap(sns.color_palette(hex_colors).as_hex())

custom_cmap_hex1 

# pandas plot: colormap = custom_cmap_hex, plt/sns plot : cmap = custom_cmap_hex


In [None]:


# euphoric = #FFF2CC, happy = #FFD966, tense = #E99999, angry = #DD7E6B, depressed = #A2C4C9, sad = #76A5AF, relaxed = #B5D7A8, calm = #D9EAD3,
hex_colors_quad = ['#ecd07a', '#FFD966', '#E99999', '#DD7E6B', '#A2C4C9', '#76A5AF', '#B5D7A8', '#D9EAD3']
custom_cmap_quad = ListedColormap(sns.color_palette(hex_colors_quad).as_hex())

custom_cmap_quad 



## Loading the data

In [None]:
moosic_data_all = pd.read_csv('../../data/processed/moosic_data_processed.csv', low_memory=False)


In [None]:
# load the modelling data file for moosic with all music track samples

moosic_data_all = pd.read_csv('../../data/processed/moosic_data_processed.csv', low_memory=False)

# get shape 

print(f"Music data: There are {moosic_data_all.shape[0]} observations and {moosic_data_all.shape[1]} feature variables ")
print('----------'*10)

moosic_data_all.head(2)

In [None]:
# count of moods - all tracks

moosic_data_all.groupby('mood_goal')['mood_goal'].value_counts()



In [None]:
# dataset is unbalanced from the perspective of the associated mood 
#    get the count of how the mood is distributed wrt the data
# get sample size = 15000 #20000

def get_balanced_data(processed_dataset, sample_size = 20000):

    ''' 
    get data with specified sample size based on each mood?
        
    '''

    sampled_moosic_data  = pd.DataFrame()
    grouped_data = processed_dataset.groupby('mood_goal')


    for mood_label, group in grouped_data:
        
        #print(f' getting data samples for the mood : {mood_label} \n ')

        if len(group) >= sample_size: 
            random_rows = group.sample(sample_size, random_state=42) 
        else:
            random_rows = group  

        sampled_moosic_data = pd.concat([sampled_moosic_data, random_rows])

        continue

    print(f' Finished processing, data has balanced number of samples for all categories. ')

    sampled_moosic_data = sampled_moosic_data.reset_index(drop=True) 

    mood_label_counts = sampled_moosic_data['mood_goal'].value_counts()
    print(f"The size of data mood label count {mood_label_counts} ")
    print("______"*10)

    return sampled_moosic_data






In [None]:
## display balanced moosic : mood-music data

moosic_data_samples = get_balanced_data(moosic_data_all, sample_size = 15000)
moosic_data_samples = pd.concat([moosic_data_samples.drop(['main_genres','core_genres','genres'], axis = 1),
        pd.concat([moosic_data_samples['core_genres'], pd.get_dummies(moosic_data_samples['core_genres'], drop_first=True).replace({True: 1, False: 0})], axis = 1) ], axis=1)


print(f"Music data: There are {moosic_data_samples.shape[0]} observations and {moosic_data_samples.shape[1]} feature variables ")
print('----------'*10)

moosic_data_samples.head()


---

# Splitting the dataset for modelling : 

* train, test split : splitting the data to avoid data leakage 
* drop columns = ['artist_id', 'track_id' , 'artist_name',  'track_name', 'core_genres']




In [None]:

# all features
all_features = ['artists_id' , 'track_id', 'artist_name', 'track_name', 'genres', 'danceability','energy', 'key', 'speechiness', 'acousticness', 
                        'release_date', 'explicit', 'key', 'loudness', 'mode', 'time_signature', 'followers', 'classical', 'country', 'instrumentalness'
                        'artist_popularity',  'track_popularity', 'tempo', 'main_genres', 'core_genres', 'mood_goal', 'blues', 'classical', 'country', 
                        'disco', 'dubstep', 'edm','electronic', 'folk', 'funk', 'gospel', 'hip hop', 'house', 'indie rock', 'jazz', 'metal', 'other', 
                        'pop', 'punk rock', 'r&b', 'reggae', 'rock','rockabilly','soul', 'techno']





In [None]:

# function to split dataset for modelling 
# splitting dataset and also retain specific features to be outputted

def split_dataset(dataset, target_feature = 'mood_goal', input_features = ['track_id', 'valence','energy', 'mood_goal'],  
                        drop_features = ['track_id'], output_features = ['track_id', 'mood_goal']): 

    ## get list of moosic data features
    #features = dataset.columns.tolist()

    # defining X (input feature vectors) and Y (target output features)


    X_data = dataset[input_features]

    y_data = dataset['mood_goal']


    # splitting the dataset into train and test

    X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.20, random_state=42, shuffle=True, stratify=y_data)

    Y_train = X_train[output_features].reset_index(drop=True)
    Y_test = X_test[output_features].reset_index(drop=True)

    X_train = X_train[input_features].drop(drop_features, axis=1).reset_index(drop=True)
    X_test =  X_test[input_features].drop(drop_features, axis=1).reset_index(drop=True)

    y_train = Y_train[target_feature].reset_index(drop=True)
    y_test = Y_test[target_feature].reset_index(drop=True)

    data  = {
        'X_data':X_data,
        'y_data':y_data,
        'X_train':X_train,
        'X_test':X_test,
        'y_train':y_train,
        'y_test':y_test,        
        'Y_train':Y_train,
        'Y_test':Y_test,
        }

    return data




In [None]:

#sample_size = 20000

mood_categories = ['angry', 'calm', 'depressed', 'euphoric', 'happy', 'relaxed', 'sad', 'tense']

# baseline_features = moosic_data.columns.tolist()

# all features
all_features = ['artists_id' , 'track_id', 'artist_name', 'track_name', 'genres', 'danceability','energy', 'key', 'speechiness', 'acousticness', 
                        'release_date', 'explicit', 'key', 'loudness', 'mode', 'time_signature', 'followers', 'classical', 'country', 'instrumentalness'
                        'artist_popularity',  'track_popularity', 'tempo', 'main_genres', 'core_genres', 'mood_goal', 'blues', 'classical', 'country', 
                        'disco', 'dubstep', 'edm','electronic', 'folk', 'funk', 'gospel', 'hip hop', 'house', 'indie rock', 'jazz', 'metal', 'other', 
                        'pop', 'punk rock', 'r&b', 'reggae', 'rock','rockabilly','soul', 'techno']


#features to be dropped
drop_features = ['track_id', 'track_name', 'artist_name',  'core_genres', 'mood_goal', 'release_date']


# feature columns
baseline_input_features = ['track_id', 'track_name', 'artist_name',  'core_genres', 'mood_goal','danceability', 'valence', 'tempo','energy', 
                        'release_date', 'key', 'speechiness', 'acousticness', 'instrumentalness',  'loudness', 'mode', 'time_signature',
                        'blues', 'classical', 'country', 'disco', 'dubstep', 'edm','electronic', 'folk', 'funk', 'gospel', 'hip hop', 'house', 'indie rock', 
                        'jazz', 'metal', 'other', 'pop', 'punk rock', 'r&b', 'reggae', 'rock','rockabilly','soul', 'techno']


# features to be outputed by the recommender at the end
output_features = ['track_id', 'track_name', 'artist_name',  'core_genres', 'mood_goal']


# target column
target_feature = 'mood_goal'



In [None]:
# get data splits
# n=15



moosic_data_nk = split_dataset(moosic_data_samples, target_feature = target_feature,
                        input_features = baseline_input_features,  output_features = output_features,
                            drop_features=drop_features)




In [None]:
# X features (input):  X_train, X_test
# Y features (output):  Y_train, Y_test
# y feature (target):  y_train, y_test


print("X train (input features) ")
print("-----"*10)
X_train = moosic_data_nk['X_train']
display(X_train.head(2))

print("X test (input features) ")
print("-----"*10)
X_test = moosic_data_nk['X_test']
display(X_test.head(2))


print("Y train (target output features) ")
print("-----"*10)
Y_train = moosic_data_nk['Y_train']
display(Y_train.head(2))


print("Y test (target output features) ")
print("-----"*10)
Y_test = moosic_data_nk['Y_test']
display(Y_test.head(2))


print("y train (target feature) ")
print("-----"*10)
y_train = moosic_data_nk['y_train']
display(y_train.head(2))


print("y test (target feature) ")
print("-----"*10)
y_test = moosic_data_nk['y_test']
display(y_test.head(2))


---
# Modelling

* elbow method
* dimensionality reduction / visualization of clusters in lower (2d) dimension
* clustering
* (?!)visualization of clusters in lower (2d) dimension
* prediction/classification (multi-class classification) models ... multi-label?
* recommender :top n = 5 based on item-item similarity computation based on clustering


---

In [None]:

def optimal_cluster_plot(data, tsne=True):

    wcss_inertias = []

    if tsne == True:
        tsne = TSNE(n_components=2)
        data = tsne.fit_transform(data)

    else:
        data = data

    for k in range(4, 20):
        model = MiniBatchKMeans(n_clusters=k)
        model.fit(data)
        wcss_inertias.append(model.inertia_)
        
    fig, ax = plt.subplots(nrows=1, ncols=1)
    ax.plot(range(4, 20), wcss_inertias, '-o')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS : Within clusters sum of squares')
    plt.title('Elbow method to determine optimal K number of clusters')
    plt.xticks(range(4, 20))
    plt.show(); 








In [None]:

# optimal number of clusters using elbow method analysis to compare with the researched  basic number of moods for music (8)

optimal_cluster_plot(X_train, tsne=True)



In [None]:
data = X_train[['valence',  'energy']]
optimal_cluster_plot(data, tsne=True)

In [None]:
# without tsne

optimal_cluster_plot(X_train, tsne=False)

In [None]:

# function clustering model (+ tsne) 


def clustering_model(data,  pca = True, tsne=True, params = {'n_clusters': 8}, sample_size=5000, *args, **kwargs):


    data = data.head(sample_size)
    scaled_data = MinMaxScaler().fit_transform(data)
    data_col = data.columns.to_list()
    data = pd.DataFrame(scaled_data, columns = data_col)

    if pca == True:
        pca = PCA(n_components=2, random_state = 42)

        pca_start_time = time.time()

        data = MinMaxScaler().fit_transform(pca.fit_transform(data))

        pca_end_time = time.time()
        pca_train_time = pca_end_time - pca_start_time
        print(f"Time taken for dimensionality reduction using PCA: {pca_train_time:.2f} seconds")

        data = pd.DataFrame(data, columns = ['dimension_0', 'dimension_1'])

    else:
        data = data


    model = MiniBatchKMeans(**params)

    kme_start_time = time.time()

    data['cluster_labels'] = model.fit_predict(data)

    kme_end_time = time.time()
    kme_train_time = kme_end_time - kme_start_time
    

    cluster_labels = model.labels_ 
    cluster_centers = model.cluster_centers_ 

    #if tsne == True:
    tsne = TSNE(n_components=2, random_state = 42)

    tsne_start_time = time.time()

    tsne_embeddings = tsne.fit_transform(data)

    tsne_end_time = time.time()
    tsne_train_time = tsne_end_time - tsne_start_time
        

    #else:
    #    data = data


    
    # kp_model_name = "minikmeans.pickle"
    # tp_model_name = "tsne_embed.pickle"

    # pickle.dump(model, open(kp_model_name, 'wb'))
    # pickle.dump(model, open(tp_model_name, 'wb'))


    # kj_model_name = "minikmeans.joblib"
    # tj_model_name = "tsne_embed.joblib"

    # joblib.dump(model, kj_model_name)
    # joblib.dump(model, tj_model_name)



    return data, tsne_embeddings, cluster_labels, cluster_centers






In [None]:


# defining the actual y (target data - mood_goal)
# y : y_train, y_test


# encode categorical data : mood
y_train_dummies = pd.get_dummies(y_train, drop_first=True).replace({True: 1, False: 0})
encoded_y_train = pd.concat([y_train, y_train_dummies], axis=1)

y_test_dummies = pd.get_dummies(y_test, drop_first=True).replace({True: 1, False: 0})
encoded_y_test = pd.concat([y_test, y_test_dummies], axis=1)


print("encoded target (train) data - mood_goal ")
print("-----"*10)
display(encoded_y_train.head(2))


print("encoded target (test) data - mood_goal ")
print("-----"*10)
display(encoded_y_test.head(2))


#mood_labels = ["depressed", "sad", "anxious",  "neutral", "calm", "euphoric", "energetic", "happy"]
#mood_1d_class = [0, 1, 2, 3, 4, 5, 6, 7]


# add targets for evaluation for clusters later

#moosic_data["mood_label"] = (pd.cut(moosic_data["valence"], bins=mood_valence_values, labels=mood_1d_labels)).astype('string')
#moosic_data["mood_class"] = (pd.cut(moosic_data["valence"], bins=mood_valence_values, labels=mood_1d_class)).astype('Int64')

#moosic_data.head(2)


In [None]:
# checking clustered data features

clustered_data, clustered_embeddings, cluster_labels, cluster_centers = clustering_model(X_train,  pca = False, tsne = True, 
                                                                        params = {'n_clusters': 8}, sample_size = 5000)

clustered_data.head(2)

In [None]:
# baseline: recommendation based on predictions of music track (mood-genre groups) and recommendation
# - mood labels are labeled encoded (but may introduce ordering bias)
## * --> np.unique(y_train['mood_2d_label'].copy()) = ['angry' 'calm' 'depressed' 'euphoric' 'happy' 'relaxed' 'sad' 'tense']
## * --> np.unique(LabelEncoder().fit_transform(y_train[['mood_2d_label']]).copy()) = array([0, 1, 2, 3, 4, 5, 6, 7])
## * --> np.concatenate([b, c]).tolist() =  ['angry','calm','depressed','euphoric','happy','relaxed','sad','tense',0,1,2,3,4,5,6,7]
#mood_list_types = ['angry','calm','depressed','euphoric','happy','relaxed','sad','tense', 0, 1, 2, 3, 4, 5, 6, 7]


baseline_cluster_params = {
    'n_clusters' : 8,
    'batch_size' : 500,
    'random_state' : 42,
    'init' : 'k-means++' #random
}


mood_clusters  = {
                        'happy' : 4,
                        'euphoric' : 3,                  
                        'tense' : 7, 
                        'angry' :0, 
                        'depressed' : 2, 
                        'sad' :6, 
                        'calm' : 1, 
                        'relaxed' : 5

    }


def baseline_clustering(x_train,  mood_clusters = mood_clusters, cluster_params = baseline_cluster_params,
                        pca=False, tsne=True, sample_size = 10000,  *args, **kwargs):


    # clustering with mini-batch kmeans 
    clustered_data, clustered_embeddings, cluster_labels, cluster_centers = clustering_model(x_train,  pca = pca, tsne = tsne, 
                                                                        params = cluster_params)

    #feature matrix of one-hot encoded cluster representation
    feature_matrix = np.eye(cluster_params['n_clusters'])[cluster_labels]

    #track_similarity_matrix = 1 - pairwise_distances(clustered_data, cluster_centers, metric='cosine')



    #print("________"*20)
    #print("________"*20)

    # visualize the clustered data
    fig, ax = plt.subplots(figsize = (16, 10))
    # fig, ax = plt.subplots()
    # fig, ax = plt.figure()

    scaled_embed = MinMaxScaler().fit_transform(clustered_embeddings)
    plt.scatter(scaled_embed[:, 0], scaled_embed[:, 1], c=cluster_labels, cmap=custom_cmap_quad)
    #plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], marker='X', color='black') 

    # for cluster_centerx, cluster_centery in zip(cluster_centers[:, 0], cluster_centers[:, 1]):

    #    plt.scatter(cluster_centerx, cluster_centery, marker='X', color='black') 


    plt.legend()
    plt.title('Music - Mood Clustered Data ', pad=15, fontsize = 20, weight = 'bold', color='#928d8d')#'#C3C7C5')
    plt.colorbar()

    get_axes = plt.gca()
    plt.xticks([]) 
    plt.yticks([]) 
    xax = get_axes.axes.get_xaxis()
    
    xax = xax.set_visible(False)

    yax = get_axes.axes.get_yaxis()
    yax = yax.set_visible(False)

    # Set the axis colors
    ax.set_facecolor('#F7F3F0') #E2E2DB
    ax.grid(color='#cfcecb')
    ax.set_facecolor('#eeeeee') #eeeeee
    ax.spines['bottom'].set_color('#eeeeee')
    ax.spines['top'].set_color('#eeeeee')
    ax.spines['right'].set_color('#eeeeee')
    ax.spines['left'].set_color('#eeeeee')
    ax.xaxis.label.set_color('#eeeeee')
    ax.yaxis.label.set_color('#eeeeee')

    plt.show()
    plt.savefig('../images/1_clusters_count_plot.png', transparent=True)

    return clustered_data, clustered_embeddings, scaled_embed




In [None]:
np.unique(Y_train['mood_goal'].copy())


In [None]:
# mood_clusters  = {
#                         'happy' : 4,
#                         'euphoric' : 3,                  
#                         'tense' : 7, 
#                         'angry' :0, 
#                         'depressed' : 2, 
#                         'sad' :6, 
#                         'calm' : 1, 
#                         'relaxed' : 5

#     }


mood_clusters  = {
                'relaxed' : 0,
                'happy' : 1,
                'euphoric' : 2,
                'depressed' : 3, 
                'sad' :4,  
                'calm' : 5,                                          
                'tense' : 6, 
                'angry' :7
    }

#mood_list_types = list(mood_clusters.items())
mood_list_types = [item for sublist in mood_clusters.items() for item in sublist]

mood_list_types

In [None]:
# mood_labels = ["relaxed", "happy", "euphoric", "depressed", "sad",  "calm", "tense",  "angry", ]
# mood_cat_values = [0, 1, 2, 3, 4, 5, 6, 7]


# # add targets for evaluation for clusters later

# moosic_data["mood_label"] = (pd.cut(moosic_data["mood_labels"], bins=mood_valence_values, labels=mood_1d_labels)).astype('string')
# moosic_data["mood_class"] = (pd.cut(moosic_data["valence"], bins=mood_valence_values, labels=mood_1d_class)).astype('Int64')


In [None]:
# run baseline
# - run clustering model : kmeans/ mini-batch kmeans
# - dimensionality reduction and visualization with tsne
# - query clustered data and output N = 5 recommendations
# - final data shows music track name and its associated, artist, mood goal/group, core genre, and its K clustered labels.

# mood_clusters  = {
#                         'happy' : [(0.5, 1.0), (0.5, 0.75)],
#                         'euphoric' : [(0.5, 1.0), (0.75, 1.0)],                  
#                         'tense' : [(0.0, 0.5), (0.75, 1.0)], 
#                         'angry' :[(0.0, 0.5), (0.5, 0.75)], 
#                         'depressed' : [(0.0, 0.5), (0.25, 0.5)], 
#                         'sad' :[(0.0, 0.5), (0.0, 0.25)], 
#                         'calm' : [(0.5, 1.0), (0.0, 0.25)], 
#                         'relaxed' : [(0.5, 1.0), (0.25, 0.5)]

#     }


mood_clusters  = {
                        'happy' : 4,
                        'euphoric' : 3,                  
                        'tense' : 7, 
                        'angry' :0, 
                        'depressed' : 2, 
                        'sad' :6, 
                        'calm' : 1, 
                        'relaxed' : 5

    }


clustered_data, clustered_embeddings, scaled_embeddings = baseline_clustering(X_train,  mood_clusters = mood_clusters, cluster_params = baseline_cluster_params,
                        pca=False, tsne=True, sample_size = 15000)




In [None]:

# recommend top N = 5 or random N = 5 music track based on user preferences
def baseline_recommender(clustered_data, y_train, clustered_embeddings, scaled_embed, user_preferences = {'mood_goal': 'relaxed', 'preferred_genre': 'hip hop'}, 
                        playlist_length= 5, mood_clusters = mood_clusters, *args, **kwargs):


    recommender_start_time = time.time()

    # recommender: query data based on user's preferences and clustered data


    y_train_classes = np.unique(y_train['mood_goal'].copy())
    y_train_classes_encoded = LabelEncoder().fit_transform(y_train[['mood_goal']]).copy()
    mood_list_types = np.concatenate([y_train_classes, np.unique(y_train_classes_encoded)]).tolist()
    y_train['mood_label'] = y_train_classes_encoded

    #y_train['mood_label'] = y_train.replace({'relaxed' : 0, 'happy' : 1,'euphoric' : 2, 'depressed' : 3, 
    #           'sad' :4,  'calm' : 5,  'tense' : 6, 'angry' :7})

    #mood_list_types = [item for sublist in mood_clusters.items() for item in sublist]

    final_data = pd.concat([ y_train[['track_id', 'track_name', 'artist_name', 'mood_goal', 'core_genres', 'mood_label' ]], clustered_data['cluster_labels']], axis=1)

    if (user_preferences['mood_goal'] in mood_list_types): #and (user_preferences['preferred_genre'] in mood_list_types):    
        choice = mood_clusters[user_preferences['mood_goal']]
        query_data = final_data.query("cluster_labels == @choice") 

    else:
        raise ValueError("Input mood goal is unavailable .... specified only! ")

    #print("________"*20)
    #print("________"*20)


    print(f"Recommended music tracks based on {user_preferences['mood_goal']}: \n ")
    print("________"*15)

    #print(f" Enjoy these {playlist_length} music tracks from spotify")
    #print("             "*10)

    #recommended_moosic_playlist = query_data.sample(n=playlist_length, random_state = 42, replace=False)

    #print("________"*10)
    moosic_randomN_idx = np.random.choice(
                            query_data.index,
                            size = 5, #playlist_length,
                            replace= False #random n = 5
                            )
    
    recommended_moosic_playlist = final_data.iloc[moosic_randomN_idx]


    recommender_end_time = time.time()
    recommender_train_time = recommender_end_time - recommender_start_time
    print(f"Time taken for the recommender model : {recommender_train_time:.2f} seconds")

    #print("________"*20)
    #print("________"*20)

    return final_data, recommended_moosic_playlist






In [None]:


# baseline recommender
final_data, n_mood_music = baseline_recommender(clustered_data, Y_train, clustered_embeddings, scaled_embeddings, user_preferences = {'mood_goal': 'relaxed', 'preferred_genre': 'pop'}, 
                        playlist_length= 5, mood_clusters = mood_clusters)


# display recommendations
n_mood_music

In [None]:
final_data.groupby('mood_goal')[['mood_goal']].value_counts()

In [None]:

#groups = final_data.groupby('cluster_labels')[['mood_label', 'cluster_labels']].value_counts()

#groups.to_dict()

##


* tracks count grouped by cluster labels

``````
{0.0: 603,
 1.0: 679,
 2.0: 1278,
 3.0: 792,
 4.0: 146,
 5.0: 244,
 6.0: 293,
 7.0: 965}
``````



In [None]:
# evaluation metrics : clustering


def cluster_evaluation_metrics(ground_truth, predictions, silhouette=True):
    
    if silhouette == False:
        clustering_metrics = [
                # (n_samples, )
                metrics.rand_score,
                metrics.fowlkes_mallows_score,
                metrics.homogeneity_score,
                metrics.completeness_score,
                metrics.v_measure_score,
                metrics.mutual_info_score,
                metrics.adjusted_rand_score,
                metrics.adjusted_mutual_info_score
            ]

        for metric in clustering_metrics:
            score_class = metric(ground_truth.to_numpy(), predictions.to_numpy())



    else:
        cluster_scores_metrics = [
                # (n_samples, 1)
                metrics.silhouette_score,
                metrics.calinski_harabasz_score,
            ]


        for metric in cluster_scores_metrics:
            score_class = metric(ground_truth.to_numpy().reshape(-1,1), predictions.to_numpy().reshape(-1,1))







In [None]:
# display final dataset for all clustered samples

final_data.head(2)


In [None]:
# drop null/empty rows

null_rows11 = final_data[final_data.isnull().T.any()].index
final_data = final_data.drop(null_rows11)
empty_values = final_data.isna().sum()
print(empty_values)

In [None]:
final_data.head(2)

In [None]:
# evaluation metrics


print(metrics.rand_score(final_data['mood_label'], final_data['cluster_labels']))

print(metrics.cluster.adjusted_rand_score(final_data['mood_label'], final_data['cluster_labels']))

print(metrics.homogeneity_score(final_data['mood_label'], final_data['cluster_labels']))


print(metrics.completeness_score(final_data['mood_label'], final_data['cluster_labels']))


print(metrics.v_measure_score(final_data['mood_label'], final_data['cluster_labels']))

print(metrics.silhouette_score(final_data['mood_label'].to_numpy().reshape(-1,1), final_data['cluster_labels'].to_numpy().reshape(-1,1)))

print(metrics.mutual_info_score(final_data['mood_label'], final_data['cluster_labels']))



# Main model 



In [None]:
main_input_features = ['track_id', 'track_name', 'artist_name',  'core_genres', 'mood_goal','danceability', 'release_date',
                    'valence', 'energy',  'tempo', 'acousticness', 'instrumentalness', #'key', 'loudness',
                    'blues', 'classical', 'country', 'disco', 'dubstep', 'edm','electronic', 
                    'folk', 'funk', 'gospel', 'hip hop', 'house', 'indie rock', 'jazz', 'metal', 
                    'other', 'pop', 'punk rock', 'r&b', 'reggae', 'rock','rockabilly','soul', 'techno']

# fetures to be outputed by the recommender at the end
output_features = ['track_id', 'track_name', 'artist_name',  'core_genres', 'mood_goal']

# target column
target_feature = 'mood_goal'



In [None]:
# get data splits


main_moosic_data_nk = split_dataset(moosic_data_samples, mood_categories = mood_categories, target_feature = target_feature,
                        input_features = main_input_features,  output_features = output_features)




In [None]:
# X features (input):  X_train, X_test
# Y features (output):  Y_train, Y_test
# y feature (target):  y_train, y_test


print("X train (input features) ")
print("-----"*10)
Xm_train = main_moosic_data_nk['X_train']
display(Xm_train.head(2))

print("X test (input features) ")
print("-----"*10)
Xm_test = main_moosic_data_nk['X_test']
display(Xm_test.head(2))


print("Y train (target output features) ")
print("-----"*10)
Ym_train = main_moosic_data_nk['Y_train']
display(Ym_train.head(2))


print("Y test (target output features) ")
print("-----"*10)
Ym_test = main_moosic_data_nk['Y_test']
display(Ym_test.head(2))


print("y train (target feature) ")
print("-----"*10)
ym_train = main_moosic_data_nk['y_train']
display(ym_train.head(2))


print("y test (target feature) ")
print("-----"*10)
ym_test = main_moosic_data_nk['y_test']
display(ym_test.head(2))


In [None]:
Xm_train.shape

In [None]:


# defining the actual y (target data - mood_goal)
# y : y_train, y_test


# encode categorical data : mood
ym_train_dummies = pd.get_dummies(y_train, drop_first=True).replace({True: 1, False: 0})
encoded_ym_train = pd.concat([ym_train, ym_train_dummies], axis=1)

ym_test_dummies = pd.get_dummies(y_test, drop_first=True).replace({True: 1, False: 0})
encoded_ym_test = pd.concat([ym_test, ym_test_dummies], axis=1)


print("encoded target (train) data - mood_goal ")
print("-----"*10)
display(encoded_ym_train.head(2))


print("encoded target (test) data - mood_goal ")
print("-----"*10)
display(encoded_ym_test.head(2))


# mood_labels = ["depressed", "sad", "anxious",  "neutral", "calm", "euphoric", "energetic", "happy"]
# mood_1d_class = [0, 1, 2, 3, 4, 5, 6, 7]


# # add targets for evaluation for clusters later

# moosic_data["mood_label"] = (pd.cut(moosic_data["valence"], bins=mood_valence_values, labels=mood_1d_labels)).astype('string')
# moosic_data["mood_class"] = (pd.cut(moosic_data["valence"], bins=mood_valence_values, labels=mood_1d_class)).astype('Int64')

# moosic_data.head(2)


In [None]:
#Xm_train.columns.to_list()

In [None]:
# global variables


modelling_data = {
    
    'X_train' : Xm_train,
    'X_test' : Xm_test,
    'Y_train' : Ym_train,
    'Y_test' : Ym_test,
    'y_train' : ym_train,
    'y_test' : ym_test,
    'encoded_y_train' : encoded_ym_train,
    'encoded_y_test' : encoded_ym_test

    }


main_cluster_params = {
        'n_clusters' : 8,
        'batch_size' : 500,
        'random_state' : 42,
        'init' : 'k-means++' 
    }

# clf_params = {
#         'n_estimators': [50, 100, 200],
#         'learning_rate': [0.01, 0.1, 0.2],
#         'max_depth': [None, 3, 8, 11, 20]
#     }

user_preferences = {'mood goal': 'relaxed', 'preferred_genre': 'hip hop'}

In [None]:


mood_clusters  = {
                        'happy' : 4,
                        'euphoric' : 3,                  
                        'tense' : 7, 
                        'angry' :0, 
                        'depressed' : 2, 
                        'sad' :6, 
                        'calm' : 1, 
                        'relaxed' : 5

    }


def moosic_clustering(Xm_train, playlist_length= 5, cluster_params = main_cluster_params, mood_clusters = mood_clusters,
                        pca=False, tsne=True, sample_size = 10000, *args, **kwargs):


    # model : clustering with mini-batch kmeans 
    clustered_data, clustered_embeddings, cluster_labels, cluster_centers = clustering_model(Xm_train,  pca = False, tsne = True, 
                                                                            params = main_cluster_params, sample_size = sample_size)


    #scaled_embed = MinMaxScaler().fit_transform(clustered_embeddings)

    # visualize the clustered data
    fig, ax = plt.subplots(figsize = (16, 10))
    

    scaled_embed = MinMaxScaler().fit_transform(clustered_embeddings)
    plt.scatter(scaled_embed[:, 0], scaled_embed[:, 1], c=cluster_labels, cmap=custom_cmap_quad)
    #plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], marker='X', color='black') 

    # for cluster_centerx, cluster_centery in zip(cluster_centers[:, 0], cluster_centers[:, 1]):

    #    plt.scatter(cluster_centerx, cluster_centery, marker='X', color='black') 


    plt.legend()
    plt.title('Music - Mood Clustered Data ', pad=15, fontsize = 20, weight = 'bold', color='#928d8d')#'#C3C7C5')
    plt.colorbar()

    get_axes = plt.gca()
    plt.xticks([]) 
    plt.yticks([]) 
    xax = get_axes.axes.get_xaxis()
    
    xax = xax.set_visible(False)

    yax = get_axes.axes.get_yaxis()
    yax = yax.set_visible(False)

    # Set the axis colors
    ax.set_facecolor('#F7F3F0') #E2E2DB
    ax.grid(color='#cfcecb')
    ax.set_facecolor('#eeeeee') #eeeeee
    ax.spines['bottom'].set_color('#eeeeee')
    ax.spines['top'].set_color('#eeeeee')
    ax.spines['right'].set_color('#eeeeee')
    ax.spines['left'].set_color('#eeeeee')
    ax.xaxis.label.set_color('#eeeeee')
    ax.yaxis.label.set_color('#eeeeee')

    plt.show()
    plt.savefig('../images/3_clusters_main.png', transparent=True)


    return clustered_data, clustered_embeddings, scaled_embed




In [None]:
###

main_clustered_data, main_clustered_embeddings, main_scaled_embeddings = moosic_clustering(Xm_train, playlist_length= 5, cluster_params = main_cluster_params, 
                                                                            mood_clusters = mood_clusters, pca=False, tsne=True, sample_size = 15000)




In [None]:
tfidf_vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2))


In [None]:
m_cluster_data = pd.concat([ Y_train, clustered_data['cluster_labels']], axis=1)



In [None]:

m_cluster_data['mood_genre'] = m_cluster_data['mood_goal'] + ' ' + m_cluster_data['core_genres']



In [None]:
m_cluster_data

In [None]:
# drop null/empty rows

null_rows111 = m_cluster_data[m_cluster_data.isnull().T.any()].index
m_cluster_data = m_cluster_data.drop(null_rows111)
empty_values1 = m_cluster_data.isna().sum()
print(empty_values1)

In [None]:
m_cluster_data

In [None]:

track_mood_genre_vector = tfidf_vectorizer.fit_transform(m_cluster_data['mood_genre'])
track_similarity_matrix = linear_kernel(track_mood_genre_vector, track_mood_genre_vector)
track_similarity_data = pd.DataFrame(track_similarity_matrix, index=m_cluster_data['track_name'], columns=m_cluster_data['track_name'])
track_similarity_data = track_similarity_data.reset_index(drop=True)
track_similarity_data = track_similarity_data.rename_axis(None, axis=1)


In [None]:

#m_cluster_data['mood_genre_vectors'] = track_mood_genre_vector
mtrack_similarity_matrix = linear_kernel(track_mood_genre_vector, track_mood_genre_vector)


In [None]:

# recommend top N = 5 or random N = 5 music track based on user preferences
def moosic_recommender(main_clustered_data, main_clustered_embeddings, main_scaled_embeddings, Ym_train, user_preferences = {'mood': 'relaxed', 'genre': 'pop'},   
                        playlist_length= 5, mood_clusters = mood_clusters, which_sim = 'mood', *args, **kwargs):

    recommender_start_time = time.time()



    # get numerical representations of mood _genre vectors
    tfidf_vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2))

    # user_preference = " ".join(user_preferences.values())
    user_preference = " ".join(user_preferences['mood'].lower(), user_preferences['genre'].lower())
    user_vector = tfidf_vectorizer.fit_transform(user_preferences['genre'])


    # so as to query data based on user's preferences and clustered data

    moosic_cluster_data = pd.concat([ Y_train, clustered_data['cluster_labels']], axis=1)
    moosic_cluster_data['mood_genre'] = moosic_cluster_data['mood_goal'] + ' ' + moosic_cluster_data['core_genres']


    ## track vectorization and track - to - track similarity computation

    track_mood_genre_vector = tfidf_vectorizer.fit_transform(moosic_cluster_data['mood_genre'])
    moosic_cluster_data['mood_genre_vectors'] = track_mood_genre_vector


    clustered_moosic_tracks = moosic_cluster_data.groupby('mood_genre')['Features'].apply(list).reset_index()

    # track to track similarity based on track_name
    if which_sim == "track_name":
        track_similarity_matrix = linear_kernel(track_mood_genre_vector, track_mood_genre_vector)
        track_similarity_data = pd.DataFrame(track_similarity_matrix, index=moosic_cluster_data['track_name'], columns=moosic_cluster_data['track_name'])
        track_similarity_data = track_similarity_data.reset_index(drop=True)
        track_similarity_data = track_similarity_data.rename_axis(None, axis=1)



    # track to track similarity based on mood_genre
    if which_sim == "mood":
        for mood_genre, _ in clustered_moosic_tracks:
        # Compute similarity matrix within the mood_genre
        similarity_matrix = linear_kernel(track_features)
        similarity_matrices[genre] = similarity_matrix


    # features track_similarity_data
    tsd_features = track_similarity_data.columns.tolist()
    scored_track_data = pd.concat([moosic_cluster_data, track_similarity_data], axis=1)

    #mood_genre = f'{mood.lower()} {genre.lower()}'

    # query track to track similarity moosic data based on user input
    queried_user_data = scored_track_data.copy(deep = True)


    # pickle dataframe:
    import pickle
    
    with open('df.bin', 'wb') as f:
        pickle.dump(queried_user_data, f, pickle.HIGHEST_PROTOCOL)
    queried_user_data = (queried_user_data.query(" mood_genre == @user_preference ")).reset_index(drop=True)


    ## - sort random N mood-music tracks by predicted probability for the category entered

    moosic_randomN_idx = np.random.choice(
                            queried_user_data.index,
                            size = 5, #playlist_length,
                            replace= False #random n = 5
                            )

    #recommended_moosic_playlist = queried_user_data[['track_id', 'track_name', 'artist_name']].iloc[moosic_randomN_idx]
    recommended_moosic_playlist = queried_user_data[['track_id', 'track_name', 'artist_name']].iloc[:5]


    recommender_end_time = time.time()
    recommender_train_time = recommender_end_time - recommender_start_time
    print(f"Time taken for the recommender model : {recommender_train_time:.2f} seconds")

    #print("________"*20)
    #print("________"*20)

    return recommended_moosic_playlist, scored_track_data






In [None]:
def moosic_recommender(mood, genre, modelling_data = modelling_data, playlist_length= 5, cluster_params = baseline_cluster_params,
                        pca=False, tsne=True, sample_size = 10000, *args, **kwargs):


    # data : train, test features and targets
    X_train, X_test = modelling_data['X_train'].drop(['key', 'speechiness',  'instrumentalness', 'tempo', 'acousticness'], axis =1), modelling_data['X_test'].drop(['key', 'speechiness',  'instrumentalness', 'tempo', 'acousticness'], axis =1)
    Y_train, Y_test = modelling_data['Y_train'], modelling_data['Y_test']
    y_train, y_test = modelling_data['y_train'], modelling_data['y_test']
        
    # get dummies
    #encoded_y_train, encoded_y_test = modelling_data['encoded_y_train'], modelling_data['encoded_y_test']

    # Label encoder
    encoded_y_train, encoded_y_test = LabelEncoder().fit_transform(y_train), LabelEncoder().fit_transform(y_test)



    # model : clustering with mini-batch kmeans 
    clustered_data, clustered_embeddings, cluster_labels, cluster_centers = clustering_model(X_train,  pca = False, tsne = True, 
                                                                            params = main_cluster_params, sample_size = 10000)


    #scaled_embed = MinMaxScaler().fit_transform(clustered_embeddings)

    # visualize the clustered data
    fig, ax = plt.subplots(figsize = (16, 10))

    scaled_embed = clustered_embeddings #MinMaxScaler().fit_transform(clustered_embeddings)
    plt.scatter(scaled_embed[:, 0], scaled_embed[:, 1], c=cluster_labels, cmap=custom_cmap_hex1)


    plt.legend()
    plt.title('Music - Mood Clustered Data ', pad=15, fontsize = 20, weight = 'bold', color='#2B2960')
    plt.colorbar()

    get_axes = plt.gca()
    plt.xticks([]) 
    plt.yticks([]) 
    xax = get_axes.axes.get_xaxis()
    
    xax = xax.set_visible(False)

    yax = get_axes.axes.get_yaxis()
    yax = yax.set_visible(False)

    # Set the axis colors
    ax.set_facecolor('#E2E2DB') #E2E2DB
    ax.spines['bottom'].set_color('#e3dfdb')
    ax.spines['top'].set_color('#e3dfdb')
    ax.spines['right'].set_color('#e3dfdb')
    ax.spines['left'].set_color('#e3dfdb')
    ax.xaxis.label.set_color('#605F5F')
    ax.yaxis.label.set_color('#605F5F')


    plt.show()
    plt.savefig('../images/1_mainkclusters_plot.png', transparent=True)




    # model recommender (classification/similarity and prediction/filter) part: 

    moosic_cluster_data = pd.concat([ Y_train, clustered_data['cluster_labels']], axis=1)
    moosic_cluster_data['mood_genre'] = moosic_cluster_data['mood_goal'] + ' ' + moosic_cluster_data['core_genres']


    ## track vectorization and track - to - track similarity computation

    tfidf_vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2))
    track_mood_genre_vector = tfidf_vectorizer.fit_transform(moosic_cluster_data['mood_genre'])

    track_similarity_matrix = linear_kernel(track_mood_genre_vector, track_mood_genre_vector)
    track_similarity_data = pd.DataFrame(track_similarity_matrix, index=moosic_cluster_data['track_name'], columns=moosic_cluster_data['track_name'])
    track_similarity_data = track_similarity_data.reset_index(drop=True)
    track_similarity_data = track_similarity_data.rename_axis(None, axis=1)

    # features track_similarity_data
    tsd_features = track_similarity_data.columns.tolist()
    scored_track_data = pd.concat([moosic_cluster_data, track_similarity_data], axis=1)

    mood_genre = f'{mood.lower()} {genre.lower()}'

    # query track to track similarity moosic data based on user input
    queried_user_data = scored_track_data.copy(deep = True)
    # pickle dataframe:
    import pickle
    
    with open('df.bin', 'wb') as f:
        pickle.dump(queried_user_data, f, pickle.HIGHEST_PROTOCOL)
    queried_user_data = (queried_user_data.query(" mood_genre == @mood_genre ")).reset_index(drop=True)


    ## - sort random N mood-music tracks by predicted probability for the category entered

    moosic_randomN_idx = np.random.choice(
                            queried_user_data.index,
                            size = 5, #playlist_length,
                            replace= False #random n = 5
                            )

    #recommended_moosic_playlist = queried_user_data[['track_id', 'track_name', 'artist_name']].iloc[moosic_randomN_idx]
    recommended_moosic_playlist = queried_user_data[['track_id', 'track_name', 'artist_name']].iloc[:5]

    return recommended_moosic_playlist, scored_track_data


In [None]:
user_preferences = {'mood goal': 'relaxed', 'preferred_genre': 'pop'}


recommended_moosic_playlist, scored_track_data = moosic_recommender(user_preferences['mood goal'], user_preferences['preferred_genre'], modelling_data = modelling_data, playlist_length= 5, cluster_params = main_cluster_params, 
                            pca=False, tsne=True, sample_size = 10000)



recommended_moosic_playlist


In [None]:
scored_track_data

In [None]:
scored_track_data

In [None]:

#from sklearn.metrics import rand_score, homogeneity_score, completeness_score, v_measure_score, silhouette_score


print(rand_score(scored_track_data['mood_label'], scored_track_data['cluster_labels']))

print(homogeneity_score(scored_track_data['mood_label'], scored_track_data['cluster_labels']))


print(completeness_score(scored_track_data['mood_label'], scored_track_data['cluster_labels']))


print(v_measure_score(scored_track_data['mood_label'], scored_track_data['cluster_labels']))

print(silhouette_score(scored_track_data['mood_label'].to_numpy().reshape(-1,1), scored_track_data['cluster_labels'].to_numpy().reshape(-1,1)))


In [None]:

def web_app_query(mood, genre, playlist_length = 5):
    
    playlist = moosic_recommender(mood, genre, modelling_data = modelling_data, playlist_length= playlist_length, cluster_params = main_cluster_params, 
                            pca=False, tsne=True, sample_size = 10000)

    #return model
    return playlist.to_dict('records')


# model = web_app_query('calm', 'pop', playlist_length = 5)



In [None]:
web_app_query('happy', 'pop')

In [None]:
import pickle 

with open('model.bin', 'wb') as f:
    pickle.dump(web_app_query, f, pickle.HIGHEST_PROTOCOL)

* Delete prints
* change dataframe into list of dicts


In [None]:
model.to_dict('records')

In [None]:
main_clustered_data, main_clustered_embeddings, main_scaled_embeddings = moosic_clustering(Xm_train, playlist_length= 5, cluster_params = main_cluster_params, 
                                                                            mood_clusters = mood_clusters, pca=False, tsne=True, sample_size = 20000)



In [None]:


params_kmodels = {
    
        'n_clusters': 2,
        'init_kmode': ['Cao', 'Huang'], 
        'random_state': 42
    
    }



In [None]:

# functions
# -- dimensionality reduction (tsne, umap, pca)
# -- clustering models (kmeans, minibatch kmeans, kmode, minibatch kmode, kmeans + kmode, mean shift, dbscan)
# -- text vectorization and similarity computation (tdidf vectorizer, cosine similarity, linear kernel)
# -- classification models (xgboost, svm, logistic regression, naive bayes, random forest)




def dimensionality_reduction(dataset, method='tsne', n_components=2, random_state=42):

    """
    Dimensionality reduction techniques

    """

    methods = {
        'pca': PCA(n_components=n_components, random_state=random_state),
        'tsne': TSNE(n_components=n_components, random_state=random_state),
        'umap': UMAP(n_components=n_components, random_state=random_state)
    }

    if method in methods:
        reduction_method = methods[method]
        return reduction_method.fit_transform(dataset)
    else:
        raise ValueError(f"The dimensionality reduction technique: '{method}' is currently not available.")




def clustering_models(dataset, model='kmeans', n_clusters=2):

    """
    Clustering models

    """


    models = {
        'kmeans': KMeans(n_clusters),
        'minibatch_kmeans': MiniBatchKMeans(n_clusters),
        'kmode': KModes(n_clusters),
        'kprototypes': KPrototypes(n_clusters),
        'mean_shift': MeanShift(),
        'dbscan': DBSCAN()
    }

    if model in models:
        clustering_model = models[model]
        return clustering_model.fit_predict(dataset)
    else:
        raise ValueError(f"The clustering model : '{model}' model is currently not available.")


def text_processing(dataset, vectorizer='tfidf'):

    """
    Similarity matrix computation - Text vectorization

    """

    vectorizers = {
        'tfidf': lambda d: TfidfVectorizer().fit_transform(d),
        'cosine_similarity': lambda d: cosine_similarity(d),
        'linear_kernel': lambda d: linear_kernel(d)
    }

    if vectorizer in vectorizers:
        text_vectorizer = vectorizers[vectorizer]
        return text_vectorizer(dataset)
    else:
        raise ValueError(f"The text vectorization method : '{vectorizer}' is currently not available.")




def classification_models(dataset, labels, model='xgboost', random_state=42):

    """
    Classification models

    """

    models = {
        'xgboost': XGBClassifier(random_state=random_state),
        'svm': SVC(random_state=random_state),
        'logistic_regression': LogisticRegression(random_state=random_state),
        'gaussian_naive_bayes': GaussianNB(),
        'multinomial_naive_bayes': MultinomialNB(),
        'random_forest': RandomForestClassifier(random_state=random_state),
        'adaboost': AdaBoostClassifier(random_state=random_state),
        'gradientboost': GradientBoostingClassifier(random_state=random_state)
    }

    if model in models:
        classification_model = models[model]
        return classification_model.fit(dataset, labels)
    else:
        raise ValueError(f"The classification model: '{model}' model is currently not listed.")








In [None]:
# functions
# -- evaluation metrics (rand index, silhouette score, vmeasure score, ndcg, precision, recall)
# -- 
# -- 
# -- 
# -- 







In [None]:

# function clustering model (+ tsne) 


def pca_tsne(data,  pca = True, tsne=True, n_components=8, sample_size=5000, *args, **kwargs):


    data = data.head(sample_size)
    scaled_data = MinMaxScaler().fit_transform(data)
    data_col = data.columns.to_list()
    data = pd.DataFrame(scaled_data, columns = data_col)

    if pca == True:
        pca = PCA(n_components=2, random_state = 42)

        pca_start_time = time.time()

        data = MinMaxScaler().fit_transform(pca.fit_transform(data))

        pca_end_time = time.time()
        pca_train_time = pca_end_time - pca_start_time
        print(f"Time taken for dimensionality reduction using PCA: {pca_train_time:.2f} seconds")

        data = pd.DataFrame(data, columns = ['dimension_0', 'dimension_1'])



    if tsne == True:
        tsne = TSNE(n_components=2, random_state = 42)

        tsne_start_time = time.time()

        tsne_embeddings = tsne.fit_transform(data)

        tsne_end_time = time.time()
        tsne_train_time = tsne_end_time - tsne_start_time
        
    else:
        data = data




    return data, tsne_embeddings






In [None]:

# function clustering model (+ tsne) 


def clustering_model(data,  pca = True, tsne=True, params = {'n_clusters': 8}, sample_size=5000, *args, **kwargs):


    data = data.head(sample_size)
    scaled_data = MinMaxScaler().fit_transform(data)
    data_col = data.columns.to_list()
    data = pd.DataFrame(scaled_data, columns = data_col)

    if pca == True:
        pca = PCA(n_components=2, random_state = 42)

        pca_start_time = time.time()

        data = MinMaxScaler().fit_transform(pca.fit_transform(data))

        pca_end_time = time.time()
        pca_train_time = pca_end_time - pca_start_time
        print(f"Time taken for dimensionality reduction using PCA: {pca_train_time:.2f} seconds")

        data = pd.DataFrame(data, columns = ['dimension_0', 'dimension_1'])

    else:
        data = data


    model = MiniBatchKMeans(**params)

    kme_start_time = time.time()

    data['cluster_labels'] = model.fit_predict(data)

    kme_end_time = time.time()
    kme_train_time = kme_end_time - kme_start_time
    

    cluster_labels = model.labels_ 
    cluster_centers = model.cluster_centers_ 

    #if tsne == True:
    tsne = TSNE(n_components=2, random_state = 42)

    tsne_start_time = time.time()

    tsne_embeddings = tsne.fit_transform(data)

    tsne_end_time = time.time()
    tsne_train_time = tsne_end_time - tsne_start_time
        

    #else:
    #    data = data


    
    # kp_model_name = "minikmeans.pickle"
    # tp_model_name = "tsne_embed.pickle"

    # pickle.dump(model, open(kp_model_name, 'wb'))
    # pickle.dump(model, open(tp_model_name, 'wb'))


    # kj_model_name = "minikmeans.joblib"
    # tj_model_name = "tsne_embed.joblib"

    # joblib.dump(model, kj_model_name)
    # joblib.dump(model, tj_model_name)



    return data, tsne_embeddings, cluster_labels, cluster_centers






# Hyperparameter tuning

In [None]:

# classifier/prediction model
# hyperparameter optimization



def classifier_tuning(x_data, y_data, model_params, random_state=42, *args, **kwargs): # model_name = 'random forest',  *args, **kwargs):

    classifiers = [
        ('RandomForestClassifier', RandomForestClassifier()),
        ('XGBClassifier', XGBClassifier()),
        ('LinearSVC', LinearSVC()),
        ('MultinomialNB', MultinomialNB()),
        ('AdaBoostClassifier', AdaBoostClassifier()),
        ('GradientBoostingClassifier', GradientBoostingClassifier())
    ]

    
    cv = KFold(n_splits=3, shuffle=True, random_state=42)

    best_classifiers = {}

    for model_name, clf_model in classifiers: 

        cv_space = model_params[model_name]
        #grid_search = GridSearchCV(classifiers, param_grid, cv=cv)
        random_search = RandomizedSearchCV(clf_model, param_distributions=cv_space, n_iter=10,
                                scoring='accuracy', n_jobs=-1, cv=cv, random_state=random_state)

        random_search.fit(x_data, y_data)
        best_classifiers[model_name] = random_search.best_estimator_

        print(f"The best hyperparameters for {model_name} are: \n {random_search.best_params_}")
        print(f"The best score for {model_name} is: {random_search.best_score_}")

        scores = cross_val_score(random_search, x_data, y_data, scoring='accuracy', cv=3, n_jobs=-1)

        print( f" The {model_name} accuracy is : mean - {np.mean(scores):.3f} &  std - {np.std(scores):.3f} " )

    return best_classifiers



# tune


model_params = {
    'RandomForestClassifier': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 3, 8, 11, 20],
        'min_samples_split': [2, 5, 10]
    },
    'XGBClassifier': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [None, 3, 8, 11, 20]
    },
    'LinearSVC': {
        'C': [0.1, 1.0, 10.0],
        'penalty': ['l1', 'l2']
    },
    'MultinomialNB': {
        'alpha': [0.01, 0.1, 1.0],
        'fit_prior': [True, False]
    },
    'GradientBoostingClassifier': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [None, 3, 8, 11, 20]
    },
    'AdaBoostClassifier': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2]
    },
}

#x_data = 

#y_data 

classifier_tuning(X_train, y_train['cluster_labels'], model_params, random_state=42)



# Error analysis : clustering and classification metrics


In [None]:
# error analysis : clustering and classification metrics


def classification_metrics(ground_truth, predictions):

    score_rocauc = roc_auc_score(ground_truth.to_numpy().reshape(-1,1), predictions.to_numpy().reshape(-1,1))
    print(f"ROC of: {score_rocauc:.2f} ")

    score_acc = accuracy_score(ground_truth.to_numpy().reshape(-1,1), predictions.to_numpy().reshape(-1,1))
    print(f"ROC of: {score_acc:.2f} ")

    score_report = classification_report(ground_truth.to_numpy().reshape(-1,1), predictions.to_numpy().reshape(-1,1))
    print(f"ROC of: {score_report:.2f} ")

    score_matrix = confusion_matrix(ground_truth.to_numpy().reshape(-1,1), predictions.to_numpy().reshape(-1,1))
    print(f"ROC of: {score_matrix:.2f} ")

    score_roc = roc_curve(ground_truth.to_numpy().reshape(-1,1), predictions.to_numpy().reshape(-1,1))
    print(f"ROC of: {score_roc:.2f} ")


    scores = {
        'roc_auc_score' : score_rocauc,
        'accuracy_score' : score_acc,
        'classification_report' : score_report,
        'confusion_matrix' : score_matrix,
        'roc_curve' : score_roc,
        }
    

    return scores






In [None]:

# evaluation of models
# - recommended music tracks based on mood and genre
# - actual music track groupings based on mood and genre

def clf_evaluation_metrics(predictions, ground_truth, playlist_length= 5, user_preferences = {'mood goal': 'relaxed', 'preferred_genre': 'hip hop'}):
    
    moosic_groups = set(ground_truth)
    n_recommended = set(predictions[:playlist_length])
    intersection = moosic_groups.intersection(n_recommended)

    # precision at recommended playlist lenght, N = 5 (top n? random?)
    precision_n = len(intersection) / playlist_length

    # recall at recommended playlist lenght, N = 5 (top n? random?)
    recall_n = len(intersection) / len(playlist_length)

    # f1_score at recommended playlist lenght, N = 5 (top n? random?)
    f1_form = 2 * (precision_n * recall_n) / (precision_n + recall_n)
    f1_score_n = [0 if (p_n + r_n)==0 else f1_form for p_n, r_n in zip(precision_n, recall_n)]

    # mean average precision (MAP) at recommended playlist lenght, N = 5 (top n? random?)
    # map_n = lambda true_items, recommended_items, n: np.mean([precision_at_n(true_items, recommended_items, i + 1) 
    #                   for i, item in enumerate(recommended_items[:n]) if item in set(true_items)]) if any(item in set(true_items) for item in recommended_items[:n]) else 0
    for i in range(playlist_length):
        if any(predictions[i]) in moosic_groups:
            map_n = np.mean([ len(intersection) for i in range(playlist_length)]) #if predictions[i] in moosic_groups ]) 
        else:
            map_n = 0


    # discounted cumulative gain (DCG) at recommended playlist lenght, N = 5 (top n? random?)
    dcg_n = sum((2 ** 1 - 1) / np.log2(i + 2) for i, track in enumerate(predictions[:playlist_length]) if track in ground_truth)

    # normalized discounted cumulative gain (NDCG) at recommended playlist lenght, N = 5 (top n? random?)
    optimal_dcg_n = sum((2 ** 1 - 1) / np.log2(i + 2) for i, track in enumerate(predictions[:playlist_length]) if track in sorted(ground_truth, reverse=True))
    ndcg_n  = [0 if optimal_dcg_n==0 else dcg_n / optimal_dcg_n]


    eval_data = {
        'user_preferences' : user_preferences,
        'playlist_length' : playlist_length,
        'predictions' : predictions,
        'ground_truth' : ground_truth,
        'moosic_groups' : moosic_groups,
        'n_recommended' : n_recommended,
        'intersection' : intersection,
        'precision_n' : precision_n,
        'recall_n' : recall_n,
        'f1_score_n' : f1_score_n,
        'map_n' : map_n,
        'dcg_n' : dcg_n,
        'optimal_dcg_n' : optimal_dcg_n,
        'ndcg_n' : ndcg_n
        }


    return eval_data







# User data - test

|  user_id  |  user_name  | preferred_genre | mood_goal | previous_choices |
|:---------:|:-----------:|:---------------:|:---------:|:----------------:|
|  m1m0h0  |  apollo  | hip hop | happy | ['new age', 'any'] |
|  m1m0h1  |  egwu | any | euphoric | ['electronic', 'tense'] |
|  m1m0h2  |  aurras   | folk | sad | ['world/traditional', 'euphoric'] |
|  m1m0h3  |  pelios  | jazz | tense | ['country', 'relaxed'] |
|  m1m0h4  |  inuaria  | metal | calm | ['blues', 'angry'] |
|  m1m0h5  |  psyche  | blues | depressed | ['any', 'happy'] |
|  m1m0h6  |  ihy  | pop | any | ['folk', 'sad'] |
|  m1m0h7  |  ova  | rock | angry | ['jazz, 'relaxed'] |
|  m1m0h8  |  thalia  | any | relaxed | ['hip hop', 'calm'] |




In [None]:
# test

user_preferences = {'mood goal': ['relaxed', 'tense'], 'preferred_genre': ['hip hop', 'folk']}
user_test_data = pd.DataFrame([user_preferences])

user_test_data



---

#### Initial analysis of the metrics

For the mini-batch kmeans clustering with:  
* rand_score of: 0.78  means the model is good (okay) with respect to the true mood_class
* fowlkes_mallows_score of: 0.17 , bad or moderate cluster prediction by the model?
* homogeneity_score of: 0.09, low score indicates the clusters are not highly homogeneous with respect to the predicted mood_class labels 
* completeness_score of: 0.09, low score indicates that some data points of the same class are split across predicted by the model clusters 
* v_measure_score of: 0.09 , okay/bad? quality of clustering
* mutual_info_score of: 0.18, an okay level of shared information 
* adjusted_rand_score of: 0.04 , low level beyond what is expected by chance
* adjusted_mutual_info_score of: 0.09 , low/okay level of agreement beyond what is expected by chance

<br>

For the mini-batch kmeans clustering with:  
* silhouette_score of: -0.22 , negative, the clusters overlap and are not well separated
* calinski_harabasz_score of: 2729.40 , better separation between clusters? , low within-cluster variance due to high value

<br>

* rand_score measures: the similarity of the predicted clusters and the true clusters for the mood music data, 0 (not a good match/clustering) to 1 (perfect identical to true clusters) 
* fowlkes_mallows_score: the similarity of the predicted clusters and the true clusters for the mood music data, 0 (not a good match/clustering) to 1 (perfect) 
* homogeneity_score: a measure of how much each cluster contains only data points that belong to a single class
* completeness_score: a measure of how well all cluster data points that belong to the same class are assigned to the same cluster
* v_measure_score: the harmonic mean of homogeneity and completeness, a balanced measure of the quality of clusters 
* mutual_info_score: the measure of the amount of information shared between true and predicted clusters
* adjusted_rand_score: a variation of the rand index score that accounts for chance
* adjusted_mutual_info_score: a variation of the mutual info score that accounts for chance
* silhouette_score : it measures the quality of clusters by evaluating how similar each data point is to its own cluster compared to other clusters
* calinski_harabasz_score: the variance ratio criterion, it measures the cluster quality based on between-cluster and within-cluster variance



<br>

In summary, the model 

* was able to cluster 78% of the data to the right mood clusters for the music tracks based on valence and other audio features
* thus the baseline model predicted clusters is approximately 78 % similar to the actual music-mood (1-D) clusters
* also show that clusters are not well separated and a lot of music data belonging to similar clusters were not sisigned to the same ones
* the music tracks smaples seem to belong to multiple mood classes



---


In [None]:
# similarity between the predicted and actual mood clusters
# by what percentage are they similar?
# rand index score of 0.78
# in terms of % 

RI = 0.82
RI_rate = RI * 100
print(f"The similarity rate between predicted and true clusters is {RI_rate:.2f} %")
print(f"The baseline model predicted clusters is approximately {RI_rate:.2f} % similar to the actual music-mood (1-D) clusters")

