# Atividade  
Ajustar o script para classificar gêneros musicais do Spotify considerando a base de dados [disponível no HuggingFace](https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset).

In [37]:
import pandas as pd

from matplotlib import pyplot as plt

from tqdm import tqdm

from sklearn.model_selection import train_test_split
from sklearn import neighbors
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

from csv import reader
from math import sqrt
plt.rcParams['figure.figsize'] = [16, 10]

import random
from random import seed
from random import randrange

import seaborn as sns

import requests
import io
    
# Downloading the csv file from your GitHub account

url = "https://raw.githubusercontent.com/Zuluke/Projetos-AM/main/spotify_activity/dataset.csv?token=GHSAT0AAAAAACPRK7QNQXQV5QYEVVSPIZ7EZQ75FOQ" # Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content

# Reading the downloaded content and turning it into a pandas dataframe

df = pd.read_csv(io.StringIO(download.decode('utf-8')))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 1 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   404: Not Found  0 non-null      object
dtypes: object(1)
memory usage: 132.0+ bytes


In [28]:
df.track_genre.nunique()
df.info()

AttributeError: 'DataFrame' object has no attribute 'track_genre'

## Sem atributos categóricos

### Preprocessamento  
Antes de tudo, é fundamental remover os atributos categóricos não-numéricos. No caso do atributo "key", ele será mantido sendo tratado com método one-hot encoding, tendo a coluna referente ao valor desconhecido descartada. Por último, os atributos popularity, duration_ms, loudness, tempo e time_signature serão padronizados com min_max_scaler para manter os valores entre 0 e 1.

In [19]:
scaler = MinMaxScaler()

# Fazendo uma cópia de amostra do conjunto de dados
df_aj, _, = train_test_split(df, train_size=0.5, stratify=df.track_genre)

# Removendo colunas categóricas nominais e de palavras de baixo calão
df_aj = df_aj.drop(columns=['track_id','energy','speechiness','acousticness','instrumentalness','liveness',
                                      'duration_ms','explicit' ,'artists', 'popularity','loudness','time_signature',
                                      'album_name', 'tempo','track_name','key', 'mode','danceability'])
df_aj = df_aj

df_aj.dropna(inplace=True, axis=0, how='any')

# Criando dicionários para converter labels em números e vice-versa
dict_label_num = {}
dict_num_label = {}
for index in range(len(df_aj.track_genre.unique())):
  dict_label_num[df_aj.track_genre.unique()[index]] = index
  dict_num_label[index] = df_aj.track_genre.unique()[index]

# Renomeando valores de track_genre
df_aj.track_genre = df_aj.track_genre.map(dict_label_num)

In [24]:
df_aj.info()

<class 'pandas.core.frame.DataFrame'>
Index: 57000 entries, 24001 to 31540
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   57000 non-null  int64  
 1   valence      57000 non-null  float64
 2   track_genre  57000 non-null  int64  
dtypes: float64(1), int64(2)
memory usage: 1.7 MB


In [25]:
# Divisão de dados de atributos e classe
sptf_X = df_aj.drop(columns='track_genre') #caracteristicas
sptf_Y = df_aj.track_genre #classe

# Divisão em conjuntos de treino, teste e validação
sptf_X_train, sptf_X_test, sptf_Y_train, sptf_Y_test = train_test_split(sptf_X, sptf_Y, test_size=0.40, random_state=10)
sptf_X_test, sptf_X_valid, sptf_Y_test, sptf_Y_valid = train_test_split(sptf_X_test, sptf_Y_test, test_size=0.50, random_state=10)

sptf_X_train = sptf_X_train.values
sptf_X_test = sptf_X_test.values
sptf_X_valid = sptf_X_valid.values
sptf_Y_train = sptf_Y_train.values
sptf_Y_test = sptf_Y_test.values
sptf_Y_valid = sptf_Y_valid.values

### KNN

In [20]:
# we create an instance of Neighbours Classifier and fit the data.
sptf_clf = neighbors.KNeighborsClassifier()

# Construindo o espaco de busca por configuracoes do classificador
k_range = range(3, 81, 2) #k
k_scores_train = []
k_scores_train_full = []
k_scores_valid = []
vet_distancias = ['euclidean', 'manhattan']
best_f1 = 0
#p_range = range(1, 198) #k
# use iteration to caclulator different k in models, then return the average accuracy based on the cross validation
for k in vet_distancias:
  for j in tqdm(k_range, desc=f'Distância {k}'):
    knn = neighbors.KNeighborsClassifier(n_neighbors=j, metric=k)
    scores = cross_val_score(knn, sptf_X_train, sptf_Y_train, cv=5, scoring='f1_weighted')
    k_scores_train.append(scores.mean())
    if scores.mean() > best_f1:
      sptf_clf = neighbors.KNeighborsClassifier(n_neighbors=j, metric=k)
      best_f1 = scores.mean()
    knn.fit(sptf_X_train, sptf_Y_train)
    k_scores_train_full.append(f1_score(sptf_Y_train, knn.predict(sptf_X_train), average='weighted'))
    k_scores_valid.append(f1_score(sptf_Y_valid, knn.predict(sptf_X_valid), average='weighted'))

#treinando o classificador
sptf_clf = sptf_clf.fit(sptf_X_train, sptf_Y_train)

# plot to see clearly
plt.plot(list(range(0,len(k_scores_train))), k_scores_train)
plt.plot(list(range(0,len(k_scores_train_full))), k_scores_train_full)
plt.plot(list(range(0,len(k_scores_valid))), k_scores_valid)
plt.legend(('Score medio Treino CV', 'Conj. Treino', 'Conj. Validacao'),
           loc='upper center', shadow=True)
plt.xlabel('Values of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.show()

print("F1 de treinamento clf: %0.3f" %  f1_score(sptf_Y_train, sptf_clf.predict(sptf_X_train), average='weighted'))
print("F1 de validação clf: %0.3f" %  f1_score(sptf_Y_valid, sptf_clf.predict(sptf_X_valid), average='weighted'))
print("F1 de teste clf: %0.3f" %  f1_score(sptf_Y_test, sptf_clf.predict(sptf_X_test), average='weighted'))

Distância euclidean:   0%|          | 0/39 [00:03<?, ?it/s]


KeyboardInterrupt: 

In [None]:
sptf_clf

### LVQ

In [25]:
# LVQ for the Ionosphere Dataset
from random import seed
from random import randrange
from csv import reader
from math import sqrt

# Load a CSV file
def load_csv(filename):
	dataset = list()
	with open(filename, 'r') as file:
		csv_reader = reader(file)
		for row in csv_reader:
			if not row:
				continue
			dataset.append(row)
	return dataset

# Convert string column to float
def str_column_to_float(dataset, column):
	for row in dataset:
		row[column] = float(row[column].strip())

# Convert string column to integer
def str_column_to_int(dataset, column):
	class_values = [row[column] for row in dataset]
	unique = set(class_values)
	lookup = dict()
	for i, value in enumerate(unique):
		lookup[value] = i
	for row in dataset:
		row[column] = lookup[row[column]]
	return lookup

# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
	dataset_split = list()
	dataset_copy = list(dataset)
	fold_size = int(len(dataset) / n_folds)
	for i in range(n_folds):
		fold = list()
		while len(fold) < fold_size:
			index = randrange(len(dataset_copy))
			fold.append(dataset_copy.pop(index))
		dataset_split.append(fold)
	return dataset_split

# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
	correct = 0
	for i in range(len(actual)):
		if actual[i] == predicted[i]:
			correct += 1
	return correct / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
	folds = cross_validation_split(dataset, n_folds)
	scores = list()
	for fold in folds:
		train_set = list(folds)
		train_set.remove(fold)
		train_set = sum(train_set, [])
		test_set = list()
		for row in fold:
			row_copy = list(row)
			test_set.append(row_copy)
			row_copy[-1] = None
		predicted = algorithm(train_set, test_set, *args)
		actual = [row[-1] for row in fold]
		accuracy = accuracy_metric(actual, predicted)
		scores.append(accuracy)
	return scores

# calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
	distance = 0.0
	for i in range(len(row1)-1):
		distance += (row1[i] - row2[i])**2
	return sqrt(distance)

# Locate the best matching unit
def get_best_matching_unit(codebooks, test_row):
	distances = list()
	for codebook in codebooks:
		dist = euclidean_distance(codebook, test_row)
		distances.append((codebook, dist))
	distances.sort(key=lambda tup: tup[1])
	return distances[0][0]

# Make a prediction with codebook vectors
def predict(codebooks, test_row):
	bmu = get_best_matching_unit(codebooks, test_row)
	return bmu[-1]

# Create a random codebook vector
def random_codebook(train):
	n_records = len(train)
	n_features = len(train[0])
	codebook = [train[randrange(n_records)][i] for i in range(n_features)]
	return codebook

# Train a set of codebook vectors
def train_codebooks(train, n_codebooks, lrate, epochs):
	codebooks = [random_codebook(train) for i in range(n_codebooks)]
	for epoch in range(epochs):
		rate = lrate * (1.0-(epoch/float(epochs)))
		for row in train:
			bmu = get_best_matching_unit(codebooks, row)
			for i in range(len(row)-1):
				error = row[i] - bmu[i]
				if bmu[-1] == row[-1]:
					bmu[i] += rate * error
				else:
					bmu[i] -= rate * error
	return codebooks

# LVQ Algorithm
def learning_vector_quantization(train, test, n_codebooks, lrate, epochs):
	codebooks = train_codebooks(train, n_codebooks, lrate, epochs)
	predictions = list()
	for row in test:
		output = predict(codebooks, row)
		predictions.append(output)
	return(predictions)


In [78]:
import numpy as np

# Adjusting evaluation algorithm with cross validation split to consider F1-score
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
	folds = cross_validation_split(dataset, n_folds)
	scores = list()
	for fold in range(len(folds)):
		train_set = [folds[index] for index in range(len(folds)) if (not (index != fold))]
		train_set = sum(train_set, [])
		test_set = list()
		for row in folds[fold]:
			row_copy = list(row)
			test_set.append(row_copy)
			row_copy[-1] = None
		predicted = algorithm(train_set, test_set, *args)
		actual = [row[-1] for row in folds[fold]]
		f1score = f1_score(actual, predicted, average='weighted')
		scores.append(f1score)
	return scores

# Test LVQ on Spotify dataset
random.seed(1)
# load and prepare data
dataset = df_aj.copy().values
# evaluate algorithm
n_folds = 5
n_epochs = 20
learn_rate = [0.01,0.2, 0.5]
n_codebooks = [40,50,60]
best_mean_f1 = [0]
best_scores = [0] * n_folds

for rate in learn_rate:
	for prototypes in tqdm(n_codebooks, desc=f'Taxa {rate}'):
		scores = evaluate_algorithm(dataset, learning_vector_quantization, n_folds, prototypes, rate, n_epochs)
		if (sum(scores)/float(len(scores))) > best_mean_f1:
			best_mean_f1 = (sum(scores)/float(len(scores)))
			best_scores = scores
	
	print(f'Rate: {rate}')
	print('Melhores pontuações: %s' % best_scores)
	print('Melhor F1-score médio: %.3f%%' % best_mean_f1,'\n\n')

print('Melhores pontuações: %s' % best_scores)
print('Melhor F1-score médio: %.6f%%' % best_mean_f1)

Taxa 0.01:   0%|          | 0/3 [00:00<?, ?it/s]

Taxa 0.01: 100%|██████████| 3/3 [04:53<00:00, 97.73s/it] 


Rate: 0.01
Melhores pontuações: [0.00016811912440815201, 0.00016811912440815201, 0.00018443153931745087, 0.00010025717547068612, 0.00014064388528733107]
Melhor F1-score médio: 0.000% 




  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[

Rate: 0.2
Melhores pontuações: [0.0001945779065922428, 0.00012367733956300674, 0.00016180370948234708, 0.00022294909219929435, 0.00016180370948234708]
Melhor F1-score médio: 0.000% 




  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[i] -= rate * error
  bmu[i] += rate * error
  distance += (row1[i] - row2[i])**2
  bmu[

Rate: 0.5
Melhores pontuações: [0.0001945779065922428, 0.00012367733956300674, 0.00016180370948234708, 0.00022294909219929435, 0.00016180370948234708]
Melhor F1-score médio: 0.000% 


Melhores pontuações: [0.0001945779065922428, 0.00012367733956300674, 0.00016180370948234708, 0.00022294909219929435, 0.00016180370948234708]
Melhor F1-score médio: 0.000173%



