<a href="https://colab.research.google.com/github/amadords/Projetos-Publicos/blob/master/Classificador_de_M%C3%BAsica_Spotify.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Classificador de Músicas para o Spotify**
---

[![LinkedIn](https://img.shields.io/badge/LinkedIn-DanielSousaAmador-cyan.svg)](https://www.linkedin.com/in/daniel-sousa-amador)
[![GitHub](https://img.shields.io/badge/GitHub-amadords-darkblue.svg)](https://github.com/amadords)
[![Medium](https://img.shields.io/badge/Medium-DanielSousaAmador-white.svg)](https://daniel-s-amador.medium.com/)


Os serviços de **streaming** já fazem há alguns anos, parte da vida das pessoas por sua **facilidade** e **acessibilidade**.

Dentre os vários motivos para isso estão a sua acessibilidade e facilidade de utilização para o usuário e recorrência na receita dos detentores da tecnologia, uma vez que mensalmente as pessoas pagam para utilizar.


Mas, você sabe o que é streaming?


São serviços que permitem que algum **conteúdo seja transmitido via internet** sem necessidade de baixar, visualizar conteúdo de propaganda ou até mesmo perder tempo procurando e também correr risco de contaminar o smartphone ou computador com virus.

**Exemplos** de serviços de streaming são a **Netflix** em filmes e **Spotify** em música. Você provavelmente conhece ambos.

![spotify](https://image.freepik.com/fotos-gratis/jovem-casal-ouvindo-musica-com-app-spotify_23-2147987797.jpg)

## Sobre o projeto

O objetivo aqui é utilizar **dados disbonibilizados pela própria Spotify** para criar um **classificador** que tem por objetivo **identificar possíves músicas que um usuário possa gostar**.


As informações originais sobre as **features** estão [aqui](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/).

## Checklist

1. Definição do Problema
2. Análise exploratória de dados
3. Pré-processamento
4. Pipeline e Machine Learning
5. Tuning do modelo

# 1. Definição do Problema

Não apenas em serviços de streaming, mas em qualquer empresa que oferece algum tipo de serviço ou produto busca ser mais acertiva em suas indicações ao usuários, veja a Amazon, por exemplo, se você acaba de criar um conta nela, não há de antemão nada que te interesse, mas a partir do momento que você começa a comprar ou a salvar itens, ou colocar no carrinho, **indicações** começam a surgir.

Do mesmo modo faz a Netflix ou o Spotify, ao criar sua conta nesta não há nenhum tipo de indicação a não ser o que está em alta. Mas, não necessariamente você gosta do que todos gostam, não é mesmo?! Os **algoritmos não têm como adivinhar**, então **a partir do momento que você começa a ouvir músicas, outras são sugeridas a você**!

Algumas medidas são utilizadas para classificar isso, como por exemplo se a música permite dançar ou a duração da música etc. Na parte de análise exploratória será abordada cada feature.

**Importando bibliotecas necessárias**

In [None]:
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')

**Lendo a base de dados**

In [None]:
dataset= pd.read_csv('https://raw.githubusercontent.com/amadords/data/main/data.csv', sep=',')

**Visualizando primeiros dados**

In [None]:
dataset.head()

Unnamed: 0,id,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target,song_title,artist
0,0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,1,Mask Off,Future
1,1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4.0,0.588,1,Redbone,Childish Gambino
2,2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4.0,0.173,1,Xanny Family,Future
3,3,0.604,0.494,199413,0.338,0.51,5,0.0922,-15.236,1,0.0261,86.468,4.0,0.23,1,Master Of None,Beach House
4,4,0.18,0.678,392893,0.561,0.512,5,0.439,-11.648,0,0.0694,174.004,4.0,0.904,1,Parallel Lines,Junior Boys


# 2. Análise exploratória de dados

* **'id'**
    * O número de identificação da música
* **'acousticness'**
    * Indica se a música é ou não acústica, 0,0 para não acústica e 1,0 para o máximo de acusticidade.
* **'danceability'**
    * Descreve como uma faixa é adequada para dançar com base em uma combinação de elementos musicais, incluindo tempo, estabilidade de ritmo, força de batida e regularidade geral. Um valor de 0,0 é menos dançável e 1,0 é mais dançante.
* **'duration_ms'**
    * A duração da faixa de música em milissegundos
* **'energy'**
    * É uma medida de 0,0 (menos energética possivel) a 1,0 (mais energética possivel) que representa uma medida de intensidade e atividade da música. Normalmente, as faixas energéticas parecem rápidas, altas e barulhentas. Por exemplo, death metal tem alta energia, enquanto um prelúdio de Bach tem pontuação baixa na escala. As características perceptivas que contribuem para este atributo incluem faixa dinâmica, intensidade percebida, timbre, taxa de início e entropia geral.
* **'instrumentalness'**
     * Prediz se uma faixa não contém vocais. Os sons “Ooh” e “aah” são tratados como instrumentais neste contexto. Faixas de rap ou palavra falada são claramente “vocais”. Quanto mais próximo o valor da instrumentalidade estiver de 1,0, maior será a probabilidade de a faixa não conter conteúdo vocal. Valores acima de 0,5 destinam-se a representar faixas instrumentais, mas a confiança é maior à medida que o valor se aproxima de 1,0. 
* **'key'**
    * Basicamente o tom da música, onde: 
        * 0 = C
        * 1 = Db
        * 2 = D
        * 3 = Eb
        * 4 = E
        * 5 = F
        * 6 = Gb
        * 7 = G
        * 8 = Ab
        * 9 = A
        * 10 = Bb
        * 11 = B
* **'liveness'**
    * Detecta a presença de um público na gravação. Valores de vivacidade mais altos representam um aumento na probabilidade de a trilha ter sido executada ao vivo. Um valor acima de 0,8 fornece uma grande probabilidade de que a faixa esteja ao vivo. 
* **'loudness'**
    * O volume geral de uma faixa em decibéis (dB). Os valores de intensidade são calculados em toda a faixa e são úteis para comparar a intensidade relativa das faixas. Loudness é a qualidade de um som que é o principal correlato psicológico da força física (amplitude). Os valores típicos variam entre -60 e 0 db. 
* **'speechiness'**
    * Detecta a presença de palavras faladas em uma faixa. Quanto mais exclusivamente falada for a gravação (por exemplo, talk show, audiolivro, poesia), mais próximo de 1,0 será o valor do atributo. Valores acima de 0,66 descrevem faixas que provavelmente são compostas inteiramente de palavras faladas. Os valores entre 0,33 e 0,66 descrevem faixas que podem conter música e fala, em seções ou em camadas, incluindo casos como música rap. Valores abaixo de 0,33 provavelmente representam música e outras faixas não semelhantes à fala. 
* **'tempo'**
    * Não confunda! Nada tem a ver com o tempo da música, mas sim com as batidas por minuto (BPM). Na terminologia musical, o tempo é a velocidade ou ritmo de uma determinada peça e deriva diretamente da duração média da batida. 
* **'time_signature'**
    * Uma fórmula de compasso geral estimada em uma faixa. A assinatura de tempo (medidor) é uma convenção notacional para especificar quantas batidas existem em cada barra (ou medida). Normalmente é 4 por as músicas serem normalmente 4/4.
* **'valence'**
    * Uma medida de 0,0 a 1,0 que descreve a positividade musical transmitida por uma faixa. Faixas com alta valência soam mais positivas (por exemplo, feliz, alegre, eufórico), enquanto faixas com baixa valência soam mais negativas (por exemplo, triste, deprimido, zangado).
* **'target'**
    * Define se os usuários gostaram ou não da músicas:
        * 0 = Não gostaram
        * 1= Gostaram
* **'song_title'**
    * Nome da faixa.
* **'artist'**
    * O nome do artista ou banda.

**Q1. Há valores nulos em nossa base de dados?** 

Não!

In [None]:
dataset.isnull().sum()

id                  0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
valence             0
target              0
song_title          0
artist              0
dtype: int64

**Q2. Há valores em nossa variável alvo além de 0 e 1?** 

Não Há!

Lembrando que os valores 0 são para músicas que os usuários não gostaram e os 1 para os que gostaram.

In [None]:
dataset.target.unique()

array([1, 0])

**Q3. Como é a distribuição estatística dos dados?**

In [None]:
dataset.describe()

Unnamed: 0,id,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target
count,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0
mean,1008.0,0.18759,0.618422,246306.2,0.681577,0.133286,5.342588,0.190844,-7.085624,0.612295,0.092664,121.603272,3.96827,0.496815,0.505702
std,582.402066,0.259989,0.161029,81981.81,0.210273,0.273162,3.64824,0.155453,3.761684,0.487347,0.089931,26.685604,0.255853,0.247195,0.500091
min,0.0,3e-06,0.122,16042.0,0.0148,0.0,0.0,0.0188,-33.097,0.0,0.0231,47.859,1.0,0.0348,0.0
25%,504.0,0.00963,0.514,200015.0,0.563,0.0,2.0,0.0923,-8.394,0.0,0.0375,100.189,4.0,0.295,0.0
50%,1008.0,0.0633,0.631,229261.0,0.715,7.6e-05,6.0,0.127,-6.248,1.0,0.0549,121.427,4.0,0.492,1.0
75%,1512.0,0.265,0.738,270333.0,0.846,0.054,9.0,0.247,-4.746,1.0,0.108,137.849,4.0,0.691,1.0
max,2016.0,0.995,0.984,1004627.0,0.998,0.976,11.0,0.969,-0.307,1.0,0.816,219.331,5.0,0.992,1.0


**À frente veremos alguns plots de dados**

Aqui há os estilos disponíveis na biblioteca **Matplotlib**.

In [None]:
plt.style.available

['Solarize_Light2',
 '_classic_test_patch',
 'bmh',
 'classic',
 'dark_background',
 'fast',
 'fivethirtyeight',
 'ggplot',
 'grayscale',
 'seaborn',
 'seaborn-bright',
 'seaborn-colorblind',
 'seaborn-dark',
 'seaborn-dark-palette',
 'seaborn-darkgrid',
 'seaborn-deep',
 'seaborn-muted',
 'seaborn-notebook',
 'seaborn-paper',
 'seaborn-pastel',
 'seaborn-poster',
 'seaborn-talk',
 'seaborn-ticks',
 'seaborn-white',
 'seaborn-whitegrid',
 'tableau-colorblind10']

**Os plots servirão para encontrarmos algum tipo de padrão entre os dados**

**Q4. Há alguma relação entre as variáveis `'acousticness'` e `'danceability'`?**

Aparentemente não.

In [None]:
%matplotlib notebook
style.use("seaborn-colorblind")
dataset.plot(x='acousticness', y='danceability', c='target', kind='scatter', colormap='Accent_r');

<IPython.core.display.Javascript object>

**Q5. Há alguma relação entre as variáveis `'tempo'` e `'valence'`?**

Aparentemente não também.

In [None]:
%matplotlib notebook
style.use("seaborn-colorblind")
dataset.plot(x='tempo', y='valence', c='target', kind='scatter' , colormap='Accent_r');

<IPython.core.display.Javascript object>

**Q6. Há alguma relação entre as variáveis `'tempo'` e `'speechiness'`?**

Aparentemente não mais uma vez.

In [None]:
%matplotlib notebook
style.use("seaborn-colorblind")
dataset.plot(x='tempo', y='speechiness', c='target', kind='scatter' , colormap='Accent');

<IPython.core.display.Javascript object>

**Q7. Há alguma relação entre as variáveis `'danceability'` e `'energy'`?**

Mais uma vez, não.

In [None]:
%matplotlib notebook
style.use('classic')
dataset.plot(x='danceability', y='energy', c='target', kind='scatter' , colormap='Reds');

<IPython.core.display.Javascript object>

**Teste!**

Não encontramos nenhuma correlação entre as features, contudo nosso teste não foi exaustivo e você pode testar as diversas combinações para encontrar alguma correlação que possa importar.

# 3. Pré-processamento

Aqui utilizaremos duas abordagens que você pode entender melhor [aqui]( https://bit.ly/2Siq0YU).

Utilizaremos o One Hot Encoder e Label Encoder e após isso, utilizaremos o Get Dummies.

**Dividindo os dados**

In [None]:
classes = dataset['target']
dataset.drop('target', axis=1, inplace=True) # dataset terá todos os dados, exceto o target(classes)

**Criando função pra remover coluna sempre que necessário**

In [None]:
def remove_features(lista_features):
    for i in lista_features:
        dataset.drop(i, axis=1, inplace=True)
    return 'OK'

**Removendo features não-representativas**

O id da música a faixa da música são valores únicos que não terão importância em nosso modelo.

In [None]:
remove_features(['id','song_title'])

'OK'

**Visualizando o novo dataset**

Já com as features removidas.

In [None]:
dataset.head(3)

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,artist
0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,Future
1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4.0,0.588,Childish Gambino
2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4.0,0.173,Future


**Vamos verificar algums informações das colunas**

Somente para ver se está tudo certo.

Aparentemente está.

In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2017 entries, 0 to 2016
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      2017 non-null   float64
 1   danceability      2017 non-null   float64
 2   duration_ms       2017 non-null   int64  
 3   energy            2017 non-null   float64
 4   instrumentalness  2017 non-null   float64
 5   key               2017 non-null   int64  
 6   liveness          2017 non-null   float64
 7   loudness          2017 non-null   float64
 8   mode              2017 non-null   int64  
 9   speechiness       2017 non-null   float64
 10  tempo             2017 non-null   float64
 11  time_signature    2017 non-null   float64
 12  valence           2017 non-null   float64
 13  artist            2017 non-null   object 
dtypes: float64(10), int64(3), object(1)
memory usage: 220.7+ KB


## Label Encoder

**Criando o objeto Label Encoder**

In [None]:
# importando a biblioteca
from sklearn.preprocessing import LabelEncoder
# instanciando o objeto
enc = LabelEncoder()
# inteiros recebe a coluna 'artist' transformada
inteiros = enc.fit_transform(dataset['artist'])
#tipo da coluna
type(dataset['artist']),type(inteiros)

(pandas.core.series.Series, numpy.ndarray)

**Visualizando valores únicos**

Da coluna *'artist'* já transformada.

Os arrays numpy não suportam o unique.

In [None]:
set(inteiros)

{0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,


**Criando uma nova coluna chamada 'artist_inteiros'**

Para atribuir posteriormente ao DataFrame.

In [None]:
dataset['artist_inteiros'] = inteiros

**Fazendo copia do dataset com a feature artist categórico**

Para pode utilizar novamente depois com seu formato original, ou seja, categórico.

In [None]:
dataset_com_artist=dataset.copy()

**Removendo 'artist'**

In [None]:
remove_features(['artist'])

'OK'

**Visualizando dataset ainda com 'artist'**

In [None]:
dataset_com_artist.head(3)

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,artist,artist_inteiros
0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,Future,449
1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4.0,0.588,Childish Gambino,222
2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4.0,0.173,Future,449


**Visualizando dataset sem 'artist'**

In [None]:
dataset.head(3) # não tem mais a feature artist

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,artist_inteiros
0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,449
1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4.0,0.588,222
2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4.0,0.173,449


## One Hot Encoding

In [None]:
# importando o pacote OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
# instanciando um objeto do tipo OnehotEncoder
ohe = OneHotEncoder()

**A função .values pega todos os valores como array**

In [None]:
dataset.values

array([[1.02000e-02, 8.33000e-01, 2.04600e+05, ..., 4.00000e+00,
        2.86000e-01, 4.49000e+02],
       [1.99000e-01, 7.43000e-01, 3.26933e+05, ..., 4.00000e+00,
        5.88000e-01, 2.22000e+02],
       [3.44000e-02, 8.38000e-01, 1.85707e+05, ..., 4.00000e+00,
        1.73000e-01, 4.49000e+02],
       ...,
       [8.57000e-03, 6.37000e-01, 2.07200e+05, ..., 4.00000e+00,
        4.70000e-01, 9.47000e+02],
       [1.64000e-03, 5.57000e-01, 1.85600e+05, ..., 4.00000e+00,
        6.23000e-01, 1.24200e+03],
       [2.81000e-03, 4.46000e-01, 2.04520e+05, ..., 4.00000e+00,
        4.02000e-01, 1.32000e+02]])

**Transformando em array numpy o dataset**

Se você viu o [notebook](https://github.com/danielamador12/public-projects/blob/master/pr%C3%A1tica_pre-processamento_e_metricas.ipynb) viu que o One Hot Encoder requer um pouco mais de trabalho

In [None]:
dataset_array = dataset.values
dataset_array.shape

(2017, 14)

**Pegando o numero de linhas e salvando em uma variável**

In [None]:
num_rows = dataset_array.shape[0] # shape[0] significa pegar o shape (ou tamanho) de linhas
num_rows

2017

**Visualizando coluna de inteiros**

Que é nossa coluna com os artistas já codificados com o *Label Encoder*.

A coluna que se quer é a 14º. O 13 é porque vai de 0 a 13.

In [None]:
dataset_array[:][:,13]

array([ 449.,  222.,  449., ...,  947., 1242.,  132.])

**Transformando a matriz em uma dimensão**

len(inteiros) é para pegar o tamanho de linhas baseado no inteiros (2017).

1 é o numero de dimensões.

In [None]:
inteiros = inteiros.reshape(len(inteiros),1)
type(inteiros),inteiros.shape

(numpy.ndarray, (2017, 1))

**Criando as novas features a partir da matriz de presença**

In [None]:
# criando as novas features 
novas_features = ohe.fit_transform(inteiros)
# imprimindo as novas features
novas_features
# são 1343 colunas presentes em 2017 linhas

<2017x1343 sparse matrix of type '<class 'numpy.float64'>'
	with 2017 stored elements in Compressed Sparse Row format>

**Vamos visualizar o tipo das colunas**

Veja que agora é uma *matriz esparsa*.

In [None]:
type(novas_features)

scipy.sparse.csr.csr_matrix

**Concatenando e visualizandos as features**

In [None]:
# Concatenando as novas features ao array em formato array
dataset_array = np.concatenate([dataset_array, novas_features.toarray()], axis=1)
# Visualizando a quantidade de linhas e colunas da base
dataset_array.shape

(2017, 1357)

**Transformando em DataFrame e visualizando as colunas**

In [None]:
# transformando
dataf = pd.DataFrame(dataset_array)
# visualizando
dataf.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,1317,1318,1319,1320,1321,1322,1323,1324,1325,1326,1327,1328,1329,1330,1331,1332,1333,1334,1335,1336,1337,1338,1339,1340,1341,1342,1343,1344,1345,1346,1347,1348,1349,1350,1351,1352,1353,1354,1355,1356
0,0.0102,0.833,204600.0,0.434,0.0219,2.0,0.165,-8.795,1.0,0.431,150.062,4.0,0.286,449.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.199,0.743,326933.0,0.359,0.00611,1.0,0.137,-10.401,1.0,0.0794,160.083,4.0,0.588,222.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0344,0.838,185707.0,0.412,0.000234,2.0,0.159,-7.148,1.0,0.289,75.044,4.0,0.173,449.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Lembra como era antes da transformação?**

In [None]:
dataset_com_artist.head(3)

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,artist,artist_inteiros
0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,Future,449
1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4.0,0.588,Childish Gambino,222
2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4.0,0.173,Future,449


**Tipo da tabela acima**

In [None]:
type(dataset_com_artist)

pandas.core.frame.DataFrame

**Dropando 'artist_inteiros'**

Para ficar com *'artist'* sem Label Encoder e sem One Hot Encoder.

In [None]:
dataset_com_artist.drop('artist_inteiros',axis=1,inplace=True)

**Visualizando os DataFrame novamente**

In [None]:
dataset_com_artist.head(3)

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,artist
0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,Future
1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4.0,0.588,Childish Gambino
2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4.0,0.173,Future


**Vamos checar as informações**

Só para ter certeza que está do mesmo modo que antes e, está.

In [None]:
dataset_com_artist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2017 entries, 0 to 2016
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      2017 non-null   float64
 1   danceability      2017 non-null   float64
 2   duration_ms       2017 non-null   int64  
 3   energy            2017 non-null   float64
 4   instrumentalness  2017 non-null   float64
 5   key               2017 non-null   int64  
 6   liveness          2017 non-null   float64
 7   loudness          2017 non-null   float64
 8   mode              2017 non-null   int64  
 9   speechiness       2017 non-null   float64
 10  tempo             2017 non-null   float64
 11  time_signature    2017 non-null   float64
 12  valence           2017 non-null   float64
 13  artist            2017 non-null   object 
dtypes: float64(10), int64(3), object(1)
memory usage: 220.7+ KB


## Get Dummies

Se você lembra do que falei [aqui]( https://bit.ly/2Siq0YU), o *Get Dummies* faz todo o trabalho do Label Encoder e One Hot Encoder todo de uma vez.

**Aplicando o get_dummies nos dados**

In [None]:
dataset_com_artist=pd.get_dummies(dataset_com_artist, columns=['artist'], prefix=['artist'])

**Visualizando o DataFrame já com a transformação**

In [None]:
dataset_com_artist.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,artist_!!!,artist_*NSYNC,artist_10cm,artist_2 Chainz,artist_2 LIVE CREW,artist_20th Century Steel Band,artist_21 Savage,artist_2milly,artist_3LW,artist_4 Non Blondes,artist_5 Seconds of Summer,artist_5kinAndBone5,artist_A Day To Remember,artist_A Guy Called Gerald,artist_A Tribe Called Quest,artist_A Trust Unclean,artist_A Wake in Providence,artist_A$AP Ferg,artist_A$AP Rocky,artist_A-1,artist_A-Trak,artist_AFI,artist_AJ Tracey,artist_ASTR,artist_Aaron Shust,artist_Above & Beyond,"artist_Above, Below",...,artist_Worlds Famous Supreme Team,artist_Wyclef Jean,artist_Wynton Marsalis,artist_X-Press 2,artist_XIA,artist_Xantos,artist_Xavier Davis,artist_Xscape,artist_YACHT,artist_Yacht Club,artist_Yeah Yeah Yeahs,artist_Yeasayer,artist_Yelena Eckemoff,artist_Yelle,artist_Yellow Claw,artist_Young & Sick,artist_Young M.A.,artist_Young Thug,artist_Young the Giant,artist_ZAYN,artist_ZHU,artist_ZZT,artist_Zac Brown Band,artist_Zach Williams,artist_Zapp,artist_Zara Larsson,artist_Zdar,artist_Zedd,artist_Zeds Dead,artist_Zion & Lennox,artist_alt-J,artist_deadmau5,artist_for KING & COUNTRY,artist_one sonic society,artist_tUnE-yArDs,artist_tobyMac,artist_권나무 Kwon Tree,artist_도시총각 Dosichonggak,artist_카우칩스 The CowChips,artist_플랫핏 Flat Feet
0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4.0,0.588,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4.0,0.173,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0.604,0.494,199413,0.338,0.51,5,0.0922,-15.236,1,0.0261,86.468,4.0,0.23,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0.18,0.678,392893,0.561,0.512,5,0.439,-11.648,0,0.0694,174.004,4.0,0.904,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


**Visualizando 'features' geradas**

In [None]:
dataset_com_artist.columns[13:]

Index(['artist_!!!', 'artist_*NSYNC', 'artist_10cm', 'artist_2 Chainz',
       'artist_2 LIVE CREW', 'artist_20th Century Steel Band',
       'artist_21 Savage', 'artist_2milly', 'artist_3LW',
       'artist_4 Non Blondes',
       ...
       'artist_alt-J', 'artist_deadmau5', 'artist_for KING & COUNTRY',
       'artist_one sonic society', 'artist_tUnE-yArDs', 'artist_tobyMac',
       'artist_권나무 Kwon Tree', 'artist_도시총각 Dosichonggak',
       'artist_카우칩스 The CowChips', 'artist_플랫핏 Flat Feet'],
      dtype='object', length=1343)

**Quantas colunas temos?**

In [None]:
len(dataset_com_artist.columns)

1356

**Visualizando as colunas**

In [None]:
dataset_com_artist.dtypes

acousticness                float64
danceability                float64
duration_ms                   int64
energy                      float64
instrumentalness            float64
                             ...   
artist_tobyMac                uint8
artist_권나무 Kwon Tree          uint8
artist_도시총각 Dosichonggak      uint8
artist_카우칩스 The CowChips      uint8
artist_플랫핏 Flat Feet          uint8
Length: 1356, dtype: object

**Checando missing values**

Mais uma vezes, nenhum.

O '.sum().sum()' é para visualizar se em toda a base tem. Se houvesse deveriamos procurar qual ou quais linhas seria.

In [None]:
dataset_com_artist.isnull().sum().sum()

0

# 4. Pipeline e Machine Learning

Como também falamos [aqui]( https://bit.ly/2Siq0YU) os *pipelines* são **automatizações de processos** de *Machine Learning*, então em projeto real é crucial utilizá-lo para testar diferentes configurações.

**Importando as bibliotecas de Pipelines e Pré-processadores**

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

**Treinando o algoritmo de SVM**

Aqui estão sendo usados todos os dados, sem divisao de treino e teste

In [None]:
clf = svm.SVC().fit(dataset_com_artist,classes)

**Mudando o nome do dataset com get_dummies**

In [None]:
dataset_get=dataset_com_artist

**Vamos esclarecer os DataFrames que temos?**

* **dataset** = somente Label Encoder
* **dataset_array** =  Label Encoder e One Hot Encoder
* **dataset_get** = Get Dummies

In [None]:
clf

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

**Criando função para aplicar Cross Validation e retornar o score**

O cross validation faz uma gestão de validação melhor, evitando a variância.

In [None]:
def Acuracia(clf,X,y):
    # x é os dados de treino e y os de classe
    # recebe o classificador, dados de treino com as features e classe e número de folds
    resultados=cross_val_predict(clf,X,y,cv=10) # faz a validação cruzada
    return metrics.accuracy_score(y,resultados) # retorna a acurácia # recebe os valores reais (y) e os valores preditos por resultados

**Fazendo cross validation com a função 'Acuracia' e dados com get dummies**

In [None]:
Acuracia(clf,dataset_get,classes)

0.5577590480912246

**Criando primeiro pipeline**

In [None]:
pip_1 = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', svm.SVC())
])

**Imprimindo Etapas do Pipeline**

In [None]:
pip_1.steps

[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
 ('clf',
  SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
      max_iter=-1, probability=False, random_state=None, shrinking=True,
      tol=0.001, verbose=False))]

**Chamando a função Acuracia**

Passando os dados de musicas e as classes, além de passar o *pip_1*.

In [None]:
Acuracia(pip_1,dataset_get,classes)

0.58601883986118

**Criando vários Pipelines**

In [None]:
pip_2 = Pipeline([
    ('min_max_scaler', MinMaxScaler()),
    ('clf', svm.SVC())
])

pip_3 = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', svm.SVC(kernel='rbf'))
])

pip_4 = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', svm.SVC(kernel='poly'))
])

pip_5 = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', svm.SVC(kernel='linear'))
])

**Função Acuracia utilizando o pipeline pip_2**

In [None]:
Acuracia(pip_2,dataset_get,classes)

0.7223599405057015

**Teste com Label Encoder e StandardScaler**

In [None]:
Acuracia(pip_1,dataset,classes)

0.7149231531978185

**Teste com Label Encoder e MinMaxScaler**

In [None]:
# Teste com apenas LabelEncoder na coluna 'artist' usando o pipeline 'pip_2'
Acuracia(pip_2,dataset,classes)

0.6757560733763014

**Teste de desempenho dos Kernels**

Os *kernels* são funções matemáticas que mapeam e organizam os dados. Cada um tentará mapear os dados de uma forma. No próprio [site](https://scikit-learn.org/stable/modules/svm.html#svm-kernels) da *scikit learn* você encontrar as imagens autoexplicativas.


Vamos aos testes!

In [None]:
# Testando o Kernel RBF
Acuracia(pip_3,dataset,classes)

0.7149231531978185

In [None]:
# Teste de kernel poly
Acuracia(pip_4,dataset,classes)

0.6683192860684184

In [None]:
# Teste de Kernel linear
Acuracia(pip_5,dataset,classes)

0.6236985622211205

# 5. Tuning do modelo

Agora que vimos  que o **kernel RBF** se saiu melhor, vamos fazer o *tuning* do modelo para tentar melhorá-lo.

Se tiver mais interesse em saber sobre *tuning*, leia [aqui](https://bit.ly/30mKZ1q).

**Importando o utilitário GridSearchCV**

In [None]:
from sklearn.model_selection import GridSearchCV

**Listas com valores para parâmetros**

In [None]:
# Lista de Valores de C
lista_C = [0.001, 0.01, 0.1, 1, 10, 100]

# Lista de Valores de gamma
lista_gamma = [0.001, 0.01, 0.1, 1, 10, 100]

**Criando um dicionário que recebe as listas de parâmetros e valores**

In [None]:
parametros_grid = dict(clf__C=lista_C, clf__gamma=lista_gamma)

**Visualizando o dicionário criado**

In [None]:
parametros_grid

{'clf__C': [0.001, 0.01, 0.1, 1, 10, 100],
 'clf__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

**Criando o objeto Grid**

Ele recebe os parâmetros de Pipeline e configurações de cross validation.

In [None]:
grid = GridSearchCV(pip_3, parametros_grid, cv=10, scoring='accuracy')

**Aplicando o gridsearch**

Passando os dados de treino e classes.

In [None]:
grid.fit(dataset, classes)

GridSearchCV(cv=10, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('scaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('clf',
                                        SVC(C=1.0, break_ties=False,
                                            cache_size=200, class_weight=None,
                                            coef0=0.0,
                                            decision_function_shape='ovr',
                                            degree=3, gamma='scale',
                                            kernel='rbf', max_iter=-1,
                                            probability=False,
                                            random_state=None, shrinking=True,
                                            tol=0.00

### Resultados do Grid

**Imprimindo os scores por combinações**

In [None]:
grid.cv_results_

{'mean_fit_time': array([0.16862829, 0.16384068, 0.16585724, 0.16461494, 0.1658251 ,
        0.12558331, 0.16383872, 0.16390057, 0.16476038, 0.16446714,
        0.20363884, 0.16255207, 0.1637917 , 0.15339496, 0.14663818,
        0.16802161, 0.21254966, 0.16983471, 0.15208127, 0.13727636,
        0.13497531, 0.20882921, 0.22251258, 0.17495315, 0.14344277,
        0.13973792, 0.18883848, 0.19495778, 0.1967833 , 0.16223192,
        0.15713582, 0.22736824, 0.407865  , 0.19488795, 0.1987797 ,
        0.16485555]),
 'mean_score_time': array([0.01540256, 0.01598303, 0.01566043, 0.01570935, 0.01656446,
        0.01331666, 0.01546686, 0.0156646 , 0.01563735, 0.01616101,
        0.01633623, 0.01350739, 0.0157707 , 0.01493793, 0.01356168,
        0.01554289, 0.01620092, 0.01347089, 0.01411741, 0.01267643,
        0.01127617, 0.01561558, 0.01627514, 0.01379132, 0.01257651,
        0.01083372, 0.0103991 , 0.01555111, 0.01616237, 0.0135407 ,
        0.01159751, 0.00982633, 0.00962164, 0.015537  , 0.

**Imprimindo os melhores parâmetros**

In [None]:
grid.best_params_

{'clf__C': 100, 'clf__gamma': 0.01}

**Visualizando o melhor score**

In [None]:
grid.best_score_

0.720858578395153

**Visualizando as chaves do grid**

In [None]:
grid.cv_results_.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_clf__C', 'param_clf__gamma', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'split5_test_score', 'split6_test_score', 'split7_test_score', 'split8_test_score', 'split9_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])

A verdade é que há sempre opções há se considerar, novos testes para fazer, inclusive de algoritmos, cabe ao **Cientista de Dados** sempre explorar ao máximo essas opções!

# Obrigado!

Obrigado por ter disponibilizado um pouco do seu tempo e atenção aqui. Espero que, de alguma forma, tenha sido útil para seu crescimento. Se houver qualquer dúvida ou sugestão, não hesite em entrar em contato no [LinkedIn](https://www.linkedin.com/in/daniel-sousa-amador) e verificar meus outros projetos no [GitHub](https://github.com/amadords).

[![LinkedIn](https://img.shields.io/badge/LinkedIn-DanielSousaAmador-cyan.svg)](https://www.linkedin.com/in/daniel-sousa-amador)
[![GitHub](https://img.shields.io/badge/GitHub-amadords-darkblue.svg)](https://github.com/amadords)
[![Medium](https://img.shields.io/badge/Medium-DanielSousaAmador-white.svg)](https://daniel-s-amador.medium.com/)


<center><img width="90%" src="https://raw.githubusercontent.com/danielamador12/Portfolio/master/github.png"></center>