## Projeto de Inteligencia Artificial
### Predição de resultados para partidas da NBA

### 1. Introdução
    A finalidade deste projeto é, utilizando as técnicas de machine learning ensinadas, produzir uma classificação de vitória ou derrota de times da NBA, num periodo entre 2014 e 2018, a partir de estatísticas vindas de uma série de jogos das equipes, armazenadas em uma base de dados.

#### Cabeçalho dos includes 

In [24]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt 

dataframe = pd.read_csv('nba.games.stats.csv')

### 2. Análise da base de dados
    Primeiramente olhamos o tamanho de nossa base de dados para termos noção de volume de dados que será analisado, em seguida devemos observar os tipos de dados que estão armazenados nela.

In [25]:
len(dataframe)

9840

In [26]:
dataframe.dtypes

Unnamed: 0                    int64
Team                         object
Game                          int64
Date                         object
Home                         object
Opponent                     object
WINorLOSS                    object
TeamPoints                    int64
OpponentPoints                int64
FieldGoals                    int64
FieldGoalsAttempted           int64
FieldGoals.                 float64
X3PointShots                  int64
X3PointShotsAttempted         int64
X3PointShots.               float64
FreeThrows                    int64
FreeThrowsAttempted           int64
FreeThrows.                 float64
OffRebounds                   int64
TotalRebounds                 int64
Assists                       int64
Steals                        int64
Blocks                        int64
Turnovers                     int64
TotalFouls                    int64
Opp.FieldGoals                int64
Opp.FieldGoalsAttempted       int64
Opp.FieldGoals.             

In [27]:
dataframe.head()

Unnamed: 0.1,Unnamed: 0,Team,Game,Date,Home,Opponent,WINorLOSS,TeamPoints,OpponentPoints,FieldGoals,...,Opp.FreeThrows,Opp.FreeThrowsAttempted,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls
0,1,ATL,1,2014-10-29,Away,TOR,L,102,109,40,...,27,33,0.818,16,48,26,13,9,9,22
1,2,ATL,2,2014-11-01,Home,IND,W,102,92,35,...,18,21,0.857,11,44,25,5,5,18,26
2,3,ATL,3,2014-11-05,Away,SAS,L,92,94,38,...,27,38,0.711,11,50,25,7,9,19,15
3,4,ATL,4,2014-11-07,Away,CHO,L,119,122,43,...,20,27,0.741,11,51,31,6,7,19,30
4,5,ATL,5,2014-11-08,Home,NYK,W,103,96,33,...,8,11,0.727,13,44,26,2,6,15,29


In [28]:
dataframe.isnull().sum()

Unnamed: 0                  0
Team                        0
Game                        0
Date                        0
Home                        0
Opponent                    0
WINorLOSS                   0
TeamPoints                  0
OpponentPoints              0
FieldGoals                  0
FieldGoalsAttempted         0
FieldGoals.                 0
X3PointShots                0
X3PointShotsAttempted       0
X3PointShots.               0
FreeThrows                  0
FreeThrowsAttempted         0
FreeThrows.                 0
OffRebounds                 0
TotalRebounds               0
Assists                     0
Steals                      0
Blocks                      0
Turnovers                   0
TotalFouls                  0
Opp.FieldGoals              0
Opp.FieldGoalsAttempted     0
Opp.FieldGoals.             0
Opp.3PointShots             0
Opp.3PointShotsAttempted    0
Opp.3PointShots.            0
Opp.FreeThrows              0
Opp.FreeThrowsAttempted     0
Opp.FreeTh

    Depois que conseguimos ter uma ideia da "cara" da base de dados e verificarmos se existem dados nulos dentro dela, podemos iniciar com o tratamento de algumas das colunas. Neste caso, começaremos com 'Home', que diz se a partida foi em casa ou fora de casa, e determinamos 1 para partidas em casa e 0 para partidas fora de casa. Para a coluna 'WINorLOSS', fizemos o mesmo tratamento, onde 'W', que representa uma vitória, será convertido para 1, e 'L', que representa uma derrota, será convertido para 0. Também deletamos 'Unnamed: 0' que representa o índice para nossa base, desnecessário durante o procedimento.

In [29]:
dataframe['Home'] = np.where(dataframe['Home'] == 'Home', 1, 0) #1 para em casa e 0 para fora de casa

In [30]:
dataframe['WINorLOSS'] = np.where(dataframe['WINorLOSS'] == 'W', 1, 0) # 1 para win 0 para loss

In [31]:
dataframe.head()

Unnamed: 0.1,Unnamed: 0,Team,Game,Date,Home,Opponent,WINorLOSS,TeamPoints,OpponentPoints,FieldGoals,...,Opp.FreeThrows,Opp.FreeThrowsAttempted,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls
0,1,ATL,1,2014-10-29,0,TOR,0,102,109,40,...,27,33,0.818,16,48,26,13,9,9,22
1,2,ATL,2,2014-11-01,1,IND,1,102,92,35,...,18,21,0.857,11,44,25,5,5,18,26
2,3,ATL,3,2014-11-05,0,SAS,0,92,94,38,...,27,38,0.711,11,50,25,7,9,19,15
3,4,ATL,4,2014-11-07,0,CHO,0,119,122,43,...,20,27,0.741,11,51,31,6,7,19,30
4,5,ATL,5,2014-11-08,1,NYK,1,103,96,33,...,8,11,0.727,13,44,26,2,6,15,29


In [32]:
del dataframe['Unnamed: 0']

In [33]:
dataframe.head()

Unnamed: 0,Team,Game,Date,Home,Opponent,WINorLOSS,TeamPoints,OpponentPoints,FieldGoals,FieldGoalsAttempted,...,Opp.FreeThrows,Opp.FreeThrowsAttempted,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls
0,ATL,1,2014-10-29,0,TOR,0,102,109,40,80,...,27,33,0.818,16,48,26,13,9,9,22
1,ATL,2,2014-11-01,1,IND,1,102,92,35,69,...,18,21,0.857,11,44,25,5,5,18,26
2,ATL,3,2014-11-05,0,SAS,0,92,94,38,92,...,27,38,0.711,11,50,25,7,9,19,15
3,ATL,4,2014-11-07,0,CHO,0,119,122,43,93,...,20,27,0.741,11,51,31,6,7,19,30
4,ATL,5,2014-11-08,1,NYK,1,103,96,33,81,...,8,11,0.727,13,44,26,2,6,15,29


### 3. Tratamento da base
    Depois de arrumarmos algumas colunas, podemos seguir para uma análise a partir da matriz de correlação, buscando entender o que está relacionado com o que. Durante a análise, percebemos alta correlação em Field Goals e Team Points, então, decidimos deixar apenas uma coluna, de forma que a análise fique mais eficiente, já que ambos representam uma similaridade alta.

In [34]:
df = pd.DataFrame(dataframe)
corr = df.corr()
corr.style.background_gradient().set_precision(2)

Unnamed: 0,Game,Home,WINorLOSS,TeamPoints,OpponentPoints,FieldGoals,FieldGoalsAttempted,FieldGoals.,X3PointShots,X3PointShotsAttempted,X3PointShots.,FreeThrows,FreeThrowsAttempted,FreeThrows.,OffRebounds,TotalRebounds,Assists,Steals,Blocks,Turnovers,TotalFouls,Opp.FieldGoals,Opp.FieldGoalsAttempted,Opp.FieldGoals.,Opp.3PointShots,Opp.3PointShotsAttempted,Opp.3PointShots.,Opp.FreeThrows,Opp.FreeThrowsAttempted,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls
Game,1.0,-0.00044,-0.0017,0.059,0.061,0.078,0.058,0.044,0.045,0.053,0.012,-0.037,-0.037,-0.00046,0.0028,0.018,0.083,-0.0093,-0.017,-0.081,-0.09,0.08,0.058,0.046,0.045,0.052,0.013,-0.036,-0.035,-0.0012,0.0029,0.019,0.085,-0.008,-0.017,-0.081,-0.091
Home,-0.00044,1.0,0.16,0.11,-0.11,0.083,-0.0052,0.094,0.031,0.0022,0.047,0.057,0.058,0.011,0.026,0.086,0.11,0.0055,0.079,-0.027,-0.08,-0.083,0.0052,-0.094,-0.031,-0.0022,-0.047,-0.057,-0.058,-0.011,-0.026,-0.086,-0.11,-0.0055,-0.079,0.027,0.08
WINorLOSS,-0.0017,0.16,1.0,0.46,-0.46,0.38,-0.042,0.45,0.24,0.033,0.32,0.14,0.11,0.11,-0.04,0.26,0.31,0.14,0.17,-0.12,-0.11,-0.38,0.042,-0.45,-0.24,-0.033,-0.32,-0.14,-0.11,-0.11,0.04,-0.26,-0.31,-0.14,-0.17,0.12,0.11
TeamPoints,0.059,0.11,0.46,1.0,0.36,0.84,0.28,0.71,0.52,0.27,0.48,0.32,0.27,0.18,-0.012,0.093,0.57,0.1,0.057,-0.12,0.15,0.27,0.28,0.098,0.14,0.16,0.035,0.2,0.19,0.029,0.017,-0.28,0.12,-0.091,-0.15,0.038,0.22
OpponentPoints,0.061,-0.11,-0.46,0.36,1.0,0.27,0.28,0.098,0.14,0.16,0.035,0.2,0.19,0.029,0.017,-0.28,0.12,-0.091,-0.15,0.038,0.22,0.84,0.28,0.71,0.52,0.27,0.48,0.32,0.27,0.18,-0.012,0.093,0.57,0.1,0.057,-0.12,0.15
FieldGoals,0.078,0.083,0.38,0.84,0.27,1.0,0.44,0.78,0.33,0.11,0.36,-0.18,-0.19,0.0049,-0.00092,0.08,0.64,0.098,0.062,-0.15,0.051,0.22,0.22,0.087,0.11,0.12,0.032,0.11,0.11,0.024,0.006,-0.26,0.1,-0.1,-0.14,0.037,-0.13
FieldGoalsAttempted,0.058,-0.0052,-0.042,0.28,0.28,0.44,1.0,-0.22,0.092,0.26,-0.14,-0.22,-0.22,-0.037,0.51,0.42,0.2,0.12,0.037,-0.28,0.074,0.22,0.22,0.08,0.11,0.12,0.036,0.14,0.14,0.017,-0.045,0.32,0.16,-0.18,0.26,0.13,-0.14
FieldGoals.,0.044,0.094,0.45,0.71,0.098,0.78,-0.22,1.0,0.29,-0.066,0.49,-0.036,-0.047,0.031,-0.35,-0.21,0.55,0.021,0.042,0.028,0.0061,0.087,0.08,0.037,0.04,0.045,0.0084,0.029,0.021,0.014,0.038,-0.51,0.0035,0.0096,-0.33,-0.054,-0.047
X3PointShots,0.045,0.031,0.24,0.52,0.14,0.33,0.092,0.29,1.0,0.75,0.69,-0.099,-0.12,0.028,-0.12,-0.013,0.42,0.0081,-0.0021,-0.0058,0.033,0.11,0.11,0.04,0.087,0.12,0.0043,0.043,0.041,0.0044,-0.032,-0.096,0.044,-0.008,-0.13,-0.035,-0.1
X3PointShotsAttempted,0.053,0.0022,0.033,0.27,0.16,0.11,0.26,-0.066,0.75,1.0,0.068,-0.092,-0.099,0.0072,0.017,0.09,0.23,0.047,-0.0056,-0.012,0.047,0.12,0.12,0.045,0.12,0.14,0.021,0.053,0.052,0.0085,-0.054,0.12,0.075,0.002,-0.083,0.03,-0.071


In [35]:
del dataframe['FieldGoals']

In [36]:
df = pd.DataFrame(dataframe)
corr = df.corr()
corr.style.background_gradient().set_precision(2)

Unnamed: 0,Game,Home,WINorLOSS,TeamPoints,OpponentPoints,FieldGoalsAttempted,FieldGoals.,X3PointShots,X3PointShotsAttempted,X3PointShots.,FreeThrows,FreeThrowsAttempted,FreeThrows.,OffRebounds,TotalRebounds,Assists,Steals,Blocks,Turnovers,TotalFouls,Opp.FieldGoals,Opp.FieldGoalsAttempted,Opp.FieldGoals.,Opp.3PointShots,Opp.3PointShotsAttempted,Opp.3PointShots.,Opp.FreeThrows,Opp.FreeThrowsAttempted,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls
Game,1.0,-0.00044,-0.0017,0.059,0.061,0.058,0.044,0.045,0.053,0.012,-0.037,-0.037,-0.00046,0.0028,0.018,0.083,-0.0093,-0.017,-0.081,-0.09,0.08,0.058,0.046,0.045,0.052,0.013,-0.036,-0.035,-0.0012,0.0029,0.019,0.085,-0.008,-0.017,-0.081,-0.091
Home,-0.00044,1.0,0.16,0.11,-0.11,-0.0052,0.094,0.031,0.0022,0.047,0.057,0.058,0.011,0.026,0.086,0.11,0.0055,0.079,-0.027,-0.08,-0.083,0.0052,-0.094,-0.031,-0.0022,-0.047,-0.057,-0.058,-0.011,-0.026,-0.086,-0.11,-0.0055,-0.079,0.027,0.08
WINorLOSS,-0.0017,0.16,1.0,0.46,-0.46,-0.042,0.45,0.24,0.033,0.32,0.14,0.11,0.11,-0.04,0.26,0.31,0.14,0.17,-0.12,-0.11,-0.38,0.042,-0.45,-0.24,-0.033,-0.32,-0.14,-0.11,-0.11,0.04,-0.26,-0.31,-0.14,-0.17,0.12,0.11
TeamPoints,0.059,0.11,0.46,1.0,0.36,0.28,0.71,0.52,0.27,0.48,0.32,0.27,0.18,-0.012,0.093,0.57,0.1,0.057,-0.12,0.15,0.27,0.28,0.098,0.14,0.16,0.035,0.2,0.19,0.029,0.017,-0.28,0.12,-0.091,-0.15,0.038,0.22
OpponentPoints,0.061,-0.11,-0.46,0.36,1.0,0.28,0.098,0.14,0.16,0.035,0.2,0.19,0.029,0.017,-0.28,0.12,-0.091,-0.15,0.038,0.22,0.84,0.28,0.71,0.52,0.27,0.48,0.32,0.27,0.18,-0.012,0.093,0.57,0.1,0.057,-0.12,0.15
FieldGoalsAttempted,0.058,-0.0052,-0.042,0.28,0.28,1.0,-0.22,0.092,0.26,-0.14,-0.22,-0.22,-0.037,0.51,0.42,0.2,0.12,0.037,-0.28,0.074,0.22,0.22,0.08,0.11,0.12,0.036,0.14,0.14,0.017,-0.045,0.32,0.16,-0.18,0.26,0.13,-0.14
FieldGoals.,0.044,0.094,0.45,0.71,0.098,-0.22,1.0,0.29,-0.066,0.49,-0.036,-0.047,0.031,-0.35,-0.21,0.55,0.021,0.042,0.028,0.0061,0.087,0.08,0.037,0.04,0.045,0.0084,0.029,0.021,0.014,0.038,-0.51,0.0035,0.0096,-0.33,-0.054,-0.047
X3PointShots,0.045,0.031,0.24,0.52,0.14,0.092,0.29,1.0,0.75,0.69,-0.099,-0.12,0.028,-0.12,-0.013,0.42,0.0081,-0.0021,-0.0058,0.033,0.11,0.11,0.04,0.087,0.12,0.0043,0.043,0.041,0.0044,-0.032,-0.096,0.044,-0.008,-0.13,-0.035,-0.1
X3PointShotsAttempted,0.053,0.0022,0.033,0.27,0.16,0.26,-0.066,0.75,1.0,0.068,-0.092,-0.099,0.0072,0.017,0.09,0.23,0.047,-0.0056,-0.012,0.047,0.12,0.12,0.045,0.12,0.14,0.021,0.053,0.052,0.0085,-0.054,0.12,0.075,0.002,-0.083,0.03,-0.071
X3PointShots.,0.012,0.047,0.32,0.48,0.035,-0.14,0.49,0.69,0.068,1.0,-0.05,-0.066,0.033,-0.2,-0.12,0.37,-0.029,0.00076,0.0051,0.0054,0.032,0.036,0.0084,0.0043,0.021,-0.02,0.013,0.012,-0.0028,0.012,-0.26,-0.02,-0.011,-0.11,-0.073,-0.074


    Na linha seguinte, utilizamos uma função para procurar duplicatas, de forma que diminua parte da base que pode influenciar viés.

In [37]:
dataframe.drop_duplicates(None,'first',inplace=True)

In [38]:
df = pd.DataFrame(dataframe)
corr = df.corr()
corr.style.background_gradient().set_precision(2)

Unnamed: 0,Game,Home,WINorLOSS,TeamPoints,OpponentPoints,FieldGoalsAttempted,FieldGoals.,X3PointShots,X3PointShotsAttempted,X3PointShots.,FreeThrows,FreeThrowsAttempted,FreeThrows.,OffRebounds,TotalRebounds,Assists,Steals,Blocks,Turnovers,TotalFouls,Opp.FieldGoals,Opp.FieldGoalsAttempted,Opp.FieldGoals.,Opp.3PointShots,Opp.3PointShotsAttempted,Opp.3PointShots.,Opp.FreeThrows,Opp.FreeThrowsAttempted,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls
Game,1.0,-0.00044,-0.0017,0.059,0.061,0.058,0.044,0.045,0.053,0.012,-0.037,-0.037,-0.00046,0.0028,0.018,0.083,-0.0093,-0.017,-0.081,-0.09,0.08,0.058,0.046,0.045,0.052,0.013,-0.036,-0.035,-0.0012,0.0029,0.019,0.085,-0.008,-0.017,-0.081,-0.091
Home,-0.00044,1.0,0.16,0.11,-0.11,-0.0052,0.094,0.031,0.0022,0.047,0.057,0.058,0.011,0.026,0.086,0.11,0.0055,0.079,-0.027,-0.08,-0.083,0.0052,-0.094,-0.031,-0.0022,-0.047,-0.057,-0.058,-0.011,-0.026,-0.086,-0.11,-0.0055,-0.079,0.027,0.08
WINorLOSS,-0.0017,0.16,1.0,0.46,-0.46,-0.042,0.45,0.24,0.033,0.32,0.14,0.11,0.11,-0.04,0.26,0.31,0.14,0.17,-0.12,-0.11,-0.38,0.042,-0.45,-0.24,-0.033,-0.32,-0.14,-0.11,-0.11,0.04,-0.26,-0.31,-0.14,-0.17,0.12,0.11
TeamPoints,0.059,0.11,0.46,1.0,0.36,0.28,0.71,0.52,0.27,0.48,0.32,0.27,0.18,-0.012,0.093,0.57,0.1,0.057,-0.12,0.15,0.27,0.28,0.098,0.14,0.16,0.035,0.2,0.19,0.029,0.017,-0.28,0.12,-0.091,-0.15,0.038,0.22
OpponentPoints,0.061,-0.11,-0.46,0.36,1.0,0.28,0.098,0.14,0.16,0.035,0.2,0.19,0.029,0.017,-0.28,0.12,-0.091,-0.15,0.038,0.22,0.84,0.28,0.71,0.52,0.27,0.48,0.32,0.27,0.18,-0.012,0.093,0.57,0.1,0.057,-0.12,0.15
FieldGoalsAttempted,0.058,-0.0052,-0.042,0.28,0.28,1.0,-0.22,0.092,0.26,-0.14,-0.22,-0.22,-0.037,0.51,0.42,0.2,0.12,0.037,-0.28,0.074,0.22,0.22,0.08,0.11,0.12,0.036,0.14,0.14,0.017,-0.045,0.32,0.16,-0.18,0.26,0.13,-0.14
FieldGoals.,0.044,0.094,0.45,0.71,0.098,-0.22,1.0,0.29,-0.066,0.49,-0.036,-0.047,0.031,-0.35,-0.21,0.55,0.021,0.042,0.028,0.0061,0.087,0.08,0.037,0.04,0.045,0.0084,0.029,0.021,0.014,0.038,-0.51,0.0035,0.0096,-0.33,-0.054,-0.047
X3PointShots,0.045,0.031,0.24,0.52,0.14,0.092,0.29,1.0,0.75,0.69,-0.099,-0.12,0.028,-0.12,-0.013,0.42,0.0081,-0.0021,-0.0058,0.033,0.11,0.11,0.04,0.087,0.12,0.0043,0.043,0.041,0.0044,-0.032,-0.096,0.044,-0.008,-0.13,-0.035,-0.1
X3PointShotsAttempted,0.053,0.0022,0.033,0.27,0.16,0.26,-0.066,0.75,1.0,0.068,-0.092,-0.099,0.0072,0.017,0.09,0.23,0.047,-0.0056,-0.012,0.047,0.12,0.12,0.045,0.12,0.14,0.021,0.053,0.052,0.0085,-0.054,0.12,0.075,0.002,-0.083,0.03,-0.071
X3PointShots.,0.012,0.047,0.32,0.48,0.035,-0.14,0.49,0.69,0.068,1.0,-0.05,-0.066,0.033,-0.2,-0.12,0.37,-0.029,0.00076,0.0051,0.0054,0.032,0.036,0.0084,0.0043,0.021,-0.02,0.013,0.012,-0.0028,0.012,-0.26,-0.02,-0.011,-0.11,-0.073,-0.074


    Também excluimos a coluna 'Game' da análise, já que o mesmo tem como propósito similar a um índice. Em seguida, removemos as colunas com informações relacionadas aos oponentes, já que, essencialmente, estas são informações duplicadas. Para fins de exemplificação, temos: existe uma linha que representa uma vitória do Chicago Bulls contra o Boston Celtics, da mesma forma, existe uma linha que representa uma derrota do Boston Celtics contra o Chicago Bulls, então, essas duas linhas estão trazendo a mesma informação, portanto, não há necessidade de analisar colunas que possui estatisticas de oponentes, pelo menos nesta aplicação.

In [39]:
del dataframe['Game']

In [40]:
df = pd.DataFrame(dataframe)
corr = df.corr()
corr.style.background_gradient().set_precision(2)

Unnamed: 0,Home,WINorLOSS,TeamPoints,OpponentPoints,FieldGoalsAttempted,FieldGoals.,X3PointShots,X3PointShotsAttempted,X3PointShots.,FreeThrows,FreeThrowsAttempted,FreeThrows.,OffRebounds,TotalRebounds,Assists,Steals,Blocks,Turnovers,TotalFouls,Opp.FieldGoals,Opp.FieldGoalsAttempted,Opp.FieldGoals.,Opp.3PointShots,Opp.3PointShotsAttempted,Opp.3PointShots.,Opp.FreeThrows,Opp.FreeThrowsAttempted,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls
Home,1.0,0.16,0.11,-0.11,-0.0052,0.094,0.031,0.0022,0.047,0.057,0.058,0.011,0.026,0.086,0.11,0.0055,0.079,-0.027,-0.08,-0.083,0.0052,-0.094,-0.031,-0.0022,-0.047,-0.057,-0.058,-0.011,-0.026,-0.086,-0.11,-0.0055,-0.079,0.027,0.08
WINorLOSS,0.16,1.0,0.46,-0.46,-0.042,0.45,0.24,0.033,0.32,0.14,0.11,0.11,-0.04,0.26,0.31,0.14,0.17,-0.12,-0.11,-0.38,0.042,-0.45,-0.24,-0.033,-0.32,-0.14,-0.11,-0.11,0.04,-0.26,-0.31,-0.14,-0.17,0.12,0.11
TeamPoints,0.11,0.46,1.0,0.36,0.28,0.71,0.52,0.27,0.48,0.32,0.27,0.18,-0.012,0.093,0.57,0.1,0.057,-0.12,0.15,0.27,0.28,0.098,0.14,0.16,0.035,0.2,0.19,0.029,0.017,-0.28,0.12,-0.091,-0.15,0.038,0.22
OpponentPoints,-0.11,-0.46,0.36,1.0,0.28,0.098,0.14,0.16,0.035,0.2,0.19,0.029,0.017,-0.28,0.12,-0.091,-0.15,0.038,0.22,0.84,0.28,0.71,0.52,0.27,0.48,0.32,0.27,0.18,-0.012,0.093,0.57,0.1,0.057,-0.12,0.15
FieldGoalsAttempted,-0.0052,-0.042,0.28,0.28,1.0,-0.22,0.092,0.26,-0.14,-0.22,-0.22,-0.037,0.51,0.42,0.2,0.12,0.037,-0.28,0.074,0.22,0.22,0.08,0.11,0.12,0.036,0.14,0.14,0.017,-0.045,0.32,0.16,-0.18,0.26,0.13,-0.14
FieldGoals.,0.094,0.45,0.71,0.098,-0.22,1.0,0.29,-0.066,0.49,-0.036,-0.047,0.031,-0.35,-0.21,0.55,0.021,0.042,0.028,0.0061,0.087,0.08,0.037,0.04,0.045,0.0084,0.029,0.021,0.014,0.038,-0.51,0.0035,0.0096,-0.33,-0.054,-0.047
X3PointShots,0.031,0.24,0.52,0.14,0.092,0.29,1.0,0.75,0.69,-0.099,-0.12,0.028,-0.12,-0.013,0.42,0.0081,-0.0021,-0.0058,0.033,0.11,0.11,0.04,0.087,0.12,0.0043,0.043,0.041,0.0044,-0.032,-0.096,0.044,-0.008,-0.13,-0.035,-0.1
X3PointShotsAttempted,0.0022,0.033,0.27,0.16,0.26,-0.066,0.75,1.0,0.068,-0.092,-0.099,0.0072,0.017,0.09,0.23,0.047,-0.0056,-0.012,0.047,0.12,0.12,0.045,0.12,0.14,0.021,0.053,0.052,0.0085,-0.054,0.12,0.075,0.002,-0.083,0.03,-0.071
X3PointShots.,0.047,0.32,0.48,0.035,-0.14,0.49,0.69,0.068,1.0,-0.05,-0.066,0.033,-0.2,-0.12,0.37,-0.029,0.00076,0.0051,0.0054,0.032,0.036,0.0084,0.0043,0.021,-0.02,0.013,0.012,-0.0028,0.012,-0.26,-0.02,-0.011,-0.11,-0.073,-0.074
FreeThrows,0.057,0.14,0.32,0.2,-0.22,-0.036,-0.099,-0.092,-0.05,1.0,0.92,0.33,0.052,0.063,-0.15,0.039,0.013,0.023,0.2,0.11,0.14,0.029,0.043,0.053,0.013,0.18,0.19,0.016,0.044,-0.068,0.036,-0.0074,0.0039,0.037,0.73


In [41]:
features_to_drop = ['Date', 'Opponent', 'OpponentPoints',
                        'Opp.FieldGoals', 'Opp.3PointShotsAttempted', 'Opp.3PointShots.', 'Opp.FreeThrows', 
                        'Opp.FreeThrowsAttempted', 'Opp.FreeThrows.', 'Opp.OffRebounds', 'Opp.TotalRebounds', 
                        'Opp.Assists', 'Opp.Steals', 'Opp.Blocks', 'Opp.Turnovers', 'Opp.TotalFouls',
                        'Opp.FieldGoalsAttempted', 'Opp.FieldGoals.', 'Opp.3PointShots']

In [42]:
x = dataframe.drop(features_to_drop, axis=1)

In [43]:
df = pd.DataFrame(x)
corr = df.corr()
corr.style.background_gradient().set_precision(2)

Unnamed: 0,Home,WINorLOSS,TeamPoints,FieldGoalsAttempted,FieldGoals.,X3PointShots,X3PointShotsAttempted,X3PointShots.,FreeThrows,FreeThrowsAttempted,FreeThrows.,OffRebounds,TotalRebounds,Assists,Steals,Blocks,Turnovers,TotalFouls
Home,1.0,0.16,0.11,-0.0052,0.094,0.031,0.0022,0.047,0.057,0.058,0.011,0.026,0.086,0.11,0.0055,0.079,-0.027,-0.08
WINorLOSS,0.16,1.0,0.46,-0.042,0.45,0.24,0.033,0.32,0.14,0.11,0.11,-0.04,0.26,0.31,0.14,0.17,-0.12,-0.11
TeamPoints,0.11,0.46,1.0,0.28,0.71,0.52,0.27,0.48,0.32,0.27,0.18,-0.012,0.093,0.57,0.1,0.057,-0.12,0.15
FieldGoalsAttempted,-0.0052,-0.042,0.28,1.0,-0.22,0.092,0.26,-0.14,-0.22,-0.22,-0.037,0.51,0.42,0.2,0.12,0.037,-0.28,0.074
FieldGoals.,0.094,0.45,0.71,-0.22,1.0,0.29,-0.066,0.49,-0.036,-0.047,0.031,-0.35,-0.21,0.55,0.021,0.042,0.028,0.0061
X3PointShots,0.031,0.24,0.52,0.092,0.29,1.0,0.75,0.69,-0.099,-0.12,0.028,-0.12,-0.013,0.42,0.0081,-0.0021,-0.0058,0.033
X3PointShotsAttempted,0.0022,0.033,0.27,0.26,-0.066,0.75,1.0,0.068,-0.092,-0.099,0.0072,0.017,0.09,0.23,0.047,-0.0056,-0.012,0.047
X3PointShots.,0.047,0.32,0.48,-0.14,0.49,0.69,0.068,1.0,-0.05,-0.066,0.033,-0.2,-0.12,0.37,-0.029,0.00076,0.0051,0.0054
FreeThrows,0.057,0.14,0.32,-0.22,-0.036,-0.099,-0.092,-0.05,1.0,0.92,0.33,0.052,0.063,-0.15,0.039,0.013,0.023,0.2
FreeThrowsAttempted,0.058,0.11,0.27,-0.22,-0.047,-0.12,-0.099,-0.066,0.92,1.0,-0.029,0.092,0.091,-0.17,0.051,0.016,0.028,0.21


    Depois das análises dentro da matriz de correlação, devemos tratar a coluna nominal 'Team', de forma com que a mesma seja convertida para valores numéricos, neste caso, binários. Portanto, utilizando o OneHotEncoder, convertemos cada valor nominal diferente em uma coluna de valores binários, então, por exemplo, se o time for o Cleveland Cavaliers, estará o valor 1 em sua coluna e 0 para as outras que representam os outros times.

In [44]:
x_enc = pd.get_dummies(x)

In [45]:
x_enc.head()

Unnamed: 0,Home,WINorLOSS,TeamPoints,FieldGoalsAttempted,FieldGoals.,X3PointShots,X3PointShotsAttempted,X3PointShots.,FreeThrows,FreeThrowsAttempted,...,Team_OKC,Team_ORL,Team_PHI,Team_PHO,Team_POR,Team_SAC,Team_SAS,Team_TOR,Team_UTA,Team_WAS
0,0,0,102,80,0.5,13,22,0.591,9,17,...,0,0,0,0,0,0,0,0,0,0
1,1,1,102,69,0.507,7,20,0.35,25,33,...,0,0,0,0,0,0,0,0,0,0
2,0,0,92,92,0.413,8,25,0.32,8,11,...,0,0,0,0,0,0,0,0,0,0
3,0,0,119,93,0.462,13,33,0.394,20,26,...,0,0,0,0,0,0,0,0,0,0
4,1,1,103,81,0.407,9,22,0.409,28,36,...,0,0,0,0,0,0,0,0,0,0


In [46]:
df = pd.DataFrame(x_enc)
corr = df.corr()
corr.style.background_gradient().set_precision(2)

Unnamed: 0,Home,WINorLOSS,TeamPoints,FieldGoalsAttempted,FieldGoals.,X3PointShots,X3PointShotsAttempted,X3PointShots.,FreeThrows,FreeThrowsAttempted,FreeThrows.,OffRebounds,TotalRebounds,Assists,Steals,Blocks,Turnovers,TotalFouls,Team_ATL,Team_BOS,Team_BRK,Team_CHI,Team_CHO,Team_CLE,Team_DAL,Team_DEN,Team_DET,Team_GSW,Team_HOU,Team_IND,Team_LAC,Team_LAL,Team_MEM,Team_MIA,Team_MIL,Team_MIN,Team_NOP,Team_NYK,Team_OKC,Team_ORL,Team_PHI,Team_PHO,Team_POR,Team_SAC,Team_SAS,Team_TOR,Team_UTA,Team_WAS
Home,1.0,0.16,0.11,-0.0052,0.094,0.031,0.0022,0.047,0.057,0.058,0.011,0.026,0.086,0.11,0.0055,0.079,-0.027,-0.08,-3.5e-19,-3.5e-19,-8.9e-19,3.7e-19,-8.6e-19,1.2e-19,-3.5e-19,1.2e-19,-3.5e-19,9e-19,4e-19,-3.8e-19,-8.9e-19,-1.3e-19,1.2e-19,-6.4e-19,6.5e-19,1.5e-19,3.7e-19,-6e-19,4e-19,9e-19,-6.4e-19,-3.5e-19,-3.5e-19,4e-19,4e-19,-1.3e-19,3.7e-19,3.8e-19
WINorLOSS,0.16,1.0,0.46,-0.042,0.45,0.24,0.033,0.32,0.14,0.11,0.11,-0.04,0.26,0.31,0.14,0.17,-0.12,-0.11,0.012,0.036,-0.065,-0.0045,-0.012,0.053,-0.017,-0.017,-0.014,0.11,0.06,0.01,0.043,-0.074,-0.0023,0.0068,-0.0045,-0.046,-0.0079,-0.062,0.035,-0.057,-0.063,-0.065,0.024,-0.049,0.075,0.058,0.015,0.017
TeamPoints,0.11,0.46,1.0,0.28,0.71,0.52,0.27,0.48,0.32,0.27,0.18,-0.012,0.093,0.57,0.1,0.057,-0.12,0.15,-0.01,0.017,-0.021,-0.024,-0.015,0.054,-0.026,0.04,-0.034,0.15,0.09,-0.017,0.054,-0.024,-0.066,-0.051,-0.029,0.0025,0.013,-0.059,0.053,-0.047,-0.049,0.0011,0.026,-0.019,0.00065,0.04,-0.065,0.014
FieldGoalsAttempted,-0.0052,-0.042,0.28,1.0,-0.22,0.092,0.26,-0.14,-0.22,-0.22,-0.037,0.51,0.42,0.2,0.12,0.037,-0.28,0.074,-0.023,0.05,-0.0018,0.043,0.0089,-0.024,-0.0097,0.048,0.055,0.045,-0.0088,-0.002,-0.035,0.043,-0.046,-0.062,-0.069,-0.03,0.028,0.017,0.059,0.019,-0.0075,0.048,0.034,-0.028,-0.027,-0.021,-0.12,0.01
FieldGoals.,0.094,0.45,0.71,-0.22,1.0,0.29,-0.066,0.49,-0.036,-0.047,0.031,-0.35,-0.21,0.55,0.021,0.042,0.028,0.0061,0.0025,-0.027,-0.025,-0.047,-0.059,0.039,-0.024,-0.004,-0.043,0.12,-0.0017,0.0065,0.056,-0.051,-0.034,0.015,0.051,0.022,0.017,-0.034,0.0062,-0.016,-0.056,-0.034,-0.0056,0.0094,0.051,0.02,0.0037,0.04
X3PointShots,0.031,0.24,0.52,0.092,0.29,1.0,0.75,0.69,-0.099,-0.12,0.028,-0.12,-0.013,0.42,0.0081,-0.0021,-0.0058,0.033,0.045,0.048,-0.00057,-0.027,0.004,0.12,0.061,0.021,-0.0053,0.14,0.2,-0.042,0.039,-0.043,-0.085,-0.036,-0.09,-0.14,-0.016,-0.07,-0.017,-0.041,0.031,-0.028,0.059,-0.063,-0.047,0.021,-0.0018,-0.034
X3PointShotsAttempted,0.0022,0.033,0.27,0.26,-0.066,0.75,1.0,0.068,-0.092,-0.099,0.0072,0.017,0.09,0.23,0.047,-0.0056,-0.012,0.047,0.06,0.079,0.02,-0.036,0.012,0.13,0.094,0.038,0.0059,0.11,0.29,-0.07,0.034,-0.028,-0.1,-0.044,-0.13,-0.18,-0.029,-0.087,0.00083,-0.035,0.072,-0.0079,0.059,-0.1,-0.091,0.021,-0.0091,-0.067
X3PointShots.,0.047,0.32,0.48,-0.14,0.49,0.69,0.068,1.0,-0.05,-0.066,0.033,-0.2,-0.12,0.37,-0.029,0.00076,0.0051,0.0054,0.0092,-0.0075,-0.023,-0.0033,-0.015,0.03,-0.0043,-0.0086,-0.016,0.081,-0.00035,0.01,0.021,-0.036,-0.024,-0.014,0.0096,-0.025,0.0085,-0.0093,-0.025,-0.02,-0.028,-0.032,0.026,0.019,0.033,0.011,0.01,0.024
FreeThrows,0.057,0.14,0.32,-0.22,-0.036,-0.099,-0.092,-0.05,1.0,0.92,0.33,0.052,0.063,-0.15,0.039,0.013,0.023,0.2,-0.025,-0.0048,-0.0015,-0.0039,0.047,0.0022,-0.043,0.022,-0.061,-0.017,0.075,-0.018,0.04,0.009,0.02,-0.036,-0.0061,0.084,-0.022,-0.044,0.043,-0.067,-0.026,0.021,-0.0094,0.014,-0.02,0.061,-0.011,-0.022
FreeThrowsAttempted,0.058,0.11,0.27,-0.22,-0.047,-0.12,-0.099,-0.066,0.92,1.0,-0.029,0.092,0.091,-0.17,0.051,0.016,0.028,0.21,-0.03,-0.02,-0.0058,-0.019,0.039,0.0058,-0.057,0.025,-0.025,-0.034,0.1,-0.03,0.075,0.02,0.0062,-0.02,-0.0084,0.063,-0.024,-0.065,0.056,-0.063,0.0012,0.025,-0.025,0.024,-0.039,0.043,-0.00052,-0.02


### 4. Treinamento
    De início, armazenamos em variáveis auxiliares X e y, onde X será nossa base de treino e y será a coluna que queremos predizer. Em seguida, dividimos o o treinamento e teste em 80% e 20%, respectivamente.

In [47]:
X = x_enc.drop(columns=['WINorLOSS'])
y = x_enc.WINorLOSS

In [49]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

    Aqui, criamos uma função para que nos ajude a utilizar dois tipos de modelo de treinamento em seguida. Nela, buscamos achar os melhores valores de teste por meio do K-Fold Cross-Validation, onde K=10, como é normalmente utilizado. Depois de acharmos os melhores parâmetros, tentamos predizer os dados a partir de medidas como acurácia, utilizando o score() para observarmos a precisão e o f1-score, de forma que valide a acurácia encontrada. Também geramos a matriz de confusão para analisarmos os true negatives e true positives e, em seguida, observamos as medidas de sensibilidade e especificidade. Por fim, é mostrado o tempo de processamento dos algoritmos utilizados.

In [50]:
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
import time
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix

def rf_train_test(model, train_X, train_y, test_X, test_y, features, param_grid=None): 
    start_time = time.time()
    
    # search param combination that results in higher accuracy using K fold cross-validation
    clf = GridSearchCV(model, param_grid, cv=10, scoring="accuracy", return_train_score=False)
    
    # fit the model with best params
    clf.fit(train_X, train_y)
    print("Best parameters combination: %s\n" % clf.best_params_)

    # predict using fitted model on test/validation data
    y_true, y_pred = test_y, clf.best_estimator_.predict(test_X)
    
    # predict class probabilities
    predictions = clf.predict_proba(test_X)
    
    # keep only the positive class to AUROC metric
    predictions = [p[1] for p in predictions]
    print("\nAUROC accuracy: ", roc_auc_score(y_true, predictions))

    # report over test dataset
    print(classification_report(y_true, y_pred))
    test_score = (clf.score(test_X, test_y)*100)
    print("Test score: %s" % "{0:.3f}%\n".format(test_score))
    
    cm=confusion_matrix(test_y, y_pred)
    print('Matriz de Confusão:')
    print(cm)
    
    tp, fp, fn, tn = confusion_matrix(y_test, y_pred).ravel()
    sensibility = tn / (tn+fn)
    print('\nSensibilidade: ', round(sensibility,2))
    
    specificity = tn / (tn+fp)
    print('\nEspecificidade: ', round(specificity, 2))
    
    # calculate training and testing time
    processing_time = time.time() - start_time
    print("\nProcessing time: %s seconds \n" % "{0:.3f}".format(processing_time))

#### Modelo Árvore de Decisão
    Utilizamos o modelo de Árvore de Decisão neste primeiro exemplo de teste para observar se conseguimos alcançar um pure subset, que seria um ramo da árvore que podemos garantir valores identicos, tanto para 'ganhou' ou 'perdeu', neste caso, '1' ou '0', respectivamente.

In [53]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

train_X = X_train
train_y = y_train
test_X = X_test
test_y = y_test


print("\n ------------ DecisionTreeClassifier \n")
rf_train_test(DecisionTreeClassifier(), 
              train_X, train_y, 
              test_X, test_y,  
              x_enc,
              {
               'max_features': ['sqrt', 'log2'],
               'criterion': ['gini', 'entropy'],
               'random_state': [np.random.seed(1234)],
               'max_depth': [5, 10, 15, None],
              }
              )


 ------------ DecisionTreeClassifier 

Best parameters combination: {'criterion': 'entropy', 'max_depth': 10, 'max_features': 'log2', 'random_state': None}


AUROC accuracy:  0.7525306989273968
              precision    recall  f1-score   support

           0       0.68      0.72      0.70       972
           1       0.71      0.66      0.69       996

    accuracy                           0.69      1968
   macro avg       0.69      0.69      0.69      1968
weighted avg       0.69      0.69      0.69      1968

Test score: 69.258%

Matriz de Confusão:
[[704 268]
 [337 659]]

Sensibilidade:  0.66

Especificidade:  0.71

Processing time: 1.880 seconds 



    Pelos valores de precisão e acurácia, podemos perceber que a árvore de decisão não conseguiu encontrar um pure subset, portanto é justificado os valores serem um tanto baixos, portanto uma análise a partir de outro modelo pode ser considerado.

#### Modelo Random Forest Classifier
    Como o teste a partir da Árvore de Decisão deixou a desejar, foi escolhido um outro tipo de modelo, Random Forest. Nele, pegamos o conceito da Árvore de Decisão e o expandimos. Este algoritmo funciona criando várias Árvores de Decisão, onde cada uma possui um atributo de classificação, este será utilizado para descobrir qual árvore está mais próxima de uma classe. No fim, a Floresta escolherá a árvore com mais votos e a tomará como resultado. De início, temos um modelo que é muito bom em problemas de classificação, no entanto, o modelo basicamente toma conta da análise, ou seja, possuimos pouco controle deste, mesmo tentanto parâmetros diferentes ou seeds aleatórias, utilizadas tanto neste modelo quanto o Árvore de Decisão.

In [54]:
from sklearn.ensemble import RandomForestClassifier

train_X = X_train
train_y = y_train
test_X = X_test
test_y = y_test

print("\n ------------ Random Forest \n")
rf_train_test(RandomForestClassifier(), 
              train_X, train_y, 
              test_X, test_y,  
              x_enc,
              {'n_estimators': [50, 70, 100],
               'max_features': ['sqrt', 'log2'],
               'criterion': ['gini', 'entropy'],
               'random_state': [np.random.seed(1234)],
               'max_depth': [5, 10, 15, None],
               'n_jobs': [-1]}
              )


 ------------ Random Forest 

Best parameters combination: {'criterion': 'entropy', 'max_depth': None, 'max_features': 'sqrt', 'n_estimators': 100, 'n_jobs': -1, 'random_state': None}


AUROC accuracy:  0.8921968739154148
              precision    recall  f1-score   support

           0       0.81      0.82      0.81       972
           1       0.82      0.81      0.82       996

    accuracy                           0.82      1968
   macro avg       0.82      0.82      0.82      1968
weighted avg       0.82      0.82      0.82      1968

Test score: 81.504%

Matriz de Confusão:
[[795 177]
 [187 809]]

Sensibilidade:  0.81

Especificidade:  0.82

Processing time: 138.550 seconds 



    Com bons valores de acurácia e precisão, como era de se esperar, o modelo Random Forest foi mais eficaz do que o Decision Tree, devido a tantas árvores geradas durante sua execução. Além disso, o Random Forest utiliza dois conceitos em suas aplicaçãoes: Bagging e Boosting. O bagging serve para, basicamente, melhorar a estabilidade dos dados, reduzir a variancia e ajuda na prevenção de overfitting, que esta base possivelmente contem, devido ao tratamento de encoder para conversão da coluna 'Team'. Enquanto o boosting, basicamente, reduz o viés e a variancia em casos supervisionados.

### 5. Conclusão
    Por fim, este projeto foi realizado para fins de iniciação de aprendizado tanto em IA, machine learning e, sutilmente, data science, até mesmo Python. Neste, conseguimos entender os principios do tratamento de dados, tanto em conversões nominais para numéricas quanto o que fazer com dados faltanto, colunas bem correlacionadas. Durante esta aplicação, também conseguimos aprender um modelo de aprendizagem diferente como o Random Forest, já que o conceito de Árvore de Decisão já apareceu em outras ocasiões e aplicações. O intervalo de resultados encontrados pelo modelo escolhido foram satisfatórios em várias iterações que foram feitas para o treinamento, deixando-o satisfatório.
    No entanto, podemos apontar algumas falhas durante este processo, uma delas foi omitir a coluna 'Date', que poderia trazer consigo uma análise mais realistica da base, então, faltou o tratamento dela. Como foi discutido na apresentação, 'Date' poderia ser classificada de acordo com o ano do jogo ao invés da data completa, ou até mesmo sendo classificada por meio de seasons, se o interesse fosse avaliar as tendencias das mesmas. O que também seria uma possível falha seria o provável overfitting que pode acontecer durante o encode da coluna 'Team', apesar do modelo Random Forest seja dito que previne overfitting, também não podemos dizer de forma clara que, de fato, não ocorreu, devido ao próprio modelo.