# Análisis de Datos 20-2: Proyecto de clase
**Presentado por:** Alvaro Andrés Gómez Rey, Peter Steven Mesa Franco, Santiago Rosero Cordoba

## 1. Limpieza y EDA
### 1.1. Importar librerias y cargar DataFrame
Se importan las librerias necesarias para realizar el EDA, y los datos de los jugadores de la NBA en la temporada 2017-2018.

In [1]:
import numpy as np #operaciones matriciales y con vectores
import pandas as pd #tratamiento de datos
import matplotlib as mpl
import matplotlib.pyplot as plt #gráficos
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split #metodo de particionamiento de datasets para evaluación
from sklearn.model_selection import KFold, cross_val_score #protocolo de evaluación
from sklearn import metrics
from sklearn.preprocessing import scale 
import seaborn as sns

data = pd.read_csv('2017-18_NBA_salary.csv', sep=',')

Los datos deben estar en el mismo directorio que este notebook.

### 1.2. Entendiendo las variables

In [24]:
print("Data Shape(R,C):",data.shape)
data.columns

Data Shape(R,C): (485, 28)


Index(['Player', 'Salary', 'NBA_Country', 'NBA_DraftNumber', 'Age', 'Tm', 'G',
       'MP', 'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%',
       'STL%', 'BLK%', 'TOV%', 'USG%', 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM',
       'DBPM', 'BPM', 'VORP'],
      dtype='object')

En el DataFrame inicial tenemos 485 filas y 28 columnas.
* Player: nombre del jugador
* **Salary: salario anual (en dolares)**
* NBA_Country: país de procedencia
* NBA_DraftNumber: número de selección en el draft
* Age: edad
* Tm: equipo
* G: partidos
* MP: minutos jugados
* PER: Player Efficiency Rating
* TS%: True Shooting Percentage (porcentaje real de tiros)
* 3Par: tasa de intentos de tres puntos
* FTr: tasa de intentos de tiro libre
* ORB%: porcentaje de rebotes ofensivos
* DRB%: porcentaje de rebotes defensivos
* TRB%: porcentaje de rebotes totales
* AST%: porcentaje de asistencias
* STL%: porcentaje de robos
* BLK%: porcentaje de bloqueos
* TOV%: porcentaje de pérdidas
* USG%: porcentaje de uso
* OWS: Offensive Win Shares  (número  estimado  de  victorias contribuidas  por  un  jugador dado su desempeño ofensivo)
* DWS: Defensive Win Shares  (número  estimado  de  victorias contribuidas  por  un  jugador dado su desempeño defensivo)
* WS: Win Shares (número estimado de victorias contribuidas por un jugador)
* WS/48: Win Sharesper 48(número estimado de victorias contribuidas por un jugador por 48 minutos)
* OBPM: Offensive  Box  Plus/Minus(métrica  para  medir  la  contribución  de  un  jugador  al equipo mientras está en el juego dado su juego ofensivo)
* DBPM: Defensive  Box  Plus/Minus(métrica  para  medir  la  contribución  de  un  jugador  al equipo mientras está en el juego dado su juego defensivo)
* BPM: Box Plus/Minus(métrica para medir la contribución de un jugador al equipo mientras está en el juego)
* VORP: Value Over Replacement Player (métrica para medir la contribución total de un jugador al equipo)

  **Salary** es nuestra variable dependiente, pues queremos poder predecir su valor a partir de las demas variables.

In [29]:
data.head()

Unnamed: 0,Player,Salary,NBA_Country,NBA_DraftNumber,Age,Tm,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP
0,Zhou Qi,815615,China,43,22,HOU,16,87,0.6,0.303,0.593,0.37,6.5,16.8,11.7,1.5,1.1,6.8,18.2,19.5,-0.4,0.1,-0.2,-0.121,-10.6,0.5,-10.1,-0.2
1,Zaza Pachulia,3477600,Georgia,42,33,GSW,66,937,16.8,0.608,0.004,0.337,11.0,25.0,18.5,15.4,1.9,1.3,19.3,17.2,1.7,1.4,3.1,0.16,-0.6,1.3,0.8,0.7
2,Zach Randolph,12307692,USA,19,36,SAC,59,1508,17.3,0.529,0.193,0.14,7.0,23.8,15.0,14.9,1.4,0.6,12.5,27.6,0.3,1.1,1.4,0.046,-0.6,-1.3,-1.9,0.0
3,Zach LaVine,3202217,USA,13,22,CHI,24,656,14.6,0.499,0.346,0.301,1.4,14.4,7.7,18.6,1.8,0.5,9.7,29.5,-0.1,0.5,0.4,0.027,-0.7,-2.0,-2.6,-0.1
4,Zach Collins,3057240,USA,10,20,POR,62,979,8.2,0.487,0.387,0.146,4.9,18.3,11.7,7.3,0.8,2.5,15.6,15.5,-0.4,1.2,0.8,0.038,-3.7,0.9,-2.9,-0.2


*Ejemplo de datos del DataFrame*

In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 485 entries, 0 to 484
Data columns (total 28 columns):
Player             485 non-null object
Salary             485 non-null int64
NBA_Country        485 non-null object
NBA_DraftNumber    485 non-null int64
Age                485 non-null int64
Tm                 485 non-null object
G                  485 non-null int64
MP                 485 non-null int64
PER                485 non-null float64
TS%                483 non-null float64
3PAr               483 non-null float64
FTr                483 non-null float64
ORB%               485 non-null float64
DRB%               485 non-null float64
TRB%               485 non-null float64
AST%               485 non-null float64
STL%               485 non-null float64
BLK%               485 non-null float64
TOV%               483 non-null float64
USG%               485 non-null float64
OWS                485 non-null float64
DWS                485 non-null float64
WS                 485 non-n

In [31]:
data.describe(include="all")

Unnamed: 0,Player,Salary,NBA_Country,NBA_DraftNumber,Age,Tm,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP
count,485,485.0,485,485.0,485.0,485,485.0,485.0,485.0,483.0,483.0,483.0,485.0,485.0,485.0,485.0,485.0,485.0,483.0,485.0,485.0,485.0,485.0,485.0,485.0,485.0,485.0,485.0
unique,483,,44,,,31,,,,,,,,,,,,,,,,,,,,,,
top,Kay Felder,,USA,,,TOT,,,,,,,,,,,,,,,,,,,,,,
freq,3,,374,,,55,,,,,,,,,,,,,,,,,,,,,,
mean,,6636507.0,,29.451546,26.263918,,50.16701,1154.142268,13.260825,0.535387,0.337383,0.263404,4.873814,14.950722,9.908247,12.947835,1.529485,1.713196,13.140373,18.89732,1.275464,1.176495,2.455258,0.079959,-1.270722,-0.489485,-1.760206,0.598763
std,,7392602.0,,21.12576,4.272297,,24.874872,811.357419,8.76928,0.112352,0.226894,0.294578,4.58281,6.84753,4.956436,9.112408,0.989562,1.683792,6.11529,5.940536,1.881444,1.03458,2.67367,0.162992,5.026275,2.389343,5.661447,1.245653
min,,46080.0,,1.0,19.0,,1.0,1.0,-41.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.3,0.0,-1.2,-1.063,-36.5,-14.3,-49.2,-1.3
25%,,1471382.0,,11.0,23.0,,29.0,381.0,9.8,0.5055,0.167,0.155,1.8,10.2,6.2,6.9,1.0,0.6,9.9,15.0,0.0,0.3,0.3,0.04,-2.7,-1.7,-3.6,-0.1
50%,,3202217.0,,25.0,26.0,,59.0,1134.0,13.2,0.545,0.346,0.231,3.2,14.0,8.7,9.9,1.5,1.2,12.5,17.9,0.8,1.0,1.8,0.083,-1.1,-0.4,-1.3,0.1
75%,,10000000.0,,47.0,29.0,,71.0,1819.0,16.5,0.5825,0.481,0.3195,7.0,18.8,13.3,17.6,1.9,2.2,15.75,22.2,2.0,1.8,3.6,0.123,0.4,1.0,0.5,0.9


De 485 registros, dos tienen valores nulo en los siguientes campos: TS%, 3PAr, FTr y TOV%.

In [4]:
data[data.isnull().any(axis=1)]

Unnamed: 0,Player,Salary,NBA_Country,NBA_DraftNumber,Age,Tm,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP
29,Tyler Lydon,1579440,USA,24,21,DEN,1,2,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,-0.016,-5.6,-0.9,-6.5,0.0
37,Trey McKinney-Jones,46080,USA,62,27,IND,1,1,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,-0.001,-5.7,-0.1,-5.9,0.0


Los valores minimos para dichos campos son 0, asi que estas filas son inusuales.

Dos hacen referencia al mismo jugador, **Kay Felder**, en diferentes equipos y con algunos datos diferentes. **Esta ocurrencia es normal, pues un jugador en una temporada puede estar en máximo tres equipos.** Su salario se mantiene constante.

In [5]:
data[data["Player"].duplicated()]

Unnamed: 0,Player,Salary,NBA_Country,NBA_DraftNumber,Age,Tm,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP
225,Kay Felder,1312611,USA,54,22,CHI,14,134,3.6,0.386,0.273,0.182,0.8,10.8,5.6,23.0,1.1,0.7,17.4,28.0,-0.5,0.1,-0.5,-0.166,-8.2,-3.3,-11.5,-0.3
226,Kay Felder,1312611,USA,54,22,TOT,15,137,2.9,0.375,0.279,0.176,1.5,10.5,5.9,22.5,1.1,0.6,17.9,28.4,-0.6,0.1,-0.5,-0.185,-8.7,-3.5,-12.1,-0.3


El máximo valor de TS% es 1.5, y solamente el jugador que tiene dicho valor esta por encima de 1.

In [34]:
data[data["TS%"]>1]

Unnamed: 0,Player,Salary,NBA_Country,NBA_DraftNumber,Age,Tm,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP
142,Naz Mitrou-Long,92160,Canada,62,24,UTA,1,1,134.1,1.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,45.1,0.1,0.0,0.1,2.713,68.7,-14.3,54.4,0.0


In [36]:
data[data["PER"]<0]

Unnamed: 0,Player,Salary,NBA_Country,NBA_DraftNumber,Age,Tm,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP
6,Xavier Silas,74159,USA,62,30,BOS,2,7,-4.9,0.0,0.667,0.0,15.9,15.4,15.7,0.0,7.2,0.0,0.0,19.2,-0.1,0.0,0.0,-0.251,-12.6,-0.7,-13.3,0.0
22,Vander Blue,50000,USA,62,25,LAL,5,45,-1.8,0.255,0.4,0.4,0.0,2.4,1.2,8.1,1.1,0.0,33.8,8.4,-0.1,0.0,-0.1,-0.103,-8.7,-1.4,-10.1,-0.1
84,Scotty Hopson,74159,USA,62,28,DAL,1,8,-4.6,0.266,0.0,2.0,0.0,0.0,0.0,15.8,0.0,0.0,34.7,16.4,0.0,0.0,0.0,-0.237,-14.8,-5.9,-20.7,0.0
135,Nicolas Brussino,1312611,Argentina,62,24,ATL,4,10,-4.6,0.0,1.0,0.0,0.0,33.5,16.8,0.0,0.0,0.0,0.0,8.7,0.0,0.0,0.0,-0.18,-13.8,-4.8,-18.6,0.0
143,Nate Wolters,50000,USA,38,26,UTA,5,19,-2.9,0.167,0.0,0.0,6.1,5.9,6.0,7.1,0.0,0.0,0.0,14.2,-0.1,0.0,-0.1,-0.141,-10.3,-2.2,-12.5,-0.1
147,Mindaugas Kuzminskas,3025035,Lithuania,62,28,NYK,1,2,-41.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,43.6,0.0,0.0,0.0,-1.063,-36.5,-12.7,-49.2,0.0
161,Matt Williams,50000,USA,62,24,MIA,3,11,-0.7,0.417,0.833,0.0,0.0,10.2,5.1,0.0,0.0,0.0,14.3,28.6,-0.1,0.0,0.0,-0.196,-8.6,-8.6,-17.1,0.0
188,Luis Montero,50000,Dominican Rep...,62,24,DET,2,8,-15.5,0.0,0.0,0.0,0.0,28.2,13.8,0.0,0.0,0.0,66.7,16.6,-0.1,0.0,-0.1,-0.443,-23.8,-2.1,-25.8,0.0
224,Kay Felder,1312611,USA,54,22,DET,1,3,-31.6,0.0,0.5,0.0,35.9,0.0,18.4,0.0,0.0,0.0,33.3,44.2,-0.1,0.0,-0.1,-1.005,-29.5,-11.9,-41.4,0.0
241,Josh McRoberts,6021175,USA,37,30,DAL,2,6,-12.5,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.6,0.0,0.0,0.0,-0.189,-9.8,-2.3,-12.1,0.0


In [7]:
data.dtypes

Player              object
Salary               int64
NBA_Country         object
NBA_DraftNumber      int64
Age                  int64
Tm                  object
G                    int64
MP                   int64
PER                float64
TS%                float64
3PAr               float64
FTr                float64
ORB%               float64
DRB%               float64
TRB%               float64
AST%               float64
STL%               float64
BLK%               float64
TOV%               float64
USG%               float64
OWS                float64
DWS                float64
WS                 float64
WS/48              float64
OBPM               float64
DBPM               float64
BPM                float64
VORP               float64
dtype: object

### 1.3. Limpieza

### 1.4. Analizando la relación entre las variables