# Regresión Logística

La diferencia con respecto a la regresión lineal múltiple es que aquí se trata de obtener o bien una victoria o bien una derrota, y no unos coeficientes.

In [10]:
# Primero se importan las librerías
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [3]:
# A continuación se cargan los datos
wdir = os.path.dirname(os.path.realpath('__file__'))
data = pd.read_csv(wdir+'/fourFactors.csv')

# Comprobamos que los datos son correctos
pd.set_option('display.expand_frame_repr', False)
data.head(10)

Unnamed: 0,efg,efg_opp,ftr,ftr_opp,orb,orb_opp,scr,scr_opp,tov,tov_opp,won
0,0.525,0.5,0.73913,0.923077,0.228571,0.2,80,74,0.146128,0.171317,1
1,0.516949,0.515873,0.6,0.789474,0.21875,0.333333,67,80,0.13624,0.100806,0
2,0.575472,0.490909,0.761905,0.777778,0.407407,0.242424,77,68,0.138427,0.137137,1
3,0.373016,0.689394,0.785714,0.733333,0.35,0.266667,58,102,0.206517,0.110294,0
4,0.675,0.410448,0.8125,0.7,0.32,0.340426,94,69,0.129803,0.126728,1
5,0.530303,0.470588,1.0,0.85,0.222222,0.21875,73,65,0.129333,0.189702,1
6,0.401786,0.314286,0.708333,0.888889,0.225806,0.326531,62,60,0.203447,0.133452,1
7,0.453333,0.472727,0.846154,0.590909,0.333333,0.28125,79,65,0.100312,0.236183,1
8,0.507937,0.45283,0.785714,0.8,0.314286,0.1875,75,64,0.103681,0.114613,1
9,0.492424,0.552632,0.866667,0.916667,0.333333,0.321429,78,85,0.141844,0.117555,0


Hay 9 columnas, las 4 primeras hacen referencia al Obradoiro y las 4 siguientes al oponente, de forma alternada.  
Por último la columna `Win` es 1 si ganó el Obradoiro y 0 si perdió.

Tenemos 34 filas y 11 columnas

In [4]:
# Se genera una visión general de los datos
data.describe()

Unnamed: 0,efg,efg_opp,ftr,ftr_opp,orb,orb_opp,scr,scr_opp,tov,tov_opp,won
count,34.0,34.0,34.0,34.0,34.0,34.0,34.0,34.0,34.0,34.0,34.0
mean,0.523693,0.524913,0.754046,0.792098,0.293585,0.291643,77.941176,80.647059,0.146697,0.138363,0.411765
std,0.086719,0.083036,0.103986,0.105577,0.071591,0.076257,12.494705,13.984585,0.042138,0.043342,0.499554
min,0.373016,0.314286,0.545455,0.5,0.111111,0.111111,58.0,60.0,0.062625,0.053476,0.0
25%,0.464773,0.491133,0.67,0.73141,0.257373,0.244318,69.0,69.25,0.119671,0.106855,0.0
50%,0.512443,0.516557,0.755952,0.8,0.286041,0.289916,77.0,77.5,0.137334,0.134795,0.0
75%,0.582031,0.569994,0.816761,0.885417,0.333333,0.333333,86.25,91.25,0.180298,0.168625,1.0
max,0.759615,0.701613,1.0,0.944444,0.475,0.487179,112.0,116.0,0.269507,0.236183,1.0


Nos devuelve el valor medio, desviación, mínimo, máximo y los percentiles 25%, 50% y 75%.
Pero esto no diferencia entre victorias y derrotas, por lo que es necesario agrupar por esa columna.

In [5]:
# Medias en las victorias y derrotas
data.groupby(['won']).mean()

Unnamed: 0_level_0,efg,efg_opp,ftr,ftr_opp,orb,orb_opp,scr,scr_opp,tov,tov_opp
won,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0.508639,0.563719,0.736848,0.826916,0.298262,0.300173,77.1,88.2,0.147922,0.124894
1,0.545198,0.469476,0.778616,0.74236,0.286903,0.279458,79.142857,69.857143,0.144946,0.157604


Podemos ver claramente como para las victorias hay que tener casi 0.55 en eFG% y que pesa más el factor de tiro libre que los balones perdidos y el rebote.

### Entrenamiento del modelo

Hace falta hacer una separación de los datos en datos de entrenamiento y datos de prueba.

In [7]:
# Se cargan los datos de entrada (X) y la posible salida (Y)
X = data[['efg','efg_opp','ftr','ftr_opp','orb','orb_opp','tov','tov_opp']]
Y = data['won']
# Se separan los datos de entrenamiento y los de prueba
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=4)

Ahora se entrena el modelo

In [9]:
# Se crea el modelo con SciKit-Learn y se entrena (en el mismo paso)
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(x_train, y_train)

Se muestran los coeficientes de nuestro modelo

In [12]:
clf.coef_

array([[ 0.08792366, -0.57117837,  0.11265059, -0.23421202, -0.19882258,
        -0.07891725, -0.04218186,  0.20709843]])

### Cáculo del error

Para calcular el error cometido usaremos los datos guardados para test. Primero descargamos la librería metrics:

In [11]:
# Se calculan las predicciones de resultados
y_pred = clf.predict(x_test)

Probabilidad de acertar:

In [12]:
metrics.accuracy_score(y_test, y_pred)

0.5454545454545454

Mostramos el valor real (y_test) y la predicción de nuestro modelo (y_pred)

In [13]:
'Realidad: ', list(y_test), 'Predicción: ', list(y_pred)

('Realidad: ',
 [1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0],
 'Predicción: ',
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

### Conclusiones

Se puede observar que el mayor peso para conseguir la victoria es tener un buen rebote ofensivo y un buen factor de tiro.Afecta más a la victoria los balores perdidos por el oponente que los propios.  
(porcentaje calculado con la suma de los valores absolutos)