# Práctica Regresión Lineal, Múltiple y Logística

Autor: Adrián Arroyo Calle




Usando el conocido conjunto de datos IRIS, realice una clasificación mediante regresión lineal múltiple. Para ello, se destinará ⅔ de los datos escogidos aleatoriamente de manera estratificada para aprendizaje y, el resto, para verificación.

In [42]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.model_selection import StratifiedKFold

Cargamos el dataset IRIS

In [43]:
iris = load_iris(return_X_y=False)

Normalizamos la entrada al rango `[0,1]`

In [44]:
scaler = MinMaxScaler()
scaled_iris_data = scaler.fit_transform(iris.data)

Binarizamos las clases de salida, para usar varios regresores lineales

In [45]:
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
bin_iris_target = lb.fit_transform(iris.target)

Hacemos la separación Holdout 33%, estratificada

In [46]:
x_train,x_test,y_train,y_test = train_test_split(scaled_iris_data,bin_iris_target,test_size=0.33,stratify=bin_iris_target)

In [47]:
y_test.shape[1]

3

In [48]:
y_predict = np.zeros(y_test.shape, dtype=float)
for i in range(y_test.shape[1]):
    reg = LinearRegression()
    reg.fit(x_train,y_train[:,i])
    y_predict[:,i] = reg.predict(x_test)

Ahora con los datos de prueba analizamos con qué regresor se obtiene más probabilidad

In [49]:
score = np.sum(np.argmax(y_predict,axis=1) == np.argmax(y_test, axis=1))/y_test.shape[0]
print("Tasa de acierto %f" % score)

Tasa de acierto 0.820000


Matriz de confusión

In [50]:
s = accuracy_score(np.argmax(y_test,axis=1), np.argmax(y_predict, axis=1))
m = confusion_matrix(np.argmax(y_test,axis=1), np.argmax(y_predict, axis=1))
m

array([[16,  1,  0],
       [ 0, 13,  4],
       [ 0,  4, 12]])

Validación cruzada

In [51]:
score = 0
skf = StratifiedKFold(n_splits=10)
for train_index, test_index in skf.split(scaled_iris_data,iris.target):
    x_train, x_test = scaled_iris_data[train_index], scaled_iris_data[test_index]
    y_train, y_test = bin_iris_target[train_index], bin_iris_target[test_index]
    y_predict = np.zeros(y_test.shape,dtype=float)
    for i in range(y_test.shape[1]):
        reg = LinearRegression()
        reg.fit(x_train,y_train[:,i])
        y_predict[:,i] = reg.predict(x_test)
    score += accuracy_score(np.argmax(y_test,axis=1), np.argmax(y_predict, axis=1))
print("Tasa de acierto: %f" % (score/10))

Tasa de acierto: 0.826667


Regresión logística

Método holdout

In [52]:
x_train,x_test,y_train,y_test = train_test_split(scaled_iris_data,iris.target,test_size=0.33,stratify=iris.target)

In [53]:
y_predict = np.zeros(y_test.shape,dtype=float)
clf = LogisticRegression()
clf.fit(x_train,y_train)
y_predict = clf.predict(x_test)
score = accuracy_score(y_test, y_predict)
print("Score: %f" % score)

Score: 0.820000


In [54]:
m = confusion_matrix(y_test, y_predict)
m

array([[17,  0,  0],
       [ 0,  9,  7],
       [ 0,  2, 15]])

Validación cruzada

In [55]:
score = 0
skf = StratifiedKFold(n_splits=10)
for train_index, test_index in skf.split(scaled_iris_data,iris.target):
    x_train, x_test = scaled_iris_data[train_index], scaled_iris_data[test_index]
    y_train, y_test = iris.target[train_index], iris.target[test_index]
    y_predict = np.zeros(y_test.shape,dtype=float)
    reg = LogisticRegression()
    reg.fit(x_train,y_train)
    y_predict = reg.predict(x_test)
    score += accuracy_score(y_test, y_predict)
print("Tasa de acierto: %f" % (score/10))

Tasa de acierto: 0.840000
