# Ejercicio 1 (1 pt.)
Limpia el dataset de errores y elementos que no sean relevantes. Justifica en una cleda markdown las decisiones que has tomado.

En el dataset hay varios pacientes, cada uno de ellos han visitado el médico al menos 2 veces. Queremos construir un modelo que prediga el estado del paciente entre Demented, Nondemented y Converted. Con esto podremos identificar a tiempo sobre todo aquello pacientes a los que las pruebas diagnósticas iniciales indicaba que no eran dementes, pero que acabaron siéndolo.

In [2]:
# extraemos los datos del csv y vemos como se ven

import pandas as pd
import numpy as np

dataset = pd.read_csv('dementia_dataset.csv')

def transform_group(df):
  # Create three new columns with initial values of 0
  df['Nondemented'] = 0
  df['Demented'] = 0
  df['Converted'] = 0
  
  # Set the corresponding column to 1 based on the value in the "Group" column
  df.loc[df['Group'] == 'Nondemented', 'Nondemented'] = 1
  df.loc[df['Group'] == 'Demented', 'Demented'] = 1
  df.loc[df['Group'] == 'Converted', 'Converted'] = 1
  
  # Drop the original "Group" column
  df.drop('Group', axis=1, inplace=True)
  
  return df

def hand_to_binary(hand):
    if hand == 'R':
        return 1
    elif hand == 'L':
        return 0
    else:
        return None
      
def male_to_binary(male):
  if male == 'M':
    return 1
  else:
    return 0

dataset['Hand'] = dataset['Hand'].apply(hand_to_binary)
dataset['M/F'] = dataset['M/F'].apply(male_to_binary)
dataset = transform_group(dataset)

new_dataset = dataset.drop(['MRI ID', 'EDUC'], axis=1)

from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder() ## Label encoder para las columnas que son strings
new_dataset['Subject ID'] = le.fit_transform(new_dataset['Subject ID'])

imputer = KNNImputer(missing_values=np.nan, n_neighbors=5, weights='distance', metric='nan_euclidean')
values = imputer.fit_transform(new_dataset)

dataset_filled = pd.DataFrame(values, columns=new_dataset.columns)

from sklearn.model_selection import train_test_split

display(dataset_filled)

data_x = dataset_filled[['Subject ID', 'Visit', 'MR Delay', 'M/F', 'Hand', 'Age', 'SES', 'MMSE', 'CDR', 'eTIV', 'nWBV', 'ASF']]
data_y = dataset_filled[['Nondemented', 'Demented', 'Converted']]

x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.2, random_state=42)

Unnamed: 0,Subject ID,Visit,MR Delay,M/F,Hand,Age,SES,MMSE,CDR,eTIV,nWBV,ASF,Nondemented,Demented,Converted
0,0.0,1.0,0.0,1.0,1.0,87.0,2.000000,27.0,0.0,1987.0,0.696,0.883,1.0,0.0,0.0
1,0.0,2.0,457.0,1.0,1.0,88.0,2.000000,30.0,0.0,2004.0,0.681,0.876,1.0,0.0,0.0
2,1.0,1.0,0.0,1.0,1.0,75.0,3.135793,23.0,0.5,1678.0,0.736,1.046,0.0,1.0,0.0
3,1.0,2.0,560.0,1.0,1.0,76.0,2.006997,28.0,0.5,1738.0,0.713,1.010,0.0,1.0,0.0
4,1.0,3.0,1895.0,1.0,1.0,80.0,1.980747,22.0,0.5,1698.0,0.701,1.034,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
368,148.0,2.0,842.0,1.0,1.0,82.0,1.000000,28.0,0.5,1693.0,0.694,1.037,0.0,1.0,0.0
369,148.0,3.0,2297.0,1.0,1.0,86.0,1.000000,26.0,0.5,1688.0,0.675,1.040,0.0,1.0,0.0
370,149.0,1.0,0.0,0.0,1.0,61.0,2.000000,30.0,0.0,1319.0,0.801,1.331,1.0,0.0,0.0
371,149.0,2.0,763.0,0.0,1.0,63.0,2.000000,30.0,0.0,1327.0,0.796,1.323,1.0,0.0,0.0


#### Cleaning the data

Se han transformado las categorias Group, M/F y Hand para que sean valores binarios.

Seguido de esto se ha eliminado del dataset la columna EDUC ya que no se menciona que sea relevante en el enunciadio. Tambien se ha quitado MRI ID ya que se opina que el numero de identificacion de visita al medico no es relevante para saber si un paciente tiene demencia o no.

Tambien se ha usado el LabelEncoder para transformar las celdas alphanumericas en numeros y que asi no hubiese problema de formato.

Por ultimo se ha visto que columnas contenian valores NaN y se ha ha usado SimpleImputer usando la media para inferir los datos que faltaban en SES y MMSE.

# Ejercicio 2 (1.5 pt.)

Usa el Perceptron multicapa de SKlearn. El resutlado minimo a conseguir debe ser una precision del 65%

##### Observaciones

Hay que escalar los datos para que te de unos resultados decentes. 

In [11]:
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

clf = MLPClassifier(hidden_layer_sizes=(100, 100, 100), max_iter=500, alpha=0.0001, solver='adam', verbose=10, random_state=21, tol=0.000000001)
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

print(y_pred)
print(np.array(y_test))

print(clf.score(x_test, y_test))


Iteration 1, loss = 1.99238050
Iteration 2, loss = 1.86512070
Iteration 3, loss = 1.76221869
Iteration 4, loss = 1.66389027
Iteration 5, loss = 1.57800542
Iteration 6, loss = 1.49361719
Iteration 7, loss = 1.41011146
Iteration 8, loss = 1.33270972
Iteration 9, loss = 1.25471251
Iteration 10, loss = 1.17230838
Iteration 11, loss = 1.09445632
Iteration 12, loss = 1.01416861
Iteration 13, loss = 0.93560762
Iteration 14, loss = 0.85858135
Iteration 15, loss = 0.78361560
Iteration 16, loss = 0.71591519
Iteration 17, loss = 0.65259096
Iteration 18, loss = 0.60227416
Iteration 19, loss = 0.55887511
Iteration 20, loss = 0.52433724
Iteration 21, loss = 0.49641988
Iteration 22, loss = 0.47215785
Iteration 23, loss = 0.44971629
Iteration 24, loss = 0.43171909
Iteration 25, loss = 0.41397956
Iteration 26, loss = 0.39909212
Iteration 27, loss = 0.38475645
Iteration 28, loss = 0.37227961
Iteration 29, loss = 0.36060686
Iteration 30, loss = 0.34942658
Iteration 31, loss = 0.33875993
Iteration 32, los



# Ejercicio 3 (1.5 pt.)

Usa el algoritmo Decision Tree de Sklearn para construir un segundo modelo. Calcula accuracy y la matriz de confusión. El resultado mínimo que debéis conseguir de precisión es de un 85% (random_state=42)

#### Ventajas
- El entrenamiento es muy rapido
- Es facil interpretar resultados por humano, alg caja blanca
- Para algunos problemas consigue buena precision
- Se pueden convertir facilmente en reglas
- Preparacion de datos poco exigente
- Puede trabaja con variable cuantitativas y cualitativas

#### Desventajas
- Muy dependiente al ruido de la entrada
- los arboles de decision tienden al sobre-entrenamiento
- No se puede garantizar que el arbol generado sea optimo
- Se recomuendo balancear el conjunto de datos antes de entrenar


##### Observaciones

## Ejemplos de otras cosas

In [16]:
## KNN 
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=2) 

model.fit(x_train, y_train)
prediction = model.predict(x_test)

print("Prediction")
print(prediction)
print("Test")
print(np.array(y_test))

print(model.score(x_test, y_test))




Prediction
[[0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 0.]]
Test
[[0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]