# Ejercicio: Clasificación con XGBoost

- Usando los datos del ejercico [Ejercicio: Clasificación de Quiebras](../module_2/2_05.ipynb). Entrena un modelo xgboost.
- Genera un conjunto de test con el 20% de las muestras. Puedes tener tambien un conjuto de validación que usaremos más adelante.
- La documentación de este algorítmo esta disponible en:
    - https://docs.aws.amazon.com/es_es/sagemaker/latest/dg/xgboost.html
    - https://docs.aws.amazon.com/es_es/sagemaker/latest/dg/xgboost.html#xgboost-modes Ver sección: Utilizar XGBoost como algoritmo integrado
- Puedes usar el parámetro "eval_metric": "auc" para evaluar el modelo usando el área bajo la curva ROC. Más info en: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters 
- Tienes un ejemplo en https://sagemaker-examples.readthedocs.io/en/latest/hyperparameter_tuning/xgboost_random_log/hpo_xgboost_random_log.html

In [None]:
import sagemaker

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = sess.boto_region_name

bucket = sess.default_bucket()
prefix = 'module_4/part_3'

print(role)
print(sess)
print(region)
print(bucket)
print(prefix)

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('../module_2/data/data.csv')
df

In [None]:
y = df.iloc[:, 0]
x = df.iloc[:, 1:]

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=1)

In [None]:
train = pd.concat([y_train, x_train], axis=1)
validation = pd.concat([y_val, x_val], axis=1)

In [None]:
train.to_csv('train.csv', index=False, header=False)
validation.to_csv('validation.csv', index=False, header=False)

In [None]:
sess.upload_data(path='train.csv', bucket=bucket, key_prefix=f'{prefix}/data')

In [None]:
sess.upload_data(path='validation.csv', bucket=bucket, key_prefix=f'{prefix}/data')

#### Entrenamiento del modelo linear learner
- https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html

In [None]:
image = sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")
print(image)

In [None]:
s3_train_data = f's3://{bucket}/{prefix}/data/train.csv'
s3_validation_data = f's3://{bucket}/{prefix}/data/validation.csv'

print(s3_train_data)
print(s3_validation_data)


In [None]:
train_input = sagemaker.TrainingInput(
    s3_train_data, 
    content_type="text/csv",
)
validation_input = sagemaker.TrainingInput(
    s3_validation_data,
    content_type="text/csv",
)

data_channels = {
    'train': train_input, 
    'validation': validation_input
}


In [None]:
s3_output_location = f's3://{bucket}/{prefix}/output'

hyperparameters = {
    "max_depth": "5",
    "eta": "0.2",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.7",
    "objective": "binary:logistic",
    "num_round": "50",
     "eval_metric": "auc",
}


estimator = sagemaker.estimator.Estimator(
    image_uri=image,
    role=role,
    instance_count=1,
    hyperparameters=hyperparameters,
    instance_type="ml.c4.xlarge",
    output_path=s3_output_location,
    sagemaker_session=sess,
)


In [None]:
jobname = f'xgboost-quiebras-auc'

estimator.fit(
    inputs=data_channels,
    job_name=jobname,
)