# Ejercicio: Clasificación con XGBoost

- Usando los datos del ejercico [Ejercicio: Clasificación de Quiebras](../module_2/2_05.ipynb). Entrena un modelo xgboost.
- Genera un conjunto de test con el 20% de las muestras. Puedes tener tambien un conjuto de validación que usaremos más adelante.
- La documentación de este algorítmo esta disponible en:
    - https://docs.aws.amazon.com/es_es/sagemaker/latest/dg/xgboost.html
    - https://docs.aws.amazon.com/es_es/sagemaker/latest/dg/xgboost.html#xgboost-modes Ver sección: Utilizar XGBoost como algoritmo integrado
- Puedes usar el parámetro "eval_metric": "auc" para evaluar el modelo usando el área bajo la curva ROC. Más info en: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters 
- Tienes un ejemplo en https://sagemaker-examples.readthedocs.io/en/latest/hyperparameter_tuning/xgboost_random_log/hpo_xgboost_random_log.html

In [3]:
import sagemaker

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = sess.boto_region_name

bucket = sess.default_bucket()
prefix = 'module_4/part_3'

print(role)
print(sess)
print(region)
print(bucket)
print(prefix)

arn:aws:iam::467432373215:role/service-role/AmazonSageMaker-ExecutionRole-20221206T164397
<sagemaker.session.Session object at 0x7fab91aad410>
eu-west-1
sagemaker-eu-west-1-467432373215
module_4/part_3


In [4]:
import pandas as pd

In [5]:
df = pd.read_csv('../module_2/data/data.csv')
df

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
0,1,0.370594,0.424389,0.405750,0.601457,0.601457,0.998969,0.796887,0.808809,0.302646,...,0.716845,0.009219,0.622879,0.601453,0.827890,0.290202,0.026601,0.564050,1,0.016469
1,1,0.464291,0.538214,0.516730,0.610235,0.610235,0.998946,0.797380,0.809301,0.303556,...,0.795297,0.008323,0.623652,0.610237,0.839969,0.283846,0.264577,0.570175,1,0.020794
2,1,0.426071,0.499019,0.472295,0.601450,0.601364,0.998857,0.796403,0.808388,0.302035,...,0.774670,0.040003,0.623841,0.601449,0.836774,0.290189,0.026555,0.563706,1,0.016474
3,1,0.399844,0.451265,0.457733,0.583541,0.583541,0.998700,0.796967,0.808966,0.303350,...,0.739555,0.003252,0.622929,0.583538,0.834697,0.281721,0.026697,0.564663,1,0.023982
4,1,0.465022,0.538432,0.522298,0.598783,0.598783,0.998973,0.797366,0.809304,0.303475,...,0.795016,0.003878,0.623521,0.598782,0.839973,0.278514,0.024752,0.575617,1,0.035490
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6814,0,0.493687,0.539468,0.543230,0.604455,0.604462,0.998992,0.797409,0.809331,0.303510,...,0.799927,0.000466,0.623620,0.604455,0.840359,0.279606,0.027064,0.566193,1,0.029890
6815,0,0.475162,0.538269,0.524172,0.598308,0.598308,0.998992,0.797414,0.809327,0.303520,...,0.799748,0.001959,0.623931,0.598306,0.840306,0.278132,0.027009,0.566018,1,0.038284
6816,0,0.472725,0.533744,0.520638,0.610444,0.610213,0.998984,0.797401,0.809317,0.303512,...,0.797778,0.002840,0.624156,0.610441,0.840138,0.275789,0.026791,0.565158,1,0.097649
6817,0,0.506264,0.559911,0.554045,0.607850,0.607850,0.999074,0.797500,0.809399,0.303498,...,0.811808,0.002837,0.623957,0.607846,0.841084,0.277547,0.026822,0.565302,1,0.044009


In [6]:
y = df.iloc[:, 0]
x = df.iloc[:, 1:]

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=1)

In [10]:
train = pd.concat([y_train, x_train], axis=1)
validation = pd.concat([y_val, x_val], axis=1)

In [11]:
train.to_csv('train.csv', index=False, header=False)
validation.to_csv('validation.csv', index=False, header=False)

In [12]:
sess.upload_data(path='train.csv', bucket=bucket, key_prefix=f'{prefix}/data')

's3://sagemaker-eu-west-1-467432373215/module_4/part_3/data/train.csv'

In [13]:
sess.upload_data(path='validation.csv', bucket=bucket, key_prefix=f'{prefix}/data')

's3://sagemaker-eu-west-1-467432373215/module_4/part_3/data/validation.csv'

#### Entrenamiento del modelo linear learner
- https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html

In [14]:
image = sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")
print(image)

141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-xgboost:1.5-1


In [15]:
s3_train_data = f's3://{bucket}/{prefix}/data/train.csv'
s3_validation_data = f's3://{bucket}/{prefix}/data/validation.csv'

print(s3_train_data)
print(s3_validation_data)


s3://sagemaker-eu-west-1-467432373215/module_4/part_3/data/train.csv
s3://sagemaker-eu-west-1-467432373215/module_4/part_3/data/validation.csv


In [16]:
train_input = sagemaker.TrainingInput(
    s3_train_data, 
    content_type="text/csv",
)
validation_input = sagemaker.TrainingInput(
    s3_validation_data,
    content_type="text/csv",
)

data_channels = {
    'train': train_input, 
    'validation': validation_input
}


In [17]:
s3_output_location = f's3://{bucket}/{prefix}/output'

hyperparameters = {
    "max_depth": "5",
    "eta": "0.2",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.7",
    "objective": "binary:logistic",
    "num_round": "50",
     "eval_metric": "auc",
}


estimator = sagemaker.estimator.Estimator(
    image_uri=image,
    role=role,
    instance_count=1,
    hyperparameters=hyperparameters,
    instance_type="ml.c4.xlarge",
    output_path=s3_output_location,
    sagemaker_session=sess,
)


In [19]:
jobname = f'xgboost-quiebras-auc'

estimator.fit(
    inputs=data_channels,
    job_name=jobname,
)

2022-12-14 11:33:52 Starting - Starting the training job...
2022-12-14 11:34:15 Starting - Preparing the instances for trainingProfilerReport-1671017631: InProgress
............
2022-12-14 11:36:18 Downloading - Downloading input data...
2022-12-14 11:36:48 Training - Training image download completed. Training in progress..[34m[2022-12-14 11:36:50.313 ip-10-0-210-219.eu-west-1.compute.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2022-12-14:11:36:50:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2022-12-14:11:36:50:INFO] Failed to parse hyperparameter eval_metric value auc to Json.[0m
[34mReturning the value itself[0m
[34m[2022-12-14:11:36:50:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34m[2022-12-14:11:36:50:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2022-12-14:11:36:50:INFO] Running XGBoost Sagemaker in algorithm mode[0m
[34m[20