Estimado candidato,

Estamos encantados de que estés considerando unirte a nuestro equipo de ciencia de datos. Como parte de nuestro proceso de evaluación, te pedimos que completes la siguiente prueba técnica.

**Descripción de la tarea:**

Se te proporciona un conjunto de datos anónimos que consta de varias columnas nombradas como 'col1', 'col2', 'col3', etc., hasta 'col20', y una columna 'target' que representa la variable objetivo. Tu tarea es realizar un análisis exploratorio de datos (EDA) y construir un modelo de aprendizaje automático para predecir la variable 'target'.

**Detalles de la tarea:**

1. **Análisis exploratorio de datos (EDA):** Realiza un análisis exploratorio detallado de los datos. Esto debe incluir, pero no está limitado a:
   - Estadísticas descriptivas de las variables (mínimo, máximo, media, mediana, desviación estándar, etc.).
   - Verificación de valores perdidos o anómalos.
   - Análisis de correlación entre las variables.
   - Visualizaciones para entender mejor las distribuciones y relaciones de los datos.

2. **Preprocesamiento de datos:** Prepárate para realizar cualquier limpieza o transformación necesaria de los datos basándote en tu EDA.

3. **Creación de un modelo de línea base (Baseline):** Entrena varios modelos de aprendizaje automático para predecir la variable 'target'. Comienza con modelos sencillos como la regresión logística y avanza hacia modelos más complejos como los árboles de decisión, random forest, SVM, XGBoost, etc.

4. **Evaluación del modelo:** Evalúa el rendimiento de cada modelo utilizando métricas apropiadas. Por ejemplo, si 'target' es una variable binaria, podrías considerar la precisión, la sensibilidad, la especificidad, el AUC-ROC, etc.

5. **Selección del modelo:** Selecciona el modelo que creas que funciona mejor. Justifica tu elección basándote en las métricas de evaluación y cualquier otra consideración pertinente.

6. **Predicción:** Usa tu modelo seleccionado para hacer predicciones en el conjunto de datos.

**Entregables:**

Por favor, proporciona el código que hayas utilizado para completar esta tarea, junto con un informe detallado que explique tu enfoque y los resultados. El informe debe estar bien estructurado y ser fácil de seguir, permitiendo a los evaluadores entender tu proceso de pensamiento y las decisiones que hayas tomado.

¡Buena suerte y esperamos ver tu solución!

In [2]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib as plt

from sklearn.metrics import root_mean_squared_error, r2_score ,mean_absolute_percentage_error 

In [3]:
url ='./data/train.csv'
data = pd.read_csv(url)

In [4]:
df = data.copy()

## Empezamos con el Procesamiento de datos

In [5]:
df.head()

Unnamed: 0,ID,col1,col2,col3,col4,col5,col6,col7,col8,col9,...,col12,col13,col14,col15,col16,col17,col18,col19,col20,target
0,4995,0.02,26.8,0.09,1.35,0.06,0.09,0.09,1.97,1.48,...,0.031,9.52,0.84,0.001,1.24,0.96,0.09,0.08,0.08,0
1,1709,1.13,3.95,0.32,3.95,0.03,4.66,0.56,1.89,1.36,...,0.17,13.83,1.15,0.006,10.55,3.76,0.02,0.45,0.06,0
2,7825,0.07,8.05,0.04,0.14,0.04,0.06,0.06,0.05,0.0,...,0.12,2.61,1.52,0.008,4.13,0.27,0.03,0.03,0.01,0
3,6918,0.09,26.4,0.03,0.92,0.09,0.57,0.0,0.36,0.03,...,0.055,5.22,1.96,0.0,11.32,0.25,0.09,0.01,0.05,0
4,5,0.94,14.47,0.03,2.88,0.003,0.8,0.43,1.38,0.11,...,0.135,9.75,1.89,0.006,27.17,5.42,0.08,0.19,0.02,1


In [6]:
df.tail()

Unnamed: 0,ID,col1,col2,col3,col4,col5,col6,col7,col8,col9,...,col12,col13,col14,col15,col16,col17,col18,col19,col20,target
5592,3048,0.01,8.92,0.2,4.88,0.05,0.36,0.09,0.54,0.14,...,0.069,8.6,1.9,0.007,17.18,2.91,0.08,0.43,0.09,0
5593,2130,0.09,1.36,0.04,3.45,0.003,3.42,0.03,1.39,1.12,...,0.108,9.36,1.58,0.009,42.15,4.39,0.02,0.11,0.07,0
5594,5005,0.1,4.95,0.01,0.25,0.04,0.09,0.1,1.88,0.33,...,0.021,18.78,0.62,0.007,5.97,0.32,0.1,0.1,0.08,0
5595,2125,0.05,23.18,0.04,3.65,0.001,4.43,0.63,1.94,1.27,...,0.194,13.32,1.93,0.005,23.84,4.8,0.08,0.22,0.07,0
5596,3294,0.09,10.81,0.69,2.67,0.06,5.97,0.02,1.15,0.53,...,0.088,8.15,1.72,0.001,36.53,5.66,0.0,0.17,0.02,0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5597 entries, 0 to 5596
Data columns (total 22 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      5597 non-null   int64  
 1   col1    5597 non-null   float64
 2   col2    5597 non-null   float64
 3   col3    5597 non-null   float64
 4   col4    5597 non-null   float64
 5   col5    5597 non-null   float64
 6   col6    5597 non-null   float64
 7   col7    5597 non-null   float64
 8   col8    5597 non-null   float64
 9   col9    5597 non-null   float64
 10  col10   5597 non-null   float64
 11  col11   5597 non-null   float64
 12  col12   5597 non-null   float64
 13  col13   5597 non-null   float64
 14  col14   5597 non-null   float64
 15  col15   5597 non-null   float64
 16  col16   5597 non-null   float64
 17  col17   5597 non-null   float64
 18  col18   5597 non-null   float64
 19  col19   5597 non-null   float64
 20  col20   5597 non-null   float64
 21  target  5597 non-null   int64  
dtype

In [8]:
df.isnull().sum()

ID        0
col1      0
col2      0
col3      0
col4      0
col5      0
col6      0
col7      0
col8      0
col9      0
col10     0
col11     0
col12     0
col13     0
col14     0
col15     0
col16     0
col17     0
col18     0
col19     0
col20     0
target    0
dtype: int64

In [9]:
df[df.duplicated(keep=False)]

Unnamed: 0,ID,col1,col2,col3,col4,col5,col6,col7,col8,col9,...,col12,col13,col14,col15,col16,col17,col18,col19,col20,target


### Observaciones
1. No hay nulos
2. No hay valores duplicados.
3. Los tipos de datos están correctos.

# Analisis exploratorio

In [10]:
df.describe(include='all')

Unnamed: 0,ID,col1,col2,col3,col4,col5,col6,col7,col8,col9,...,col12,col13,col14,col15,col16,col17,col18,col19,col20,target
count,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,...,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0,5597.0
mean,3962.909952,0.674797,14.43884,0.163603,1.579702,0.042747,2.213093,0.25089,0.807654,0.766057,...,0.10005,9.819267,1.33188,0.005194,16.621249,2.931896,0.049602,0.149855,0.044999,0.114347
std,2311.129964,1.273677,8.851097,0.254659,1.22122,0.036072,2.581244,0.272472,0.652471,0.436435,...,0.057997,5.581795,0.568734,0.00296,17.729833,2.327347,0.028827,0.14417,0.026929,0.318261
min,0.0,0.0,-0.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.001,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1963.0,0.04,6.85,0.03,0.55,0.008,0.1,0.05,0.09,0.4,...,0.049,4.94,1.01,0.003,2.22,0.82,0.02,0.04,0.02,0.0
50%,3922.0,0.07,14.39,0.05,1.21,0.04,0.55,0.09,0.75,0.76,...,0.103,9.86,1.42,0.005,7.84,2.43,0.05,0.08,0.05,0.0
75%,5960.0,0.29,22.26,0.1,2.51,0.07,4.33,0.45,1.39,1.16,...,0.151,14.69,1.76,0.008,29.98,4.67,0.07,0.25,0.07,0.0
max,7993.0,5.05,29.84,1.05,4.94,0.13,8.66,0.9,2.0,1.5,...,0.2,19.82,2.89,0.01,60.01,7.99,0.1,0.5,0.09,1.0


## Variables numericas