<a href="https://colab.research.google.com/github/carmeniturbe/ejercicios_Machine_Learning/blob/main/preprocesamiento_de_abalones.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Utilicen el conjunto de datos de abalones para predecir el número de anillos a partir de las medicines físicas utilizando la regresión linear.

Nota: Al igual que los árboles, el número de anillos para abalones se puede usar para determinar la edad.

# Tareas:

1) Separar los datos en un matriz de características (X) y un vector objetivo (y).

2) Crear un train test split en los datos. Utilicen un número aleatorio 42 por coherencia.

3) Utilizar transformadores de columna para transformar las columnas adecuadas

Para las transformaciones de columna:

    a) Utilizar selectores de columnas para seleccionar las columnas categóricas y las columnas numéricas

    b) Utilizar un OneHotEncoder para codificar las columnas categóricas

    c) Utilizar StandardScaler para escalar columnas numéricas

    d) Utilizar un ColumnTransformer para emparejar la transformación al tipo de columna

    e) Transformar los datos y mostrar los arrays de NumPy resultantes

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder

df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Data Science - Coding Dojo/Data/abalone/abalone.data')

df.head()

Unnamed: 0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
0,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
1,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


In [16]:
# Nombres de las columnas
column_names = ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight','Shucked weight', 'Viscera weight', 'Shell weight', 'Rings']

#Nombramos cada columna con el nombre obtenido
df.columns = column_names
df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
1,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4176 entries, 0 to 4175
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4176 non-null   object 
 1   Length          4176 non-null   float64
 2   Diameter        4176 non-null   float64
 3   Height          4176 non-null   float64
 4   Whole weight    4176 non-null   float64
 5   Shucked weight  4176 non-null   float64
 6   Viscera weight  4176 non-null   float64
 7   Shell weight    4176 non-null   float64
 8   Rings           4176 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


In [18]:
#1. separamos datos en una matriz de caracteristicas
X = df.drop(columns = 'Rings')
y = df['Rings']
#2.creamos un train test split de los datos
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state= 42)

In [19]:
#3. Comenzaremos con el proceso de transformación de columnas
#Utilizamos selectores de columnas para seleccionar las columnas categóricas y las columnas numéricas
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

In [21]:
# Instanciamos el StandardScaler y el OneHotEncoder
scaler = StandardScaler()
ohe = OneHotEncoder(handle_unknown='ignore')

In [22]:
# Hacemos tuplas para procesar las columnas categoricas y numericas
num_tuple = (scaler, num_selector) #Esto es para decirle que data debe escalar
cat_tuple = (ohe, cat_selector) #Esto es para decirle con que data debe usar OneHotEncoder

In [23]:
# Utilizamos ColumnTransformer para emparejar la transformación al tipo de columna
from sklearn.compose import make_column_transformer
col_transformer = make_column_transformer(num_tuple, cat_tuple, remainder = 'passthrough')

In [25]:
# Encajamos el transformador a los datos de entrenamiento
col_transformer.fit(X_train)

In [30]:
# Finalmente transformamos los conjunstos de datos y de prueba
X_train_processed = col_transformer.transform(X_train)
X_test_processed = col_transformer.transform(X_test)
X_train_processed
X_test_processed

array([[ 0.75390131,  0.927397  ,  0.8242306 , ...,  1.        ,
         0.        ,  0.        ],
       [ 0.54478103,  0.57260174,  0.23742158, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.08471641,  0.11643641,  0.12005978, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [ 0.46113292,  0.72465685,  0.9415924 , ...,  1.        ,
         0.        ,  0.        ],
       [ 0.96302159,  1.07945211,  1.52840142, ...,  1.        ,
         0.        ,  0.        ],
       [-1.21182931, -0.84657929, -0.70147285, ...,  1.        ,
         0.        ,  0.        ]])

In [31]:
# Es dificil visualizar la data de esa forma así que la combertiremos en un DataFrame de Pandas
X_train_df = pd.DataFrame(X_train_processed)
X_train_df.head()
#Esta salida muestra la salida de las columnas numéricas y la salida de las columnas categóricas.

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-1.546422,-1.55617,-1.053558,-1.258947,-1.260633,-1.337852,-1.212341,0.0,0.0,1.0
1,0.795725,0.521917,0.706869,0.605245,0.789463,0.749584,0.409117,0.0,0.0,1.0
2,0.252013,0.319177,0.354783,0.37885,0.602065,0.040129,0.172355,1.0,0.0,0.0
3,1.172142,0.927397,0.824231,1.234461,1.277152,1.490874,0.940037,1.0,0.0,0.0
4,-1.462774,-1.4548,-1.17092,-1.233452,-1.177094,-1.205966,-1.212341,0.0,0.0,1.0


In [32]:
X_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3132 entries, 0 to 3131
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       3132 non-null   float64
 1   1       3132 non-null   float64
 2   2       3132 non-null   float64
 3   3       3132 non-null   float64
 4   4       3132 non-null   float64
 5   5       3132 non-null   float64
 6   6       3132 non-null   float64
 7   7       3132 non-null   float64
 8   8       3132 non-null   float64
 9   9       3132 non-null   float64
dtypes: float64(10)
memory usage: 244.8 KB
