# Preprocesamiento de Datos - Ejemplo Práctico

En esta demostración se preprocesará un conjunto de datos de población estadounidense. Los datos utilizados son un subconjunto modificado de [este set de datos](https://archive.ics.uci.edu/ml/datasets/Adult) y se encuentran en el archivo `census.csv`.

***Laura y Celeste***

In [1]:
import pandas as pd

In [2]:
# Importemos los datos:
df = pd.read_csv('census.csv')
print(df)

        age         workclass         education                race     sex  \
0      39.0         State-gov         Bachelors               White    Male   
1      50.0  Self-emp-not-inc         Bachelors               White    Male   
2      38.0           Private       High-school               White    Male   
3      53.0           Private  Some-high-school               Black    Male   
4      28.0           Private         Bachelors               Black  Female   
...     ...               ...               ...                 ...     ...   
41711  33.0           Private         Bachelors               White    Male   
41712  39.0           Private         Bachelors               White  Female   
41713  38.0           Private         Bachelors               White    Male   
41714  44.0           Private         Bachelors  Asian-Pac-Islander    Male   
41715  35.0      Self-emp-inc         Bachelors               White    Male   

       hours_per_week  USA_born  label  
0         

In [3]:
# Veamos el dataset:
print(df.head())

    age         workclass         education   race     sex  hours_per_week  \
0  39.0         State-gov         Bachelors  White    Male            40.0   
1  50.0  Self-emp-not-inc         Bachelors  White    Male            13.0   
2  38.0           Private       High-school  White    Male            40.0   
3  53.0           Private  Some-high-school  Black    Male            40.0   
4  28.0           Private         Bachelors  Black  Female            40.0   

   USA_born  label  
0       1.0  <=50K  
1       1.0  <=50K  
2       1.0  <=50K  
3       1.0  <=50K  
4       0.0  <=50K  


In [4]:
# Descripción de las columnas:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41716 entries, 0 to 41715
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41617 non-null  float64
 1   workclass       41705 non-null  object 
 2   education       41702 non-null  object 
 3   race            41700 non-null  object 
 4   sex             41701 non-null  object 
 5   hours_per_week  41631 non-null  float64
 6   USA_born        41701 non-null  float64
 7   label           41716 non-null  object 
dtypes: float64(3), object(5)
memory usage: 2.5+ MB


Unnamed: 0,age,hours_per_week,USA_born
count,41617.0,41631.0,41701.0
mean,38.476608,40.74322,0.895062
std,13.365972,12.000085,0.306477
min,17.0,1.0,0.0
25%,28.0,40.0,1.0
50%,37.0,40.0,1.0
75%,47.0,45.0,1.0
max,90.0,99.0,1.0


In [5]:
# Verifiquemos si hay datos faltantes:
df.isnull().sum()
#df.isnull verifica si hay valores nulos en el dataframe y devuelve un dataframe booleano.
#df.sum cuenta el numero de valores nulos en cada columna.

age               99
workclass         11
education         14
race              16
sex               15
hours_per_week    85
USA_born          15
label              0
dtype: int64

In [6]:
# Descartemos las filas que tengan 3 o más datos faltantes:
df = df[df.isnull().sum(axis=1) < 3]
#.sum(axis=1) cuenta el numero de valores nulos en cada fila.
#Especifica el eje sobre el cual se  realiza la operacion en este caso significa que la suma se calcula sobre las columnas para cada filas.


In [7]:
# Contemos de nuevo los datos faltantes:
df.isnull().sum()

age               85
workclass          0
education          0
race               0
sex                0
hours_per_week    68
USA_born           0
label              0
dtype: int64

In [8]:
# Imputemos los datos faltantes de edad y horas trabajadas por semana con la mediana de cada una de esas columnas:
medianas=df[['age','hours_per_week']].median()
df[['age','hours_per_week']]=df[['age','hours_per_week']].fillna(medianas)

In [9]:
# Contemos de nuevo los datos faltantes:
df.isnull().sum()

age               0
workclass         0
education         0
race              0
sex               0
hours_per_week    0
USA_born          0
label             0
dtype: int64

In [10]:
# Apliquemos one-hot encoding a la columna "workclass":
one_hot=pd.get_dummies(df, columns=['workclass'])
print(one_hot)

        age         education                race     sex  hours_per_week  \
0      39.0         Bachelors               White    Male            40.0   
1      50.0         Bachelors               White    Male            13.0   
2      38.0       High-school               White    Male            40.0   
3      53.0  Some-high-school               Black    Male            40.0   
4      28.0         Bachelors               Black  Female            40.0   
...     ...               ...                 ...     ...             ...   
41711  33.0         Bachelors               White    Male            40.0   
41712  39.0         Bachelors               White  Female            36.0   
41713  38.0         Bachelors               White    Male            50.0   
41714  44.0         Bachelors  Asian-Pac-Islander    Male            40.0   
41715  35.0         Bachelors               White    Male            60.0   

       USA_born  label  workclass_Federal-gov  workclass_Local-gov  \
0    

In [11]:
# Hallemos los valores que toma la columna "education":
education_values = df['education'].unique()
print(education_values)

['Bachelors' 'High-school' 'Some-high-school' 'Masters' 'Some-college'
 'Middle-school' 'Doctorate' 'Some-middle-school' 'Preschool'
 'Elementary-school']


In [12]:
# Apliquemos ordinal encoding a la columna "education":
education_order = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad', 'Some-college', 'Assoc-voc', 'Assoc-acdm', 'Bachelors', 'Masters', 'Prof-school', 'Doctorate']
education_mapping = {level: index for index, level in enumerate(education_order)}
df['education'] = df['education'].map(education_mapping)
print(df['education'])


0        12.0
1        12.0
2         NaN
3         NaN
4        12.0
         ... 
41711    12.0
41712    12.0
41713    12.0
41714    12.0
41715    12.0
Name: education, Length: 41694, dtype: float64


In [13]:
# Verifiquemos que la columna "education" tenga los valores apropiados:
print(df['education'].unique())


[12. nan 13.  9. 15.  0.]


In [14]:
# Apliquemos one-hot encoding a la columna "race":
one_hot_race=pd.get_dummies(df, columns=['race'])
print(one_hot_race)



        age         workclass  education     sex  hours_per_week  USA_born  \
0      39.0         State-gov       12.0    Male            40.0       1.0   
1      50.0  Self-emp-not-inc       12.0    Male            13.0       1.0   
2      38.0           Private        NaN    Male            40.0       1.0   
3      53.0           Private        NaN    Male            40.0       1.0   
4      28.0           Private       12.0  Female            40.0       0.0   
...     ...               ...        ...     ...             ...       ...   
41711  33.0           Private       12.0    Male            40.0       1.0   
41712  39.0           Private       12.0  Female            36.0       1.0   
41713  38.0           Private       12.0    Male            50.0       1.0   
41714  44.0           Private       12.0    Male            40.0       1.0   
41715  35.0      Self-emp-inc       12.0    Male            60.0       1.0   

       label  race_Amer-Indian-Eskimo  race_Asian-Pac-Islander 

In [15]:
# Apliquemos binary encoding a la columna "sex":
df['sex']=df['sex'].map({'Male':0,'Female':1})
print(df['sex'])    

0        0
1        0
2        0
3        0
4        1
        ..
41711    0
41712    1
41713    0
41714    0
41715    0
Name: sex, Length: 41694, dtype: int64


In [24]:
import category_encoders as ce
columnas_categoricas = df.select_dtypes(include=['object', 'category']).columns.tolist()
print("🔍 Columnas categóricas detectadas:", columnas_categoricas)
# Aplicar Binary Encoding
encoder = ce.BinaryEncoder(cols=columnas_categoricas)
df_encoded = encoder.fit_transform(df[columnas_categoricas])

df_final = pd.concat([df.drop(columns=columnas_categoricas), df_encoded], axis=1)



🔍 Columnas categóricas detectadas: ['workclass', 'race', 'label']


In [25]:
# Veamos el dataset resultante:
print(df_final.head())


    age  education  sex  hours_per_week  USA_born  workclass_0  workclass_1  \
0  39.0       12.0    0            40.0       1.0            0            0   
1  50.0       12.0    0            13.0       1.0            0            1   
2  38.0        NaN    0            40.0       1.0            0            1   
3  53.0        NaN    0            40.0       1.0            0            1   
4  28.0       12.0    1            40.0       0.0            0            1   

   workclass_2  race_0  race_1  race_2  label_0  label_1  
0            1       0       0       1        0        1  
1            0       0       0       1        0        1  
2            1       0       0       1        0        1  
3            1       0       1       0        0        1  
4            1       0       1       0        0        1  


In [26]:
# Veamos la nueva descripción del dataset:
print(df_final.info())
print(df_final.describe())



<class 'pandas.core.frame.DataFrame'>
Index: 41694 entries, 0 to 41715
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41694 non-null  float64
 1   education       21033 non-null  float64
 2   sex             41694 non-null  int64  
 3   hours_per_week  41694 non-null  float64
 4   USA_born        41694 non-null  float64
 5   workclass_0     41694 non-null  int64  
 6   workclass_1     41694 non-null  int64  
 7   workclass_2     41694 non-null  int64  
 8   race_0          41694 non-null  int64  
 9   race_1          41694 non-null  int64  
 10  race_2          41694 non-null  int64  
 11  label_0         41694 non-null  int64  
 12  label_1         41694 non-null  int64  
dtypes: float64(4), int64(9)
memory usage: 4.5 MB
None
                age     education           sex  hours_per_week      USA_born  \
count  41694.000000  21033.000000  41694.000000    41694.000000  41694.000000   
mean

In [27]:
# Verifiquemos el tipo de dato de cada columna:
print(df_final.dtypes)


age               float64
education         float64
sex                 int64
hours_per_week    float64
USA_born          float64
workclass_0         int64
workclass_1         int64
workclass_2         int64
race_0              int64
race_1              int64
race_2              int64
label_0             int64
label_1             int64
dtype: object


In [31]:
# Carguemos el dataset a un nuevo archivo:
df_final.to_csv('census_preprocessed.csv', index=False)
df = pd.read_csv('census_preprocessed.csv')
print(df.head())




    age  education  sex  hours_per_week  USA_born  workclass_0  workclass_1  \
0  39.0       12.0    0            40.0       1.0            0            0   
1  50.0       12.0    0            13.0       1.0            0            1   
2  38.0        NaN    0            40.0       1.0            0            1   
3  53.0        NaN    0            40.0       1.0            0            1   
4  28.0       12.0    1            40.0       0.0            0            1   

   workclass_2  race_0  race_1  race_2  label_0  label_1  
0            1       0       0       1        0        1  
1            0       0       0       1        0        1  
2            1       0       0       1        0        1  
3            1       0       1       0        0        1  
4            1       0       1       0        0        1  
