# Using numerical and categorical variables together

How to combine the preprocessing steps to treat numerical and categorical variables?.

Load the entire adult census dataset.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [2]:
df = pd.read_csv('adult_cencus.csv')

#### **Dealing with missing values**

In [3]:
df.isna().sum()

age                  0
workclass         2799
fnlwgt               0
education            0
education_num        0
marital_status       0
occupation        2809
relationship         0
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country     857
class                0
dtype: int64

In [5]:
df[['workclass', 'occupation', 'native_country']].dtypes

workclass         object
occupation        object
native_country    object
dtype: object

All missing values are in categorical features, so i choose to fill them with **'unknown'**.

In [6]:
df[['workclass', 'occupation', 'native_country']].fillna('Unknown', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


Before continuing the processing steps, since education and education_num features gives the same information, ill keep just one of them, so education will be removed from the dataset

In [7]:
data, target = df.drop(columns=['education', 'class']), df['class']

### **Separate categorical and numerical features**

In [8]:
from sklearn.compose import make_column_selector as selector

numerical = selector(dtype_include=np.number)(data)
categorical = selector(dtype_include=object)(data)

Now the dataset is ready to be preprocessed