# Using numerical and categorical variables together

How to combine the preprocessing steps to treat numerical and categorical variables?.

Load the entire adult census dataset.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [2]:
df = pd.read_csv('adult_cencus.csv')

#### **Dealing with missing values**

In [3]:
df.isna().sum()

age                  0
workclass         2799
fnlwgt               0
education            0
education_num        0
marital_status       0
occupation        2809
relationship         0
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country     857
class                0
dtype: int64

In [5]:
df[['workclass', 'occupation', 'native_country']].dtypes

workclass         object
occupation        object
native_country    object
dtype: object

All missing values are in categorical features, so i choose to fill them with **'unknown'**.

In [6]:
df[['workclass', 'occupation', 'native_country']].fillna('Unknown', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


Before continuing the processing steps, since education and education_num features gives the same information, ill keep just one of them, so education will be removed from the dataset

In [7]:
data, target = df.drop(columns=['education', 'class']), df['class']

### **Separate categorical and numerical features**

In [8]:
from sklearn.compose import make_column_selector as selector

numerical = selector(dtype_include=np.number)(data)
categorical = selector(dtype_include=object)(data)

Now the dataset is ready to be preprocessed

In [9]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,48842.0,38.643585,13.71051,17.0,28.0,37.0,48.0,90.0
fnlwgt,48842.0,189664.134597,105604.025423,12285.0,117550.5,178144.5,237642.0,1490400.0
education_num,48842.0,10.078089,2.570973,1.0,9.0,10.0,12.0,16.0
capital_gain,48842.0,1079.067626,7452.019058,0.0,0.0,0.0,0.0,99999.0
capital_loss,48842.0,87.502314,403.004552,0.0,0.0,0.0,0.0,4356.0
hours_per_week,48842.0,40.422382,12.391444,1.0,40.0,40.0,45.0,99.0


I note that the numerical features need to be scaled, since their averages are very different.

## Dispatch columns to a specific processor

I know that I need to treat data differently depending on their nature (i.e. numerical or categorical).

Scikit-learn provides a `ColumnTransformer` class which will send specific
columns to a specific transformer, making it easy to fit a single predictive model on a dataset that combines both kinds of variables together (heterogeneously typed tabular data).

We first define the columns depending on their data type:

* **one-hot encoding** will be applied to categorical columns. Besides, we
  use `handle_unknown="ignore"` to solve the potential issues due to rare
  categories.
* **numerical scaling** numerical features which will be standardized.

`ColumnTransfomer` preprocessor contains three values:
- the preprocessor name, 
- the transformer, 
- and the columns.

First, let's create the preprocessors for the numerical and categorical
parts.

In [11]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

scaler = StandardScaler()
encoder = OneHotEncoder(handle_unknown='ignore')