<a href="https://colab.research.google.com/github/dajebbar/FreeCodeCamp-python-data-analysis/blob/main/num_cat_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Using numerical and categorical variables together
---


In [1]:
import pandas as pd

adult_census = pd.read_csv("./adult.csv")
# drop the duplicated column `"education-num"` as stated in the first notebook
adult_census = adult_census.drop(columns="education-num")

target_name = "class"
target = adult_census[target_name]

data = adult_census.drop(columns=[target_name])

## Selection based on data types
We will separate categorical and numerical variables using their data types to identify them, as we saw previously that object corresponds to categorical columns (strings). We make use of `make_column_selector` helper to select the corresponding columns.



In [8]:
from sklearn.compose import make_column_selector as selector

numerical_data_selector = selector(dtype_include='number')
categorical_data_selector = selector(dtype_include='object')

numerical_columns = numerical_data_selector(data)
categorical_columns = categorical_data_selector(data)

In [9]:
categorical_columns

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

## Dispatch columns to a specific processor

In the previous sections, we saw that we need to treat data differently depending on their nature (i.e. numerical or categorical).

Scikit-learn provides a `ColumnTransformer` class which will send specific columns to a specific transformer, making it easy to fit a single predictive model on a dataset that combines both kinds of variables together (heterogeneously typed tabular data).

We first define the columns depending on their data type:

-  **one-hot encoding** will be applied to categorical columns. Besides, we use `handle_unknown="ignore"` to solve the potential issues due to rare categories.
-  **numerical scaling** numerical features which will be standardized.  

Now, we create our `ColumnTransfomer` by specifying three values: the preprocessor name, the transformer, and the columns. First, let's create the preprocessors for the numerical and categorical parts.

In [10]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

categorical_preprocessor = OneHotEncoder(handle_unknown='ignore')
numerical_preprocessor = StandardScaler()

Now, we create the transformer and associate each of these preprocessors with their respective columns.

In [11]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
                                  ('one-hot-encoder', 
                                   categorical_preprocessor, 
                                   categorical_columns),
                                  ('standard-scaler', 
                                  numerical_preprocessor, 
                                   numerical_columns),
])

A `ColumnTransformer` does the following:

-  It **splits the columns** of the original dataset based on the column names or indices provided. We will obtain as many subsets as the number of transformers passed into the `ColumnTransformer`.
-  It **transforms each subsets**. A specific transformer is applied to each subset: it will internally call `fit_transform` or `transform`. The output of this step is a set of transformed datasets.
-  It then **concatenates the transformed datasets** into a single dataset.  

The important thing is that `ColumnTransformer` is like any other scikit-learn transformer. In particular it can be combined with a classifier in a `Pipeline`:

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))

We can display an interactive diagram with the following command:

In [13]:
from sklearn import set_config
set_config(display='diagram')
model

The final model is more complex than the previous models but still follows the same API (the same set of methods that can be called by the user):

-  the `fit` method is called to preprocess the data and then train the classifier of the preprocessed data;
-  the `predict` method makes predictions on new data;
-  the `score` method is used to predict on the test data and compare the predictions to the expected test labels to compute the accuracy.  

Let's start by splitting our data into train and test sets.