# Adult UCI dataset

Classification with preprocessing and pipeline

In this example we will use the [Adult UCI dataset](https://archive.ics.uci.edu/dataset/2/adult)

The target is the `income` boolean column, specifying if the individual income si either *low* or *high*.

In this example we will do simply the classification with hold-out test and no optimisation. The students are invited to expand the exercise with a GridSearchCV optimisation and several classifiers.

## Technical aspects
In `sklearn` a `Pipeline` is a sequence of operation such that the output of one operation is the input to the following. Each operation is a transformation of data.

Examples of those operations are:
- column transformations:
  - encodings:
    - `OneHotEncoding` to transform a categorical column into a set of `0-1` columns
    - `OrdinalEncoding` to transform an ordinal column into a numeric one
  - imputation of null values, such as `SimpleImputer`, this one requires a **strategy**, for example, in numbers the nulls can be filled with the `median`, in categoricals the nulls can be filled with a constant value, such as `unknown`
  - numeric transformations, such as:
    - `MinMaxScaler`
    - `StandardScaler`
- estimators

In `sklearn` we have available the `ColumnTransformer` to transform groups of columns in a uniform way. It requires, for each group, a transformation pipeline and a list of attributes to be transformed with that pipeline

We will then build a final pipeline composed by the preprocessing with the `ColumnTransformer` and the classifier

## Workflow
1. load the file `adults.csv` into a dataframe and explore it
1. use as target the column `income`, separate the predicting columns and the target into `X` and `y`
1. show the percentage of null values in each column
1. prepare the variable `categorical_features` containing all the names of the categorical features of `X` excluding `education` (it will be ignored) and `sex` (it will be transformed separately);
1. prepare the variable `numeric_features` containing all the names of the numeric features excluding `fnlwgt` (this is the *importance* of each observation, it can be used in the final classification report as `sample_weight`)
1. the target should be binary, inspecting the unique values you will see that there are four distinct values due to typing errors, reduce the target to binary with an appropriate mapping (hint: you can use the `.map()` function of Pandas
1. split into train and test
1. prepare the column transformers for:
  1. numeric: simple imputation with the median and standard scaling
  1. categorical: simple imputation with nulls substituted by `unknown`
  1. boolean: ordinal encoding
1. prepare a pipeline with:
  1. the column transformer as preprocessor
  1. the `DecisionTreeClassifier`
1. fit the pipeline to the train part
1. predict the test and produce a classification report

In [15]:
import pandas as pd
url = '../data/adult.csv'



In [27]:
url = '../data/adults.csv'
df = pd.read_csv(url)
df.head

<bound method NDFrame.head of        age         workclass  fnlwgt  education  education-num  \
0       39         State-gov   77516  Bachelors             13   
1       50  Self-emp-not-inc   83311  Bachelors             13   
2       38           Private  215646    HS-grad              9   
3       53           Private  234721       11th              7   
4       28           Private  338409  Bachelors             13   
...    ...               ...     ...        ...            ...   
48837   39           Private  215419  Bachelors             13   
48838   64               NaN  321403    HS-grad              9   
48839   38           Private  374983  Bachelors             13   
48840   44           Private   83891  Bachelors             13   
48841   35      Self-emp-inc  182148  Bachelors             13   

           marital-status         occupation    relationship  \
0           Never-married       Adm-clerical   Not-in-family   
1      Married-civ-spouse    Exec-managerial     

In [28]:

X = df.drop(columns=['income'])  
y = df['income']  


print("Feature (X):")
print(X.head())
print("\nTarget (y):")
print(y.head())





Feature (X):
   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss  hours-per-week native-country  
0          2174             0              40  United-States  
1             0             0              

In [29]:
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder

random_state = 346
np.random.seed(random_state)

In [32]:
# Percentuali di valori nulli
null_percentages = df.isnull().mean() * 100
print("Percentuale di valori nulli per ogni colonna:")
print(null_percentages)


Percentuale di valori nulli per ogni colonna:
age               0.000000
workclass         1.971664
fnlwgt            0.000000
education         0.000000
education-num     0.000000
marital-status    0.000000
occupation        1.977806
relationship      0.000000
race              0.000000
sex               0.000000
capital-gain      0.000000
capital-loss      0.000000
hours-per-week    0.000000
native-country    0.560993
income            0.000000
dtype: float64


In [33]:
# Preparare la variabile categorical_features
categorical_features = X.select_dtypes(include=['object']).columns.difference(['education', 'sex'])
print("\nColonne categoriali (escludendo 'education' e 'sex'):")
print(categorical_features.tolist())


Colonne categoriali (escludendo 'education' e 'sex'):
['marital-status', 'native-country', 'occupation', 'race', 'relationship', 'workclass']


In [34]:
# Preparare la variabile numeric_features
numeric_features = X.select_dtypes(include=['number']).columns.difference(['fnlwgt'])
print("Colonne numeriche (escludendo 'fnlwgt'):")
print(numeric_features.tolist())

Colonne numeriche (escludendo 'fnlwgt'):
['age', 'capital-gain', 'capital-loss', 'education-num', 'hours-per-week']


### Here you should prepare the numeric, boolean and categorical transformers

`your_transformer = Pipeline(
  steps = [
              ("step name", transformer )
            , ("step name", transformer)
          ]
)`

In [18]:
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median"))
        , ("scaler", StandardScaler())]
)

boolean_transformer = #........

categorical_transformer = #..........

SyntaxError: invalid syntax (1603905297.py, line 7)

### Here you collect the transformers in the `ColumnTransformer`

In [10]:
# preprocessor = ColumnTransformer(
#     transformers=[
#           ("transformer name", transformer, list_of_features)
#         , ("transformer name", transformer, list_of_features)
#         , ("transformer name", transformer, list_of_features)
#     ]
#     , sparse_threshold=0 # this prevents internal representation in sparse matrices
#                          # it is useful to speed operations
# )

### Final operations

- prepare the final pipeline composed by the preprocessor and the classifier
- do the train/test split
- fit the pipeline to the train part
- predict the test part
- produce the classification report

In [12]:
# clf = Pipeline(
#     steps=[
#           ("step name", component)
#         , ("step name", component)
#         ]
# )

In the classification report use the parameter `sample_weight` passing the column `fnlwgt`

              precision    recall  f1-score   support

       <=50K       0.88      0.89      0.89 1404639602.0
        >50K       0.64      0.60      0.62 443124458.0

    accuracy                           0.82 1847764060.0
   macro avg       0.76      0.75      0.75 1847764060.0
weighted avg       0.82      0.82      0.82 1847764060.0

