# Python Learning Sessions: Feature Engineering

![impute](resources\sklearn.impute.png)

## Missing values

When dealing with missing values, there are a lot of strategies to deal with them:
* `Dropping the columns` - When there are a lot of missing values, it is better to drop the column.
* `Dropping the rows` - If dataset size is much larger than the rows with missing values, it is better to drop the rows. Be careful, however, as your model might encouter missing values at infernce and may pickup errors.
* `Imputation` - When the missing values are relatively small, imputation might improve the model's performance. There are a few ways to impute the missing values:
    * Last value carried forward - Commonly used in time series data.
    * Mean - Use the mean of the column to impute the missing values. Commonly used in regression.
    * Median - Use the median of the column to impute the missing values.
    * k-NN - Use the k-nearest neighbors to impute the missing values.
    * Using NA as a placeholder - Use NA as a placeholder for the missing values. Models will treat this as a different category.
    * Other logical rules - Use the logical rules to impute the missing values, e.g., 
* `Adding indicator variables` - Quite common in social sciences; adding an extra column that indicates if the value is missing or not.
* `Using a model has native support for missing values` - Use a model that can process missing value, e.g., `HistGradientBoostingClassifier` or `LightGBMClassifier`.
* `Random` - Use random values to impute the missing values.

You have to be careful when dealing with missing values. There are several mechanisms how missing values are generated (Gelman, 2007):
* `Completely random` - The missing values are generated completely randomly. Discarding these samples does not bias inference.
* `Missingness at random` - The missing values are not generated completely randomly. It may depend on different groups of samples. This can be ignored if they are related to a certain feature.
* `Missing values depend on external factors` - The missing values are generated based on external factors that are not caputred in the dataset. Recommended to model as a missing value.
* `Missng values that depend on the value of the feature` - The missing values are generated based on the value of the feature itself, e.g., people with low salary might not disclose it.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn import set_config

set_config(display="diagram")

In [2]:
link = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/taxis.csv'
df = pd.read_csv(link, parse_dates=[0, 1])
df

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.60,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.00,0.0,9.30,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.70,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.10,0.0,13.40,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6428,2019-03-31 09:51:53,2019-03-31 09:55:27,1,0.75,4.5,1.06,0.0,6.36,green,credit card,East Harlem North,Central Harlem North,Manhattan,Manhattan
6429,2019-03-31 17:38:00,2019-03-31 18:34:23,1,18.74,58.0,0.00,0.0,58.80,green,credit card,Jamaica,East Concourse/Concourse Village,Queens,Bronx
6430,2019-03-23 22:55:18,2019-03-23 23:14:25,1,4.14,16.0,0.00,0.0,17.30,green,cash,Crown Heights North,Bushwick North,Brooklyn,Brooklyn
6431,2019-03-04 10:09:25,2019-03-04 10:14:29,1,1.12,6.0,0.00,0.0,6.80,green,credit card,East New York,East Flatbush/Remsen Village,Brooklyn,Brooklyn


In [5]:
def convert_text_cols_to_categorical(df):
    cols = df.select_dtypes(include=['object']).columns
    return df.astype({col: 'category' for col in cols})
    
X = df.pipe(convert_text_cols_to_categorical)

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   pickup           6433 non-null   datetime64[ns]
 1   dropoff          6433 non-null   datetime64[ns]
 2   passengers       6433 non-null   int64         
 3   distance         6433 non-null   float64       
 4   fare             6433 non-null   float64       
 5   tip              6433 non-null   float64       
 6   tolls            6433 non-null   float64       
 7   total            6433 non-null   float64       
 8   color            6433 non-null   category      
 9   payment          6389 non-null   category      
 10  pickup_zone      6407 non-null   category      
 11  dropoff_zone     6388 non-null   category      
 12  pickup_borough   6407 non-null   category      
 13  dropoff_borough  6388 non-null   category      
dtypes: category(6), datetime64[ns](2), float

### Using a model has native support for missing values

In [48]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import OrdinalEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.impute import SimpleImputer

preproc = ColumnTransformer(
    transformers=[
        ('date', 'drop', make_column_selector(dtype_include=['datetime64[ns]'])),
        (
            'categorical',
            OrdinalEncoder(
                handle_unknown='use_encoded_value',
                unknown_value=np.nan
            ) ,
            make_column_selector(dtype_include=['category']))
    ],
    remainder='passthrough',
)
preproc

pipe = Pipeline(
    steps=[
        ('preprocess', preproc),
        ('variance', VarianceThreshold()),
        ('predictor', Lasso(alpha=0.8)),
    ]
)
pipe

In [49]:
pipe.fit(X.drop(columns='fare'), X['fare'])

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [50]:
from sklearn.ensemble import HistGradientBoostingRegressor

pipe = Pipeline(
    steps=[
        ('preprocess', preproc),
        ('variance', VarianceThreshold()),
        ('predictor', HistGradientBoostingRegressor()),
    ]
)
pipe

In [51]:
pipe.fit(X.drop(columns='fare'), X['fare'])

In [61]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    pipe, 
    X=X.drop(columns='fare'),
    y=X['fare'],
    cv=3
)
'CV Score:', scores.mean()

('CV Score:', 0.9667588453587226)

### Univariate feature imputation

In [64]:
from sklearn.ensemble import RandomForestRegressor

pipe = Pipeline(
    steps=[
        ('preprocess', preproc),
        ('imputer', SimpleImputer()),
        ('variance', VarianceThreshold()),
        ('predictor', RandomForestRegressor()),
    ]
)
pipe

In [65]:
scores = cross_val_score(
    pipe, 
    X=X.drop(columns='fare'),
    y=X['fare'],
    cv=3
)
'CV Score:', scores.mean()

('CV Score:', 0.9866782101817507)

In [38]:
from sklearn.model_selection import cross_val_score

cross_val_score(
    pipe, 
    X=X.drop(columns='fare'),
    y=X['fare'],
    cv=5
)

Traceback (most recent call last):
  File "C:\Users\amaamorado\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\amaamorado\Anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 105, in __call__
    score = scorer(estimator, *args, **kwargs)
  File "C:\Users\amaamorado\Anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 418, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
  File "C:\Users\amaamorado\Anaconda3\lib\site-packages\sklearn\utils\metaestimators.py", line 113, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
  File "C:\Users\amaamorado\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 707, in score
    Xt = transform.transform(Xt)
  File "C:\Users\amaamorado\Anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py", line 748, in transform
    Xs = self._fit_transform(
  File "C:\Users\amaa

array([nan, nan, nan, nan, nan])

## References
* [6.4. Imputation of missing values](https://scikit-learn.org/stable/modules/impute.html)
* Gelman, Andrew, and Jennifer Hill. Data Analysis Using Regression and Multilevel/hierarchical Models. Cambridge: Cambridge University Press, 2007