# Preprocessing

Today: A discussion of options during preprocessing to generate better predictions
- Missing values
- Outliers
- Feature construction
- Feature transformation/scaling
- Feature selection


Some say that applied ML is just data cleaning and processing

- Remember the Nate Silver quote?
- Andrew Ng, (Coursera, Stanford, Google Brain, Baidu... super influential CS)

# Missing Values

- think about what the right replacement is!
- why is it missing --> might help pick replacement or option
- different variables should likely get different filling options
- fancy (actually, easy) but can be powerful: NN replacement

# Outliers

Can alter model estimations
- change inference
- reduce predictive power

[Which of these 4 datasets has problems with outliers?](https://ledatascifi.github.io/ledatascifi-2021/content/03/04b-whyplot.html?highlight=outliers)

[Options for dealing with them](https://ledatascifi.github.io/ledatascifi-2021/content/03/05d_outliers.html?highlight=outliers)
- Drop 
- Transform
    - winsorize: if a value is above p99, change to = p99
    - scaling (next slide)

In sklearn, some scalars & models are more sensitive to outliers...
- Good idea to address by default!

If you want to drop or winsorize inside of sklearn pipelines, you need to use [FunctionTransformer](https://scikit-learn.org/stable/modules/preprocessing.html#custom-transformers) 
- This lets you use outside functions (like `pd`) inside pipelines
- Still, this is harder than it should be, code-wise... should be updated in future

Transforming within sklearn pipelines
- Covered 
- `RobustScalar` works inside of pipelines and is robust to outliers

# Feature construction

This is where really good models are made!

Examples on the next slide!

- interactions, e.g: 
    - story example: woman or child, woman and first class, finance AND coding
- polynomial expansions. If you have $X_1$ and $X_2$:
    - Poly of degree 2: $X_1$, $X_2$, $X1^2$, $X2^2$, $X_1*X_2$
    - [Visual example](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html#Validation-curves-in-Scikit-Learn)
- binning: 
    - not profits as a #, but profit bins, e.g.: lowest, low, negative, zero, positive, high, highest
    - remember the HW problem on on year vs year dummies?
- extracting info from variables (e.g. date vars or text vars)


In [104]:
import seaborn as sns
from statsmodels.formula.api import ols as sm_ols 
import statsmodels.api as sm        

# for the polynomial
from sklearn.preprocessing import PolynomialFeatures
polynomial_features= PolynomialFeatures(degree=3)

titanic = sns.load_dataset("titanic")
X = titanic[['age','sex','survived']]
X['sex'] = titanic['sex'] =='female'
X['sex'] = X.sex.astype(int)
X = X.dropna()
y = X['survived']
X = X.drop('survived',axis=1)
X = polynomial_features.fit_transform(X)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['sex'] = titanic['sex'] =='female'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['sex'] = X.sex.astype(int)


In [106]:
print(f'''
R2 of different models 
---------------------------
Age + Gender:                 {sm_ols('survived ~ age + sex',data=titanic).fit().rsquared.round(4)}

Child + Gender:               {sm_ols('survived ~ (age<18) + sex',data=titanic).fit().rsquared.round(4)}
Child + Gender + Interaction: {sm_ols('survived ~ (age<18) + sex + sex*(age<18)',data=titanic).fit().rsquared.round(4)}

poly deg 3 of (Age + Gender): {sm.OLS(y,X).fit().rsquared.round(4)}

''')


R2 of different models 
---------------------------
Age + Gender:                 0.2911

Child + Gender:               0.2994
Child + Gender + Interaction: 0.3093

poly deg 3 of (Age + Gender): 0.3166




# Scaling and transformation

https://youtu.be/9rBc3rTsJsY?t=195
- StandardScalar is sensitive to outliers (which change mean & std)
- RobustScalar uses percentiles (not sensitive to outliers)
- [Skewed variables?](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_map_data_to_normal.html#sphx-glr-auto-examples-preprocessing-plot-map-data-to-normal-py) Use PowerTransformer, possibly

Illustration: SVM with 2 versions
- with and without scaling where one var is x10000.
- will post example later today

# [Feature selection](https://scikit-learn.org/stable/modules/feature_selection.html) and/or extraction

- If you have too many variables, you will create an overfit model!
    - The link above shows options in sklearn to pick some variables
    - Options: `rfecv`, `selectfrommodel`, `selectKBest`, `Lasso` (HW!)
    - Also: Picking based on "feature importance" from bagging/tree models 

An alternative to picking variables is "combining them" via PCA
- You have LOTS of vars AND think the "true" number that matters is low
- [illustration](https://www.textbook.ds100.org/ch/25/pca_dims.html#intuition-for-reducing-dimensions)
- pros: reduces overfitting, quicker estimation
- cons: hard (very!) to interpret what the model is doing 


## Next week

- Continue the discussion of preprocessing with practice
- Modeling tips (things that win competitions and endear bosses)
- Discussion of projects

### Student demos 
- Please send an email if you want a change to do a demo for participation (I know 6-8 of you haven't yet from my records)

![](https://media.giphy.com/media/H7x1H0veAJlo4/giphy.gif)

