### Feature Engineering
#### Resources:
- kaggle course: https://www.kaggle.com/code/ryanholbrook/what-is-feature-engineering
- book 'Hands-on Machine Learning with Scikit-Learn, Kera and Tensorflow', pages 1-33

#### Definition
Feature engineering makes the data better suited to the problem at hand. It is usually used to improve a model's predictive performance.

Step 1: Establish a baseline by training the model on the un-augmented dataset

- Define X and y variables
- Train and score baseline model
- Train and score baseline model

Code example:
    X = df.copy()
    y = X.pop("CompressiveStrength")
    baseline = RandomForestRegressor(criterion="mae", random_state=0)
    baseline_score = cross_val_score(baseline, X, y, cv=5, scoring="neg_mean_absolute_error")
    baseline_score = -1 * baseline_score.mean()
    print(f"MAE Baseline Score: {baseline_score:.4}")

In [1]:
from xgboost import XGBRegressor

In [2]:
import pandas as pd
train_df = pd.read_csv('train.csv')
train_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
X = train_df.drop(columns='SalePrice')
X.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,2,2008,WD,Normal
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,5,2007,WD,Normal
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,9,2008,WD,Normal
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,2,2006,WD,Abnorml
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,12,2008,WD,Normal


In [4]:
y = train_df['SalePrice']

In [5]:
from sklearn.model_selection import cross_val_score

In [6]:
scores = cross_val_score(XGBRegressor(objective='reg:squarederror'), X, y, scoring='neg_mean_squared_error')
(-scores)**0.5
# probably error in variables --- need to convert categorical variables?

Traceback (most recent call last):
  File "/Users/floramatos/opt/anaconda3/envs/PythonData/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/floramatos/opt/anaconda3/envs/PythonData/lib/python3.7/site-packages/xgboost/sklearn.py", line 504, in fit
    nthread=self.n_jobs)
  File "/Users/floramatos/opt/anaconda3/envs/PythonData/lib/python3.7/site-packages/xgboost/core.py", line 520, in __init__
    data, feature_names, feature_types
  File "/Users/floramatos/opt/anaconda3/envs/PythonData/lib/python3.7/site-packages/xgboost/core.py", line 420, in _convert_dataframes
    meta_type)
  File "/Users/floramatos/opt/anaconda3/envs/PythonData/lib/python3.7/site-packages/xgboost/core.py", line 294, in _maybe_pandas_data
    raise ValueError(msg + ', '.join(bad_fields))
ValueError: DataFrame.dtypes for data must be int, float or bool.
                Did not expect the data types in fie

array([nan, nan, nan, nan, nan])

Step 2:
After assessing correlations/mutual information, should we merge features that are highly correlated?
- For example, using dimensionality reduction and feature extraction, an unsupervised learning algorithm.
- Would this create a better model?

Bad Data -- what questions to ask in preparation for ML?

-> Is the data insufficient?
- Is 1460 cases sufficient data to train our ML model?
- "even for very simple problems you typically need thousands of examples" (page 23)

-> Is the data representative of the new cases you want to generalize to?
- Get a sense of the data through exploratory analysis.
- We don't have control over the data collection, so this is not on our hands.
- We can only identify an issue and deal with it.

-> Are there errors, outliers and noise in the training data?
We should spend time cleaning up our training data by:
- identifying outliers -- discard them? fix the errors manually?
- identifying rows with missing data -- drop rows? fill in the missing values?
- identifying columns with missing data -- delete feature? fill in the missing values? train model with and without feature?

-> Are some features irrelevant?
Feature engineering involves:
- feature selection = select most useful features to train on among existing features
- feature extraction = combine existing features to produce a more useful one (dimensionality reduction)
- feature creation = gather new data

# Mutual Information
Source code: https://www.kaggle.com/code/ryanholbrook/mutual-information

In [7]:
# define features and outcome variable
X = train_df.copy()
y = train_df.pop('SalePrice')

# Label encoding for categoricals
for colname in X.select_dtypes("object"):
    X[colname], _ = X[colname].factorize()

# All discrete features should now have integer dtypes (double-check this before using MI!)
discrete_features = X.dtypes == int

In [9]:
# check factorized categorical variables
X

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,0,65.0,8450,0,-1,0,0,0,...,0,-1,-1,-1,0,2,2008,0,0,208500
1,2,20,0,80.0,9600,0,-1,0,0,0,...,0,-1,-1,-1,0,5,2007,0,0,181500
2,3,60,0,68.0,11250,0,-1,1,0,0,...,0,-1,-1,-1,0,9,2008,0,0,223500
3,4,70,0,60.0,9550,0,-1,1,0,0,...,0,-1,-1,-1,0,2,2006,0,1,140000
4,5,60,0,84.0,14260,0,-1,1,0,0,...,0,-1,-1,-1,0,12,2008,0,0,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,0,62.0,7917,0,-1,0,0,0,...,0,-1,-1,-1,0,8,2007,0,0,175000
1456,1457,20,0,85.0,13175,0,-1,0,0,0,...,0,-1,0,-1,0,2,2010,0,0,210000
1457,1458,70,0,66.0,9042,0,-1,0,0,0,...,0,-1,2,0,2500,5,2010,0,0,266500
1458,1459,20,0,68.0,9717,0,-1,0,0,0,...,0,-1,-1,-1,0,4,2010,0,0,142125


In [10]:
# calculate mutual information
from sklearn.feature_selection import mutual_info_regression

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X, y, discrete_features)
mi_scores[::3]  # show a few features with their MI scores

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [12]:
import matplotlib.pyplot as plt
import numpy as np

def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")


plt.figure(dpi=100, figsize=(8, 5))
plot_mi_scores(mi_scores)

NameError: name 'mi_scores' is not defined

<Figure size 800x500 with 0 Axes>