# Feature Engineering
Feature engineering is the art (yes art) of manually creating new features to feed to the model with the goal of boosting performance. In this notebook, I will explore (and save) a number of general and well-known techniques.
This notebook follows two sources:
* Kaggle [course](https://www.kaggle.com/learn/feature-engineering)
* DataCamp [course]()

## Baseline Model
Before proceeding in creating new features, it is recommended to evalute a baseline relatively powerful model, with the initial features. A model performance's improvement reflects the usefulness of the new synthetized features.

## Mututal Information
### Definition
Having a large number of features might be extremely overwhelming. Thus, the first step is to reduce the available set of features into a smaller one that might serve as a starting point to build a powerful prediction model. Later, more features can be incorporated in the process.
Mutual Information, unlike correlation is capable of capturing any type of relationship between the target and the feature.


### More details
In simple terms, MI reflects the extent to which the knowledge of one variable (feature) reduces the uncertainty of another (target).  
On a technical note, MI is based on information technology measurement refferred to as ***entropy***. The larger the entropy, the more uncertain the variable is, (the less correlated / tied to each other).

### Key Notes
There are important poitns to keep in mind:
* Each feature is evaluated separately. In other words, a feature might quite powerful when combined with other features. Yet, it might not as significant on its own
* A high MI score reflects a **potentially** useful feature. A transformation such as log, exponential, polynomial might be needed to take see the relationship between the target and the feature.
 

### Applying
The kaggle course offers a great opportunity to practice this technique on the [***Ames***](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=data_description.txt) dataset. In the next section, we will consider a couple of related ideas that might represents the basics for mode advanced techniques

In [170]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [171]:
# set the defaults of matplotlib
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)


It is good to check this [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.factorize.html) for better grasp of the code.

In [172]:
# this is a function to calculate the MI scores given data and labels
from sklearn.feature_selection import mutual_info_regression

def calculate_mi_scores(X, y):
    X = X.copy()
    # convert non-numerical data to numerical data
    for colname in X.select_dtypes(["object", "category"]):
        X[colname], _ = X[colname].factorize()
    # fill the Nan values in the numerical columns with 0 temporarily
    for col in X.select_dtypes(np.number):
        X[col] = X[col].fillna(0)

    # all data with an int dtype should be considered discrete (in the MI calculation)
    discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores


In [173]:
def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

In [174]:
train_file = "ames.csv"
df = pd.read_csv(train_file).rename(columns={"SalePrice": "y"})
Y = "y"

In [175]:

features = ["YearBuilt", "MoSold", "ScreenPorch"]
sns.relplot(
    x="value", y=Y, col="variable", data=df.melt(id_vars=Y, value_vars=features), facet_kws=dict(sharex=False),
);


In [176]:
X = df.copy()
y = X.pop(Y)
mi_scores = calculate_mi_scores(X, y)
best_20_feats = list(mi_scores.head(20).index)

After finding the features with the highest MI scores, it is a good idea to search for features that interact with one or more of the relevant features. This might empower the model quite well.  
Creating new features might be tricky. Here are a couple of tips on how to discover and creat new features;
* have a better understanding of the features and the field of interest. A domain knowledge can be inspiring.
* Study previous work
* Data visualization is one of the most important tools.

### general ideas:
1. It might be fruitful to apply a number of transformations such as ***log, exponential, polynomial***. The latters are mainly used for normalization purposes especially where the data is relatively skewed.  
2. The more complex the combincation the hard it is for the model to learn it by its own. Combincation involving different arithmetical operators are quite powerful: They are generally inspired by domain knowledge.
3. There is a certain type of feature generally determined by the absence of presence of certain factor. Such generally come together and it might be useful to group them all together.
4. Certain features are quite complex: such as strings. They can be broken down as certain parts represent particular information (again research is quite helpful in this regard)
5. if two features seem to interact it might be useful to group one by the other and apply a number of aggregations 

### Model and Feature Engineering
creating features. Here are some guidelines:
* Linear models learn sums and differences naturally, but can't learn anything more complex.
* Ratios seem to be difficult for most models to learn. Ratio combinations often lead to some easy performance gains.
* Linear models and neural nets generally do better with normalized features. Neural nets especially need features scaled to values not too far from 0. Tree-based models (like random forests and XGBoost) can sometimes benefit from normalization, but usually much less so.
* Tree models can learn to approximate almost any combination of features, but when a combination is especially important they can still benefit from having it explicitly created, especially when data is limited.
* Counts are especially helpful for tree models, since these models don't have a natural way of aggregating information across many features at once.

In [177]:
from  sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
kf = KFold(random_state=3, n_splits=5, shuffle=True )
scoring = "neg_mean_squared_log_error"

def score(X, y, model):
    X = X.copy()
    # the function factorize converts non-numerical data to numerical 
    for col in X.select_dtypes(['category', 'object']):
        X[col], _ = X[col].factorize()
    # fill the numerical values by 0: temporary solution
    for col in X.select_dtypes(np.number):
        X[col]= X[col].fillna(0)
    score = cross_val_score(model, X, y, cv=kf, scoring=scoring)
    return np.sqrt(- score.mean())



In [178]:
best_20_feats.append(Y)
df_20 = df.copy()[best_20_feats]

In [190]:
X20 = df_20.copy()
y20 = X20.pop(Y)

In [180]:
from sklearn.ensemble import RandomForestRegressor as rfr

rf_base = rfr(n_estimators=200, max_depth=5, random_state=3)
# so
print(score(X, y, rf_base)) 

print(score(X20, y20, rf_base)) 

0.17436623262344367
0.17965724821215492


let's create a number of new features:
* LivLotRatio = the ration of GrLivArea to LotArea
* Spaciousness = Sum of FirstFlrSf and SecondFlrSf divided by TotarlRoomAbvGr: the average space by room 
* TotalOutSurface = sum of all porches and WoodDeckSf


In [181]:
# X_1 dataframe to store the new features synthetized
X_1 = pd.DataFrame()  # dataframe to hold new features

X_1["LivLotRatio"] = X['GrLivArea'] / X['LotArea']
X_1["Spaciousness"] = (X['FirstFlrSF'] + X['SecondFlrSF']) / X['TotRmsAbvGrd']
X_1["TotalOutsideSF"] = X['WoodDeckSF'] +  X["OpenPorchSF"]+ X["EnclosedPorch"] + X["Threeseasonporch"] + X["ScreenPorch"]



if we find an interaction between a categorical and continous feature, consider the following procedure:
1. create get_dummies of the categorial features: X_new = pd.get_dummies(df.cat, suffixe='meaningful')
2. X_new = X_new.mul(df.Con, axis=0): multiply row by row



In [182]:
# let's apply this idea to our dataset with BlgdType (building type) and "GrLivArea": 
X_2 = pd.get_dummies(df['BldgType'], prefix='Bldg')
X_2 = X_2.mul(df['GrLivArea'], axis=0)

#### Counts feature engineering example

In [183]:
# let's consider the outside surfaces features, we can consider the number of such features each sample have
X_3 = pd.DataFrame()

# YOUR CODE HERE
porchs = ["WoodDeckSF","OpenPorchSF","EnclosedPorch","Threeseasonporch","ScreenPorch"]

X_3["PorchTypes"] = X[porchs].gt(0).sum(axis=1)


#### Breaking down feature engineering example

In [184]:
df.MSSubClass.unique()

array(['One_Story_1946_and_Newer_All_Styles', 'Two_Story_1946_and_Newer',
       'One_Story_PUD_1946_and_Newer',
       'One_and_Half_Story_Finished_All_Ages', 'Split_Foyer',
       'Two_Story_PUD_1946_and_Newer', 'Split_or_Multilevel',
       'One_Story_1945_and_Older', 'Duplex_All_Styles_and_Ages',
       'Two_Family_conversion_All_Styles_and_Ages',
       'One_and_Half_Story_Unfinished_All_Ages',
       'Two_Story_1945_and_Older', 'Two_and_Half_Story_All_Ages',
       'One_Story_with_Finished_Attic_All_Ages',
       'PUD_Multilevel_Split_Level_Foyer',
       'One_and_Half_Story_PUD_All_Ages'], dtype=object)

In [185]:
# we can consider the first word when splitting the data by the string "_" as broader and larger class of homes

X_4 = pd.DataFrame()

# YOUR CODE HERE
X_4['MSClass'] = X['MSSubClass'].apply(lambda x: x.split("_")[0])


#### Grouping feature engineering example
We can see that a house is highly affected by the type of neighborhood it resides in. Let's consider the median / mean area of houses by neighborhood

In [186]:
X_5 = pd.DataFrame()

# YOUR CODE HERE
area_by_neighborhood = X.groupby("Neighborhood").GrLivArea.agg('median')
X_5["MedNhbdArea"] = X['Neighborhood'].apply(lambda x: area_by_neighborhood[x])


In [191]:
# let's check how the model performs with the given new features
X20 = X20.join([X_1, X_2, X_3, X_4, X_5])

In [194]:
X_new = X.join([X_1, X_2, X_3, X_4, X_5])

In [195]:
print(score(X_new, y, rf_base))

0.17451238570779448


#### Using Unsupervised learning
It is worth noting that we can use unsupervised learning algorithms such as ***K-means*** to a set of features that are interconnected. To hypertune mainly the number of parameters it might be useful to measure the performance of the model with each value of the new cluster feature

#### Using PCA

#### Categorical Encoding
This technique empowers the categorical features by converting to numerical values
##### Target Encoding
Target encoding is any encoding that replace a feature's categories with certain number derived from the target. One popular approach is to replace map each value of the category to the mean when grouping the target on that feature.


##### Smoothing
This technique represents certain risks that should be taken into account. Missing values should be imputed somehow: for a relatively large dataset, this might not be an easy task. Additionally, rare categories could not accurately be represented by practically any statistics. Target encoding in its most basic form might lead to overfitting.  
The solution is the technique known as ***smoothing***. It can generally be expressed in pseudo code as follows:
$\begin{align} 
    encoding = w * (statistics~in~category) + (1 - w) * (statistics~overall)
\end{align}$
where the term $w$ mainly is a synthetic metric inspired from the category frequency:
$\begin{align} w = \frac{n}{n + m}\end{align}$
The choice of $m$ is the result of several considerations: 
* if the target values within a category are highly variant, it might be a good idea to consider large values of $m$
* if the target values vary slightly within a specified range, then smaller values of m would not not hurt.  
  
  
When to consider Target Encoding:
* High-cardanility features: the large the number of categories, the more troubelsome it gets. In such situations, label encoding: mapping each category to a random value as well as one-hot encoding are not favorable choices
* domain-supported features: certain features can be quite relevant in the prediction even when they might score poorly on feature metrics. a target encoding might bring the feature's usefulness to the surface.

When using Target encoding it is ***crucial*** to train the encoder on a sample of the training data and not all of it. Otherwise, the model might easily overfit. instead for encoding the values manually, it is advisable to use skelearn encoders described in detail in this [link](https://practicaldatascience.co.uk/machine-learning/how-to-use-category-encoders-to-transform-categorical-variables).  
Smoothing can be applied using the ***MEstimateEncoder*** class