### Feature engineering

1) First step is to construct a ranking with a feature utility metric

2) Choose a smaller set of the most useful features to develop initially 

#### Mutual Information

Advantage of MI over correlation is that it can detect any kind of relationship, while correlation only detects linear relationships.

Mutual information is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other.

Technical note: what we're calling uncertainty is measured using a quantity from information theory known as "entropy". The entropy of a variable means roughly: "how many yes-or-no questions you would need to describe an occurance of that variable, on average."

If MI is zero, quantities are independent.

Things to remember:

    1) MI can help you to understand the relative potential of a feature as a predictor of the target
    
    2) It's possible for a feature to be very informative when interacting with other features, but not so informative 
       all alone. MI can't detect interactions between features. It is a univariate metric.
       
    3) The actual usefulness of a feature depends on the model you use it with. Just because a feature has a high 
       MI score doesn't mean your model will be able to do anything with that information. You may need to transform 
       the feature.
       
    4) The scikit-learn algorithm for MI treats discrete features differently from continuous features. 
       Consequently, you need to tell it which are which. Scikit-learn has two mutual information metrics 
       in it feature_selection module:
            
            a) one for real-valued targets (mutual_info_regression) 

            b) one for categorical targets (mutual_info_classif).

In [None]:
X = df.copy()
y = X.pop("price")

# Label encoding for categoricals
for colname in X.select_dtypes("object"):
    X[colname], _ = X[colname].factorize()
    
# All discrete features should now have integer dtypes (double-check this before using MI!)
discrete_features = X.dtypes == int

In [None]:
from sklearn.feature_selection import mutual_info_regression

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X, y, discrete_features)
mi_scores[::3]  # show a few features with their MI scores

In [None]:
def plot_utility_scores(scores):
    y = scores.sort_values(ascending=True)
    width = np.arange(len(y))
    ticks = list(y.index)
    plt.barh(width, y)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")


plt.figure(dpi=100, figsize=(8, 5))
plot_utility_scores(mi_scores)

In [None]:
# As we might expect, the high-scoring curb_weight feature exhibits a strong relationship with price, the target.

sns.relplot(x="curb_weight", y="price", data=df);

# The fuel_type feature has a fairly low MI score, but as we can see from the figure, 
# it clearly separates two price populations with different trends within the horsepower feature. 
# This indicates that fuel_type contributes an interaction effect and might not be unimportant after all. 
# Before deciding a feature is unimportant from its MI score, it's good to investigate any possible interaction effects 
# Domain knowledge can offer a lot of guidance here.

sns.lmplot(x="horsepower", y="price", hue="fuel_type", data=df);
