First encounter a new dataset:
1. construct a ranking with a feature utility metric ("**mutual information**") - a function measuring associations between a feature and the target.
    - mutual information: can detect any kind of relationship, while correlation only detects linear relationship. 
    - high MI: a feature contains significant information about the target
    - low MI: a feature does not provide much useful information about the target
        - feature interactions: MI evaluates each feature individually w.r.t. the target, some features may become useful when combine with other features. So low MI doen't necessary mean useless. 
        - domain knowledge
        - correlation: check if low MI feature is correlated with other highly informative features.
2. Choose a smaller set of the most useful features to develop first.

Mutual Information
- a meaure of uncertainty between two quantities. 
- if knew the value of a feature, how much more confident would you be about the target?
- "entropy": how many yes-or-no questions you would need to describe an occurance of that variable on average.
- when MI = 0, the quantities are independent. 
- MI > 2.0 are uncommon since MI is a logarithmic quantity.
- MI is a univariate metric, can't detect interactions between features.

### Example - 1985 Automobiles

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

plt.style.use("seaborn-whitegrid")

df = pd.read_csv("../input/fe-course-data/autos.csv")
df.head()

In [None]:
X = df.copy()
y = X.pop("price")

# Label encoding for categoricals
for colname in X.select_dtypes("object"):
    X[colname], _ = X[colname].factorize()

# All discrete features should now have integer dtypes (double-check this before using MI!)
discrete_features = X.dtypes == int

In [None]:
from sklearn.feature_selection import mutual_info_regression

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X, y, discrete_features)
mi_scores[::3]  # show a few features with their MI scores

def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")


plt.figure(dpi=100, figsize=(8, 5))
plot_mi_scores(mi_scores)

curb_weight          1.540126
highway_mpg          0.951700
length               0.621566
fuel_system          0.485085
stroke               0.389321
num_of_cylinders     0.330988
compression_ratio    0.133927
fuel_type            0.048139
Name: MI Scores, dtype: float64

In [None]:
sns.relplot(x="curb_weight", y="price", data=df);

Before deciding a feature is unimportant from its MI score, it's good to investigate any possible interaction effects -- domain knowledge can offer a lot of guidance here.

In [None]:
sns.lmplot(x="horsepower", y="price", hue="fuel_type", data=df);

## Exercises

# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex2 import *

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.feature_selection import mutual_info_regression

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)


# Load data
df = pd.read_csv("../input/fe-course-data/ames.csv")


# Utility functions from Tutorial
def make_mi_scores(X, y):
    X = X.copy()
    for colname in X.select_dtypes(["object", "category"]):
        X[colname], _ = X[colname].factorize()
    # All discrete features should now have integer dtypes
    discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores


def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

In [None]:
features = ["YearBuilt", "MoSold", "ScreenPorch"]
sns.relplot(
    x="value", y="SalePrice", col="variable", data=df.melt(id_vars="SalePrice", value_vars=features), facet_kws=dict(sharex=False),
);

### 1. Understand mutual information
Based on the plots, which feature do you think would have the highest mutual information with SalePrice?

Own Ans: YearBuilt - because can see a clear pattern between SalePrice and YearBuilt. <br>
<br>
Correct Ans: Based on the plots, YearBuilt should have the highest MI score since knowing the year tends to constrain SalePrice to a smaller range of possible values. This is generally not the case for MoSold, however. Finally, since ScreenPorch is usually just one value, 0, on average it won't tell you much about SalePrice (though more than MoSold) .



In [None]:
# compute mutual information scores for the Ames features:
X = df.copy()
y = X.pop('SalePrice')

mi_scores = make_mi_scores(X, y)

In [None]:
print(mi_scores.head(20))
print(mi_scores.tail(20))  # uncomment to see bottom 20

plt.figure(dpi=100, figsize=(8, 5))
plot_mi_scores(mi_scores.head(20))
plot_mi_scores(mi_scores.tail(20))  # uncomment to see bottom 20

OverallQual     0.581262
Neighborhood    0.569813
GrLivArea       0.496909
YearBuilt       0.437939
GarageArea      0.415014
TotalBsmtSF     0.390280
GarageCars      0.381467
FirstFlrSF      0.368825
BsmtQual        0.364779
KitchenQual     0.326194
ExterQual       0.322390
YearRemodAdd    0.315402
MSSubClass      0.287131
GarageFinish    0.265440
FullBath        0.251693
Foundation      0.236115
LotFrontage     0.233334
GarageType      0.226117
FireplaceQu     0.221955
SecondFlrSF     0.200658
Name: MI Scores, dtype: float64
ExterCond           0.020934
KitchenAbvGr        0.017677
BsmtHalfBath        0.013719
LotConfig           0.013637
ScreenPorch         0.012981
PoolArea            0.012831
MiscVal             0.010997
LowQualFinSF        0.009328
Heating             0.007622
Functional          0.006380
MiscFeature         0.004322
Street              0.003381
Condition2          0.003176
RoofMatl            0.002620
PoolQC              0.001370
Utilities           0.000291
Threeseasonporch    0.000000
BsmtFinSF2          0.000000
MoSold              0.000000
LandSlope           0.000000
Name: MI Scores, dtype: float64

### 2. Examine MI 
Do the scores seem reasonable? Do the high scoring features represent things you'd think most people would value in a home? Do you notice any themes in what they describe?

Own Ans: The scores seem reasonable.  
<br>
Correct Ans: Some common themes among most of these features are:<br>
Location: Neighborhood<br>
Size: all of the Area and SF features, and counts like FullBath and GarageCars <br>
Quality: all of the Qual features<br>
Year: YearBuilt and YearRemodAdd<br>
Types: descriptions of features and styles like Foundation and GarageType<br>
These are all the kinds of features you'll commonly see in real-estate listings (like on Zillow), It's good then that our mutual information metric scored them highly. On the other hand, the lowest ranked features seem to mostly represent things that are rare or exceptional in some way, and so wouldn't be relevant to the average home buyer.

In [None]:
# investigate possible interaction effects for the BldgType feature
sns.catplot(x="BldgType", y="SalePrice", data=df, kind="boxen");

Investigate whether BldgType produces a significant interaction with either of the following: <br>
<br>
GrLivArea  # Above ground living area <br>
MoSold     # Month sold <br>
Run the following cell twice, the first time with feature = "GrLivArea" and the next time with feature="MoSold":

In [None]:
# feature = "GrLivArea"
feature="MoSold"

sns.lmplot(
    x=feature, y="SalePrice", hue="BldgType", col="BldgType",
    data=df, scatter_kws={"edgecolor": 'w'}, col_wrap=3, height=4,
);

**The trend lines being significantly different from one category to the next indicates an interaction effect.**

### 3. Discover Interactions
From the plots, does BldgType seem to exhibit an interaction effect with either GrLivArea or MoSold?

Own Ans: yes.

<br>
Correct Ans: The trends lines within each category of BldgType are clearly very different, indicating an interaction between these features. Since knowing BldgType tells us more about how GrLivArea relates to SalePrice, we should consider including BldgType in our feature set. <br>
The trend lines for MoSold, however, are almost all the same. This feature hasn't become more informative for knowing BldgType.

In [None]:
mi_scores.head(10)

OverallQual     0.581262
Neighborhood    0.569813
GrLivArea       0.496909
YearBuilt       0.437939
GarageArea      0.415014
TotalBsmtSF     0.390280
GarageCars      0.381467
FirstFlrSF      0.368825
BsmtQual        0.364779
KitchenQual     0.326194
Name: MI Scores, dtype: float64

themes: Location, size, and quality. <br>
One strategy: Combining these top features with other related features, especially those you've identified as creating interactions. 