## Common transformations in Pandas:
1. Mathematical transforms: apply mathematical operaations to columns 
    - e.g., Box-Cox Transformation - normalizer 
2. Counts
    - aggregate binary/boolean features
    - e.g., count how many values are greater than 0 for each row 
3. Building-up and Breaking-down features
    - split strings to multiple features to extract useful information or join features into a composed feature
4. Group transforms 

#### Tips on discovering new features
1. Understand the features. Refer to your dataset's data documentation.
2. Research the problem domain to acquire domain knowledge. 
3. Study previous work - solution write-ups from past Kaggle competitions.
4. Use data visualization.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)

accidents = pd.read_csv("../input/fe-course-data/accidents.csv")
autos = pd.read_csv("../input/fe-course-data/autos.csv")
concrete = pd.read_csv("../input/fe-course-data/concrete.csv")
customer = pd.read_csv("../input/fe-course-data/customer.csv")

### 1. Mathematical transforms
- apply arithmetic operations to columns 

In [None]:
autos["stroke_ratio"] = autos.stroke / autos.bore

autos[["stroke", "bore", "stroke_ratio"]].head()

In [None]:
# measure of power
autos["displacement"] = (
    np.pi * ((0.5 * autos.bore) ** 2) * autos.stroke * autos.num_of_cylinders
)

Data visualization can help to suggest transformations.
e.g., the distribution of WindSpeed is highly skewed. Thus, logarithm is effective to normalize it.

In [None]:
# If the feature has 0.0 values, use np.log1p (log(1+x)) instead of np.log
accidents["LogWindSpeed"] = accidents.WindSpeed.apply(np.log1p)

# Plot a comparison
fig, axs = plt.subplots(1, 2, figsize=(8, 4))
sns.kdeplot(accidents.WindSpeed, shade=True, ax=axs[0])
sns.kdeplot(accidents.LogWindSpeed, shade=True, ax=axs[1]);

### 2. Counts
- aggregate binary or boolean features (presence or absence) to create a count 

In [None]:
roadway_features = ["Amenity", "Bump", "Crossing", "GiveWay",
    "Junction", "NoExit", "Railway", "Roundabout", "Station", "Stop",
    "TrafficCalming", "TrafficSignal"]
accidents["RoadwayFeatures"] = accidents[roadway_features].sum(axis=1)

accidents[roadway_features + ["RoadwayFeatures"]].head(10)

In [None]:
# .gt() is a shorthand for “greater than”
# check how many values are greater than 0 for each row 

components = [ "Cement", "BlastFurnaceSlag", "FlyAsh", "Water",
               "Superplasticizer", "CoarseAggregate", "FineAggregate"]
concrete["Components"] = concrete[components].gt(0).sum(axis=1)

concrete[components + ["Components"]].head(10)

### 3. Building-up and Breaking-down features
- for entries that have complex strings (e.g., phone number (999)-444-1234, street address, URL, dates etc)
- The str accessor lets you apply string methods like split directly to columns. 

In [None]:
customer[["Type", "Level"]] = (  # Create two new features
    customer["Policy"]           # from the Policy feature
    .str                         # through the string accessor
    .split(" ", expand=True)     # by splitting on " "
                                 # and expanding the result into separate columns
)

customer[["Policy", "Type", "Level"]].head(10)

In [None]:
	Policy	Type	Level
0	Corporate L3	Corporate	L3
1	Personal L3	Personal	L3
2	Personal L3	Personal	L3

- join simple features into a composed feature

In [None]:
autos["make_and_style"] = autos["make"] + "_" + autos["body_style"]
autos[["make", "body_style", "make_and_style"]].head()

### 4. Group transforms 
- aggregate information across multiple rows grouped by some category
- Using an aggregation function tp combine two features: a categorical feature that provides the grouping and another feature whose values you wish to aggregate

In [None]:
# average income by state

customer["AverageIncome"] = (
    customer.groupby("State")  # for each state
    ["Income"]                 # select the income
    .transform("mean")         # and compute its mean
)

customer[["State", "Income", "AverageIncome"]].head(10)

In [None]:
# calculate the frequency with which each state occurs in the dataset:

customer["StateFreq"] = (
    customer.groupby("State")
    ["State"]
    .transform("count")
    / customer.State.count()
)

customer[["State", "StateFreq"]].head(10)

- create a grouped feature using only the training set and then join it to the validation set.
    -  can use the validation set's merge method 

In [None]:
# Create splits
df_train = customer.sample(frac=0.5)
df_valid = customer.drop(df_train.index)

# Create the average claim amount by coverage type, on the training set
df_train["AverageClaim"] = df_train.groupby("Coverage")["ClaimAmount"].transform("mean")

# Merge the values into the validation set
df_valid = df_valid.merge(
    df_train[["Coverage", "AverageClaim"]].drop_duplicates(),
    on="Coverage",
    how="left",
)

df_valid[["Coverage", "AverageClaim"]].head(10)

#### Tips on Creating Features
It's good to keep in mind your model's own strengths and weaknesses when creating features. Here are some guidelines: <br>
- Linear models learn sums and differences naturally, but can't learn anything more complex.
- Ratios seem to be difficult for most models to learn. Ratio combinations often lead to some easy performance gains.
- Linear models and neural nets generally do better with normalized features. Neural nets especially need features scaled to values not too far from 0. 
- Tree-based models (like random forests and XGBoost) can sometimes benefit from normalization, but usually much less so.
Tree models can learn to approximate almost any combination of features, but when a combination is especially important they can still benefit from having it explicitly created, especially when data is limited. 
- Counts are especially helpful for tree models, since these models don't have a natural way of aggregating information across many features at once.
- (Above fully copied from Kaggle Online Course)

## Exercises

In [None]:
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex3 import *

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor


def score_dataset(X, y, model=XGBRegressor()):
    # Label encoding for categoricals
    for colname in X.select_dtypes(["category", "object"]):
        X[colname], _ = X[colname].factorize()
    # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
    score = cross_val_score(
        model, X, y, cv=5, scoring="neg_mean_squared_log_error",
    )
    score = -1 * score.mean()
    score = np.sqrt(score)
    return score


# Prepare data
df = pd.read_csv("../input/fe-course-data/ames.csv")
X = df.copy()
y = X.pop("SalePrice")

### 1. Create mathematical transforms

In [None]:
X_1["LivLotRatio"] = df.GrLivArea / df.LotArea
X_1["Spaciousness"] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvGrd
X_1["TotalOutsideSF"] = df.WoodDeckSF + df.OpenPorchSF + df.EnclosedPorch + df.Threeseasonporch + df.ScreenPorch
q_1.check()

If you've discovered an interaction effect between a numeric feature and a categorical feature, you might want to model it explicitly using a one-hot encoding, like so:

# One-hot encode Categorical feature, adding a column prefix "Cat"
X_new = pd.get_dummies(df.Categorical, prefix="Cat")

# Multiply row-by-row
X_new = X_new.mul(df.Continuous, axis=0)

# Join the new features to the feature set
X = X.join(X_new)

### 2. Interaction with a categorical 
- We discovered an interaction between BldgType and GrLivArea in Exercise 2. Now create their interaction features.

In [None]:
# One-hot encode BldgType. Use `prefix="Bldg"` in `get_dummies`
X_2 = pd.get_dummies(df.BldgType, prefix='Bldg')
# Multiply
X_2 = X_2.mul(df.GrLivArea, axis=0)

# Check your answer
q_2.check()

### 3. Count Feature
- Let's try creating a feature that describes how many kinds of outdoor areas a dwelling has. Create a feature PorchTypes that counts how many of the following are greater than 0.0.

In [None]:
X_3 = pd.DataFrame()

# YOUR CODE HERE
types = ["WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
"Threeseasonporch","ScreenPorch"]
X_3["PorchTypes"] = df[types].gt(0).sum(axis=1)

# Check your answer
q_3.check()

### 4. Break down a categorical feature

In [None]:
# describe the type of a dwelling
df.MSSubClass.unique()

array(['One_Story_1946_and_Newer_All_Styles', 'Two_Story_1946_and_Newer',
       'One_Story_PUD_1946_and_Newer',
       'One_and_Half_Story_Finished_All_Ages', 'Split_Foyer',
       'Two_Story_PUD_1946_and_Newer', 'Split_or_Multilevel',
       'One_Story_1945_and_Older', 'Duplex_All_Styles_and_Ages',
       'Two_Family_conversion_All_Styles_and_Ages',
       'One_and_Half_Story_Unfinished_All_Ages',
       'Two_Story_1945_and_Older', 'Two_and_Half_Story_All_Ages',
       'One_Story_with_Finished_Attic_All_Ages',
       'PUD_Multilevel_Split_Level_Foyer',
       'One_and_Half_Story_PUD_All_Ages'], dtype=object)

You can see that there is a more general categorization described (roughly) by the first word of each category. Create a feature containing only these first words by splitting MSSubClass at the first underscore _. (Hint: In the split method use an argument n=1.)

In [None]:
X_4 = pd.DataFrame()

# YOUR CODE HERE
X_4['MSClass'] = (
    df["MSSubClass"]
    .str
    .split("_", n=1, expand=True))[0]

# Check your answer
q_4.check()

### 5. Use a grouped transform
- The value of a home often depends on how it compares to typical homes in its neighborhood. Create a feature MedNhbdArea that describes the median of GrLivArea grouped on Neighborhood.

In [None]:
X_5 = pd.DataFrame()

# YOUR CODE HERE
X_5["MedNhbdArea"] = (
    df.groupby("Neighborhood")
    ["GrLivArea"]
    .transform("median")
)

# Check your answer
q_5.check()

Now you've made your first new feature set! If you like, you can run the cell below to score the model with all of your new features added:

In [None]:
X_new = X.join([X_1, X_2, X_3, X_4, X_5])
score_dataset(X_new, y)