**This notebook is an exercise in the [Feature Engineering](https://www.kaggle.com/learn/feature-engineering) course.  You can reference the tutorial at [this link](https://www.kaggle.com/ryanholbrook/creating-features).**

---


# Introduction #

In this exercise you'll start developing the features you identified in Exercise 2 as having the most potential. As you work through this exercise, you might take a moment to look at the data documentation again and consider whether the features we're creating make sense from a real-world perspective, and whether there are any useful combinations that stand out to you.

Run this cell to set everything up!

In [1]:
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex3 import *

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor


def score_dataset(X, y, model=XGBRegressor()):
    # Label encoding for categoricals
    for colname in X.select_dtypes(["category", "object"]):
        X[colname], _ = X[colname].factorize()
    # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
    score = cross_val_score(
        model, X, y, cv=5, scoring="neg_mean_squared_log_error",
    )
    score = -1 * score.mean()
    score = np.sqrt(score)
    return score


# Prepare data
df = pd.read_csv("../input/fe-course-data/ames.csv")
X = df.copy()
y = X.pop("SalePrice")

In [2]:
X.head(2)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YearSold,SaleType,SaleCondition
0,One_Story_1946_and_Newer_All_Styles,Residential_Low_Density,141.0,31770.0,Pave,No_Alley_Access,Slightly_Irregular,Lvl,AllPub,Corner,...,0.0,0.0,No_Pool,No_Fence,,0.0,5,2010,WD,Normal
1,One_Story_1946_and_Newer_All_Styles,Residential_High_Density,80.0,11622.0,Pave,No_Alley_Access,Regular,Lvl,AllPub,Inside,...,120.0,0.0,No_Pool,Minimum_Privacy,,0.0,6,2010,WD,Normal


In [3]:
y.head(2)

0    215000
1    105000
Name: SalePrice, dtype: int64

-------------------------------------------------------------------------------

Let's start with a few mathematical combinations. We'll focus on features describing areas -- having the same units (square-feet) makes it easy to combine them in sensible ways. Since we're using XGBoost (a tree-based model), we'll focus on ratios and sums.

# 1) Create Mathematical Transforms

Create the following features:

- `LivLotRatio`: the ratio of `GrLivArea` to `LotArea`
- `Spaciousness`: the sum of `FirstFlrSF` and `SecondFlrSF` divided by `TotRmsAbvGrd`
- `TotalOutsideSF`: the sum of `WoodDeckSF`, `OpenPorchSF`, `EnclosedPorch`, `Threeseasonporch`, and `ScreenPorch`

In [7]:
# YOUR CODE HERE
X_1 = pd.DataFrame()  # dataframe to hold new features

X_1["LivLotRatio"] = X['GrLivArea'] / X['LotArea']
X_1["Spaciousness"] = (X['FirstFlrSF'] + X['SecondFlrSF']) / X['TotRmsAbvGrd']
X_1["TotalOutsideSF"] = X['WoodDeckSF'] + X['OpenPorchSF'] + X['EnclosedPorch'] + \
                                            X['Threeseasonporch'] + X['ScreenPorch']


# Check your answer
q_1.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [10]:
X_1.head()

Unnamed: 0,LivLotRatio,Spaciousness,TotalOutsideSF
0,0.052125,236.571429,272.0
1,0.077095,179.2,260.0
2,0.093152,221.5,429.0
3,0.189068,263.75,0.0
4,0.117787,271.5,246.0


In [8]:
# Lines below will give you a hint or solution code
#q_1.hint()
q_1.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python

X_1["LivLotRatio"] = df.GrLivArea / df.LotArea
X_1["Spaciousness"] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvGrd
X_1["TotalOutsideSF"] = df.WoodDeckSF + df.OpenPorchSF + df.EnclosedPorch + df.Threeseasonporch + df.ScreenPorch

```

-------------------------------------------------------------------------------

If you've discovered an interaction effect between a numeric feature and a categorical feature, you might want to model it explicitly using a one-hot encoding, like so:

```
# One-hot encode Categorical feature, adding a column prefix "Cat"
X_new = pd.get_dummies(df.Categorical, prefix="Cat")

# Multiply row-by-row
X_new = X_new.mul(df.Continuous, axis=0)

# Join the new features to the feature set
X = X.join(X_new)
```

# 2) Interaction with a Categorical

We discovered an interaction between `BldgType` and `GrLivArea` in Exercise 2. Now create their interaction features.

In [11]:
X.GrLivArea.head(2)

0    1656.0
1     896.0
Name: GrLivArea, dtype: float64

In [13]:
# X_2.head()

Unnamed: 0,Bldg_Duplex,Bldg_OneFam,Bldg_Twnhs,Bldg_TwnhsE,Bldg_TwoFmCon
0,0,1,0,0,0
1,0,1,0,0,0
2,0,1,0,0,0
3,0,1,0,0,0
4,0,1,0,0,0


In [19]:
# YOUR CODE HERE
# One-hot encode BldgType. Use `prefix="Bldg"` in `get_dummies`
X_2 = pd.get_dummies(X.BldgType, prefix="Bldg")
# Multiply
X_2 = X_2.mul(X.GrLivArea, axis=0)


# Check your answer
q_2.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [23]:
X_2.sample(100)

Unnamed: 0,Bldg_Duplex,Bldg_OneFam,Bldg_Twnhs,Bldg_TwnhsE,Bldg_TwoFmCon
1494,0.0,1117.0,0.0,0.0,0.0
2070,0.0,998.0,0.0,0.0,0.0
5,0.0,1604.0,0.0,0.0,0.0
99,0.0,1478.0,0.0,0.0,0.0
1726,0.0,0.0,0.0,1405.0,0.0
...,...,...,...,...,...
2627,1556.0,0.0,0.0,0.0,0.0
2417,0.0,2373.0,0.0,0.0,0.0
2357,0.0,1309.0,0.0,0.0,0.0
986,0.0,1287.0,0.0,0.0,0.0


In [16]:
# Lines below will give you a hint or solution code
#q_2.hint()
q_2.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python

X_2 = pd.get_dummies(df.BldgType, prefix="Bldg")
X_2 = X_2.mul(df.GrLivArea, axis=0)

```

# 3) Count Feature

Let's try creating a feature that describes how many kinds of outdoor areas a dwelling has. Create a feature `PorchTypes` that counts how many of the following are greater than 0.0:

```
WoodDeckSF
OpenPorchSF
EnclosedPorch
Threeseasonporch
ScreenPorch
```

In [24]:
porch_types = ['WoodDeckSF','OpenPorchSF','EnclosedPorch','Threeseasonporch','ScreenPorch']
X[porch_types].head()

Unnamed: 0,WoodDeckSF,OpenPorchSF,EnclosedPorch,Threeseasonporch,ScreenPorch
0,210.0,62.0,0.0,0.0,0.0
1,140.0,0.0,0.0,0.0,120.0
2,393.0,36.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0
4,212.0,34.0,0.0,0.0,0.0


In [25]:
X_3 = pd.DataFrame()

# YOUR CODE HERE
X_3["PorchTypes"] = X[porch_types].gt(0).sum(axis=1)


# Check your answer
q_3.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [27]:
X_3.head()

Unnamed: 0,PorchTypes
0,2
1,2
2,2
3,0
4,2


In [26]:
# Lines below will give you a hint or solution code
#q_3.hint()
q_3.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python

X_3 = pd.DataFrame()

X_3["PorchTypes"] = df[[
    "WoodDeckSF",
    "OpenPorchSF",
    "EnclosedPorch",
    "Threeseasonporch",
    "ScreenPorch",
]].gt(0.0).sum(axis=1)

```

# 4) Break Down a Categorical Feature

`MSSubClass` describes the type of a dwelling:

In [29]:
X.MSSubClass.head(2)

0    One_Story_1946_and_Newer_All_Styles
1    One_Story_1946_and_Newer_All_Styles
Name: MSSubClass, dtype: object

In [28]:
df.MSSubClass.unique()

array(['One_Story_1946_and_Newer_All_Styles', 'Two_Story_1946_and_Newer',
       'One_Story_PUD_1946_and_Newer',
       'One_and_Half_Story_Finished_All_Ages', 'Split_Foyer',
       'Two_Story_PUD_1946_and_Newer', 'Split_or_Multilevel',
       'One_Story_1945_and_Older', 'Duplex_All_Styles_and_Ages',
       'Two_Family_conversion_All_Styles_and_Ages',
       'One_and_Half_Story_Unfinished_All_Ages',
       'Two_Story_1945_and_Older', 'Two_and_Half_Story_All_Ages',
       'One_Story_with_Finished_Attic_All_Ages',
       'PUD_Multilevel_Split_Level_Foyer',
       'One_and_Half_Story_PUD_All_Ages'], dtype=object)

You can see that there is a more general categorization described (roughly) by the first word of each category. Create a feature containing only these first words by splitting `MSSubClass` at the first underscore `_`. (Hint: In the `split` method use an argument `n=1`.)

In [35]:
'One_Story_1946_and_Newer_All_Styles'.split("_", 1)

['One', 'Story_1946_and_Newer_All_Styles']

In [38]:
# X['MSSubClass'].str.split("_", 1, expand=True)

Unnamed: 0,0,1
0,One,Story_1946_and_Newer_All_Styles
1,One,Story_1946_and_Newer_All_Styles
2,One,Story_1946_and_Newer_All_Styles
3,One,Story_1946_and_Newer_All_Styles
4,Two,Story_1946_and_Newer
...,...,...
2925,Split,or_Multilevel
2926,One,Story_1946_and_Newer_All_Styles
2927,Split,Foyer
2928,One,Story_1946_and_Newer_All_Styles


In [39]:
X_4 = pd.DataFrame()

# YOUR CODE HERE
X_4['MSClass'] = X['MSSubClass'].str.split("_", 1, expand=True)[0]

# Check your answer
q_4.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [36]:
# Lines below will give you a hint or solution code
#q_4.hint()
q_4.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python

X_4 = pd.DataFrame()

X_4["MSClass"] = df.MSSubClass.str.split("_", n=1, expand=True)[0]

```

# 5) Use a Grouped Transform

The value of a home often depends on how it compares to typical homes in its neighborhood. Create a feature `MedNhbdArea` that describes the *median* of `GrLivArea` grouped on `Neighborhood`.

In [45]:
X.Neighborhood.value_counts().head(7)

North_Ames            443
College_Creek         267
Old_Town              239
Edwards               194
Somerset              182
Northridge_Heights    166
Gilbert               165
Name: Neighborhood, dtype: int64

In [41]:
X_5 = pd.DataFrame()

# YOUR CODE HERE
X_5["MedNhbdArea"] = (X.groupby('Neighborhood')['GrLivArea'].transform("median"))

# Check your answer
q_5.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [42]:
X_5.MedNhbdArea

0       1200.0
1       1200.0
2       1200.0
3       1200.0
4       1560.0
         ...  
2925    1282.0
2926    1282.0
2927    1282.0
2928    1282.0
2929    1282.0
Name: MedNhbdArea, Length: 2930, dtype: float64

In [None]:
# Lines below will give you a hint or solution code
#q_5.hint()
q_5.solution()

Now you've made your first new feature set! If you like, you can run the cell below to score the model with all of your new features added:

In [None]:
X_new = X.join([X_1, X_2, X_3, X_4, X_5])
score_dataset(X_new, y)

# Keep Going #

[**Untangle spatial relationships**](https://www.kaggle.com/ryanholbrook/clustering-with-k-means) by adding cluster labels to your dataset.

---




*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/221677) to chat with other Learners.*