**This notebook is an exercise in the [Feature Engineering](https://www.kaggle.com/learn/feature-engineering) course.  You can reference the tutorial at [this link](https://www.kaggle.com/ryanholbrook/creating-features).**

---


# Introduction #

In this exercise you'll start developing the features you identified in Exercise 2 as having the most potential. As you work through this exercise, you might take a moment to look at the data documentation again and consider whether the features we're creating make sense from a real-world perspective, and whether there are any useful combinations that stand out to you.

Run this cell to set everything up!

# 导言
在本练习中，您将开始开发练习 2 中确定的最有潜力的功能。在完成本练习的过程中，您可以花一点时间再次查看数据文档，并考虑我们创建的功能从现实世界的角度来看是否合理，以及是否有任何有用的组合让您印象深刻。

运行此单元格设置一切！

In [1]:
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex3 import *

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor


def score_dataset(X, y, model=XGBRegressor()):
    # Label encoding for categoricals
    for colname in X.select_dtypes(["category", "object"]):
        X[colname], _ = X[colname].factorize()
    # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
    score = cross_val_score(
        model, X, y, cv=5, scoring="neg_mean_squared_log_error",
    )
    score = -1 * score.mean()
    score = np.sqrt(score)
    return score


# Prepare data
df = pd.read_csv("../input/fe-course-data/ames.csv")
X = df.copy()
y = X.pop("SalePrice")



-------------------------------------------------------------------------------

Let's start with a few mathematical combinations. We'll focus on features describing areas -- having the same units (square-feet) makes it easy to combine them in sensible ways. Since we're using XGBoost (a tree-based model), we'll focus on ratios and sums.

# 1) Create Mathematical Transforms

Create the following features:

- `LivLotRatio`: the ratio of `GrLivArea` to `LotArea`
- `Spaciousness`: the sum of `FirstFlrSF` and `SecondFlrSF` divided by `TotRmsAbvGrd`
- `TotalOutsideSF`: the sum of `WoodDeckSF`, `OpenPorchSF`, `EnclosedPorch`, `Threeseasonporch`, and `ScreenPorch`

让我们从一些数学组合开始。我们将把重点放在描述面积的特征上--有了相同的单位（平方英尺），就很容易以合理的方式将它们组合起来。由于我们使用的是 XGBoost（一种基于树的模型），因此我们将重点关注比率和总和。

# 1) 创建数学变换
创建以下功能：

- LivLotRatio：GrLivArea 与 LotArea 之比
- 宽敞度：FirstFlrSF 和 SecondFlrSF 之和除以 TotRmsAbvGrd
- 室外总面积：WoodDeckSF、OpenPorchSF、EnclosedPorch、Threeseasonporch 和 ScreenPorch 之和

In [6]:
# YOUR CODE HERE
X_1 = pd.DataFrame()  # dataframe to hold new features

X_1["LivLotRatio"] = df.GrLivArea/df.LotArea
X_1["Spaciousness"] =(df.FirstFlrSF + df.SecondFlrSF )/ df.TotRmsAbvGrd
X_1["TotalOutsideSF"] = df.WoodDeckSF + df.OpenPorchSF + df.EnclosedPorch + df.Threeseasonporch + df.ScreenPorch 


# Check your answer
q_1.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [4]:
# Lines below will give you a hint or solution code
q_1.hint()
q_1.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Your code should look something like:
```python
X_1["LivLotRatio"] = ____ / ____
X_1["Spaciousness"] = (____ + ____) / ____
X_1["TotalOutsideSF"] = ____ + ____ + ____ + ____ + ____
```


<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python

X_1["LivLotRatio"] = df.GrLivArea / df.LotArea
X_1["Spaciousness"] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvGrd
X_1["TotalOutsideSF"] = df.WoodDeckSF + df.OpenPorchSF + df.EnclosedPorch + df.Threeseasonporch + df.ScreenPorch

```

-------------------------------------------------------------------------------

If you've discovered an interaction effect between a numeric feature and a categorical feature, you might want to model it explicitly using a one-hot encoding, like so:

```
# One-hot encode Categorical feature, adding a column prefix "Cat"
X_new = pd.get_dummies(df.Categorical, prefix="Cat")

# Multiply row-by-row
X_new = X_new.mul(df.Continuous, axis=0)

# Join the new features to the feature set
X = X.join(X_new)
```

# 2) Interaction with a Categorical

We discovered an interaction between `BldgType` and `GrLivArea` in Exercise 2. Now create their interaction features.

如果您发现了数字特征和分类特征之间的交互效应，您可能会想使用单次热编码（one-hot encoding）对其进行明确建模，就像这样：
```
# 对分类特征进行one-hot编码，添加列前缀 "Cat"
X_new = pd.get_dummies(df.Categorical, prefix="Cat")

# 逐行相乘
X_new = X_new.mul(df.Continuous, axis=0)

# 将新特征加入特征集
X = X.join(X_new)
```
# 2) 与分类¶ 的交互作用
我们在练习 2 中发现了 BldgType 和 GrLivArea 之间的交互作用。现在创建它们的交互特征。

In [7]:
# YOUR CODE HERE
# One-hot encode BldgType. Use `prefix="Bldg"` in `get_dummies`
X_2 = pd.get_dummies(df.BldgType,prefix="Bldg")
# Multiply
X_2 = X_2.mul(df.GrLivArea,axis=0)


# Check your answer
q_2.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [None]:
# Lines below will give you a hint or solution code
#q_2.hint()
#q_2.solution()

# 3) Count Feature

Let's try creating a feature that describes how many kinds of outdoor areas a dwelling has. Create a feature `PorchTypes` that counts how many of the following are greater than 0.0:

```
WoodDeckSF
OpenPorchSF
EnclosedPorch
Threeseasonporch
ScreenPorch
```

# 3) 计数功能
让我们尝试创建一个描述住宅有多少种室外区域的特征。创建一个 "门廊类型"（PorchTypes）特征，计算以下特征中大于 0.0 的数量：

WoodDeckSF
开放式门廊
封闭式门廊
三季门廊
屏风门廊

In [9]:
X_3 = pd.DataFrame()

# YOUR CODE HERE
X_3["PorchTypes"] = df[[
    "WoodDeckSF",
    "OpenPorchSF",
    "EnclosedPorch",
    "Threeseasonporch",
    "ScreenPorch",
]].gt(0.0).sum(axis=1)


# Check your answer
q_3.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [8]:
# Lines below will give you a hint or solution code
q_3.hint()
q_3.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Your code should look someting like:
```python
X_3 = pd.DataFrame()

X_3["PorchTypes"] = df[[
    ____,
    ____,
    ____,
    ____,
    ____,
]].____.sum(axis=1)
```


<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python

X_3 = pd.DataFrame()

X_3["PorchTypes"] = df[[
    "WoodDeckSF",
    "OpenPorchSF",
    "EnclosedPorch",
    "Threeseasonporch",
    "ScreenPorch",
]].gt(0.0).sum(axis=1)

```

# 4) Break Down a Categorical Feature

`MSSubClass` describes the type of a dwelling:

# 4) 细分分类特征
MSSubClass 描述了住宅的类型：

In [10]:
df.MSSubClass.unique()

array(['One_Story_1946_and_Newer_All_Styles', 'Two_Story_1946_and_Newer',
       'One_Story_PUD_1946_and_Newer',
       'One_and_Half_Story_Finished_All_Ages', 'Split_Foyer',
       'Two_Story_PUD_1946_and_Newer', 'Split_or_Multilevel',
       'One_Story_1945_and_Older', 'Duplex_All_Styles_and_Ages',
       'Two_Family_conversion_All_Styles_and_Ages',
       'One_and_Half_Story_Unfinished_All_Ages',
       'Two_Story_1945_and_Older', 'Two_and_Half_Story_All_Ages',
       'One_Story_with_Finished_Attic_All_Ages',
       'PUD_Multilevel_Split_Level_Foyer',
       'One_and_Half_Story_PUD_All_Ages'], dtype=object)

You can see that there is a more general categorization described (roughly) by the first word of each category. Create a feature containing only these first words by splitting `MSSubClass` at the first underscore `_`. (Hint: In the `split` method use an argument `n=1`.)

您可以看到，每个类别的第一个单词（大致）描述了一个更普遍的分类。通过在第一个下划线 _ 处拆分 MSSubClass，创建一个只包含这些第一个单词的特征（提示：在拆分方法中使用参数 n=1）。

In [12]:
X_4 = pd.DataFrame()

# YOUR CODE HERE
X_4["MSClass"] = df.MSSubClass.str.split("_", n=1, expand=True)[0]

# Check your answer
q_4.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [11]:
# Lines below will give you a hint or solution code
q_4.hint()
q_4.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Your code should look something like:
```python
X_4 = pd.DataFrame()

X_4["MSClass"] = df.____.str.____(____, n=1, expand=True)[____]
```


<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python

X_4 = pd.DataFrame()

X_4["MSClass"] = df.MSSubClass.str.split("_", n=1, expand=True)[0]

```

# 5) Use a Grouped Transform

The value of a home often depends on how it compares to typical homes in its neighborhood. Create a feature `MedNhbdArea` that describes the *median* of `GrLivArea` grouped on `Neighborhood`.

# 5) 使用分组变换
房屋的价值往往取决于它与周边典型房屋的比较。创建一个特征 MedNhbdArea，描述根据Neighborhood分组的 GrLivArea 中值。

In [17]:
X_5 = pd.DataFrame()

# YOUR CODE HERE
X_5["MedNhbdArea"] = (
    df.groupby("Neighborhood")
    ["GrLivArea"]
    .transform("median")
)
# Check your answer
q_5.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [16]:
# Lines below will give you a hint or solution code
q_5.hint()
q_5.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Your code should look something like:
```python
X_5 = pd.DataFrame()

X_5["MedNhbdArea"] = df.____("Neighborhood")["____"].transform(____)
```


<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python

X_5 = pd.DataFrame()

X_5["MedNhbdArea"] = df.groupby("Neighborhood")["GrLivArea"].transform("median")

```

Now you've made your first new feature set! If you like, you can run the cell below to score the model with all of your new features added:
现在，您已经创建了第一个新特征集！如果您愿意，可以运行下面的单元格，对添加了所有新功能的模型进行评分：

In [18]:
X_new = X.join([X_1, X_2, X_3, X_4, X_5])
score_dataset(X_new, y)

0.13865658070461215

```python
def score_dataset(X, y, model=XGBRegressor()):
    # Label encoding for categoricals
    for colname in X.select_dtypes(["category", "object"]):
        X[colname], _ = X[colname].factorize()
    # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
    score = cross_val_score(
        model, X, y, cv=5, scoring="neg_mean_squared_log_error",
    )
    score = -1 * score.mean()
    score = np.sqrt(score)
    return score
```
您提供的代码定义了一个函数 score_dataset，该函数的目的是通过交叉验证评估一个模型的性能。具体来说，该代码适用于回归问题，其中目标是*最小化均方根对数误差（RMSLE）*。

- 最小化均方根对数误差（RMSLE）

- 负均方根对数误差（neg_mean_squared_log_error）

##### 最小化均方根对数误差（RMSLE）
均方根对数误差（Root Mean Squared Logarithmic Error，RMSLE）是一种常用的回归问题的评估指标，特别是当目标变量的范围非常广泛时。与均方根误差（Root Mean Squared Error，RMSE）类似，RMSLE 也度量了模型预测值与真实值之间的差异，但在计算之前对这些值取了对数。

RMSLE 的公式定义如下：
\begin{align*}
\text { RMSLE }=\sqrt{\frac{1}{n} \sum_{i=1}^n\left(\log \left(y_i+1\right)-\log \left(\hat{y}_i+1\right)\right)^2}
\end{align*}
其中：
- $n$ 是观测值的数量
- $y_i$ 是第 $i$ 个观测的实际值
- $\hat{y}_i$ 是第 $i$ 个观测的预测值

RMSLE 有以下特点：

惩罚欠预测而不是过预测：与 RMSE 不同，RMSLE 更关心预测值与真实值之间的相对差异而不是绝对差异。这意味着欠预测通常比过预测受到更严重的惩罚。

对异常值不敏感：由于采用了对数变换，所以 RMSLE 对异常值不太敏感。这使得它在目标变量的范围非常广泛时非常有用。

确保正值：通过对数变换，RMSLE 只适用于正目标值。在应用 RMSLE 之前，必须确保数据集中没有零或负值。

解释性：RMSLE 的解释性较好，因为它度量了预测值和实际值之间的相对误差。

总的来说，如果你的数据中存在异常值，或者你更关心预测值和真实值之间的比率而不是差异，并且你希望对欠预测施加更严重的惩罚，那么 RMSLE 可能是一个合适的评估指标。在许多情况下，特别是预测销售、库存或其他正值目标时，RMSLE 是一种常用的评估方法。

##### 负均方根对数误差（neg_mean_squared_log_error）

负均方根对数误差（Negative Mean Squared Logarithmic Error，简称 neg_mean_squared_log_error）与均方根对数误差（RMSLE）非常相似，但在数学表示上取了相反的符号。这种负号的引入主要是为了与 Scikit-Learn 等机器学习库中的某些函数兼容，其中更高的分数通常表示更好的性能。

负均方根对数误差的公式定义如下：
\begin{align*}
\text { neg_mean_squared_log_error }=-\sqrt{\frac{1}{n} \sum_{i=1}^n\left(\log \left(y_i+1\right)-\log \left(\hat{y}_i+1\right)\right)^2}
\end{align*}
其中：
- $n$ 是观测值的数量。
- $y_i$ 是第 $i$ 个观测的实际值。
- $\hat{y}_i$ 是第 $i$ 个观测的预测值。


与 RMSLE 一样，这个指标也关心预测值与真实值之间的相对误差，而不是绝对误差。通过对数变换，它对异常值不太敏感，并且更关心欠预测而不是过预测。

在 Scikit-Learn 中使用 cross_val_score 或其他交叉验证方法时，可以选择 neg_mean_squared_log_error 作为评估指标。它的行为与 RMSLE 完全相同，但方向相反，因此最佳模型是具有最高（最不负）负均方根对数误差的模型。

请注意，与 RMSLE 一样，这个评估指标也只适用于正目标值。如果目标变量中存在零或负数，会导致计算对数时出现错误。


### 以下是代码各部分的详细解释：

#### 参数:

X: 特征矩阵。
y: 目标变量。
model: 要评估的模型。默认为 XGBoost 回归器 (XGBRegressor())。

#### 标签编码:

对分类特征进行标签编码。任何数据类型为 "category" 或 "object" 的列都会被视为分类特征。
#### 交叉验证:

使用 5 折交叉验证来评估模型的性能。
评分标准为负均方根对数误差（neg_mean_squared_log_error），这是因为在 scikit-learn 中，越高的分数通常表示越好的性能，所以需要取负数。
#### 返回分数:

计算均方根对数误差（RMSLE），并返回该分数。


此函数可用于对不同的模型和特征工程策略进行比较，以便找到最优的解决方案。你可以通过传递不同的模型和特征数据集来调用此函数，然后比较返回的分数以选择最佳模型。

请注意，输入数据 X 应为 DataFrame，并且目标变量 y 应为正数，因为均方根对数误差（RMSLE）只适用于正目标值。如果 y 中存在零或负数，会导致计算对数时出现错误。

# Keep Going #

[**Untangle spatial relationships**](https://www.kaggle.com/ryanholbrook/clustering-with-k-means) by adding cluster labels to your dataset.

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/feature-engineering/discussion) to chat with other learners.*