**This notebook is an exercise in the [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/data-leakage).**

---


Most people find target leakage very tricky until they've thought about it for a long time.

So, before trying to think about leakage in the housing price example, we'll go through a few examples in other applications. Things will feel more familiar once you come back to a question about house prices.

# Setup

The questions below will give you feedback on your answers. Run the following cell to set up the feedback system.

In [1]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex7 import *
print("Setup Complete")

Setup Complete


# Step 1: The Data Science of Shoelaces

Nike has hired you as a data science consultant to help them save money on shoe materials. Your first assignment is to review a model one of their employees built to predict how many shoelaces they'll need each month. The features going into the machine learning model include:
- The current month (January, February, etc)
- Advertising expenditures in the previous month
- Various macroeconomic features (like the unemployment rate) as of the beginning of the current month
- The amount of leather they ended up using in the current month

The results show the model is almost perfectly accurate if you include the feature about how much leather they used. But it is only moderately accurate if you leave that feature out. You realize this is because the amount of leather they use is a perfect indicator of how many shoes they produce, which in turn tells you how many shoelaces they need.

Do you think the _leather used_ feature constitutes a source of data leakage? If your answer is "it depends," what does it depend on?

After you have thought about your answer, check it against the solution below.

# 步骤 1：鞋带的数据科学
耐克公司聘请你担任数据科学顾问，帮助他们节省鞋带材料的开支。您的第一项任务是查看他们的一名员工建立的模型，以预测他们每个月需要多少鞋带。机器学习模型的特征包括

- 当前月份（一月、二月等）
- 上个月的广告支出
- 本月初的各种宏观经济特征（如失业率
- 当月最终使用的皮革数量

结果表明，如果将他们使用了多少皮革这一特征包括在内，模型几乎完全准确。但如果不包含该特征，则准确性一般。你意识到这是因为他们使用的皮革量是他们生产多少双鞋的完美指标，而生产多少双鞋又能告诉你他们需要多少鞋带。

你认为皮革用量特征会构成数据泄漏源吗？如果您的答案是 "取决于"，那么它取决于什么？

想好答案后，请对照下面的解决方案。

In [2]:
# Check your answer (Run this code cell to receive credit!)
q_1.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

This is tricky, and it depends on details of how data is collected (which is common when thinking about leakage). Would you at the beginning of the month decide how much leather will be used that month? If so, this is ok. But if that is determined during the month, you would not have access to it when you make the prediction. If you have a guess at the beginning of the month, and it is subsequently changed during the month, the actual amount used during the month cannot be used as a feature (because it causes leakage).

这很棘手，而且取决于数据收集的细节（在考虑泄漏问题时，这很常见）。

你会在月初决定当月将使用多少皮革吗？如果是这样，那就没问题。

但如果这是在当月确定的，您在预测时就无法获得。

如果您在月初有一个猜测，但随后在当月发生了变化，那么当月的实际用量就不能用作特征（因为这会导致泄漏）。

# Step 2: Return of the Shoelaces

You have a new idea. You could use the amount of leather Nike ordered (rather than the amount they actually used) leading up to a given month as a predictor in your shoelace model.

Does this change your answer about whether there is a leakage problem? If you answer "it depends," what does it depend on?

# 步骤 2：鞋带回归
你有了一个新想法。您可以使用耐克公司在某月之前**订购的皮革数量**（而不是他们实际使用的数量）作为鞋带模型的预测因子。

这会改变你对是否存在泄漏问题的回答吗？如果您回答 "取决于"，那么取决于什么？

In [3]:
# Check your answer (Run this code cell to receive credit!)
q_2.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

This could be fine, but it depends on whether they order shoelaces first or leather first. If they order shoelaces first, you won't know how much leather they've ordered when you predict their shoelace needs. If they order leather first, then you'll have that number available when you place your shoelace order, and you should be ok.

这样也可以，但这取决于他们是先订鞋带还是先订皮革。

如果他们先订购鞋带，当你预测他们的鞋带需求时，你就不知道他们订购了多少皮革。

如果他们先订购皮革，那么您在下鞋带订单时就可以得到这个数量，这样就没问题了。

在这个例子中，我们的目标是预测耐克公司在一个月中需要多少鞋带。我们有一个可能的特征，那就是耐克公司在这个月之前订购的皮革数量。这个特征可能对我们的预测有帮助，因为皮革的数量可能会影响鞋的生产量，进而影响鞋带的需求。

然而，我们需要考虑的一个问题是，当我们做出预测时，这个特征是否已经可用。这就涉及到了耐克公司的订购流程：

如果耐克公司是先订购鞋带，然后再订购皮革，那么在我们预测鞋带需求时，我们还不知道他们会订购多少皮革。因此，如果我们在模型中使用了皮革订购量这个特征，就会导致目标泄露，因为这个特征在实际预测时是不可用的。

如果耐克公司是先订购皮革，然后再订购鞋带，那么在我们预测鞋带需求时，我们就已经知道他们订购了多少皮革。在这种情况下，使用皮革订购量这个特征就不会导致目标泄露，因为这个特征在实际预测时是可用的。

所以，这个问题的关键在于，我们需要确保我们的模型只使用那些在实际预测时已经可用的特征，以避免目标泄露。

# Step 3: Getting Rich With Cryptocurrencies?

You saved Nike so much money that they gave you a bonus. Congratulations.

Your friend, who is also a data scientist, says he has built a model that will let you turn your bonus into millions of dollars. Specifically, his model predicts the price of a new cryptocurrency (like Bitcoin, but a newer one) one day ahead of the moment of prediction. His plan is to purchase the cryptocurrency whenever the model says the price of the currency (in dollars) is about to go up.

The most important features in his model are:
- Current price of the currency
- Amount of the currency sold in the last 24 hours
- Change in the currency price in the last 24 hours
- Change in the currency price in the last 1 hour
- Number of new tweets in the last 24 hours that mention the currency

The value of the cryptocurrency in dollars has fluctuated up and down by over $\$$100 in the last year, and yet his model's average error is less than $\$$1. He says this is proof his model is accurate, and you should invest with him, buying the currency whenever the model says it is about to go up.

Is he right? If there is a problem with his model, what is it?

# 第三步：通过加密货币致富？
你为耐克公司省了很多钱，他们给了你奖金。恭喜你。

你的朋友也是一位数据科学家，他说他已经建立了一个模型，可以让你把奖金变成数百万美元。具体来说，
他的模型可以提前一天预测出一种新的加密货币（类似比特币，不过是一种更新的加密货币）的价格。

他的计划是，只要模型显示这种加密货币（以美元计价）的价格即将上涨，就购买这种加密货币。

他的模型中最重要的特征是

- 货币的当前价格
- 过去 24 小时内卖出的货币数量
- 过去 24 小时内货币价格的变化
- 过去 1 小时内货币价格的变化
- 过去 24 小时内提及该货币的新推文数量

以美元计价的加密货币价值在过去一年中上下浮动超过 100 美元，而他的模型的平均误差却小于 1美元. 

他说，这证明他的模型是准确的，你应该和他一起投资，只要模型显示币值即将上涨，就买入。

他说得对吗？如果他的模型有问题，问题出在哪里？

In [4]:
# Check your answer (Run this code cell to receive credit!)
q_3.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

There is no source of leakage here. These features should be available at the moment you want to make a predition, and they're unlikely to be changed in the training data after the prediction target is determined. But, the way he describes accuracy could be misleading if you aren't careful. If the price moves gradually, today's price will be an accurate predictor of tomorrow's price, but it may not tell you whether it's a good time to invest. For instance, if it is $100 today, a model predicting a price of $100 tomorrow may seem accurate, even if it can't tell you whether the price is going up or down from the current price. A better prediction target would be the change in price over the next day. If you can consistently predict whether the price is about to go up or down (and by how much), you may have a winning investment opportunity.

这里没有泄漏源。这些特征应该在你要做预测的时候就有了，而且在确定预测目标后，它们不太可能在训练数据中发生变化。

但是，如果不小心，他描述准确性的方式可能会产生误导。

如果价格是逐渐变化的，那么今天的价格可以准确预测明天的价格，但它可能无法告诉你现在是否是投资的好时机。

例如，如果现在是 100𝑡𝑜𝑑𝑎𝑦,𝑎𝑚𝑜𝑑𝑒𝑙𝑝𝑟𝑒𝑑𝑖𝑐𝑖𝑡𝑔𝑎𝑝𝑟𝑖𝑛𝑖𝑐𝑒𝑜𝑓
 明天的价格是 100，这看起来似乎很准确，即使它无法告诉您价格比当前价格是上涨还是下跌。
 
 更好的预测目标是第二天的价格变化。
 
 如果您能持续预测价格即将上涨还是下跌（以及下跌的幅度），您就有可能获得一个成功的投资机会。

# Step 4: Preventing Infections

An agency that provides healthcare wants to predict which patients from a rare surgery are at risk of infection, so it can alert the nurses to be especially careful when following up with those patients.

You want to build a model. Each row in the modeling dataset will be a single patient who received the surgery, and the prediction target will be whether they got an infection.

Some surgeons may do the procedure in a manner that raises or lowers the risk of infection. But how can you best incorporate the surgeon information into the model?

You have a clever idea. 
1. Take all surgeries by each surgeon and calculate the infection rate among those surgeons.
2. For each patient in the data, find out who the surgeon was and plug in that surgeon's average infection rate as a feature.

Does this pose any target leakage issues?
Does it pose any train-test contamination issues?

# 步骤 4：预防感染
一家提供医疗保健服务的机构希望预测哪些罕见手术的患者有感染风险，以便提醒护士在跟进这些患者时要特别小心。

您想建立一个模型。建模数据集中的每一行都将是接受手术的单个患者，预测目标将是他们是否受到感染。

有些外科医生的手术方式可能会提高或降低感染风险。但如何才能最好地将外科医生的信息纳入模型呢？

你有一个聪明的想法。

- 提取每位外科医生的所有手术，计算出这些外科医生的感染率。
- 对于数据中的每位患者，找出外科医生是谁，然后将该外科医生的平均感染率作为一个特征输入。
这是否会造成目标泄漏问题？是否会造成训练-测试污染问题？

In [5]:
# Check your answer (Run this code cell to receive credit!)
q_4.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

This poses a risk of both target leakage and train-test contamination (though you may be able to avoid both if you are careful).

You have target leakage if a given patient's outcome contributes to the infection rate for his surgeon, which is then plugged back into the prediction model for whether that patient becomes infected. You can avoid target leakage if you calculate the surgeon's infection rate by using only the surgeries before the patient we are predicting for. Calculating this for each surgery in your training data may be a little tricky.

You also have a train-test contamination problem if you calculate this using all surgeries a surgeon performed, including those from the test-set. The result would be that your model could look very accurate on the test set, even if it wouldn't generalize well to new patients after the model is deployed. This would happen because the surgeon-risk feature accounts for data in the test set. Test sets exist to estimate how the model will do when seeing new data. So this contamination defeats the purpose of the test set.


这个问题存在目标泄漏和训练-测试污染的风险，但如果你足够小心，可能可以避免这两种风险。

目标泄漏是指，如果某个患者的结果会影响其外科医生的感染率，而这个感染率又被用作预测该患者是否会感染的模型的输入，那么就存在目标泄漏。如果你计算外科医生的感染率时，只使用我们正在预测的患者之前的手术数据，那么就可以避免目标泄漏。但是，对训练数据中的每个手术都进行这样的计算可能有些复杂。

训练-测试污染是指，如果你在计算外科医生的感染率时，使用了所有的手术数据，包括测试集中的数据，那么就存在训练-测试污染的问题。这样做的结果可能是，你的模型在测试集上看起来非常准确，但是在模型部署后，可能无法很好地泛化到新的患者。这是因为，外科医生的风险特征考虑了测试集中的数据。测试集的存在是为了估计模型在看到新数据时的表现，因此，这种污染违背了测试集的初衷。

这段话的意思是，如果我们在计算外科医生的感染率时，使用了我们正在尝试预测的患者的数据，那么就会出现目标泄露。这是因为我们的模型会间接地“看到”目标变量的信息。为了避免这种情况，我们应该只使用患者之前的手术数据来计算外科医生的感染率。

另一方面，如果我们在计算外科医生的感染率时，使用了测试集中的数据，那么就会出现训练-测试污染。这是因为我们的模型会间接地“看到”测试集的信息，这可能会导致模型在测试集上的表现过于乐观。为了避免这种情况，我们应该只使用训练集中的数据来计算外科医生的感染率。

总的来说，这段话的意思是，我们在创建特征时，需要小心谨慎地处理数据，以避免目标泄露和训练-测试污染。

好的，让我尝试用一个更简单的方式来解释这个问题。

目标泄漏（Target Leakage）是指在你的预测模型中，你使用了一些你在实际情况下无法获取的信息。在这个例子中，如果你在预测一个患者是否会感染时，使用了这个患者手术后的信息（比如这个患者的外科医生的感染率），那么就存在目标泄漏。因为在实际情况下，你在预测时是无法知道手术后的结果的。

训练-测试污染（Train-Test Contamination）是指你在训练模型时，使用了测试集的信息。在这个例子中，如果你在计算外科医生的感染率时，使用了测试集中的患者的信息，那么就存在训练-测试污染。因为测试集的目的是用来评估模型在未知数据上的表现的，如果你在训练模型时使用了测试集的信息，那么你的模型在测试集上的表现可能会过于乐观，无法真实反映模型在未知数据上的表现。

所以，为了避免目标泄漏和训练-测试污染，你在计算外科医生的感染率时，应该只使用训练集中，且是手术前的患者的信息。

这个问题的答案取决于你是如何计算每位外科医生的感染率的。

1. 如果你是在整个数据集（包括训练集和测试集）上计算每位外科医生的感染率，然后将这个信息用作特征，那么就会产生训练-测试污染。这是因为你在创建特征时使用了测试集的信息，这违反了测试集应该模拟真实、未知情况的原则。

2. 如果你只是在训练集上计算每位外科医生的感染率，然后将这个信息用作特征，那么就不会产生训练-测试污染。但是，你需要注意的是，这种方法可能会引入一种偏差：如果一个外科医生在训练集中的手术数量很少，那么计算出的感染率可能不准确。

至于目标泄露，这个问题的答案是没有。目标泄露是指你的特征包含了关于目标的未来信息。在这个例子中，每位外科医生的感染率是基于他们过去的手术结果计算出来的，不包含任何关于未来（即目标）的信息。

总的来说，这个想法是可行的，但在实施时需要注意避免训练-测试污染，并考虑到由于样本数量不足导致的偏差。

这个问题的答案取决于你如何计算外科医生的感染率。

1. 目标泄漏问题：如果你在计算外科医生的感染率时，包含了你正在预测的患者的手术结果，那么就存在目标泄漏问题。因为你实际上使用了你正在预测的目标（患者是否感染）的信息。如果你在计算外科医生的感染率时，只使用了你正在预测的患者之前的手术数据，那么就可以避免目标泄漏。这是因为，在这种情况下，你没有使用任何关于你正在预测的目标的未来信息。

2. 训练-测试污染问题：如果你在计算外科医生的感染率时，使用了所有的手术数据，包括测试集中的数据，那么就存在训练-测试污染的问题。这是因为，你实际上是在使用测试集中的信息来帮助你建立模型，这违反了测试集的初衷，即用来评估模型在看到新数据时的表现。

所以，如果你能够在计算外科医生的感染率时，只使用训练集中的数据，并且确保不包含你正在预测的患者的手术结果，那么就可以避免目标泄漏和训练-测试污染的问题。

# Step 5: Housing Prices

You will build a model to predict housing prices.  The model will be deployed on an ongoing basis, to predict the price of a new house when a description is added to a website.  Here are four features that could be used as predictors.
1. Size of the house (in square meters)
2. Average sales price of homes in the same neighborhood
3. Latitude and longitude of the house
4. Whether the house has a basement

You have historic data to train and validate the model.

Which of the features is most likely to be a source of leakage?

# 步骤 5：住房价格
您将建立一个预测房价的模型。该模型将持续部署，以便在网站上添加描述时预测新房的价格。以下是可用作预测因子的四个特征。

- 房屋面积（平方米）
- 同一社区房屋的平均销售价格
- 房屋的纬度和经度
- 房屋是否有地下室
您有历史数据来训练和验证模型。

哪些特征最有可能是渗漏源？

In [6]:
# Fill in the line below with one of 1, 2, 3 or 4.
potential_leakage_feature =2

# Check your answer
q_5.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

2 is the source of target leakage. Here is an analysis for each feature: 

1. The size of a house is unlikely to be changed after it is sold (though technically it's possible). But typically this will be available when we need to make a prediction, and the data won't be modified after the home is sold. So it is pretty safe. 

2. We don't know the rules for when this is updated. If the field is updated in the raw data after a home was sold, and the home's sale is used to calculate the average, this constitutes a case of target leakage. At an extreme, if only one home is sold in the neighborhood, and it is the home we are trying to predict, then the average will be exactly equal to the value we are trying to predict.  In general, for neighborhoods with few sales, the model will perform very well on the training data.  But when you apply the model, the home you are predicting won't have been sold yet, so this feature won't work the same as it did in the training data. 

3. These don't change, and will be available at the time we want to make a prediction. So there's no risk of target leakage here. 

4. This also doesn't change, and it is available at the time we want to make a prediction. So there's no risk of target leakage here.

2 是目标泄漏源。下面是对每个特征的分析：

房屋的面积不太可能在售出后发生变化（虽然技术上是有可能的）。但通常情况下，当我们需要进行预测时，这一点是可用的，而且在房屋售出后，数据也不会被修改。因此，这是非常安全的。

我们不知道何时更新的规则。如果在房屋售出后，原始数据中的字段被更新，而该房屋的售出被用来计算平均值，这就构成了目标泄漏的情况。在极端情况下，如果该社区只有一栋房屋售出，而这栋房屋正是我们试图预测的房屋，那么平均值将与我们试图预测的价值完全相等。一般来说，对于销售量很少的社区，模型在训练数据上的表现会非常好。但是，当您应用模型时，您要预测的房屋尚未售出，因此这一特征的作用将与训练数据中的不同。

这些都不会改变，而且在我们想要进行预测时都会可用。因此，这里没有目标泄漏的风险。

这一点也不会改变，而且在我们要进行预测时可以使用。因此，这里没有目标泄漏的风险。

In [7]:
q_5.hint()
q_5.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Which of these features might be updated in a database after the house is     sold? That's the one to worry about.

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 2 is the source of target leakage. Here is an analysis for each feature: 

1. The size of a house is unlikely to be changed after it is sold (though technically it's possible). But typically this will be available when we need to make a prediction, and the data won't be modified after the home is sold. So it is pretty safe. 

2. We don't know the rules for when this is updated. If the field is updated in the raw data after a home was sold, and the home's sale is used to calculate the average, this constitutes a case of target leakage. At an extreme, if only one home is sold in the neighborhood, and it is the home we are trying to predict, then the average will be exactly equal to the value we are trying to predict.  In general, for neighborhoods with few sales, the model will perform very well on the training data.  But when you apply the model, the home you are predicting won't have been sold yet, so this feature won't work the same as it did in the training data. 

3. These don't change, and will be available at the time we want to make a prediction. So there's no risk of target leakage here. 

4. This also doesn't change, and it is available at the time we want to make a prediction. So there's no risk of target leakage here.

# Conclusion
Leakage is a hard and subtle issue. You should be proud if you picked up on the issues in these examples.

Now you have the tools to make highly accurate models, and pick up on the most difficult practical problems that arise with applying these models to solve real problems.
现在，您已经掌握了制作高精度模型的工具，并能发现应用这些模型解决实际问题时出现的最棘手的实际问题。

There is still a lot of room to build knowledge and experience. Try out a [Competition](https://www.kaggle.com/competitions) or look through our [Datasets](https://kaggle.com/datasets) to practice your new skills.

Again, Congratulations!

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intermediate-machine-learning/discussion) to chat with other learners.*