**This notebook is an exercise in the [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/data-leakage).**

---


Most people find target leakage very tricky until they've thought about it for a long time.

So, before trying to think about leakage in the housing price example, we'll go through a few examples in other applications. Things will feel more familiar once you come back to a question about house prices.

# Setup

The questions below will give you feedback on your answers. Run the following cell to set up the feedback system.

In [1]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex7 import *
print("Setup Complete")

Setup Complete


# Step 1: The Data Science of Shoelaces

Nike has hired you as a data science consultant to help them save money on shoe materials. Your first assignment is to review a model one of their employees built to predict how many shoelaces they'll need each month. The features going into the machine learning model include:
- The current month (January, February, etc)
- Advertising expenditures in the previous month
- Various macroeconomic features (like the unemployment rate) as of the beginning of the current month
- The amount of leather they ended up using in the current month

The results show the model is almost perfectly accurate if you include the feature about how much leather they used. But it is only moderately accurate if you leave that feature out. You realize this is because the amount of leather they use is a perfect indicator of how many shoes they produce, which in turn tells you how many shoelaces they need.

Do you think the _leather used_ feature constitutes a source of data leakage? If your answer is "it depends," what does it depend on?

After you have thought about your answer, check it against the solution below.

*This is tricky, and it depends on details of how data is collected (which is common when thinking about leakage). Would you at the beginning of the month decide how much leather will be used that month? If so, this is ok. But if that is determined during the month, you would not have access to it when you make the prediction. If you have a guess at the beginning of the month, and it is subsequently changed during the month, the actual amount used during the month cannot be used as a feature (because it causes leakage).*

**CHATGPT ANALYSIS:**

Yes, the leather used feature constitutes a source of data leakage in this context. Here's why:

**Understanding Data Leakage**
Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates. This usually happens when the model has access to future information or directly related information that would not be available at the time of prediction.

**Analysis of the Feature
Leather Used in the Current Month:**
The amount of leather used in the current month is directly related to the number of shoes produced.
The number of shoes produced, in turn, directly affects the number of shoelaces needed.

**Why It Is Leakage**

1-Timing and Availability: When predicting the number of shoelaces needed for a given month, the model should only use information that would be available before or at the beginning of that month. The amount of leather used in the current month would not be known until the end of the month or as the month progresses. Therefore, it provides future information.

2-Direct Relationship: Since the amount of leather used is directly tied to shoe production, including this feature gives the model almost direct access to the target variable (number of shoelaces needed), which would not be available in a real-world scenario when making predictions for the upcoming month.

**Dependency Conditions**
If you are making predictions at the beginning of the month, before any shoes are produced or leather is used, then using the amount of leather used in the current month is a clear case of data leakage.

**Conclusion**
To avoid data leakage, you should exclude the amount of leather used in the current month from the features used to train your model. Instead, focus on features that are genuinely predictive and available at the time of prediction, such as:

The current month (January, February, etc.)
Advertising expenditures in the previous month
Various macroeconomic features (like the unemployment rate) as of the beginning of the current month
By doing so, you ensure that the model's performance reflects a realistic scenario where predictions are made without future information.

# Step 2: Return of the Shoelaces

You have a new idea. You could use the amount of leather Nike ordered (rather than the amount they actually used) leading up to a given month as a predictor in your shoelace model.

Does this change your answer about whether there is a leakage problem? If you answer "it depends," what does it depend on?

*This could be fine, but it depends on whether they order shoelaces first or leather first. If they order shoelaces first, you won't know how much leather they've ordered when you predict their shoelace needs. If they order leather first, then you'll have that number available when you place your shoelace order, and you should be ok.*

**CHATGPT ANALYSIS:**
Using the amount of leather ordered leading up to a given month as a predictor in your shoelace model could potentially change the situation regarding data leakage, but it does depend on specific details:

**Considerations for Leather Ordered as a Predictor**

1-Timing of the Data: If the amount of leather ordered is known and fixed before the month starts, then using this data should not constitute leakage. This is because you would be using information that is available before the month begins, which is legitimate for making predictions.

2-Correlation with Shoelaces Needed: The amount of leather ordered could still be strongly correlated with the number of shoelaces needed, but as long as this information is known in advance (before the production starts), it is a valid predictor. Companies often base orders on anticipated production, which can be influenced by factors like past demand, forecasts, and planned marketing efforts.

**It Depends On:**

1-Ordering Practices: If Nike places leather orders well in advance (e.g., a month or more before the production month), then the amount of leather ordered is a valid predictor and not a source of leakage. This practice means the order quantities reflect anticipatory measures rather than reactive measures based on the current month's production data.

2-Consistency and Timing: Ensure that the ordering data you use is consistent with what would be available at the time of prediction. For instance, if you are predicting shoelace needs for March, you should use the leather ordered in January or February, assuming these orders are based on projected needs and not adjusted frequently.

**Example Scenarios:**

1-No Leakage: If Nike orders leather in December for use in January, and you are predicting January's shoelace needs using December's order data, there is no leakage because you are using pre-existing information.

2-Potential Leakage: If Nike frequently adjusts leather orders within the current month based on real-time production needs and you use these updated order figures, this could be considered leakage because you are incorporating information that reflects the current month's production dynamics.

**Conclusion**
Using the amount of leather ordered leading up to a given month can be a legitimate predictor without causing data leakage, provided that the orders are placed sufficiently in advance and reflect anticipated rather than real-time production needs. Always ensure that the predictor variables align with the information available at the time predictions are made. This approach helps maintain the integrity and applicability of your predictive model.

# Step 3: Getting Rich With Cryptocurrencies?

You saved Nike so much money that they gave you a bonus. Congratulations.

Your friend, who is also a data scientist, says he has built a model that will let you turn your bonus into millions of dollars. Specifically, his model predicts the price of a new cryptocurrency (like Bitcoin, but a newer one) one day ahead of the moment of prediction. His plan is to purchase the cryptocurrency whenever the model says the price of the currency (in dollars) is about to go up.

The most important features in his model are:
- Current price of the currency
- Amount of the currency sold in the last 24 hours
- Change in the currency price in the last 24 hours
- Change in the currency price in the last 1 hour
- Number of new tweets in the last 24 hours that mention the currency

The value of the cryptocurrency in dollars has fluctuated up and down by over $\$$100 in the last year, and yet his model's average error is less than $\$$1. He says this is proof his model is accurate, and you should invest with him, buying the currency whenever the model says it is about to go up.

Is he right? If there is a problem with his model, what is it?

*There is no source of leakage here. These features should be available at the moment you want to make a predition, and they're unlikely to be changed in the training data after the prediction target is determined. But, the way he describes accuracy could be misleading if you aren't careful. If the price moves gradually, today's price will be an accurate predictor of tomorrow's price, but it may not tell you whether it's a good time to invest. For instance, if it is  100𝑡𝑜𝑑𝑎𝑦,𝑎𝑚𝑜𝑑𝑒𝑙𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔𝑎𝑝𝑟𝑖𝑐𝑒𝑜𝑓
 100 tomorrow may seem accurate, even if it can't tell you whether the price is going up or down from the current price. A better prediction target would be the change in price over the next day. If you can consistently predict whether the price is about to go up or down (and by how much), you may have a winning investment opportunity.*

**CHATGPT ANALYSIS:**

Your analysis is thorough and touches on critical aspects of model evaluation and potential pitfalls in your friend's approach. Here's a breakdown of your points and some additional commentary:

**Key Points:**

**No Source of Leakage:**

You correctly identify that the features your friend is using should be available at the moment the prediction is made. There’s no indication of data leakage in the traditional sense (where future data influences the model training).

**Misleading Accuracy Description:**

You highlight an essential issue: the model's reported accuracy (average error less than $1) might be misleading. Given the volatile nature of cryptocurrency prices, such a small error margin is unrealistic and suggests that the model might be overfitting or not capturing true market dynamics.

**Gradual Price Movements:**

Your explanation about gradual price movements is spot-on. If the price changes gradually, using the current price as a predictor for the next day can yield low error rates but does not provide actionable insights for investment decisions.
Additional Considerations:
Predicting the Change in Price:

You propose a better prediction target: the change in price over the next day. This approach aligns more closely with the goal of identifying profitable trading opportunities. By predicting the direction and magnitude of price changes, the model can offer more valuable insights.
Market Volatility:

The cryptocurrency market is known for its high volatility. A model predicting such a market must account for sudden price swings and external factors (e.g., news events, regulatory changes) that significantly impact prices.
Model Validation:

It's crucial to ensure the model is validated properly using techniques like cross-validation and testing on out-of-sample data to avoid overfitting and provide a realistic assessment of its predictive power.
Trading Costs and Practical Considerations:

In real-world trading, factors like transaction costs, market liquidity, and slippage must be considered. Even if a model predicts price changes accurately, these factors can erode potential profits.

**Conclusion:**

Your friend's model may not have a direct source of leakage, but the way accuracy is reported and interpreted is critical. Simply predicting tomorrow's price with a low error does not necessarily translate to profitable trading strategies. Focusing on predicting price changes and ensuring rigorous model validation can provide more realistic insights and help avoid potential pitfalls. This nuanced understanding is vital for making informed investment decisions based on model predictions.

# Step 4: Preventing Infections

An agency that provides healthcare wants to predict which patients from a rare surgery are at risk of infection, so it can alert the nurses to be especially careful when following up with those patients.

You want to build a model. Each row in the modeling dataset will be a single patient who received the surgery, and the prediction target will be whether they got an infection.

Some surgeons may do the procedure in a manner that raises or lowers the risk of infection. But how can you best incorporate the surgeon information into the model?

You have a clever idea. 
1. Take all surgeries by each surgeon and calculate the infection rate among those surgeons.
2. For each patient in the data, find out who the surgeon was and plug in that surgeon's average infection rate as a feature.

Does this pose any target leakage issues?
Does it pose any train-test contamination issues?

*This poses a risk of both target leakage and train-test contamination (though you may be able to avoid both if you are careful).

You have target leakage if a given patient's outcome contributes to the infection rate for his surgeon, which is then plugged back into the prediction model for whether that patient becomes infected. You can avoid target leakage if you calculate the surgeon's infection rate by using only the surgeries before the patient we are predicting for. Calculating this for each surgery in your training data may be a little tricky.

You also have a train-test contamination problem if you calculate this using all surgeries a surgeon performed, including those from the test-set. The result would be that your model could look very accurate on the test set, even if it wouldn't generalize well to new patients after the model is deployed. This would happen because the surgeon-risk feature accounts for data in the test set. Test sets exist to estimate how the model will do when seeing new data. So this contamination defeats the purpose of the test set.*

**CHATGPT ANALYSIS:**

Using the surgeon's average infection rate as a feature in your model can pose both target leakage and train-test contamination issues if not handled carefully. Let's break down the potential problems:

**Target Leakage Issues**
Target leakage occurs when information used to predict the target variable is available at the time of prediction but wouldn't be available in a real-world scenario before the outcome is known.

1-Post-Surgery Data: If the infection rate is calculated using data that includes the current patient’s outcome, it would mean incorporating future information (the current patient's infection status) into the prediction. This constitutes target leakage because you're using the outcome you're trying to predict to help make the prediction.

2-Dynamic Infection Rates: Even if the infection rate is calculated excluding the current patient, using infection rates that are updated frequently based on recent surgeries can inadvertently include recent outcomes that wouldn't have been known at the time of the patient's surgery.

**Train-Test Contamination Issues**
Train-test contamination occurs when information from the test set leaks into the training set, leading to overly optimistic performance estimates.

1-Shared Information: If the infection rates are calculated from the entire dataset (including both training and test sets), this would mean information from the test set is influencing the training set, contaminating the model evaluation process.

2-Proper Separation: To avoid this, infection rates should be calculated only using the training set when the model is being trained and using only the training data for validation. The test set should be entirely separate and not influence any feature engineering done on the training set.

**How to Properly Incorporate Surgeon Information**
To avoid these issues, you need to ensure that the infection rate calculation for surgeons is done in a way that avoids both target leakage and train-test contamination:

1-Cross-Validation: Use techniques like cross-validation where you calculate the infection rates in each fold separately. This way, the infection rate for a surgeon is calculated without including the patients from the same fold, thus avoiding leakage.

2-Out-of-Fold Estimates: When training the model, you can use out-of-fold estimates for infection rates. For each patient, the infection rate of their surgeon is calculated based on surgeries in other folds, not including the current fold.

3-Train-Test Split: When you finally evaluate the model on the test set, ensure that the surgeon's infection rates used for the test set are calculated solely from the training set data. This ensures no contamination occurs from the test set back into the training set.

**Summary**
Incorporating surgeon infection rates as a feature can be valuable but requires careful handling to avoid target leakage and train-test contamination:

1-Avoid Target Leakage: Ensure that the infection rates used for a patient do not include their own outcome or outcomes from patients who would not have been known at the time of their surgery.

2-Avoid Train-Test Contamination: Ensure that infection rates are calculated separately for training and test sets, using only the training data for any feature engineering used in model training and validation.

By taking these precautions, you can effectively use surgeon-specific infection rates without compromising the integrity and reliability of your predictive model.

# Step 5: Housing Prices

You will build a model to predict housing prices.  The model will be deployed on an ongoing basis, to predict the price of a new house when a description is added to a website.  Here are four features that could be used as predictors.
1. Size of the house (in square meters)
2. Average sales price of homes in the same neighborhood
3. Latitude and longitude of the house
4. Whether the house has a basement

You have historic data to train and validate the model.

Which of the features is most likely to be a source of leakage?

*2 is the source of target leakage. Here is an analysis for each feature:

The size of a house is unlikely to be changed after it is sold (though technically it's possible). But typically this will be available when we need to make a prediction, and the data won't be modified after the home is sold. So it is pretty safe.

We don't know the rules for when this is updated. If the field is updated in the raw data after a home was sold, and the home's sale is used to calculate the average, this constitutes a case of target leakage. At an extreme, if only one home is sold in the neighborhood, and it is the home we are trying to predict, then the average will be exactly equal to the value we are trying to predict. In general, for neighborhoods with few sales, the model will perform very well on the training data. But when you apply the model, the home you are predicting won't have been sold yet, so this feature won't work the same as it did in the training data.

These don't change, and will be available at the time we want to make a prediction. So there's no risk of target leakage here.

This also doesn't change, and it is available at the time we want to make a prediction. So there's no risk of target leakage here.*

**CHATGPT ANALYSIS:**

In the context of building a model to predict housing prices, leakage occurs when information from outside the training dataset is used to create the model. This can result in overly optimistic performance estimates and poor generalization to new data.

**Among the features listed:**

1-Size of the house (in square meters)
2-Average sales price of homes in the same neighborhood
3-Latitude and longitude of the house
4-Whether the house has a basement

The feature most likely to be a source of leakage is:

2. Average sales price of homes in the same neighborhood

**Explanation:**

Size of the house (in square meters): This is a legitimate predictor because it's a direct characteristic of the house whose price you want to predict. There is no leakage here as it is not derived from future information or the target variable.

Average sales price of homes in the same neighborhood: This feature is likely to cause leakage. The average sales price of homes in the same neighborhood can indirectly include information about the target variable (the sale price of the house in question) especially if the average is calculated using recent sales data which could include the sale price of the house being predicted. If historical data is used properly, it might not be direct leakage, but it needs careful handling to ensure it doesn't include information from the target variable.

Latitude and longitude of the house: These are legitimate predictors because they are fixed geographical features that can help in understanding the location value of the house. They do not involve any future information related to the house price.

Whether the house has a basement: This is also a legitimate predictor because it's a specific characteristic of the house and does not contain any future information.

**Conclusion:**

The average sales price of homes in the same neighborhood is most likely to be a source of leakage because it can contain information that includes or correlates highly with the target variable (the price of the house being predicted). Care must be taken to ensure that when this feature is used, it is calculated in a way that does not incorporate the current target house price or future information that would not be available at prediction time

# Conclusion
Leakage is a hard and subtle issue. You should be proud if you picked up on the issues in these examples.

Now you have the tools to make highly accurate models, and pick up on the most difficult practical problems that arise with applying these models to solve real problems.



---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intermediate-machine-learning/discussion) to chat with other learners.*