

---
### Kernel Density Estimation (KDE) Plots for Model Evaluation

**Introduction**
Kernel Density Estimation (KDE) plots are a valuable tool for visualizing data distributions by estimating their probability density function (PDF). These plots are particularly useful in regression analysis for comparing actual and predicted values. With the deprecation of Seaborn `distplot`, KDE plots serve as a modern and effective method for assessing model performance.

---
### Why Use KDE Plots?

KDE plots are beneficial in model evaluation for the following reasons:
* They provide a smooth approximation of the data distribution.
* They help compare the true vs. predicted distributions effectively.
* Unlike histograms, KDE plots are not sensitive to bin sizes.
* They can highlight deviations between observed and predicted values.

---
### Implementing KDE Plots in Python

We will use Seaborn `kdeplot()` function to implement a KDE plot, allowing us to compare the actual and predicted distributions effectively. It provides a smooth estimate of the data distribution, making it easier to visualize differences between observed and predicted values.

---
### Example: Evaluating a Regression Model
The following code demonstrates how to train a simple linear regression model, generate predictions, and visualize the actual vs. predicted distributions using KDE plots. We will use synthetic data to simulate a linear relationship with added noise, then split the data into testing and training sets, and then we will evaluate the model predictions.

---
**Python Code Implementation (in separate Colab cells)**

**Cell 1: Import Libraries**

In [None]:
import numpy as npy
import pandas as pds
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error # Not used in plot, but relevant for model evaluation

---
**Cell 2: Generating Sample Data**

In [None]:
# Generating Sample Data
npy.random.seed(42) # for reproducibility
x_feature = npy.random.rand(100) * 10 # Feature: 100 random values between 0 and 10
y_target = 3 * x_feature + npy.random.normal(0, 3, 100)  # Target: Linear relation with x_feature + Gaussian noise (mean=0, std_dev=3)
data = pds.DataFrame({'X': x_feature, 'Y': y_target})

print("First 5 rows of the generated data:")
print(data.head())

---
**Cell 3: Splitting Data**

In [None]:
# Splitting Data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    data[['X']],  # Feature(s) - note the double brackets to keep it as a DataFrame
    data['Y'],    # Target variable
    test_size=0.2, # 20% of data will be used for testing
    random_state=42 # Ensures the split is the same every time, for reproducibility
)

print(f"\nShape of training features (X_train): {X_train.shape}")
print(f"Shape of testing features (X_test): {X_test.shape}")
print(f"Shape of training target (y_train): {y_train.shape}")
print(f"Shape of testing target (y_test): {y_test.shape}")

---
**Cell 4: Training a Model and Making Predictions**

In [None]:
# Training a Linear Regression Model
model = LinearRegression()
model.fit(X_train, y_train) # Train the model using the training data

# Generating predictions on the test set
y_pred = model.predict(X_test)

print("\nFirst 5 actual test values (y_test):")
print(y_test.head().values)
print("First 5 predicted values (y_pred):")
print(y_pred[:5])

---
**Cell 5: Plotting KDE for Observed vs. Predicted Values**

In [None]:
# Plotting KDE for Observed (Actual) vs. Predicted Values
plt.figure(figsize=(10, 6)) # Set the figure size for better readability

# KDE plot for actual values
sns.kdeplot(y_test, label='Actual Values', fill=True, alpha=0.5, color='blue', linewidth=2)

# KDE plot for predicted values
sns.kdeplot(y_pred, label='Predicted Values', fill=True, alpha=0.5, color='red', linewidth=2)

plt.xlabel('Target Variable Value') # Label for the x-axis
plt.ylabel('Density') # Label for the y-axis
plt.title('KDE Plot: Actual vs. Predicted Values Distribution') # Title of the plot
plt.legend() # Show the legend to identify the curves
plt.grid(True, linestyle='--', alpha=0.7) # Add a grid for easier reading
plt.show() # Display the plot

---
**Output**
The resulted Kernel Density Estimation (KDE) plot compares the distribution of `actual values (blue)` and `predicted values (red)` from the linear regression model.

---
### Interpretation of the KDE Plot
**Overlap Between Distributions:** The two curves have a significant overlap, indicating that the model has captured the general distribution of the actual target values reasonably well. However, the predicted values slightly deviate from the actual values in some regions.

**Peak Differences (Mode Shifts):** The `blue (actual)` curve peaks slightly higher than the red curve, meaning that the actual values are more concentrated around certain values. The `red (predicted)` curve has a second peak, suggesting that the model may be slightly misestimating certain ranges of the target variable.

**Spread of the Distributions:** The `actual values (blue)` seem to have a wider spread, indicating more variation in real-world values. The `predicted values (red)` appear to be narrower, which suggests the model might be slightly underestimating variance (a sign of over-smoothing or bias).

**Tails of the Distributions:** The tails of the predicted values closely follow the actual values, meaning the model does not generate extreme outliers beyond what was observed in the data. If there was a significant mismatch in the tails, it could indicate that the model struggles with extreme cases.

---
### Conclusion
KDE plots are a powerful visualization tool for assessing the distribution of predicted values compared to actual values in regression analysis. Replacing deprecated `distplot` with `kdeplot()` ensures modern and effective visualization in model evaluation workflows.

---
### Practice Questions (50)

**Conceptual Understanding of KDE**

1.  What does KDE stand for?
2.  What is the primary purpose of a Kernel Density Estimation plot?
3.  What does a KDE plot visually represent (e.g., PMF, PDF, CDF)?
4.  How does a KDE plot differ visually from a histogram?
5.  What is a "kernel" in the context of KDE? (General understanding)
6.  What parameter in KDE can affect the "smoothness" of the resulting curve? (Hint: bandwidth)
7.  Why is KDE considered a non-parametric method for density estimation?
8.  Can KDE plots be used for both univariate and bivariate data?
9.  What does the area under a KDE curve represent?
10. If a KDE curve has a high peak at a certain value, what does it signify?

**KDE Plots in Model Evaluation**

11. Why are KDE plots useful for evaluating regression models?
12. When comparing actual vs. predicted values using KDE, what would an ideal plot look like?
13. What does it mean if the KDE plot of predicted values is much narrower than that of actual values?
14. What might it indicate if the KDE plot of predicted values has its peak shifted to the right of the actual values' peak?
15. If the tails of the predicted values' KDE plot are much "fatter" than the actual values' KDE, what could this imply about the model?
16. Can KDE plots help identify if a model is systematically biased in its predictions? How?
17. Why are KDE plots generally preferred over histograms for comparing distributions in model evaluation?
18. What was the Seaborn function often used for plotting distributions (including KDE) before `kdeplot` became the primary recommendation?
19. If two KDE curves (actual vs. predicted) have very little overlap, what does this suggest about the model's performance?
20. How can KDE plots complement other numerical evaluation metrics like Mean Absolute Error (MAE) or R-squared?

**Python Implementation (Seaborn `kdeplot`)**

21. Which Python library is commonly used for creating KDE plots?
22. What is the primary Seaborn function used to generate a KDE plot?
23. What does the `data` parameter in `sns.kdeplot(data=...)` typically expect?
24. How can you plot two KDEs (e.g., for `y_actual` and `y_predicted`) on the same axes using Seaborn?
25. What does the `fill=True` argument do in `sns.kdeplot()`?
26. How do you add a legend to a Seaborn plot that has multiple KDEs?
27. What `matplotlib.pyplot` function is used to display the plot after defining it?
28. How can you change the color of a KDE plot in Seaborn?
29. What parameter in `sns.kdeplot()` might you adjust to change the smoothness of the density estimate?
30. If you have a Pandas DataFrame `df` with a column 'values', how would you create a KDE plot for this column?

**Interpretation of KDE Plots (Scenarios)**

*Imagine you are looking at a KDE plot comparing 'Actual Values' (blue) and 'Predicted Values' (red) from a regression model.*

31. If the red curve is almost identical to the blue curve, what does this imply?
32. If the blue curve shows two distinct peaks (bimodal) but the red curve shows only one peak, what does this suggest?
33. If the red curve's peak is significantly lower and wider than the blue curve's peak, but they are centered at the same value, what could this mean?
34. If the red curve is consistently to the left of the blue curve, what type of error might the model be making?
35. If both curves have similar shapes and peaks, but the red curve's tails are much shorter (cut off) compared to the blue curve, what is a potential issue?
36. You observe that the predicted values' KDE (red) has a small, unexpected bump in an area where the actual values' KDE (blue) is near zero. What might this indicate?
37. If the KDE for predicted values is perfectly symmetrical around zero, while the actual values are skewed, what does it say about the model?
38. How would you interpret a situation where the actual values show a wide spread, but the predicted values are tightly clustered, resulting in a very narrow KDE?
39. If a model is improved, how would you expect the KDE plot of its predictions (compared to actuals) to change?
40. Can a KDE plot tell you if your model is overfitting? How might it give clues? (Hint: Compare training set predictions vs. test set predictions if possible, or consider the narrowness of test predictions).

**General & Comparative Questions**

41. Besides regression, what other types of model evaluation or data analysis might benefit from KDE plots?
42. What is one advantage of KDE plots over a simple scatter plot of actual vs. predicted values?
43. What is one disadvantage or limitation of using KDE plots for model evaluation?
44. How does sample size affect the reliability or appearance of a KDE plot?
45. True or False: KDE plots can directly provide a numerical score of model accuracy.
46. In the provided example code, what is the purpose of `npy.random.seed(42)`?
47. What does `test_size=0.2` in `train_test_split` signify?
48. If `fill=True` is used for two overlapping KDEs, what argument can help distinguish them better? (Hint: `alpha`)
49. Why is it important to evaluate a model on a separate *test set* rather than the *training set* when plotting these KDEs for performance assessment?
50. If the KDE plots for actual and predicted values are very different, what might be your next steps in the model improvement process?

<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://github.com/vaish-o5/CODSOFT">https://github.com/vaish-o5/CODSOFT</a></li>
  </ol>
</div>