<a href="https://colab.research.google.com/github/dajley/-Analyzing-Impact-of-Outlier-Detection-Guideline/blob/main/Analyzing_Impact_of_Outlier_Detection_Guideline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Analyzing Impact of Outlier Detection
on Predictive Performance of ML Models**

• Identify data distribution (normal, skewed,
etc.,)

• Detect & Replace outliers (IQR, Z-Score)

• Build ML Models (LR, SVR, etc.,)

• Evaluate Model performance with and
without outliers (R2 score, MAE, RMSE, MSE)


**Identifying the distribution of your data** is a key step in exploratory data analysis (EDA), especially when you're preparing for outlier detection, feature engineering, or model selection. Identifying the distribution of your data involves understanding the underlying shape and characteristics of its frequency distribution.

How to Identify Data's Distribution:

*   Descriptive Statistics
    (Mean, Median, Standard Deviation, Skewness, and Kurtosis)

*   Visualization Techniques
*   Interpretation
*   Statistical Tests
*   Comparisons and Considerations





For agriculteral dataset giving crop yield based off certain features:


*   Understand each feature (Catergorical variables vs. Numberical variables and Target variable)
*   Explore Numerical Distributions
*   Check for Skewed Distributions
*   Use Boxplot for Outlier Detection
*   Visualize Yield by Category
*   Check for Relationships Between Inputs
*   Modeling Implication Tips









To identify the data distribution of an agriculture dataset with crop yield as the target and several input features (e.g., rainfall, temperature, soil quality, fertilizer use, etc.), you’ll want to combine statistical analysis and visualizations. Here's a structured approach:

**Understand the Dataset**
Identify target and features: Crop yield is your target variable, and the rest are features.

Feature types: Determine which features are numerical, categorical, or ordinal.

**Visualize Distributions**
Use plots to get an intuitive understanding of the data distributions.

For Numerical Features:
Histogram or KDE (Kernel Density Estimation) plot

Shows how data points are distributed.

Helps identify skewness, modality, and potential outliers.

Boxplot

Highlights the spread, median, and outliers.

For Categorical Features:
Bar plots to see frequency of categories.

For Crop Yield:
Histogram/KDE: Understand its spread and shape.

Q-Q Plot: To test for normality visually.

Boxplot grouped by categorical features (e.g., region or crop type).

**Statistical Summaries**
df.describe() for numerical features.

Use .value_counts() or pd.crosstab() for categorical ones.

Skewness and kurtosis to understand asymmetry and peakness.

**Distribution Fit Tests**
Check if the target (crop yield) follows a common statistical distribution:

Shapiro-Wilk Test

Kolmogorov-Smirnov Test

Anderson-Darling Test

These test the null hypothesis that the data is normally distributed.

**Correlation & Relationship**s
Understand how features relate to crop yield:

Correlation matrix (.corr()) for numeric features.

Scatter plots: Feature vs Crop Yield.

Pairplots (e.g., using Seaborn) for multivariate visualization.

**Feature Interactions & Multivariate Distributions**
Use pairplot or multidimensional KDE plots.

Apply PCA or t-SNE to reduce dimensions and observe structure in 2D.



In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

# Histogram and KDE
sns.histplot(df['crop_yield'], kde=True)

# Boxplot
sns.boxplot(x='region', y='crop_yield', data=df)

# Q-Q plot
stats.probplot(df['crop_yield'], dist="norm", plot=plt)

# Shapiro test
stat, p = stats.shapiro(df['crop_yield'])
print(f'Shapiro-Wilk Test: stat={stat:.3f}, p={p:.3f}')


By the end of this process, you should know:

Is your data normally distributed?

Are there any outliers?

How do features relate to crop yield?

Is data transformation (e.g., log-transform) needed?

**Detect outliers (using IQR or Z-score)**
Flag the outliers but don’t remove them yet — instead, keep track of which rows have outliers.

**Create three versions of the dataset**
*   Dataset A (Remove outliers): Drop rows with outliers.
*   Dataset B (Median Imputation): Replace outlier values with the median of that feature.
*   Dataset C (Mean Imputation): Replace outlier values with the mean of that feature.

**Train ML models on each dataset**
Split each dataset into train and test sets (same random seed), train the same ML model (e.g., Random Forest), and evaluate performance.

**Compare results**
Evaluate RMSE, MAE, and R² for each dataset and analyze which outlier handling method works best.

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Load dataset
df = pd.read_csv('crop_yield_data.csv')
X = df.drop(columns=['crop_yield'])
y = df['crop_yield']

# --- 1. Detect Outliers using IQR ---
Q1 = X.quantile(0.25)
Q3 = X.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Create a mask where True means the value is an outlier
outlier_mask = ((X < lower_bound) | (X > upper_bound))

# --- 2. Create datasets ---

# Dataset A: Remove rows with any outlier
rows_to_remove = outlier_mask.any(axis=1)
X_remove = X.loc[~rows_to_remove]
y_remove = y.loc[~rows_to_remove]

# Dataset B: Replace outliers with median
X_median = X.copy()
for col in X.columns:
    median_val = X[col].median()
    X_median.loc[outlier_mask[col], col] = median_val

# Dataset C: Replace outliers with mean
X_mean = X.copy()
for col in X.columns:
    mean_val = X[col].mean()
    X_mean.loc[outlier_mask[col], col] = mean_val

# --- 3. Train-test split (same random state for comparability) ---
def split_data(X, y):
    return train_test_split(X, y, test_size=0.3, random_state=42)

X_train_orig, X_test_orig, y_train_orig, y_test_orig = split_data(X, y)
X_train_remove, X_test_remove, y_train_remove, y_test_remove = split_data(X_remove, y_remove)
X_train_median, X_test_median, y_train_median, y_test_median = split_data(X_median, y)
X_train_mean, X_test_mean, y_train_mean, y_test_mean = split_data(X_mean, y)

# --- 4. Train and evaluate ---
def train_evaluate(X_train, y_train, X_test, y_test):
    model = RandomForestRegressor(random_state=42)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    mae = mean_absolute_error(y_test, preds)
    r2 = r2_score(y_test, preds)
    return rmse, mae, r2

results = {}
results['Original'] = train_evaluate(X_train_orig, y_train_orig, X_test_orig, y_test_orig)
results['Remove Outliers'] = train_evaluate(X_train_remove, y_train_remove, X_test_remove, y_test_remove)
results['Median Imputation'] = train_evaluate(X_train_median, y_train_median, X_test_median, y_test_median)
results['Mean Imputation'] = train_evaluate(X_train_mean, y_train_mean, X_test_mean, y_test_mean)

# --- 5. Print results ---
for method, metrics in results.items():
    print(f"{method}: RMSE={metrics[0]:.3f}, MAE={metrics[1]:.3f}, R2={metrics[2]:.3f}")
