
# Supervised Classification - Exploratory Data Analysis (EDA)

## Introduction
This notebook performs **Exploratory Data Analysis (EDA)** on a **synthetic supervised classification dataset**.  
The dataset contains **1,000 samples** with **8 features** and a binary **target variable (0 or 1)**.

### **EDA Includes:**
- Dataset shape and summary statistics
- Missing value checks
- Feature distributions by class
- Correlation heatmap
- Outlier detection via boxplots
- Cumulative Distribution Function (CDF)

---


In [None]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

# Generate Synthetic Classification Dataset
X, y = make_classification(n_samples=1000, n_features=8, n_informative=5, 
                           n_redundant=2, n_classes=2, random_state=42)

# Create a DataFrame
columns = [f'Feature_{i+1}' for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=columns)
df['Target'] = y  # Add target variable

# Display dataset shape
df.shape


In [None]:

# Summary statistics
df.describe()


In [None]:

# Check for missing values
df.isnull().sum()


In [None]:

# Correlation Matrix Heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()


In [None]:

# Histograms & KDE plots
plt.figure(figsize=(12, 8))
for i, col in enumerate(df.columns[:-1], 1):
    plt.subplot(3, 3, i)
    sns.histplot(data=df, x=col, hue="Target", kde=True, bins=30, element="step")
    plt.title(f"Distribution of {col} by Target")
plt.tight_layout()
plt.show()


In [None]:

# Boxplots for Outlier Detection
plt.figure(figsize=(12, 6))
sns.boxplot(data=df.iloc[:, :-1])
plt.xticks(rotation=45)
plt.title("Boxplot of Features to Detect Outliers")
plt.show()


In [None]:

# Cumulative Distribution Function (CDF)
plt.figure(figsize=(8, 5))
for col in df.columns[:-1]:
    sorted_data = np.sort(df[col])
    cdf = np.arange(len(sorted_data)) / float(len(sorted_data))
    plt.plot(sorted_data, cdf, label=col)

plt.title("Cumulative Distribution Function (CDF)")
plt.xlabel("Feature Value")
plt.ylabel("Cumulative Probability")
plt.legend()
plt.show()



## **Conclusion**
- The dataset has **1,000 samples and 8 features**.
- No missing values were found.
- The correlation heatmap shows relationships between features.
- Boxplots help identify potential outliers.
- Feature distributions vary across the two classes.

This completes our **Exploratory Data Analysis (EDA)** for this classification dataset. 🚀
