# üåç Comprehensive AQI Anomaly Detection & Explainable AI System

## Project Overview
This notebook implements a complete AI/ML system for:
- **Real-time air quality anomaly detection** using ensemble methods
- **Predictive modeling** with advanced machine learning algorithms
- **Explainable AI** using SHAP and LIME frameworks
- **Interactive visualizations** and production-ready pipelines

**Dataset:** Air Quality Index (AQI) data from 26 Indian cities (2015-2020)

**Key Features:**
- üîç Multi-algorithm anomaly detection (Isolation Forest, LOF, Z-score)
- üìä Advanced time series analysis and feature engineering
- ü§ñ Ensemble predictive modeling (Random Forest, XGBoost, Voting Regressor)
- üî¨ Explainable AI with SHAP and LIME
- üìà Comprehensive model evaluation and comparison
- üöÄ Production-ready prediction pipeline

---

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Machine Learning - Preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline

# Machine Learning - Models
from sklearn.ensemble import IsolationForest, RandomForestRegressor, GradientBoostingRegressor, VotingRegressor
from sklearn.neighbors import LocalOutlierFactor
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor

# Machine Learning - Evaluation
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Explainable AI
import shap
from lime import lime_tabular

# Statistical analysis
from scipy import stats
from scipy.stats import zscore
from statsmodels.tsa.seasonal import seasonal_decompose

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

print("‚úÖ All libraries imported successfully!")
print(f"üìä Pandas version: {pd.__version__}")
print(f"üî¢ NumPy version: {np.__version__}")
print(f"ü§ñ Scikit-learn ready for ML tasks")

In [None]:
# Statistical Summary
print("="*80)
print("üìä STATISTICAL SUMMARY")
print("="*80)

# Numerical columns summary
df.describe().T.style.background_gradient(cmap='YlOrRd')

In [None]:
# Dataset Information
print("="*80)
print("üìã DATASET INFORMATION")
print("="*80)

print(f"\n{'Column Name':<20} {'Data Type':<15} {'Non-Null Count':<15} {'Null Count':<15} {'Null %'}")
print("-"*85)

for col in df.columns:
    non_null = df[col].notna().sum()
    null_count = df[col].isna().sum()
    null_pct = (null_count / len(df)) * 100
    dtype = str(df[col].dtype)
    print(f"{col:<20} {dtype:<15} {non_null:<15} {null_count:<15} {null_pct:.2f}%")

print(f"\nüíæ Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"üî¢ Total Data Points: {df.shape[0] * df.shape[1]:,}")

In [None]:
# Load the AQI dataset
dataset_path = '../data/dataset.csv'

try:
    df = pd.read_csv(dataset_path)
    print("‚úÖ Dataset loaded successfully!")
    print(f"üìã Shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
    print(f"üìÖ Date range: {df['Date'].min()} to {df['Date'].max()}")
except FileNotFoundError:
    print("‚ùå Dataset not found. Trying alternative path...")
    dataset_path = '../../dataset.csv'
    df = pd.read_csv(dataset_path)
    print("‚úÖ Dataset loaded from alternative path!")

# Display first few rows
print("\n" + "="*80)
print("üìä DATASET PREVIEW (First 5 rows)")
print("="*80)
df.head()

## 1Ô∏è‚É£ Project Setup and Data Loading

Let's start by importing all necessary libraries and loading our AQI dataset.