# Health Impact Classification

#### Data Description

##### Record Information
RecordID: A unique identifier assigned to each record (1 to 2392).

##### Air Quality Metrics
AQI: Air Quality Index, a measure of how polluted the air currently is or how polluted it is forecast to become.<br>
PM10: Concentration of particulate matter less than 10 micrometers in diameter (μg/m³).<br>
PM2_5: Concentration of particulate matter less than 2.5 micrometers in diameter (μg/m³).<br>
NO2: Concentration of nitrogen dioxide (ppb).<br>
SO2: Concentration of sulfur dioxide (ppb).<br>
O3: Concentration of ozone (ppb).<br>

##### Weather Conditions
Temperature: Temperature in degrees Celsius (°C).<br>
Humidity: Humidity percentage (%).<br>
WindSpeed: Wind speed in meters per second (m/s).<br>

##### Health Impact Metrics
RespiratoryCases: Number of respiratory cases reported.<br>
CardiovascularCases: Number of cardiovascular cases reported.<br>
HospitalAdmissions: Number of hospital admissions reported.<br>

##### Target Variable: Health Impact Class
HealthImpactScore: A score indicating the overall health impact based on air quality and other related factors, ranging from 0 to 100.<br>
HealthImpactClass: Classification of the health impact based on the health impact score:<br>

0: 'Very High' (HealthImpactScore >= 80)<br>
1: 'High' (60 <= HealthImpactScore < 80)<br>
2: 'Moderate' (40 <= HealthImpactScore < 60)<br>
3: 'Low' (20 <= HealthImpactScore < 40)<br>
4: 'Very Low' (HealthImpactScore < 20)<br>

### Set-Up and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Import CSV as dfFrame
df = pd.read_csv('df/air_quality_health_impact_data.csv')
print(df.head()) # Preview first 5 rows

In [None]:
# Dataset Size and Structure
df.shape  # 5811 rows, 15 columns

### Pre-Processing

In [None]:
# No missing values
df.isnull().sum()

In [None]:
# Remove identifier, redundant for prediction analysis
df = df.drop(columns=["RecordID"], errors="ignore")

### Exploratory df Analysis (EDA)

In [None]:
# Datatypes and Missing Values
df.info() # Integer - int64(4) and Float - float64(11)

In [None]:
# Statistical Summary
df.describe().T # Count, mean, standard devation, minimum, maximum, and quartiles
# Central Tendency and Spread of Data

In [None]:
df.nunique() # Variety of values and duplicty

#### Univariate Analysis

##### Numerical Values

In [None]:
# Histogram Mean
def hist_mean(data, **kwargs):
    sns.histplot(data['Value'], kde=True, **kwargs)
    plt.axvline(data['Value'].mean(), color='red', linestyle='dashed', linewidth=2) # Mean line

# Facet Grids
def facet_plots(df, plot_type, col_wrap):

    # Melt to long format - categorical column 'Feature' against 'Value'
    numeric_df = df.select_dtypes(include='number')  # Excl melted 'Value' cols
    df_long = numeric_df.melt(var_name='Feature', value_name='Value')
    
    # Create FacetGrid
    gr = sns.FacetGrid(df_long, col='Feature', col_wrap=col_wrap, sharex=False, sharey=False)
    # Plot Type
    if plot_type == 'hist': # Histogram
        gr.map_dataframe(hist_mean)
    if plot_type == 'box': # Boxplot
        gr.map(sns.boxplot, 'Value')

    # Titles
    gr.set_titles("{col_name}")
    gr.figure.subplots_adjust(top=0.9)
    plt.show()

    # WHO recommended standards
    # PM2_5 15 μg/m³
    # PM10 45 μg/m³
    # NO2 25 μg/m³ 13.2 ppb
    # SO2 40 μg/m³ 15.3 ppb
    # O 15 μg/m³ 7.64 ppb


In [None]:
# Distribution
facet_plots(df, 'hist', 7)

In [None]:
# Boxplots
facet_plots(df, 'box', 7)

##### Categorial Values

In [None]:
# Health Impact Class
# Bar Chart Palette
custom_colors = ["#FF9999", '#FFCC99', '#FFFF99', '#CCFF99', "#99FFF0"]
# Plot counts
ax = sns.countplot(x='HealthImpactClass', hue='HealthImpactClass', data=df, palette=custom_colors, legend=False)
# Title
plt.title('Health Impact Class Frequency')
# Add count labels 
for p in ax.patches:
    count = int(p.get_height())
    ax.annotate(f'{count}', 
                (p.get_x() + p.get_width() / 2, p.get_height()), 
                ha='center', va='bottom')
ax.set(xlabel='Health Impact Class', ylabel='Count')
plt.show()
# Data heavily skewed towards health impact of "Very High", Health Impact Score >= 80
#print(df['HealthImpactClass'].value_counts())

####  Multivariate Analysis

In [None]:
# Kernel density plot - Understanding variance
plt.figure(figsize=(15, len(df) * 3))

for idx, feature in enumerate(df, 1):
    plt.subplot(len(df), 4, idx)
    sns.histplot(df[feature], kde=True)
    plt.axvline(df[feature].mean(), color='red', linestyle='dashed', linewidth=2) # Mean Line
    plt.title(f"{feature} | Skewness: {round(df[feature].skew(), 2)}") # Skewness
plt.tight_layout()
plt.show()

# Cases, Admissions, Health Impact Metrics have highest variance 

##### Correlation HeatMap

In [None]:
plt.figure(figsize=(15, 10))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='Spectral', linewidths=1)
plt.title('Correlation Heatmap')
plt.show()

##### Pollutants vs Cases and Hospital Admission Rates

In [None]:
# Pollutants list
pollutants = ["PM10","PM2_5", "NO2", "SO2", "O3"]

# Pollutant Correlation
def polutant_correlations(feature):

    correlations = (df[pollutants].corrwith(df[feature])).sort_values() # Calculate correlation
    fig, ax = plt.subplots(figsize=(8, 6)) # Barplot
    bars = ax.barh(correlations.index, correlations.values, edgecolor='black')
    
    # Bar Colours vbased on Correlation
    vmin, vmax = correlations.min(), correlations.max()
    if vmin < 0 and vmax > 0:
        # Mixed: Red and Green
        divnorm = mpl.colors.TwoSlopeNorm(vmin=vmin, vcenter=0, vmax=vmax)
        cmap = plt.cm.RdYlGn
    elif vmax <= 0:
        # All negative: Red
        divnorm = mpl.colors.Normalize(vmin=vmin, vmax=0)
        cmap = plt.cm.Reds
    else:
        # All positive: Green
        divnorm = mpl.colors.Normalize(vmin=0, vmax=vmax)
        cmap = plt.cm.Greens
    
    # Apply colour to each bar
    div_colors = cmap(divnorm(correlations.values))
    for bar, color in zip(bars, div_colors):
        bar.set_facecolor(color)

    # Add labels and layout
    ax.set_xlabel(f"Pollutant Correlation with {feature}")
    plt.show()
polutant_correlations("RespiratoryCases")
polutant_correlations("CardiovascularCases")
polutant_correlations("HospitalAdmissions")
# correlations = df[pollutants].corrwith(df['RespiratoryCases'])
# correlations.plot(kind='bar', title='Correlation with Respiratory Cases')


### Feature Selection

### Training the Model

### Testing the Model

### Performance Evaluation