**Dataset: Air Quality and Public Health Impact**

This dataset is from (https://www.kaggle.com/datasets/rabieelkharoua/air-quality-and-health-impact-dataset/data) contains comprehensive information on air quality and its impact on public health, with a total of 5,811 records. It includes variables such as the Air Quality Index (AQI), concentrations of various pollutants, weather conditions, and health impact metrics. The target variable is the **Health Impact Class**, which categorizes the health impact based on air quality and related factors.



In [1]:
import pandas as pd

df = pd.read_csv('airquality.csv')

### exploratory Data analysis

In [2]:
df.head(5)

Unnamed: 0,RecordID,AQI,PM10,PM2_5,NO2,SO2,O3,Temperature,Humidity,WindSpeed,RespiratoryCases,CardiovascularCases,HospitalAdmissions,HealthImpactScore,HealthImpactClass
0,1,187.270059,295.853039,13.03856,6.639263,66.16115,54.62428,5.150335,84.424344,6.137755,7,5,1,97.244041,0.0
1,2,475.357153,246.254703,9.984497,16.318326,90.499523,169.621728,1.543378,46.851415,4.521422,10,2,0,100.0,0.0
2,3,365.996971,84.443191,23.11134,96.317811,17.87585,9.006794,1.169483,17.806977,11.157384,13,3,0,100.0,0.0
3,4,299.329242,21.020609,14.273403,81.234403,48.323616,93.161033,21.925276,99.473373,15.3025,8,8,1,100.0,0.0
4,5,78.00932,16.987667,152.111623,121.235461,90.866167,241.795138,9.217517,24.906837,14.534733,9,0,1,95.182643,0.0


**Record Information**
- **RecordID**: A unique identifier assigned to each record (1 to 2392).

---

**Air Quality Metrics**
- **AQI**: Air Quality Index, a measure of how polluted the air currently is or how polluted it is forecast to become.
- **PM10**: Concentration of particulate matter less than 10 micrometers in diameter (μg/m³).
- **PM2_5**: Concentration of particulate matter less than 2.5 micrometers in diameter (μg/m³).
- **NO2**: Concentration of nitrogen dioxide (ppb).
- **SO2**: Concentration of sulfur dioxide (ppb).
- **O3**: Concentration of ozone (ppb).

---

**Weather Conditions**
- **Temperature**: Temperature in degrees Celsius (°C).
- **Humidity**: Humidity percentage (%).
- **WindSpeed**: Wind speed in meters per second (m/s).

---

**Health Impact Metrics**
- **RespiratoryCases**: Number of respiratory cases reported.
- **CardiovascularCases**: Number of cardiovascular cases reported.
- **HospitalAdmissions**: Number of hospital admissions reported.

---

**Target Variable: Health Impact Class**
- **HealthImpactScore**: A score indicating the overall health impact based on air quality and related factors, ranging from 0 to 100.

- **HealthImpactClass**: Classification of the health impact based on the **HealthImpactScore**:
  - **0**: 'Very High' (**HealthImpactScore** ≥ 80)
  - **1**: 'High' (60 ≤ **HealthImpactScore** < 80)
  - **2**: 'Moderate' (40 ≤ **HealthImpactScore** < 60)
  - **3**: 'Low' (20 ≤ **HealthImpactScore** < 40)
  - **4**: 'Very Low' (**HealthImpactScore** < 20)

---

**Conclusion**

This dataset offers a comprehensive view of the relationship between air quality and public health, making it ideal for research, predictive modeling, and statistical analysis.


In [3]:
df.shape

(5811, 15)

In [6]:
df.isnull().sum()

RecordID               0
AQI                    0
PM10                   0
PM2_5                  0
NO2                    0
SO2                    0
O3                     0
Temperature            0
Humidity               0
WindSpeed              0
RespiratoryCases       0
CardiovascularCases    0
HospitalAdmissions     0
HealthImpactScore      0
HealthImpactClass      0
dtype: int64

In [7]:
df.duplicated().sum()

0

In [8]:
df.dtypes

RecordID                 int64
AQI                    float64
PM10                   float64
PM2_5                  float64
NO2                    float64
SO2                    float64
O3                     float64
Temperature            float64
Humidity               float64
WindSpeed              float64
RespiratoryCases         int64
CardiovascularCases      int64
HospitalAdmissions       int64
HealthImpactScore      float64
HealthImpactClass      float64
dtype: object

In [9]:
df = df.drop(columns=['RecordID'])

In [14]:
import plotly.graph_objects as go

corr_matrix = df.corr()
fig = go.Figure(data=go.Heatmap(
    z=corr_matrix.values,
    x=corr_matrix.columns,
    y=corr_matrix.columns,
    colorscale='Viridis',
    colorbar=dict(title='Correlation Coefficient'),
    zmin=-1,
    zmax=1,
    text=corr_matrix.values, 
    texttemplate='%{text:.2f}', 
    textfont=dict(size=10, color='white')
))

fig.update_layout(
    title='Correlation Matrix of Air Quality and Health Impact Data',
    xaxis_title='Variables',
    yaxis_title='Variables',
    xaxis=dict(tickangle=-45),
    yaxis=dict(tickangle=-45),
    xaxis_title_font=dict(size=12),
    yaxis_title_font=dict(size=12),
    title_font=dict(size=14),
    width=800,  
    height=600  
)

# Show the plot
fig.show()

we absoultely see that the positive correlation is by the numeric columns that taken by sensor like AQI, PM10, PM2_5, NO2, SO2 and O3, but later on we will use all of the feature for modeling

### preprocessing and modelling


In [16]:
df = df.drop(columns = 'HealthImpactClass')

for the modelling i will use regression method for predict a numerical feature called "HealthImpactScore", so the 'HealthImpactClass' would be removed.

In [17]:
df.head(2)

Unnamed: 0,AQI,PM10,PM2_5,NO2,SO2,O3,Temperature,Humidity,WindSpeed,RespiratoryCases,CardiovascularCases,HospitalAdmissions,HealthImpactScore
0,187.270059,295.853039,13.03856,6.639263,66.16115,54.62428,5.150335,84.424344,6.137755,7,5,1,97.244041
1,475.357153,246.254703,9.984497,16.318326,90.499523,169.621728,1.543378,46.851415,4.521422,10,2,0,100.0


In [20]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['HealthImpactScore'])
y = df['HealthImpactScore']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [26]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'ElasticNet Regression': ElasticNet(),
    'Decision Tree Regression': DecisionTreeRegressor(),
    'Random Forest Regression': RandomForestRegressor(),
    'Gradient Boosting Regression': GradientBoostingRegressor(),
    'HistGradientBoostingRegressor': HistGradientBoostingRegressor(),
    'LGBMRegressor': LGBMRegressor(),
    'XGBRegressor': XGBRegressor()
}

In [27]:
results = []

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results.append({
        'Model': name,
        'Mean Squared Error': mse,
        'Mean Absolute Error': mae,
        'R^2 Score': r2
    })

results_df = pd.DataFrame(results)

print("Regression Model Performance:")
print(results_df.to_string(index=False))

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000141 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2343
[LightGBM] [Info] Number of data points in the train set: 4648, number of used features: 12
[LightGBM] [Info] Start training from score 93.922841
Regression Model Performance:
                        Model  Mean Squared Error  Mean Absolute Error  R^2 Score
            Linear Regression           92.682879             7.354111   0.505338
             Ridge Regression           92.684487             7.353716   0.505329
             Lasso Regression           99.245761             7.141586   0.470311
        ElasticNet Regression          109.664001             7.070361   0.414707
     Decision Tree Regression           25.615426             2.328902   0.863287
     Random Forest Regression           10.104620             1.549161  

so we're going to try some tuning parameter for gaining some accuracy if neccessary

In [28]:
from sklearn.model_selection import GridSearchCV

param_grids = {
    'Decision Tree Regression': {
        'model': DecisionTreeRegressor(),
        'param_grid': {
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
    },
    'Random Forest Regression': {
        'model': RandomForestRegressor(),
        'param_grid': {
            'n_estimators': [50, 100, 200, 300, 400],
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
    },
    'Gradient Boosting Regression': {
        'model': GradientBoostingRegressor(),
        'param_grid': {
            'n_estimators': [50, 100, 200, 300],
            'learning_rate': [0.001, 0.01, 0.1],
            'max_depth': [3, 5, 7]
        }
    },
    'HistGradientBoostingRegressor': {
        'model': HistGradientBoostingRegressor(),
        'param_grid': {
            'learning_rate': [0.001, 0.01, 0.1],
            'max_iter': [100, 200, 300],
            'max_depth': [3, 5, 7],
            'l2_regularization': [0.1, 1, 10]
        }
    },
    'LGBMRegressor': {
        'model': LGBMRegressor(),
        'param_grid': {
            'n_estimators': [50, 100, 200, 300, 400],
            'learning_rate': [0.001, 0.01, 0.1],
            'num_leaves': [31, 63, 127]
        }
    },
    'XGBRegressor': {
        'model': XGBRegressor(),
        'param_grid': {
            'n_estimators': [50, 100, 200, 300, 400],
            'learning_rate': [0.001, 0.01, 0.1],
            'max_depth': [3, 5, 7],
            'subsample': [0.8, 1.0]
        }
    }
}

results = []

for name, config in param_grids.items():
    grid_search = GridSearchCV(estimator=config['model'], param_grid=config['param_grid'], 
                               cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)
    grid_search.fit(X_train_scaled, y_train)
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test_scaled)
    
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results.append({
        'Model': name,
        'Best Parameters': grid_search.best_params_,
        'Mean Squared Error': mse,
        'Mean Absolute Error': mae,
        'R^2 Score': r2
    })

results_df = pd.DataFrame(results)


print("Tuned Model Performance:")
print(results_df.to_string(index=False))

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Fitting 5 folds for each of 180 candidates, totalling 900 fits
Fitting 5 folds for each of 36 candidates, totalling 180 fits
Fitting 5 folds for each of 81 candidates, totalling 405 fits
Fitting 5 folds for each of 45 candidates, totalling 225 fits
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000337 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2343
[LightGBM] [Info] Number of data points in the train set: 4648, number of used features: 12
[LightGBM] [Info] Start training from score 93.922841
Fitting 5 folds for each of 90 candidates, totalling 450 fits
Tuned Model Performance:
                        Model                                                                         Best Parameters  Mean Squared Error  Mean Absolute Error  R^2 Score
     Decision Tree Regression                       {'max_depth': 30, 'min_samples_leaf': 

we can conclude the HistGradingBoostingRegressor has the best accuracy, for the next it will be the only algorithm i used on the next simple App prediction of airquality