<a href="https://colab.research.google.com/github/dongpradip/AirQuality_project/blob/main/Model_building.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing all the required library

In [None]:
# importing all the necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
import joblib

### Reading dataset form teh drive

In [None]:
df=pd.read_csv('/content/drive/MyDrive/programming_for_data_analysis/cleaned_air_quality.csv')

In [None]:
df.head()

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket,Year,Month,Day
0,Jorapokhar,2017-04-20,217.13,119.49,7.75,9.26,27.38,6.75,0.32,28.43,18.88,0.0,0.0,0.0,148.0,Moderate,2017,4,20
1,Jorapokhar,2017-04-21,217.13,170.61,8.0,10.2,27.38,6.75,0.27,29.35,15.85,0.0,0.0,0.0,148.0,Moderate,2017,4,21
2,Jorapokhar,2017-04-22,217.13,124.64,7.92,9.45,27.38,6.75,0.29,33.34,17.76,0.0,0.0,0.0,135.0,Moderate,2017,4,22
3,Jorapokhar,2017-04-23,217.13,107.36,7.74,9.39,27.38,6.75,0.31,34.1,21.71,0.0,0.0,0.0,107.0,Moderate,2017,4,23
4,Jorapokhar,2017-04-24,217.13,178.28,7.49,10.72,27.38,6.75,0.33,38.16,17.94,0.0,0.0,0.0,124.0,Moderate,2017,4,24


## Using Label Encoder to encode the category data

Good < Satisfactory < Moderate < Poor < Very Poor < Severe

* Label Encoding preserves this order

In [None]:
le = LabelEncoder()
y_encoded = le.fit_transform(df['AQI_Bucket'])
df['AQI_Bucket'] = y_encoded

In [None]:
df.head()

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket,Year,Month,Day
0,Jorapokhar,2017-04-20,217.13,119.49,7.75,9.26,27.38,6.75,0.32,28.43,18.88,0.0,0.0,0.0,148.0,1,2017,4,20
1,Jorapokhar,2017-04-21,217.13,170.61,8.0,10.2,27.38,6.75,0.27,29.35,15.85,0.0,0.0,0.0,148.0,1,2017,4,21
2,Jorapokhar,2017-04-22,217.13,124.64,7.92,9.45,27.38,6.75,0.29,33.34,17.76,0.0,0.0,0.0,135.0,1,2017,4,22
3,Jorapokhar,2017-04-23,217.13,107.36,7.74,9.39,27.38,6.75,0.31,34.1,21.71,0.0,0.0,0.0,107.0,1,2017,4,23
4,Jorapokhar,2017-04-24,217.13,178.28,7.49,10.72,27.38,6.75,0.33,38.16,17.94,0.0,0.0,0.0,124.0,1,2017,4,24


In [None]:
df.isnull().sum()

Unnamed: 0,0
City,0
Date,0
PM2.5,0
PM10,0
NO,0
NO2,0
NOx,0
NH3,0
CO,0
SO2,0


In [None]:
# seprating independent variable and dependent variable, and assigning it to X and y.
features = ['PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene']
target = 'AQI'
next_target = 'AQI_Bucket'
X = df[features]
y = df[target]
z = df[next_target]

In [None]:
# using sklearn to split the dataset for regression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# using sklearn to split the dataset for classification
X_train, X_test, z_train, z_test = train_test_split(X, z, test_size=0.2, random_state=42)

In [None]:
# Checking the shape
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
print("z_train shape:", z_train.shape)
print("z_test shape:", z_test.shape)

X_train shape: (47249, 12)
X_test shape: (11813, 12)
y_train shape: (47249,)
y_test shape: (11813,)
z_train shape: (47249,)
z_test shape: (11813,)


###Building a model (Random Forest Regressor)
* Here I could have set the n_estimators much more which will make my model more accurate but due to size limitation on github I decrease it.

In [None]:
model = RandomForestRegressor(n_estimators=50, random_state=42)

model.fit(X_train, y_train)

In [None]:
# making prediciton
y_pred = model.predict(X_test)

In [None]:
y_pred

array([111.44, 343.46, 121.84, ...,  98.5 , 114.5 , 249.28])

###Building a model (Random Forest Classifier)

In [None]:
rfc_model = RandomForestClassifier(n_estimators=100, random_state=42)

rfc_model.fit(X_train, z_train)

In [None]:
z_pred = rfc_model.predict(X_test)

### Evaluation of the Model

In [None]:
# evaluation of Random Forest Regressor
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R-squared:", r2)

Mean Absolute Error: 10.086867744520168
Mean Squared Error: 558.8994626213038
Root Mean Squared Error: 23.641054600446736
R-squared: 0.9705431053819722


###Inference
The Random Forest Regressor demonstrates excellent perfomance in predicting AQI values.
* The model achieves an R² of **0.97**, indicating that **97% of the variability in AQI is captured** by the input pollutant features.
* The Mean Absolute Error (MAE) of approximately ** 10 units** shows that, on average, the predicted AQI values differ from the actual values by a relatively small margin.
* The Root Mean Squared Error (RMSE) of around **23.6** units suggests that while most predictions are close to the true values, a few larger deviations may occur.
Overall, the model is highly accurate and reliable for AQI prediction, though minor improvements could be made through feature engineering, hyperparameter tuning, or handling potential outliers.

In [None]:
# evaluation of Random Forest Classifier
print("Accuracy:", accuracy_score(z_test, z_pred))
print("\nClassification Report:\n", classification_report(z_test, z_pred))
print("\nConfusion Matrix:\n", confusion_matrix(z_test, z_pred))

Accuracy: 0.9647845593837298

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.94      0.95       639
           1       0.96      0.97      0.97      3965
           2       0.94      0.95      0.94      1220
           3       0.97      0.97      0.97      4220
           4       0.96      0.97      0.97       547
           5       0.97      0.95      0.96      1222

    accuracy                           0.96     11813
   macro avg       0.96      0.96      0.96     11813
weighted avg       0.96      0.96      0.96     11813


Confusion Matrix:
 [[ 602    0    0   37    0    0]
 [   1 3842   31   87    0    4]
 [   0   43 1153    2    0   22]
 [  21   95    1 4103    0    0]
 [   0    0    4    0  532   11]
 [   0    2   34    0   21 1165]]


##Inference
* The model achieves an overall accuracy of 96.5%, indicating highly reliable predictions of AQI categories.
* Precision, recall, and F1-scores for each class are consistently high (around 94–97%), showing that the classifier performs well across all AQI buckets.
* The confusion matrix indicates only a few misclassifications, mostly between neighboring AQI categories, which is reasonable given the similarity in pollutant levels.
Overall, the Random Forest Classifier is very effective in predicting AQI categories.

### Saving the Trained Model


In [None]:
joblib.dump(model, '/content/drive/MyDrive/programming_for_data_analysis/aqi_rfmodel1.pkl', compress=5)

['/content/drive/MyDrive/programming_for_data_analysis/aqi_rfmodel1.pkl']

In [None]:
joblib.dump(rfc_model, '/content/drive/MyDrive/programming_for_data_analysis/aqi_rfcmodel1.pkl', compress=5)

['/content/drive/MyDrive/programming_for_data_analysis/aqi_rfcmodel1.pkl']

## Inference
Here, I used compress parameter to compress the size of the model while saving because github wasnot accepting the file that exceeds size limitation of 25 MB.