# ***Air Quality Index Prediction***

## **Organized by - DataVerse (12-hr ML Hackathon) of Cognizance IIT Roorkee**


### *By Deepak Kaura*


### **Objective:** Predict the Air Quality Index (AQI) using pollutant and meteorological data.

#### *Evaluation Metrics: Root Mean Squared Error (RMSE).*


####**Dataset Information** :-

The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2)  and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value.


**Attribute Information:**

* Date (DD/MM/YYYY)
* Time (HH.MM.SS)
* True hourly averaged concentration CO in mg/m^3 (reference analyzer)
* PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
* True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
* True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
* PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
* True hourly averaged NOx concentration in ppb (reference analyzer)
* PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
* True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
* PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
* PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
* Temperature in Â°C
* Relative Humidity (%)
* AH Absolute Humidity

## **Import all the required modules, Loading and Reading dataset**

In [92]:
import pandas as pd
import numpy as np

csv_filename="AirQualityUCI.csv"
df=pd.read_csv(csv_filename, sep=";" , parse_dates= ['Date','Time'])

We will use the AirQualityUCI.csv file as our dataset. It is a ';' seperated file so we'll specify it as a parameter for the read_csv function. We'll also use parse_dates parameter so that pandas recognizes the 'Date' and 'Time' columns and format them accordingly

## **Checking each column data types**

In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9471 entries, 0 to 9470
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           9357 non-null   object 
 1   Time           9357 non-null   object 
 2   CO(GT)         9357 non-null   object 
 3   PT08.S1(CO)    9357 non-null   float64
 4   NMHC(GT)       9357 non-null   float64
 5   C6H6(GT)       9357 non-null   object 
 6   PT08.S2(NMHC)  9357 non-null   float64
 7   NOx(GT)        9357 non-null   float64
 8   PT08.S3(NOx)   9357 non-null   float64
 9   NO2(GT)        9357 non-null   float64
 10  PT08.S4(NO2)   9357 non-null   float64
 11  PT08.S5(O3)    9357 non-null   float64
 12  T              9357 non-null   object 
 13  RH             9357 non-null   object 
 14  AH             9357 non-null   object 
 15  Unnamed: 15    0 non-null      float64
 16  Unnamed: 16    0 non-null      float64
dtypes: float64(10), object(7)
memory usage: 1.2+ MB


In [66]:
df.head()

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,10/03/2004,18.00.00,26,1360.0,150.0,119,1046.0,166.0,1056.0,113.0,1692.0,1268.0,136,489,7578,,
1,10/03/2004,19.00.00,2,1292.0,112.0,94,955.0,103.0,1174.0,92.0,1559.0,972.0,133,477,7255,,
2,10/03/2004,20.00.00,22,1402.0,88.0,90,939.0,131.0,1140.0,114.0,1555.0,1074.0,119,540,7502,,
3,10/03/2004,21.00.00,22,1376.0,80.0,92,948.0,172.0,1092.0,122.0,1584.0,1203.0,110,600,7867,,
4,10/03/2004,22.00.00,16,1272.0,51.0,65,836.0,131.0,1205.0,116.0,1490.0,1110.0,112,596,7888,,


## **Checking missing values**

In [67]:
df.isnull().sum()

Unnamed: 0,0
Date,114
Time,114
CO(GT),114
PT08.S1(CO),114
NMHC(GT),114
C6H6(GT),114
PT08.S2(NMHC),114
NOx(GT),114
PT08.S3(NOx),114
NO2(GT),114


*The data contains null values. So we drop those rows and columns containing nulls.*

In [93]:
df.dropna(how="all",axis=1,inplace=True)

In [94]:
df.dropna(how="all",axis=0,inplace=True)

### **Time for Auto Visualization using Pandas-Profiling**

In [70]:

!pip install -U ydata-profiling

Collecting ydata-profiling
  Downloading ydata_profiling-4.12.1-py2.py3-none-any.whl.metadata (20 kB)
Collecting visions<0.7.7,>=0.7.5 (from visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling)
  Downloading visions-0.7.6-py3-none-any.whl.metadata (11 kB)
Collecting htmlmin==0.1.12 (from ydata-profiling)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting phik<0.13,>=0.11.1 (from ydata-profiling)
  Downloading phik-0.12.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting multimethod<2,>=1.4 (from ydata-profiling)
  Downloading multimethod-1.12-py3-none-any.whl.metadata (9.6 kB)
Collecting imagehash==4.3.1 (from ydata-profiling)
  Downloading ImageHash-4.3.1-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting dacite>=1.8 (from ydata-profiling)
  Downloading dacite-1.8.1-py3-none-any.whl.metadata (15 kB)
Collecting PyWavelets (from imagehash==4.3.1->ydata-profiling)
  Downloading pywavelets-1.

In [71]:

#importing ydata-profiling

from ydata_profiling import ProfileReport

#generating a report

ProfileReport(df)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### **Checking the length of the dataset**

In [72]:
df.shape

(9357, 15)

In [73]:
df.tail()

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
9352,04/04/2005,10.00.00,31,1314.0,-200.0,135,1101.0,472.0,539.0,190.0,1374.0,1729.0,219,293,7568
9353,04/04/2005,11.00.00,24,1163.0,-200.0,114,1027.0,353.0,604.0,179.0,1264.0,1269.0,243,237,7119
9354,04/04/2005,12.00.00,24,1142.0,-200.0,124,1063.0,293.0,603.0,175.0,1241.0,1092.0,269,183,6406
9355,04/04/2005,13.00.00,21,1003.0,-200.0,95,961.0,235.0,702.0,156.0,1041.0,770.0,283,135,5139
9356,04/04/2005,14.00.00,22,1071.0,-200.0,119,1047.0,265.0,654.0,168.0,1129.0,816.0,285,131,5028


In [96]:
cols = list(df.columns[2:])

*If you might have noticed, the values in our data don't contain decimal places but have weird commas in place of them. For example 9.4 is written as 9,4. Now we will handle them*

In [97]:
for col in cols:
    if df[col].dtype != 'float64':
        str_x = pd.Series(df[col]).str.replace(',','.')
        float_X = []
        for value in str_x.values:
            fv = float(value)
            float_X.append(fv)

            df[col] = pd.DataFrame(float_X)

df.head()

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,10/03/2004,18.00.00,2.6,1360.0,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578
1,10/03/2004,19.00.00,2.0,1292.0,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255
2,10/03/2004,20.00.00,2.2,1402.0,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502
3,10/03/2004,21.00.00,2.2,1376.0,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867
4,10/03/2004,22.00.00,1.6,1272.0,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888


In [98]:
features=list(df.columns)

*We will define our features and ignore those that might not be of help in our prediction. For example, date is not a very useful feature that can assist in predicting the future values.*

In [99]:
features.remove('Date')
features.remove('Time')
features.remove('PT08.S4(NO2)')

In [100]:
X = df[features]
y = df['C6H6(GT)']

In [101]:
from sklearn.model_selection import train_test_split, cross_val_score # Import from model_selection instead of cross_validation
from sklearn import model_selection, metrics # Import model_selection instead of cross_validation
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score , mean_squared_error, r2_score

Here we will try to predict the C6H6(GT) values. Hence we set it as our target variables

We split the dataset to 60% training and 40% testing sets.

In [102]:
# split dataset to 60% training and 40% testing
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y, test_size=0.4, random_state=0)

In [83]:
print(X_train.shape, y_train.shape)

(5614, 12) (5614,)


## **Random forest regression**

In [103]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Initialize the Random Forest Regressor
forest = RandomForestRegressor(n_estimators=1000,
                               criterion='squared_error',  # Changed 'mse' to 'squared_error'
                               random_state=1,
                               n_jobs=-1)

# Fit the model
forest.fit(X_train, y_train)

# Make predictions
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)

# Calculate MSE and RMSE for train and test sets
mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
rmse_train = np.sqrt(mse_train)
rmse_test = np.sqrt(mse_test)

# Print results
print('MSE train: %.3f, test: %.3f' % (mse_train, mse_test))
print('RMSE train: %.3f, test: %.3f' % (rmse_train, rmse_test))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))


MSE train: 0.006, test: 0.008
RMSE train: 0.076, test: 0.091
R^2 train: 1.000, test: 1.000


In [105]:
y_test_pred = pd.DataFrame(y_test_pred, columns=['predicted_values'])

# Merge X_test and y_test_pred
result = pd.concat([X_test.reset_index(drop=True), y_test_pred.reset_index(drop=True)], axis=1)

# View the merged DataFrame
print(result)

      CO(GT)  PT08.S1(CO)  NMHC(GT)  C6H6(GT)  PT08.S2(NMHC)  NOx(GT)  \
0        1.3        917.0     101.0       5.2          772.0     98.0   
1        NaN        918.0    -200.0       1.6          551.0    114.0   
2        0.4        766.0    -200.0       2.4          616.0     43.0   
3        0.5        775.0    -200.0       2.3          609.0     58.0   
4     -200.0       1010.0    -200.0       3.0          650.0   -200.0   
...      ...          ...       ...       ...            ...      ...   
3738     0.8        956.0    -200.0       5.6          794.0    144.0   
3739     3.5       1310.0    -200.0      21.7         1344.0    376.0   
3740     0.5        817.0    -200.0       3.0          651.0     41.0   
3741     NaN        804.0    -200.0       1.5          546.0     81.0   
3742     1.5        996.0    -200.0       6.7          843.0   -200.0   

      PT08.S3(NOx)  NO2(GT)  PT08.S5(O3)     T    RH      AH  predicted_values  
0           1098.0     78.0        605.0  