<a href="https://colab.research.google.com/github/bishtanuj/AQUAP/blob/main/AQUAP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Devfolio Hackathon**

Project Title: **Air Quality Prediction**

Team Name: **Code Mania**



**About Dataset**

**T:** Average Temperature (°C) <br>
**TM:** Maximum temperature (°C) <br>
**Tm:**  Minimum temperature (°C) <br>
**SLP:**  Atmospheric pressure at sea level (hPa) <br>
**H:** Average relative humidity (%) <br>
**VV:** Average visibility (Km) <br>
**V:** Average wind speed (Km/h) <br>
**VM:** Maximum sustained wind speed (Km/h) <br>
**PM2.5:** AQI Data <br> 

In [105]:
# Setting Notebook

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print("Setup Complete")

Setup Complete


**Load and examine the data**

In this step we will load the data
* Load the data into DataFrame called `aquap_data`.
* The corresponding filepath is `aquap_filepath`.


In [106]:
# Read the file into a variable aquap_data from path of the file
aquap_data = pd.read_csv("/content/dataset.csv")

Checking that the dataset loaded properly by printing the first five rows

In [107]:
aquap_data

Unnamed: 0.1,Unnamed: 0,T,TM,Tm,SLP,H,VV,V,VM,PM 2.5
0,0,7.4,9.8,4.8,1017.6,93.0,0.5,4.3,9.4,219.720833
1,1,7.8,12.7,4.4,1018.5,87.0,0.6,4.4,11.1,182.187500
2,2,6.7,13.4,2.4,1019.4,82.0,0.6,4.8,11.1,154.037500
3,3,8.6,15.5,3.3,1018.7,72.0,0.8,8.1,20.6,223.208333
4,4,12.4,20.9,4.4,1017.3,61.0,1.3,8.7,22.2,200.645833
...,...,...,...,...,...,...,...,...,...,...
1088,1088,18.1,24.0,11.2,1015.4,56.0,1.8,15.9,25.9,288.416667
1089,1089,17.8,25.0,10.7,1015.8,54.0,2.3,9.4,22.2,256.833333
1090,1090,13.9,24.5,11.4,1015.0,95.0,0.6,8.7,14.8,169.000000
1091,1091,16.3,23.0,9.8,1016.9,78.0,1.1,7.4,16.5,186.041667


**Description:**</br>
As according to the data -
 - PM 2.5 is the target value and dependent on all the other values.
 - And the features are T,TH,Tm,SLP,H,VV,V,VM are independent values on the basis on that target value is defined.



In [108]:
aquap_data=aquap_data.drop(columns='Unnamed: 0')
aquap_data.head()

Unnamed: 0,T,TM,Tm,SLP,H,VV,V,VM,PM 2.5
0,7.4,9.8,4.8,1017.6,93.0,0.5,4.3,9.4,219.720833
1,7.8,12.7,4.4,1018.5,87.0,0.6,4.4,11.1,182.1875
2,6.7,13.4,2.4,1019.4,82.0,0.6,4.8,11.1,154.0375
3,8.6,15.5,3.3,1018.7,72.0,0.8,8.1,20.6,223.208333
4,12.4,20.9,4.4,1017.3,61.0,1.3,8.7,22.2,200.645833


In [109]:
# Show the basic information about Data.
aquap_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1093 entries, 0 to 1092
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   T       1093 non-null   float64
 1   TM      1093 non-null   float64
 2   Tm      1093 non-null   float64
 3   SLP     1093 non-null   float64
 4   H       1093 non-null   float64
 5   VV      1093 non-null   float64
 6   V       1093 non-null   float64
 7   VM      1093 non-null   float64
 8   PM 2.5  1092 non-null   float64
dtypes: float64(9)
memory usage: 77.0 KB


In [110]:
# Provides Descriptive Statics about the data
aquap_data.describe() 

Unnamed: 0,T,TM,Tm,SLP,H,VV,V,VM,PM 2.5
count,1093.0,1093.0,1093.0,1093.0,1093.0,1093.0,1093.0,1093.0,1092.0
mean,26.009241,32.482251,19.460201,1008.081885,62.918573,2.003111,6.75151,15.805124,109.090984
std,7.237401,6.679078,7.438653,7.529237,15.709816,0.747541,3.841137,7.308435,84.46579
min,6.7,9.8,0.0,991.5,20.0,0.3,0.4,1.9,0.0
25%,19.3,27.8,12.1,1001.1,54.0,1.6,3.7,11.1,41.833333
50%,28.2,34.2,21.2,1008.1,64.0,1.9,6.5,14.8,83.458333
75%,31.7,37.0,26.0,1015.0,74.0,2.6,9.1,18.3,158.291667
max,38.5,45.5,32.7,1023.2,98.0,5.8,24.4,57.6,404.5


In [111]:
aquap_data.columns

Index(['T', 'TM', 'Tm', 'SLP', 'H', 'VV', 'V', 'VM', 'PM 2.5'], dtype='object')

In [112]:
aquap_data.isnull().sum()

T         0
TM        0
Tm        0
SLP       0
H         0
VV        0
V         0
VM        0
PM 2.5    1
dtype: int64

Drop the null values in PM 2.5

In [113]:
aquap_data.dropna(inplace=True)

Importing sklearn for applying machine learning

In [127]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import sklearn.metrics as sm
from sklearn.metrics import balanced_accuracy_score
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor

In [115]:
X = aquap_data.iloc[:, :-1]
Y = aquap_data.iloc[:, -1]

In [116]:
X.head()

Unnamed: 0,T,TM,Tm,SLP,H,VV,V,VM
0,7.4,9.8,4.8,1017.6,93.0,0.5,4.3,9.4
1,7.8,12.7,4.4,1018.5,87.0,0.6,4.4,11.1
2,6.7,13.4,2.4,1019.4,82.0,0.6,4.8,11.1
3,8.6,15.5,3.3,1018.7,72.0,0.8,8.1,20.6
4,12.4,20.9,4.4,1017.3,61.0,1.3,8.7,22.2


In [117]:
Y.head()

0    219.720833
1    182.187500
2    154.037500
3    223.208333
4    200.645833
Name: PM 2.5, dtype: float64

In [118]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3, random_state=0)

In [119]:
X_train

Unnamed: 0,T,TM,Tm,SLP,H,VV,V,VM
1068,21.7,30.0,12.2,1014.5,52.0,1.1,6.3,16.5
181,13.6,20.4,7.5,1020.7,67.0,1.1,10.2,40.7
426,31.8,39.6,25.4,1005.6,35.0,1.9,11.1,27.8
1019,28.3,35.8,26.8,1003.9,89.0,1.6,4.3,11.1
9,11.9,18.9,6.3,1020.1,76.0,1.1,8.3,20.6
...,...,...,...,...,...,...,...,...
1034,30.8,35.2,26.8,1005.0,76.0,2.7,3.7,18.3
764,26.8,33.4,15.7,1014.5,52.0,2.4,4.6,9.4
836,31.9,35.0,28.3,1001.3,72.0,2.1,2.8,9.4
560,11.9,15.7,7.5,1018.9,86.0,0.8,6.1,11.1


In [120]:
Y_train

1068    235.333333
181     166.916667
426      80.208333
1019     57.583333
9       107.625000
           ...    
1034     30.666667
764       1.791667
836      38.250000
560     107.625000
685      48.916667
Name: PM 2.5, Length: 764, dtype: float64

In [121]:
X_test

Unnamed: 0,T,TM,Tm,SLP,H,VV,V,VM
785,33.1,39.0,24.0,1003.0,26.0,2.6,10.4,24.1
742,12.0,17.0,6.7,1019.8,75.0,1.3,7.6,13.0
748,15.2,22.6,6.6,1018.2,56.0,1.9,8.9,22.2
986,32.9,39.0,27.4,996.6,50.0,2.4,9.4,27.8
480,29.4,34.4,26.4,999.9,86.0,1.8,2.2,14.8
...,...,...,...,...,...,...,...,...
492,32.1,36.6,28.0,1000.8,56.0,1.9,15.4,25.9
62,29.9,38.8,21.7,1000.8,29.0,1.9,12.0,24.1
79,34.1,40.3,28.8,999.7,59.0,1.9,10.9,16.5
300,30.5,36.2,25.4,997.3,65.0,2.3,18.5,29.4


In [122]:
Y_test

785    104.625000
742    125.891667
748    279.600000
986    110.416667
480     31.333333
          ...    
492     44.125000
62     142.500000
79      65.166667
300     22.333333
194      0.000000
Name: PM 2.5, Length: 328, dtype: float64

In [128]:
linearreg = LinearRegression()

In [129]:
linearreg.fit(X_train, Y_train)

LinearRegression()

In [130]:
Y_Predict = linearreg.predict(X_test)

In [131]:
print("Mean Absolute Error (MAE) =", round(sm.mean_absolute_error(Y_test, Y_Predict), 2))
print("Mean Squared Error (MSE) =", round(sm.mean_squared_error(Y_test, Y_Predict), 2))

Mean Absolute Error (MAE) = 44.84
Mean Squared Error (MSE) = 3687.54


In [132]:
clf = RandomForestRegressor(n_estimators=10)

clf.fit(X_train, Y_train)
y_pred = clf.predict(X_test)

In [None]:
rfr.score(X_test, Y_test)

In [133]:
clf.score(X_test, Y_test)

0.7615711167322905