<h1>Assigment 1</h1>
<h2>Example of implementing AI to predict Air Pollution Index based on meteorological data and specific time of the year</h2>

This Jupyter Notebook is a part of Assignment in the CloudEARTHi project student workshop. The assignment aims to provide a simple demonstration how AI can be used to predict Air Pollution Index using meteorological data and specific time of the year. This notebook should not be considered as full functional application, correct class prediction algorithm or an example for a well-organized AI, but rather as a demo tailored to education of non-specialists. This notebook should be used together with the assignment description file.

<h3>Code and description</h3>

The first section of the notebook includes its imports. This section defines the modules that are used by the application. By the order they are imported:
<ul>
<li><strong>csv</strong> - handles csv files and in the current application is used to load the data from the pre-prepared CSV files;</li>
<li><strong>numpy</strong> – allows for handling array and matrixes;</li>
<li><strong>pandas</strong> – allows to create dataframes – tables that in the current application are used to visualize the data;</li>
<li><strong>sklearn</strong> – this module is related to various AI and data processing algorithms – in the current application it is used to handle the data and build a random forest classifier. As this module is rather large, only required classes are loaded.</li>
</ul>

In [26]:
import csv
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

This code block loads the data from the CSV file. The data is loaded into a list.
Available data is included for the following locations:
<ul>
<li><strong>Varna, Bulgaria</strong> - file name: Varna.csv</li>
<li><strong>Sofia, Bulgaria</strong> - file name: Sofia.csv</li>
<li><strong>Tromso, Norway</strong> - file name: Tromso.csv</li>
<li><strong>Edinburgh, UK</strong> - file name: Edinburgh.csv</li>
<li><strong>Vienna, Austria</strong> - file name: Vienna.csv</li>
<li><strong>New Delhi, India</strong> - file name: NewDelhi.csv</li>
</ul>

In order to run the code with a different data set just replace the file at the line indicated by the comment.

The csv files include consolidated data related to weather forecasts and Air Quality Index. The weather data is extracted from the module meteostat - https://github.com/meteostat/meteostat-python, while the Air Quality Index data is downloaded from https://aqicn.org/data-platform/register/ and evaluated using the table on https://aqicn.org/.

In [27]:
data=[]
with open('Varna.csv', newline='') as csvfile: #Replace the name of the file in this line. 
                                               #Keep the apostrophes. 
                                               #For example: with open('Sofia.csv', newline='') 
                                               #will run the application with data for Sofia city.
    spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
    for row in spamreader:
        data.append(row)

This code blocks converts the list that includes the data from the csv file into a dataframe. The data frame is used to better visualize the data. A snipped of the loaded data is presented at Out line bellow this block. 
In the snipped:
<ul>
<li><strong>date - the date that measurment was taken on. This field is removed when the data is used for training;</strong></li> 
<li><strong>month</strong> - the month from the year - 1-12;</li>
<li><strong>dayofweek</strong> - day of the week - 1-7; </li>
<li><strong>tavg</strong> - average temperature in <sup>o</sup>C;</li>
<li><strong>tmin</strong> - minimum temperature in <sup>o</sup>C;</li>
<li><strong>tmax</strong> - maximum temperature in <sup>o</sup>C;</li>
<li><strong>prcp</strong> - percipation in mm;</li>
<li><strong>wspd</strong> - wind speed in km/h;</li>
<li><strong>pres</strong> - air pressure hPa;</li>
<li><strong>AQI</strong> - air quality index;</li>
</ul>

In [28]:
dataHeader=data[0]
dfData=pd.DataFrame(data[1:],columns=dataHeader)
data=data[1:]
dfData.head()

Unnamed: 0,date,month,dayofweek,tavg,tmin,tmax,prcp,wspd,pres,AQI
0,2018-02-12 00:00:00,2,0,2.3,1.0,4.2,4.3,10.3,1012.0,Good
1,2018-02-13 00:00:00,2,1,4.9,2.0,7.0,0.0,9.3,1014.9,Moderate
2,2018-02-14 00:00:00,2,2,5.8,3.0,8.0,0.3,21.5,1009.7,Moderate
3,2018-02-15 00:00:00,2,3,3.7,1.0,8.8,10.4,16.9,1018.3,Moderate
4,2018-02-16 00:00:00,2,4,3.9,1.0,11.0,0.0,14.3,1024.8,Moderate


This block of code process the data removing entries that do not have information  - marked with nan.

In [29]:
toRemove=[]
for member in data:
    for el in member:
        if el=='nan':
            toRemove.append(member)
            break
            
for member in toRemove:
    data.remove(member)

This block of code removes the date field as it is not used in training the random forest algorithm.

In [30]:
for member in data:
    del member[0]

This block of code breaks the data into features X (meteorological data) and classes Y (AQI).
X and Y are converter from lists to arrays, as the sklearn moddules requires arrays as input data.

In [31]:
X=[]
Y=[]
for member in data:
    X.append(member[:-1])
    Y.append(member[-1])
    
X=np.array(X)
Y=np.array(Y)

This line of code creates data split. The classes and the features are split into training data and test data. The test data in this specific case is left in order to provide a demonstration for the user. The split is train to test is as 9:1. 

In [32]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.1)

This initializes the Random Forest Classifier. The setting of the classifier are not tuned in this case, but are placed just to provide a demonstration. When building an application each of the parameters bellow must be set and fine tuned in order to get maximum accuracy. 

In [33]:
clf = RandomForestClassifier(
    n_estimators=30,
    min_samples_split=5,
    max_depth=20, 
    min_samples_leaf=7,
    min_weight_fraction_leaf=0.1,
    max_features='auto',
    max_leaf_nodes=5,
    oob_score=True,
    random_state=0)

The classifier is then trained.

In [34]:
clf.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=20, max_features='auto',
                       max_leaf_nodes=5, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=7, min_samples_split=5,
                       min_weight_fraction_leaf=0.1, n_estimators=30,
                       n_jobs=None, oob_score=True, random_state=0, verbose=0,
                       warm_start=False)

When the training is complete the score or accuracy of the classifier can be displayed. The number represents the rate of accurate prediction from the whole data used to test the classifier.
A reliable  classifier should  have score higher then 0.9. It has to be noted however that a higher score is not always  an indication for a functional and precise classifier.

In [35]:
print('OOB Score:'+str(clf.oob_score_))

OOB Score:0.7304653204565408


This block of code displays the feature importance. Feature importance how much each feature has contributed to the classification. Feature importance should sum to 1.

In [36]:
dfFeatures=pd.DataFrame(clf.feature_importances_,index=dataHeader[2:])
print(dfFeatures)

                  0
dayofweek  0.190510
tavg       0.013877
tmin       0.143583
tmax       0.132285
prcp       0.118898
wspd       0.031744
pres       0.299970
AQI        0.069131


This code displays a snipped of the test data. 

In [37]:
y_list=y_test.tolist()
x_list=x_test.tolist()
test_data=[]
for x,y in zip (x_list,y_list):
    x.append(y)
    test_data.append(x)
dfTestData=pd.DataFrame(test_data,columns=dataHeader[1:])
print(dfTestData)

    month dayofweek  tavg  tmin  tmax prcp  wspd    pres       AQI
0       8         0  22.2  15.0  27.2  0.0  12.3  1014.3      Good
1       8         0  25.4  21.0  32.0  0.0  15.5  1012.8      Good
2       7         3  20.3  16.0  26.4  0.8  14.1  1009.3  Moderate
3      12         5   8.6   5.8  10.2  5.8  12.7  1018.7  Moderate
4       3         5   4.0  -1.0   8.4  0.0   9.2  1011.0  Moderate
..    ...       ...   ...   ...   ...  ...   ...     ...       ...
122     6         1  14.6  12.0  19.6  0.0   8.4  1008.3      Good
123     7         1  24.8  18.0  32.0  0.0   7.3  1012.8      Good
124     9         2  21.8  17.0  28.0  0.0  13.1  1019.3      Good
125     6         3  25.6  20.0  32.0  0.0   7.0  1015.0      Good
126     5         1  18.2  14.0  23.0  0.0  24.4  1005.9      Good

[127 rows x 9 columns]


The data above can be used to test the classifier. In order to do that the features (month, dayofweek,tavg,tmin,tmas,prcp,wspd,pres) have to be writen down in the line bellow as per the example. When placed correctly and ran the classifier should return a result that can be compared to the AQI from the table. 

In [38]:
prediction=clf.predict([[12,1,2.7,1,7.5,8.9,20.9,1014]])
print('Prediction of AQI:'+prediction[0])

Prediction of AQI:Good
