# Air Quality in Istanbul

**T	:**Average Temperature (°C)

**TM:**Maximum temperature (°C)

**Tm:**Minimum temperature (°C)

**SLP:**	Atmospheric pressure at sea level (hPa)

**H	:** Average relative humidity (%)

**PP:**Total rainfall and / or snowmelt (mm)

**VV:**Average visibility (Km)

**V	:**Average wind speed (Km/h)

**VM:**Maximum sustained wind speed (Km/h)

**VG:**Maximum speed of wind (Km/h)

**RA:**Indicate if there was rain or drizzle (In the monthly average, total days it rained)

**SN:**Snow indicator (In the monthly average, total days that snowed)

**TS:**Indicates whether there storm (In the monthly average, Total days with thunderstorm)

**FG:**Indicates whether there was fog (In the monthly average, Total days with fog)

### Data Sources

1. AQI (PM2.5, PM10 etc) => https://aqicn.org/data-platform/register/
2. Others (T, TM, Tm etc.) => https://en.tutiempo.net/istanbul.html

### Useful Links to understand project items better

1. PM2.5 vs PM10 => https://smartairfilters.com/en/blog/pm10-pm2-5-difference-particle-air-pollution/
2. Conversion from PM10 to PM2.5 => https://www.epd.gov.hk/epd/english/environmentinhk/air/guide_ref/guide_aqa_model_g5.html
3. Equation of calculating Air Quality Index => https://en.wikipedia.org/wiki/Air_quality_index

## Data Importing and Understanding

In [85]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

In [62]:
combined_data = pd.read_csv("combined_data.csv")

combined_data.head(10)

Unnamed: 0.1,Unnamed: 0,Day,T,TM,Tm,SLP,H,PP,VV,V,VM,VG,RA,SN,TS,FG
0,0,1,7.8,10.0,4.0,,70.0,0.0,11.1,12.4,20.6,,,,,
1,1,2,7.8,11.0,3.7,,81.0,0.0,7.7,7.4,13.0,,,,,
2,2,3,8.1,10.2,4.0,,84.0,0.0,6.6,6.5,16.5,,,,,
3,3,4,,,,,,,,,,,,,,
4,4,5,,,,,,,,,,,,,,
5,5,6,,,,,,,,,,,,,,
6,6,7,1.4,5.0,0.0,,86.0,9.91,6.8,18.9,40.7,,,,,
7,7,8,-2.0,5.0,-3.0,,90.0,,4.5,29.3,42.4,,,,,
8,8,9,-1.7,1.3,-5.0,,71.0,,8.4,12.4,33.5,,,,,
9,9,10,,,,,,,,,,,,,,


In [63]:
combined_data.describe()

Unnamed: 0.1,Unnamed: 0,Day,T,TM,Tm,SLP,H,PP,VV,V,VM,VG
count,2556.0,2556.0,1301.0,1301.0,1301.0,1.0,1299.0,1268.0,1166.0,1301.0,1301.0,794.0
mean,1277.5,15.725743,16.405304,19.978324,12.823444,1019.1,67.465743,1.45,9.38482,15.731207,26.753574,42.592065
std,737.997967,8.800168,7.420103,8.114314,7.186704,,10.493348,4.008207,1.357422,5.826768,8.369578,11.138365
min,0.0,1.0,-4.4,-2.9,-6.7,1019.1,34.0,0.0,1.4,0.7,5.4,18.3
25%,638.75,8.0,10.3,13.4,7.0,1019.1,60.0,0.0,9.2,11.3,20.6,35.2
50%,1277.5,16.0,16.0,20.0,12.5,1019.1,68.0,0.0,10.0,15.0,25.9,42.4
75%,1916.25,23.0,23.4,27.3,19.0,1019.1,75.0,0.51,10.0,19.6,31.7,50.0
max,2555.0,31.0,31.5,37.0,26.0,1019.1,97.0,34.04,12.6,41.5,72.0,98.2


In [64]:
combined_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2556 entries, 0 to 2555
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  2556 non-null   int64  
 1   Day         2556 non-null   int64  
 2   T           1301 non-null   float64
 3   TM          1301 non-null   float64
 4   Tm          1301 non-null   float64
 5   SLP         1 non-null      float64
 6   H           1299 non-null   float64
 7   PP          1268 non-null   float64
 8   VV          1166 non-null   float64
 9   V           1301 non-null   float64
 10  VM          1301 non-null   float64
 11  VG          794 non-null    float64
 12  RA          1664 non-null   object 
 13  SN          2475 non-null   object 
 14  TS          2376 non-null   object 
 15  FG          2506 non-null   object 
dtypes: float64(10), int64(2), object(4)
memory usage: 319.6+ KB


## Data Cleansing and Manipulation

In [65]:
df = combined_data.iloc[:, 2:-5].copy()
df.head(10)

Unnamed: 0,T,TM,Tm,SLP,H,PP,VV,V,VM
0,7.8,10.0,4.0,,70.0,0.0,11.1,12.4,20.6
1,7.8,11.0,3.7,,81.0,0.0,7.7,7.4,13.0
2,8.1,10.2,4.0,,84.0,0.0,6.6,6.5,16.5
3,,,,,,,,,
4,,,,,,,,,
5,,,,,,,,,
6,1.4,5.0,0.0,,86.0,9.91,6.8,18.9,40.7
7,-2.0,5.0,-3.0,,90.0,,4.5,29.3,42.4
8,-1.7,1.3,-5.0,,71.0,,8.4,12.4,33.5
9,,,,,,,,,


In [66]:
df.shape

(2556, 9)

In [67]:
df.isna().sum()

T      1255
TM     1255
Tm     1255
SLP    2555
H      1257
PP     1288
VV     1390
V      1255
VM     1255
dtype: int64

In [68]:
df.notna().sum()

T      1301
TM     1301
Tm     1301
SLP       1
H      1299
PP     1268
VV     1166
V      1301
VM     1301
dtype: int64

In [69]:
df.dropna(axis=0, how="all", inplace=True)
df.dropna(axis=1, how="all", inplace=True)

# SLP has only 1 in 2556. So removed.
df.drop("SLP", axis=1, inplace=True)
df.shape

(1301, 8)

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1301 entries, 0 to 2555
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   T       1301 non-null   float64
 1   TM      1301 non-null   float64
 2   Tm      1301 non-null   float64
 3   H       1299 non-null   float64
 4   PP      1268 non-null   float64
 5   VV      1166 non-null   float64
 6   V       1301 non-null   float64
 7   VM      1301 non-null   float64
dtypes: float64(8)
memory usage: 91.5 KB


In [71]:
df.describe()

Unnamed: 0,T,TM,Tm,H,PP,VV,V,VM
count,1301.0,1301.0,1301.0,1299.0,1268.0,1166.0,1301.0,1301.0
mean,16.405304,19.978324,12.823444,67.465743,1.45,9.38482,15.731207,26.753574
std,7.420103,8.114314,7.186704,10.493348,4.008207,1.357422,5.826768,8.369578
min,-4.4,-2.9,-6.7,34.0,0.0,1.4,0.7,5.4
25%,10.3,13.4,7.0,60.0,0.0,9.2,11.3,20.6
50%,16.0,20.0,12.5,68.0,0.0,10.0,15.0,25.9
75%,23.4,27.3,19.0,75.0,0.51,10.0,19.6,31.7
max,31.5,37.0,26.0,97.0,34.04,12.6,41.5,72.0


In [72]:
df.isna().sum()

T       0
TM      0
Tm      0
H       2
PP     33
VV    135
V       0
VM      0
dtype: int64

In [73]:
df["VV"].describe()

count    1166.000000
mean        9.384820
std         1.357422
min         1.400000
25%         9.200000
50%        10.000000
75%        10.000000
max        12.600000
Name: VV, dtype: float64

In [74]:
# Filling NA values of Total rainfall and / or snowmelt (mm) with Mean of the values
df["PP"].fillna(df["PP"].mean(), inplace=True)

# Filling NA values of Average visibility (Km) with Mean of the values
df["VV"].fillna(df["VV"].mean(), inplace=True)

# Filling NA values of Average relative humidity (%) with Mean of the values 
df["H"].fillna(df["H"].mean(), inplace=True)

In [75]:
df.isna().sum()

T     0
TM    0
Tm    0
H     0
PP    0
VV    0
V     0
VM    0
dtype: int64

In [None]:
# PM10 looks best to use of calculation Air Quality Index due less missed values.
# The equation of calculating AQI based on chosen concentration. 
# (AQI_high - AQI_low)/(PM10_high - PM10_low) * (PM10 - PM10_low) + AQI_low 

AQI = (AQI_high - AQI_low)/(PM10_high - PM10_low) * (PM10 - PM10_low) + AQI_low 



AttributeError: 'DataFrame' object has no attribute 'pm10'

### The Necessary Values of Calculation Air Quality Index,

![AQI Values](AQI_values.png)

## EDA (Explanatory Data Analysis)

In [76]:
df.head()

Unnamed: 0,T,TM,Tm,H,PP,VV,V,VM
0,7.8,10.0,4.0,70.0,0.0,11.1,12.4,20.6
1,7.8,11.0,3.7,81.0,0.0,7.7,7.4,13.0
2,8.1,10.2,4.0,84.0,0.0,6.6,6.5,16.5
6,1.4,5.0,0.0,86.0,9.91,6.8,18.9,40.7
7,-2.0,5.0,-3.0,90.0,1.45,4.5,29.3,42.4


In [96]:
aqi = pd.read_csv("istanbul_aqi.csv")
aqi.replace(["", " "], np.nan, inplace=True)
aqi.isna().sum()

date        0
 pm25    1173
 pm10      70
 o3       881
 no2       99
 so2      306
 co       200
dtype: int64

9.94

In [87]:
aqi.shape

(1719, 7)

In [114]:
aqi.head(33)

Unnamed: 0,date,pm25,pm10,o3,no2,so2,co
0,2020/8/1,48,14.0,18.0,16.0,2.0,2.0
1,2020/8/2,42,11.0,17.0,11.0,1.0,1.0
2,2020/8/3,36,10.0,12.0,14.0,1.0,1.0
3,2020/8/4,28,9.0,,,,
4,2020/8/5,35,,,,,
5,2020/7/1,71,32.0,6.0,21.0,2.0,4.0
6,2020/7/2,57,20.0,5.0,22.0,2.0,2.0
7,2020/7/3,46,19.0,5.0,19.0,1.0,3.0
8,2020/7/4,52,24.0,7.0,20.0,1.0,3.0
9,2020/7/5,56,21.0,9.0,23.0,1.0,
