# Proyek Analisis Data: Air Quality
- Nama: Aditya Nugraha
- Email: indonesia.adit@gmail.com
- Id Dicoding: nugraha8

**DOI**: https://doi.org/10.24432/C5RK5G

**No**: row number \
**year**: year of data in this row \
**month**: month of data in this row \
**day**: day of data in this row \
**hour**: hour of data in this row \
**PM2.5**: PM2.5 concentration (ug/m^3) \
**PM10**: PM10 concentration (ug/m^3) \
**SO2**: SO2 concentration (ug/m^3) \
**NO2**: NO2 concentration (ug/m^3) \
**CO**: CO concentration (ug/m^3) \
**O3**: O3 concentration (ug/m^3) \
**TEMP**: temperature (degree Celsius) \
**PRES**: pressure (hPa) \
**DEWP**: dew point temperature (degree Celsius) \
**RAIN**: precipitation (mm) \
**wd**: wind direction \
**WSPM**: wind speed (m/s) \
**station**: name of the air-quality monitoring site 

## Menentukan Pertanyaan Bisnis

- Bagaimana bulan dengan kualitas udara terbaik dan terburuk?
- Bagaimana bulan dengan kondisi cuaca yang paling ideal?

## Menyiapkan semua library yang dibutuhkan

In [1]:
import os

import numpy as np
import pandas as pd

# import matplotlib.pyplot as plt
# import seaborn as sns
import plotly.graph_objects as go

## Data Wrangling

### Gathering Data

In [2]:
path = 'datasets/Air-quality-dataset/PRSA_Data_20130301-20170228/'
filenames = os.listdir(path)
filenames

['PRSA_Data_Aotizhongxin_20130301-20170228.csv',
 'PRSA_Data_Changping_20130301-20170228.csv',
 'PRSA_Data_Dingling_20130301-20170228.csv',
 'PRSA_Data_Dongsi_20130301-20170228.csv',
 'PRSA_Data_Guanyuan_20130301-20170228.csv',
 'PRSA_Data_Gucheng_20130301-20170228.csv',
 'PRSA_Data_Huairou_20130301-20170228.csv',
 'PRSA_Data_Nongzhanguan_20130301-20170228.csv',
 'PRSA_Data_Shunyi_20130301-20170228.csv',
 'PRSA_Data_Tiantan_20130301-20170228.csv',
 'PRSA_Data_Wanliu_20130301-20170228.csv',
 'PRSA_Data_Wanshouxigong_20130301-20170228.csv']

In [3]:
# dfs = [Aotizhongxin, Changping, Dingling, Dongsi, Guanyuan, Gucheng, 
#       Huairou, Nongzhanguan, Shunyi, Tiantan, Wanliu, Wanshouxigong] \
#           = [pd.read_csv(path+filename) for filename in filenames]

df = pd.concat([pd.read_csv(path+filename) for filename in filenames])
df.set_index(['station', 'No']).groupby(level=0).head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM
station,No,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Aotizhongxin,1,2013,3,1,0,4.0,4.0,4.0,7.0,300.0,77.0,-0.7,1023.0,-18.8,0.0,NNW,4.4
Aotizhongxin,2,2013,3,1,1,8.0,8.0,4.0,7.0,300.0,77.0,-1.1,1023.2,-18.2,0.0,N,4.7
Aotizhongxin,3,2013,3,1,2,7.0,7.0,5.0,10.0,300.0,73.0,-1.1,1023.5,-18.2,0.0,NNW,5.6
Aotizhongxin,4,2013,3,1,3,6.0,6.0,11.0,11.0,300.0,72.0,-1.4,1024.5,-19.4,0.0,NW,3.1
Aotizhongxin,5,2013,3,1,4,3.0,3.0,12.0,12.0,300.0,72.0,-2.0,1025.2,-19.5,0.0,N,2.0
Changping,1,2013,3,1,0,3.0,6.0,13.0,7.0,300.0,85.0,-2.3,1020.8,-19.7,0.0,E,0.5
Changping,2,2013,3,1,1,3.0,3.0,6.0,6.0,300.0,85.0,-2.5,1021.3,-19.0,0.0,ENE,0.7
Changping,3,2013,3,1,2,3.0,3.0,22.0,13.0,400.0,74.0,-3.0,1021.3,-19.9,0.0,ENE,0.2
Changping,4,2013,3,1,3,3.0,6.0,12.0,8.0,300.0,81.0,-3.6,1021.8,-19.1,0.0,NNE,1.0
Changping,5,2013,3,1,4,3.0,3.0,14.0,8.0,300.0,81.0,-3.5,1022.3,-19.4,0.0,N,2.1


### Assessing Data

In [4]:
list(df['station'].unique())

['Aotizhongxin',
 'Changping',
 'Dingling',
 'Dongsi',
 'Guanyuan',
 'Gucheng',
 'Huairou',
 'Nongzhanguan',
 'Shunyi',
 'Tiantan',
 'Wanliu',
 'Wanshouxigong']

In [5]:
# stations = [filename[10:-22] for filename in filenames]
stations = list(df['station'].unique())
print(f'Station Names ({len(stations)}):', stations)

Station Names (12): ['Aotizhongxin', 'Changping', 'Dingling', 'Dongsi', 'Guanyuan', 'Gucheng', 'Huairou', 'Nongzhanguan', 'Shunyi', 'Tiantan', 'Wanliu', 'Wanshouxigong']


In [6]:
# Number of Rows/Records
print('# Total Records:', len(df))
print('# Record for each Stations:\n'+df.groupby('station').size().to_string(header=False))

# Total Records: 420768
# Record for each Stations:
Aotizhongxin     35064
Changping        35064
Dingling         35064
Dongsi           35064
Guanyuan         35064
Gucheng          35064
Huairou          35064
Nongzhanguan     35064
Shunyi           35064
Tiantan          35064
Wanliu           35064
Wanshouxigong    35064


In [7]:
# Data Types
print('# Data Types:\n'+df.dtypes.to_string())

# Data Types:
No           int64
year         int64
month        int64
day          int64
hour         int64
PM2.5      float64
PM10       float64
SO2        float64
NO2        float64
CO         float64
O3         float64
TEMP       float64
PRES       float64
DEWP       float64
RAIN       float64
wd          object
WSPM       float64
station     object


In [8]:
# Null Values
print(df.isna().sum().map(lambda x: f'{x} ({x / len(df) * 100:.2f}%)').to_string())

No             0 (0.00%)
year           0 (0.00%)
month          0 (0.00%)
day            0 (0.00%)
hour           0 (0.00%)
PM2.5       8739 (2.08%)
PM10        6449 (1.53%)
SO2         9021 (2.14%)
NO2        12116 (2.88%)
CO         20701 (4.92%)
O3         13277 (3.16%)
TEMP         398 (0.09%)
PRES         393 (0.09%)
DEWP         403 (0.10%)
RAIN         390 (0.09%)
wd          1822 (0.43%)
WSPM         318 (0.08%)
station        0 (0.00%)


In [9]:
# Categorical Data Checker

# undefined = ['No', 'station']
categorical = ['year', 'month', 'day', 'hour', 'wd']
numerical = ['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'WSPM']

# option to set the width as wide as possible
pd.set_option('display.max_colwidth', None)

# Values of Categorical Columns
cat_values = pd.DataFrame(zip(categorical, [list(df[i].value_counts().sort_index().index) for i in categorical]))
cat_values.columns = ['attributes', 'values']
cat_values['n_values'] = [len(j) for j in cat_values['values']]
display(cat_values[['attributes', 'n_values', 'values']])

pd.reset_option('display.max_colwidth') # reset width option

Unnamed: 0,attributes,n_values,values
0,year,5,"[2013, 2014, 2015, 2016, 2017]"
1,month,12,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]"
2,day,31,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]"
3,hour,24,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]"
4,wd,16,"[E, ENE, ESE, N, NE, NNE, NNW, NW, S, SE, SSE, SSW, SW, W, WNW, WSW]"


In [10]:
# Numerical Data Checker

df[numerical].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PM2.5,412029.0,79.793428,80.822391,2.0,20.0,55.0,111.0,999.0
PM10,414319.0,104.602618,91.772426,2.0,36.0,82.0,145.0,999.0
SO2,411747.0,15.830835,21.650603,0.2856,3.0,7.0,20.0,500.0
NO2,408652.0,50.638586,35.127912,1.0265,23.0,43.0,71.0,290.0
CO,400067.0,1230.766454,1160.182716,100.0,500.0,900.0,1500.0,10000.0
O3,407491.0,57.372271,56.661607,0.2142,11.0,45.0,82.0,1071.0
TEMP,420370.0,13.538976,11.436139,-19.9,3.1,14.5,23.3,41.6
PRES,420375.0,1010.746982,10.474055,982.4,1002.3,1010.4,1019.0,1042.8
DEWP,420365.0,2.490822,13.793847,-43.4,-8.9,3.1,15.1,29.1
RAIN,420378.0,0.064476,0.821004,0.0,0.0,0.0,0.0,72.5


Berdasarkan dokumen yang diterbitkan pemerintah terkait yang menangani kualitas udara [1], disebutkan terdapat beberapa variabel yang dapat menggambarkan kualitas udara yaitu PM2.5, PM10, SO2, NO2, CO, dan O3. 
semua variabel ini memiliki rentang nilai sebagai berikut
- **PM2.5**: 0-500 μg/m3
- **PM10**: 0-604 μg/m3
- **SO2**: 0 - 1004 ppb, atau 0 - 2725.10 ug/m^3
- **NO2**: 0 - 2049 ppb, atau 0 - 3993.68 ug/m^3
- **CO**: 0 - 50.4 ppm, atau 0 - 59808.80 ug/m^3
- **O3**: 0 - 0.604 ppm atau 0 - 1228.29 ug/m^3

Namun dijelaskan juga rentang nilai ini dapat lebih tinggi jika terdapat kondisi ekstrim seperti pada saat letusan gunung berapi, kebakaran, dan lainnya\
maka disimpulkan nilai yang ada pada data yang digunakan masih pada rentang yang dapat diterima.

dapat dilihat juga untuk variabel kondisi cuaca seperti TEMP, PRES, DEWP, RAIN, WSPM juga memiliki rentang nilai yang sangat wajar dan dapat diterima.

In [11]:
# mengecek duplikasi data

df.drop('No', axis=1).duplicated().sum()

0

In [12]:
# Mengecek outliers
outliers = []

for col in numerical:
    data = df[col]
    q25, q75 = np.percentile(data, 25), np.percentile(data, 75)
    iqr = q75 - q25
    cut_off = iqr * 1.5
    minimum, maximum = q25 - cut_off, q75 + cut_off
    outliers.append([x for x in data if x < minimum or x > maximum])
    
pd.Series(data=outliers, index=numerical)

PM2.5    []
PM10     []
SO2      []
NO2      []
CO       []
O3       []
TEMP     []
PRES     []
DEWP     []
RAIN     []
WSPM     []
dtype: object

### Cleaning Data

Berdasarkan hasil assessing data sebelumnya, didapatkan beberapa hal yang bisa dilakukan yaitu:
1. Penanganan terhadap Nilai NaN

In [13]:
# Mengecek nilai nan

# df.dropna(inplace=True)
df_nan_only = df[df.isna().any(axis=1)]
df_clean = df[~df.isna().any(axis=1)]

print('Total Records:', len(df))
print('Total Records without NaN:', len(df_clean))

print('Total Records with NaN:', len(df_nan_only))
display(df_nan_only.isna().sum())

Total Records: 420768
Total Records without NaN: 382168
Total Records with NaN: 38600


No             0
year           0
month          0
day            0
hour           0
PM2.5       8739
PM10        6449
SO2         9021
NO2        12116
CO         20701
O3         13277
TEMP         398
PRES         393
DEWP         403
RAIN         390
wd          1822
WSPM         318
station        0
dtype: int64

In [14]:
# mengisi nan menggunakan metode interpolate, ffill, dan bfill
# dengan asumsi nilai beberapa jam sebelum atau setelahnya tidak akan berbeda jauh

df_sorted_fillna = []
for station in stations:
    
    # memastikan semua terurut berdasarkan tanggal dan jam untuk setiap 'station'
    sorted_per_group = df[df['station'] == station].drop(['No'], axis=1)
    sorted_per_group = sorted_per_group.sort_values(['year', 'month', 'day', 'hour'])
    sorted_per_group = sorted_per_group.reset_index(drop=True).reset_index() 
    sorted_per_group = sorted_per_group.rename(columns={'index': 'No'})
    sorted_per_group['No'] = sorted_per_group['No'].map(lambda x: x+1)
    
    # menggunakan fungsi interpolate untuk mendapatkan nilai rata-rata dari baris ke +1 dan 
    # -1 dari nilai nan. interpolate tidak dapat mendapatkan nilai jika baris ke +1 dan/atau
    # -1 juga merupakan nilai nan, maka harus di tangani terpisah.
    sorted_per_group = sorted_per_group.interpolate(method='linear', axis=0)
    
    # fitur 'wd' merupakan data kategorik sehingga tidak bisa menggunakan interpolate 
    # maka menggnuakan fillna menggunakan metode ffill dan bfill dengan limit 1 agar
    # tidak bias karena metode ini dapat mengisi seluruh nilai nan hanya dengan 1 baris
    # terisi, jika seperti itu maka nilai tidak akan valid.
    sorted_per_group['wd'] = sorted_per_group['wd'].fillna(method='ffill', limit=1)
    sorted_per_group['wd'] = sorted_per_group['wd'].fillna(method='bfill', limit=1)
    
    # append ke list untuk nanti di gabung kembali
    df_sorted_fillna.append(sorted_per_group)

# menggabungkan kembali dataset
df_new = pd.concat(df_sorted_fillna).reset_index(drop=True)
# df_new.set_index(['station', 'No']).groupby(level=0).head(5)
df_new.isna().sum()

No           0
year         0
month        0
day          0
hour         0
PM2.5        0
PM10         0
SO2          0
NO2         22
CO           0
O3           0
TEMP         0
PRES         0
DEWP         0
RAIN         0
wd         183
WSPM         0
station      0
dtype: int64

In [15]:
# pengecekan nilai NaN pada kolom NO2
df_new[df_new.drop('wd', axis=1).isna().any(axis=1)]

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
70128,1,2013,3,1,0,4.0,4.0,3.0,,200.0,82.0,-2.3,1020.8,-19.7,0.0,E,0.5,Dingling
70129,2,2013,3,1,1,7.0,7.0,3.0,,200.0,80.0,-2.5,1021.3,-19.0,0.0,ENE,0.7,Dingling
175320,1,2013,3,1,0,6.0,18.0,5.0,,800.0,88.0,0.1,1021.1,-18.6,0.0,NW,4.4,Gucheng
175321,2,2013,3,1,1,6.0,15.0,5.0,,800.0,88.0,-0.3,1021.5,-19.0,0.0,NW,4.0,Gucheng
175322,3,2013,3,1,2,5.0,18.0,5.5,,700.0,52.0,-0.7,1021.5,-19.8,0.0,WNW,4.6,Gucheng
175323,4,2013,3,1,3,6.0,20.0,6.0,,650.0,62.5,-1.0,1022.7,-21.2,0.0,W,2.8,Gucheng
175324,5,2013,3,1,4,5.0,17.0,5.0,,600.0,73.0,-1.3,1023.0,-21.4,0.0,WNW,3.6,Gucheng
175325,6,2013,3,1,5,4.0,11.0,3.0,,700.0,87.0,-1.8,1023.6,-21.9,0.0,E,1.2,Gucheng
175326,7,2013,3,1,6,3.0,6.0,3.0,,700.0,92.0,-2.6,1024.3,-20.4,0.0,ENE,1.2,Gucheng
175327,8,2013,3,1,7,5.0,5.0,3.0,,800.0,86.0,-0.9,1025.6,-20.5,0.0,ENE,1.1,Gucheng


In [16]:
# kembali diasumsikan nilai beberapa jam sebelum atau sesudah tidak akan berbeda jauh kecuali ada kondisi tertentu
# maka, direncakan melakukan pengecekan pada 5 baris sebelum dan sesudah untuk diambil rata-ratanya
# namun, terlihat bahwa baris dengan no indeks 70128 dan 70129 merupakan baris pertama dan kedua
# maka akan diambil nilai rata-rata dari 10 baris setelahnya
display(df_new.iloc[70130:70130+10])
print('Nilai rata-rata NO2 pada 10 baris berikutnya adalah:', df_new.iloc[70130:70130+10]['NO2'].mean())

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
70130,3,2013,3,1,2,5.0,5.0,3.0,2.0,200.0,79.0,-3.0,1021.3,-19.9,0.0,ENE,0.2,Dingling
70131,4,2013,3,1,3,6.0,6.0,3.0,2.666667,200.0,79.0,-3.6,1021.8,-19.1,0.0,NNE,1.0,Dingling
70132,5,2013,3,1,4,5.0,5.0,3.0,3.333333,200.0,81.0,-3.5,1022.3,-19.4,0.0,N,2.1,Dingling
70133,6,2013,3,1,5,6.0,6.0,3.0,4.0,200.0,79.0,-4.5,1022.6,-19.5,0.0,NNW,1.7,Dingling
70134,7,2013,3,1,6,5.0,10.0,3.0,4.0,200.0,77.0,-4.5,1023.4,-19.5,0.0,NNW,1.8,Dingling
70135,8,2013,3,1,7,5.0,6.0,3.0,2.0,200.0,80.0,-2.1,1024.6,-20.0,0.0,NW,2.5,Dingling
70136,9,2013,3,1,8,8.0,7.0,3.0,3.0,200.0,79.0,-0.2,1025.2,-20.5,0.0,NNW,2.8,Dingling
70137,10,2013,3,1,9,8.0,8.0,3.0,2.0,200.0,81.0,0.6,1025.3,-20.4,0.0,NNW,3.8,Dingling
70138,11,2013,3,1,10,8.0,5.0,3.0,2.0,200.0,83.0,2.0,1025.1,-21.3,0.0,N,2.2,Dingling
70139,12,2013,3,1,11,3.0,2.0,3.0,2.0,200.0,83.0,3.6,1024.8,-20.7,0.0,NNE,2.7,Dingling


Nilai rata-rata NO2 pada 10 baris berikutnya adalah: 2.7


In [17]:
# mengisi nilai NO2 pada baris dengan no indeks 70128 dan 70129
df_new.loc[[70128,70129], 'NO2'] = 2.7
df_new.loc[70128:70129]

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
70128,1,2013,3,1,0,4.0,4.0,3.0,2.7,200.0,82.0,-2.3,1020.8,-19.7,0.0,E,0.5,Dingling
70129,2,2013,3,1,1,7.0,7.0,3.0,2.7,200.0,80.0,-2.5,1021.3,-19.0,0.0,ENE,0.7,Dingling


In [18]:
# akan dilakukan cara yang sama pada baris dengan no indeks 175320 hingga 175339
# akan diambil nilai rata-rata dari 10 baris setelahnya
display(df_new.iloc[175340:175340+10])
print('Nilai rata-rata NO2 pada 10 baris berikutnya adalah:', df_new.iloc[175340:175340+10]['NO2'].mean())

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
175340,21,2013,3,1,20,13.0,25.0,12.0,5.0,1100.0,61.0,1.6,1027.1,-18.4,0.0,ESE,1.9,Gucheng
175341,22,2013,3,1,21,15.0,23.0,14.0,13.0,1200.0,52.0,1.0,1028.1,-17.4,0.0,SSE,0.7,Gucheng
175342,23,2013,3,1,22,16.0,28.0,16.0,19.0,1200.0,45.0,1.3,1028.4,-17.6,0.0,E,1.0,Gucheng
175343,24,2013,3,1,23,16.0,28.0,14.0,20.0,1100.0,44.0,0.2,1028.6,-17.6,0.0,ESE,1.1,Gucheng
175344,25,2013,3,2,0,14.0,18.0,24.0,43.0,1399.0,25.0,-0.3,1028.9,-18.0,0.0,ENE,1.4,Gucheng
175345,26,2013,3,2,1,18.0,39.0,25.0,28.0,1300.0,37.0,-0.7,1029.2,-17.9,0.0,E,1.0,Gucheng
175346,27,2013,3,2,2,19.0,29.0,30.0,13.0,1100.0,47.0,-0.8,1028.8,-18.0,0.0,ENE,1.4,Gucheng
175347,28,2013,3,2,3,15.0,24.0,20.0,12.0,1000.0,46.0,-1.2,1028.6,-17.9,0.0,ENE,1.1,Gucheng
175348,29,2013,3,2,4,13.0,16.0,21.0,24.0,1200.0,37.0,-1.3,1028.7,-18.9,0.0,NE,1.1,Gucheng
175349,30,2013,3,2,5,12.0,23.0,26.0,11.0,1200.0,50.0,-1.3,1028.0,-18.4,0.0,ENE,1.4,Gucheng


Nilai rata-rata NO2 pada 10 baris berikutnya adalah: 18.8


In [19]:
# mengisi nilai NO2 pada baris dengan no indeks 175320 hingga 175339
df_new.loc[175320:175339, 'NO2'] = 18.8
df_new.loc[175320:175339]

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
175320,1,2013,3,1,0,6.0,18.0,5.0,18.8,800.0,88.0,0.1,1021.1,-18.6,0.0,NW,4.4,Gucheng
175321,2,2013,3,1,1,6.0,15.0,5.0,18.8,800.0,88.0,-0.3,1021.5,-19.0,0.0,NW,4.0,Gucheng
175322,3,2013,3,1,2,5.0,18.0,5.5,18.8,700.0,52.0,-0.7,1021.5,-19.8,0.0,WNW,4.6,Gucheng
175323,4,2013,3,1,3,6.0,20.0,6.0,18.8,650.0,62.5,-1.0,1022.7,-21.2,0.0,W,2.8,Gucheng
175324,5,2013,3,1,4,5.0,17.0,5.0,18.8,600.0,73.0,-1.3,1023.0,-21.4,0.0,WNW,3.6,Gucheng
175325,6,2013,3,1,5,4.0,11.0,3.0,18.8,700.0,87.0,-1.8,1023.6,-21.9,0.0,E,1.2,Gucheng
175326,7,2013,3,1,6,3.0,6.0,3.0,18.8,700.0,92.0,-2.6,1024.3,-20.4,0.0,ENE,1.2,Gucheng
175327,8,2013,3,1,7,5.0,5.0,3.0,18.8,800.0,86.0,-0.9,1025.6,-20.5,0.0,ENE,1.1,Gucheng
175328,9,2013,3,1,8,5.0,9.0,5.0,18.8,900.0,81.0,0.1,1026.1,-20.3,0.0,ENE,3.0,Gucheng
175329,10,2013,3,1,9,4.0,10.0,6.0,18.8,900.0,82.0,1.1,1026.1,-20.6,0.0,NE,2.8,Gucheng


In [20]:
# mengecek nilai nan pada kolom wd
df_new[df_new.isna().any(axis=1)]

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
31316,31317,2016,9,25,20,182.0,182.0,2.0,82.0,1600.0,46.0,24.714286,1009.571429,19.114286,0.0,,1.666667,Aotizhongxin
31317,31318,2016,9,25,21,137.0,146.0,2.0,44.0,1400.0,122.0,23.971429,1009.857143,18.871429,0.0,,1.850000,Aotizhongxin
31318,31319,2016,9,25,22,107.0,107.0,2.0,34.0,1300.0,108.0,23.228571,1010.142857,18.628571,0.0,,2.033333,Aotizhongxin
31392,31393,2016,9,29,0,43.0,106.0,2.0,85.0,1100.0,2.0,10.733333,1019.000000,5.700000,0.0,,0.000000,Aotizhongxin
31393,31394,2016,9,29,1,36.0,83.0,2.0,67.0,800.0,2.0,9.900000,1018.750000,6.300000,0.0,,0.000000,Aotizhongxin
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
419788,34085,2017,1,19,4,102.0,118.0,41.0,88.0,3100.0,42.0,-2.028571,1024.714286,-9.114286,0.0,,1.033333,Wanshouxigong
419789,34086,2017,1,19,5,111.0,137.0,40.0,84.0,2800.0,5.0,-2.192857,1024.571429,-8.871429,0.0,,1.050000,Wanshouxigong
419790,34087,2017,1,19,6,124.0,142.0,32.0,84.0,2700.0,4.0,-2.357143,1024.428571,-8.628571,0.0,,1.066667,Wanshouxigong
419948,34245,2017,1,25,20,241.0,241.0,39.0,108.0,3200.0,10.0,-2.800000,1025.000000,-9.000000,0.0,,0.000000,Wanshouxigong


In [21]:
# karena data merupakan data kategorikal dan baris nan dirasa terlalu banyak 
# maka kolom wd yang berisikan nan akan diubah menjadi 'Unknown'

df_new['wd'].fillna('Unknown', inplace=True)
df_new[df_new['wd'] == 'Unknown']

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
31316,31317,2016,9,25,20,182.0,182.0,2.0,82.0,1600.0,46.0,24.714286,1009.571429,19.114286,0.0,Unknown,1.666667,Aotizhongxin
31317,31318,2016,9,25,21,137.0,146.0,2.0,44.0,1400.0,122.0,23.971429,1009.857143,18.871429,0.0,Unknown,1.850000,Aotizhongxin
31318,31319,2016,9,25,22,107.0,107.0,2.0,34.0,1300.0,108.0,23.228571,1010.142857,18.628571,0.0,Unknown,2.033333,Aotizhongxin
31392,31393,2016,9,29,0,43.0,106.0,2.0,85.0,1100.0,2.0,10.733333,1019.000000,5.700000,0.0,Unknown,0.000000,Aotizhongxin
31393,31394,2016,9,29,1,36.0,83.0,2.0,67.0,800.0,2.0,9.900000,1018.750000,6.300000,0.0,Unknown,0.000000,Aotizhongxin
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
419788,34085,2017,1,19,4,102.0,118.0,41.0,88.0,3100.0,42.0,-2.028571,1024.714286,-9.114286,0.0,Unknown,1.033333,Wanshouxigong
419789,34086,2017,1,19,5,111.0,137.0,40.0,84.0,2800.0,5.0,-2.192857,1024.571429,-8.871429,0.0,Unknown,1.050000,Wanshouxigong
419790,34087,2017,1,19,6,124.0,142.0,32.0,84.0,2700.0,4.0,-2.357143,1024.428571,-8.628571,0.0,Unknown,1.066667,Wanshouxigong
419948,34245,2017,1,25,20,241.0,241.0,39.0,108.0,3200.0,10.0,-2.800000,1025.000000,-9.000000,0.0,Unknown,0.000000,Wanshouxigong


In [22]:
# pengecekan terakhir memastikan semua nan sudah tidak ada
df_new.isna().sum()

No         0
year       0
month      0
day        0
hour       0
PM2.5      0
PM10       0
SO2        0
NO2        0
CO         0
O3         0
TEMP       0
PRES       0
DEWP       0
RAIN       0
wd         0
WSPM       0
station    0
dtype: int64

In [23]:
# pandas date_range: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html
# freq: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases

# pd.to_datetime(df_new[['year', 'month', 'day', 'hour']])
# pd.date_range('2013-03-01 01:00:00', periods=5, freq='D')
# pd.date_range(start='2013-03-01 01:00:00', periods=5, freq='-1D')

## Exploratory Data Analysis (EDA)

### Eksplorasi data Numerik

Tahap ini akan melakukan eksplorasi terhadap data numerin yang ada yaitu: \
PM2.5, PM10, SO2, NO2, CO, O3, TEMP, PRES, DEWP, RAIN, dan WSPM 

dan menampilkan data statistik masing-masing atribut untuk setiap stasiunnya, data statistik yang ditampilkan yaitu: \
Min, Max, Range, Mean, Median, dan Std

In [24]:
# creating function for detailed descriptive analytic
def summary_stats(df_model, n=4):
    # central tendency: mean, median
    mean = pd.DataFrame(df_model.apply(np.mean)).T
    median = pd.DataFrame(df_model.apply(np.median)).T

    # distribution: ,std, min, max, range
    std = pd.DataFrame(df_model.apply(np.std)).T
    min_value = pd.DataFrame(df_model.apply(min)).T
    max_value = pd.DataFrame(df_model.apply(max)).T
    range_value = pd.DataFrame(df_model.apply(lambda x: x.max() - x.min())).T

    # concatenates
    summary_stats = pd.concat([min_value, max_value, range_value, mean, median, std]).T.reset_index()
    summary_stats.columns = ['attributes','min','max', 'range','mean','median', 'std']
    
    return round(summary_stats, n)

In [25]:
for station in stations:
    print(station)
    display(summary_stats(df_new[df_new['station'] == station][numerical]))
    print('-'*75)

Aotizhongxin


Unnamed: 0,attributes,min,max,range,mean,median,std
0,PM2.5,3.0,898.0,895.0,82.5406,58.0,81.9552
1,PM10,2.0,984.0,982.0,110.21,87.0,95.2612
2,SO2,0.2856,341.0,340.7144,17.4593,9.0,22.702
3,NO2,2.0,290.0,288.0,59.0741,53.0,37.0004
4,CO,100.0,10000.0,9900.0,1264.6924,900.0,1239.3942
5,O3,0.2142,423.0,422.7858,55.3286,41.0,57.3267
6,TEMP,-16.8,40.5,57.3,13.5814,14.5,11.4003
7,PRES,985.9,1042.0,56.1,1011.8517,1011.4,10.4044
8,DEWP,-35.3,28.5,63.8,3.1203,3.8,13.6901
9,RAIN,0.0,72.5,72.5,0.0674,0.0,0.9098


---------------------------------------------------------------------------
Changping


Unnamed: 0,attributes,min,max,range,mean,median,std
0,PM2.5,2.0,882.0,880.0,70.9864,46.0,72.3967
1,PM10,2.0,999.0,997.0,94.7886,72.0,83.9273
2,SO2,0.2856,310.0,309.7144,14.9431,7.0,21.0119
3,NO2,1.8477,226.0,224.1523,44.2062,36.0,29.5023
4,CO,100.0,10000.0,9900.0,1156.9902,800.0,1122.5967
5,O3,0.2142,429.0,428.7858,57.9763,46.0,54.2579
6,TEMP,-16.6,41.4,58.0,13.6716,14.6,11.3673
7,PRES,982.4,1036.5,54.1,1007.7712,1007.4,10.2259
8,DEWP,-35.1,27.2,62.3,1.4869,1.7,13.8287
9,RAIN,0.0,52.1,52.1,0.0603,0.0,0.7523


---------------------------------------------------------------------------
Dingling


Unnamed: 0,attributes,min,max,range,mean,median,std
0,PM2.5,3.0,881.0,878.0,66.8456,41.0,73.444
1,PM10,2.0,905.0,903.0,84.1137,60.0,80.2409
2,SO2,0.2856,156.0,155.7144,11.7978,5.0,15.6521
3,NO2,1.0265,205.0,203.9735,27.3036,19.0,26.2964
4,CO,100.0,10000.0,9900.0,925.1118,600.0,894.6746
5,O3,0.2142,500.0,499.7858,70.5308,62.0,58.6238
6,TEMP,-16.6,41.4,58.0,13.6716,14.6,11.3673
7,PRES,982.4,1036.5,54.1,1007.7712,1007.4,10.2259
8,DEWP,-35.1,27.2,62.3,1.4869,1.7,13.8287
9,RAIN,0.0,52.1,52.1,0.0603,0.0,0.7523


---------------------------------------------------------------------------
Dongsi


Unnamed: 0,attributes,min,max,range,mean,median,std
0,PM2.5,3.0,737.0,734.0,86.1442,61.0,86.259
1,PM10,2.0,955.0,953.0,110.3476,86.0,98.2385
2,SO2,0.2856,300.0,299.7144,18.5061,10.0,22.9544
3,NO2,2.0,258.0,256.0,53.9531,47.0,34.2128
4,CO,100.0,10000.0,9900.0,1331.913,1000.0,1169.0186
5,O3,0.6426,1071.0,1070.3574,57.7004,45.0,58.3263
6,TEMP,-16.8,41.1,57.9,13.6682,14.6,11.4596
7,PRES,987.1,1042.0,54.9,1012.5519,1012.2,10.2663
8,DEWP,-35.3,28.8,64.1,2.4451,3.0,13.8117
9,RAIN,0.0,46.4,46.4,0.064,0.0,0.786


---------------------------------------------------------------------------
Guanyuan


Unnamed: 0,attributes,min,max,range,mean,median,std
0,PM2.5,2.0,680.0,678.0,82.8975,59.0,81.0691
1,PM10,2.0,999.0,997.0,109.3723,89.0,92.3396
2,SO2,1.0,293.0,292.0,17.6093,8.0,23.6312
3,NO2,2.0,270.0,268.0,58.1393,51.0,35.2097
4,CO,100.0,10000.0,9900.0,1258.327,900.0,1151.631
5,O3,0.2142,415.0,414.7858,54.8171,40.0,57.1984
6,TEMP,-16.8,40.5,57.3,13.5814,14.5,11.4003
7,PRES,985.9,1042.0,56.1,1011.8517,1011.4,10.4044
8,DEWP,-35.3,28.5,63.8,3.1203,3.8,13.6901
9,RAIN,0.0,72.5,72.5,0.0674,0.0,0.9098


---------------------------------------------------------------------------
Gucheng


Unnamed: 0,attributes,min,max,range,mean,median,std
0,PM2.5,2.0,770.0,768.0,84.0748,60.0,82.9935
1,PM10,2.0,994.0,992.0,119.2616,100.0,97.5318
2,SO2,0.2856,500.0,499.7144,15.7058,7.0,23.2823
3,NO2,2.0,276.0,274.0,55.8302,50.0,36.5127
4,CO,100.0,10000.0,9900.0,1338.1001,985.7143,1214.0277
5,O3,0.2142,450.0,449.7858,58.0,45.0,57.171
6,TEMP,-15.6,41.6,57.2,13.8558,14.8,11.2948
7,PRES,984.0,1038.1,54.1,1008.8357,1008.5,10.1042
8,DEWP,-34.6,27.4,62.0,2.6004,3.0,13.7899
9,RAIN,0.0,41.9,41.9,0.0644,0.0,0.8381


---------------------------------------------------------------------------
Huairou


Unnamed: 0,attributes,min,max,range,mean,median,std
0,PM2.5,2.0,762.0,760.0,69.5017,47.0,70.9899
1,PM10,2.0,993.0,991.0,92.4227,69.0,84.7909
2,SO2,0.2856,315.0,314.7144,12.4467,4.0,19.3616
3,NO2,1.0265,231.0,229.9735,32.0773,25.0,26.2692
4,CO,100.0,10000.0,9900.0,1019.6719,800.0,890.5267
5,O3,0.2142,444.0,443.7858,60.8722,50.0,55.6194
6,TEMP,-19.9,40.3,60.2,12.4306,13.5,11.7542
7,PRES,982.8,1036.5,53.7,1007.6123,1007.3,10.0251
8,DEWP,-43.4,29.1,72.5,2.2188,2.7,14.0597
9,RAIN,0.0,45.9,45.9,0.068,0.0,0.8485


---------------------------------------------------------------------------
Nongzhanguan


Unnamed: 0,attributes,min,max,range,mean,median,std
0,PM2.5,2.0,844.0,842.0,85.0795,59.0,86.6913
1,PM10,2.0,995.0,993.0,109.384,85.0,96.087
2,SO2,0.5712,257.0,256.4288,18.7601,9.0,24.3796
3,NO2,2.0,273.0,271.0,58.0951,51.0,36.3798
4,CO,100.0,10000.0,9900.0,1327.7676,900.0,1257.0808
5,O3,0.2142,390.0,389.7858,58.4329,45.0,58.3298
6,TEMP,-16.8,41.1,57.9,13.6682,14.6,11.4596
7,PRES,987.1,1042.0,54.9,1012.5519,1012.2,10.2663
8,DEWP,-35.3,28.8,64.1,2.4451,3.0,13.8117
9,RAIN,0.0,46.4,46.4,0.064,0.0,0.786


---------------------------------------------------------------------------
Shunyi


Unnamed: 0,attributes,min,max,range,mean,median,std
0,PM2.5,2.0,941.0,939.0,79.438,55.0,81.4991
1,PM10,2.0,999.0,997.0,99.2719,77.0,90.5777
2,SO2,0.2856,239.0,238.7144,13.446,5.0,19.4632
3,NO2,2.0,258.0,256.0,44.0947,37.0,30.9073
4,CO,100.0,10000.0,9900.0,1197.9047,900.0,1162.9043
5,O3,0.2142,351.7164,351.5022,54.2188,42.0,54.5919
6,TEMP,-16.8,40.6,57.4,13.3755,14.4,11.4847
7,PRES,988.0,1042.8,54.8,1013.0719,1012.8,10.1774
8,DEWP,-36.0,27.5,63.5,2.4508,3.1,13.7323
9,RAIN,0.0,37.3,37.3,0.061,0.0,0.7611


---------------------------------------------------------------------------
Tiantan


Unnamed: 0,attributes,min,max,range,mean,median,std
0,PM2.5,3.0,821.0,818.0,82.0331,58.0,80.8943
1,PM10,2.0,988.0,986.0,106.5371,85.0,90.2765
2,SO2,0.5712,273.0,272.4288,14.51,7.0,20.2772
3,NO2,2.0,241.0,239.0,53.2588,47.0,32.0158
4,CO,100.0,10000.0,9900.0,1305.3333,900.0,1179.4332
5,O3,0.4284,674.0,673.5716,56.1481,40.0,59.4575
6,TEMP,-16.8,41.1,57.9,13.6682,14.6,11.4596
7,PRES,987.1,1042.0,54.9,1012.5519,1012.2,10.2663
8,DEWP,-35.3,28.8,64.1,2.4451,3.0,13.8117
9,RAIN,0.0,46.4,46.4,0.064,0.0,0.786


---------------------------------------------------------------------------
Wanliu


Unnamed: 0,attributes,min,max,range,mean,median,std
0,PM2.5,2.0,957.0,955.0,83.4676,59.0,82.1239
1,PM10,2.0,951.0,949.0,110.7079,88.0,93.5383
2,SO2,0.2856,282.0,281.7144,18.4095,10.0,22.6796
3,NO2,1.6424,264.0,262.3576,65.6684,61.0,37.9664
4,CO,100.0,10000.0,9900.0,1328.7612,900.0,1263.2009
5,O3,0.2142,364.0,363.7858,46.9094,29.0,54.4509
6,TEMP,-15.8,40.5,56.3,13.4258,14.3,11.348
7,PRES,985.9,1040.3,54.4,1011.1027,1010.8,10.356
8,DEWP,-34.9,28.5,63.4,3.2638,4.0,13.6793
9,RAIN,0.0,72.5,72.5,0.0682,0.0,0.8965


---------------------------------------------------------------------------
Wanshouxigong


Unnamed: 0,attributes,min,max,range,mean,median,std
0,PM2.5,3.0,999.0,996.0,85.0675,60.0,85.9985
1,PM10,2.0,961.0,959.0,112.5058,91.0,98.1307
2,SO2,0.2856,411.0,410.7144,17.3634,8.0,24.1713
3,NO2,2.0,251.0,249.0,55.4954,49.0,35.8325
4,CO,100.0,9800.0,9700.0,1373.6186,1000.0,1228.144
5,O3,0.2142,358.0,357.7858,55.92,42.0,57.1555
6,TEMP,-16.8,40.6,57.4,13.7818,14.8,11.386
7,PRES,985.1,1042.0,56.9,1011.5162,1011.0,10.571
8,DEWP,-35.3,28.5,63.8,2.7055,3.3,13.7049
9,RAIN,0.0,46.4,46.4,0.0643,0.0,0.7968


---------------------------------------------------------------------------


### Eksplorasi data Kategorik (wd)

Tahap ini akan melakukan eksplorasi terhadap data kategorik yang ada yaitu wd atau wind direction (arah angin)\
untuk detailnya, tahap ini akan menghitung jumlah untuk masing-masing arah angin untuk setiap stasiunnya.


In [26]:
for station in stations:
    st_len = len(df_new[df_new['station'] == station])
    print(station)
    print(df_new[df_new['station'] == station]['wd'].value_counts().map(lambda x: f'{x} ({x / st_len * 100:.2f}%)').to_string())
    print('-'*22)

Aotizhongxin


NE         5161 (14.72%)
ENE        3956 (11.28%)
SW          3377 (9.63%)
E           2611 (7.45%)
NNE         2446 (6.98%)
WSW         2213 (6.31%)
SSW         2099 (5.99%)
N           2067 (5.89%)
NW          1867 (5.32%)
ESE         1719 (4.90%)
NNW         1589 (4.53%)
SE          1345 (3.84%)
S           1304 (3.72%)
W           1175 (3.35%)
WNW         1102 (3.14%)
SSE         1022 (2.91%)
Unknown       11 (0.03%)
----------------------
Changping
NNW        4791 (13.66%)
NW         3868 (11.03%)
N          3794 (10.82%)
WNW         2887 (8.23%)
ESE         2789 (7.95%)
E           2432 (6.94%)
NNE         1924 (5.49%)
SSE         1856 (5.29%)
SE          1829 (5.22%)
NE          1736 (4.95%)
S           1701 (4.85%)
W           1418 (4.04%)
ENE         1311 (3.74%)
SSW         1128 (3.22%)
SW           885 (2.52%)
WSW          706 (2.01%)
Unknown        9 (0.03%)
----------------------
Dingling
NNW        4791 (13.66%)
NW         3868 (11.03%)
N          3794 (10.82%)
WNW       

## Visualization & Explanatory Analysis

### Pertanyaan 1: Bagaimana bulan dengan kualitas udara terbaik dan terburuk untuk setiap tahunnya, dilihat pada variabel yang yang menggambarkan AQI seperti PM2.5, PM10, SO2, NO2, CO, dan O3?

#### PM2.5 (Particulate Matter 2.5)

In [27]:
v1 = df_new.drop(['No', 'day', 'hour'], axis=1).groupby(['station', 'year', 'month']).mean(numeric_only=True).reset_index()
v1['year_month'] = pd.to_datetime(dict(year=v1['year'], month=v1['month'], day=1))

fig = go.Figure()

for station in stations:
    fig.add_trace(go.Scatter(
        x=v1[v1['station'] == station]['year_month'],
        y=v1[v1['station'] == station]['PM2.5'], 
        mode='lines', 
        name=station 
    ))

fig.update_layout(
    title='Average PM2.5 per Month',
    showlegend=True,
    plot_bgcolor='white',
)

Berdasarkan grafik diatas, dapat dilihat untuk variabel PM2.5 pada seluruh stasiun secara konsisten udara terbaik berada di sekitar bulan `juli` dan `januari` untuk setiap tahunnya.

#### PM10 (Particulate Matter 10)

In [28]:
fig = go.Figure()

for station in stations:
    fig.add_trace(go.Scatter(
        x=v1[v1['station'] == station]['year_month'],
        y=v1[v1['station'] == station]['PM10'], 
        mode='lines', 
        name=station 
    ))

fig.update_layout(
    title='Average PM10 per Month',
    showlegend=True,
    plot_bgcolor='white',
)

Berdasarkan grafik diatas, dapat dilihat untuk variabel PM10 pada seluruh stasiun secara konsisten udara terbaik berada di sekitar bulan `juli` dan `januari` untuk setiap tahunnya.

#### SO2 (Sodium Dioxide)

In [29]:
fig = go.Figure()

for station in stations:
    fig.add_trace(go.Scatter(
        x=v1[v1['station'] == station]['year_month'],
        y=v1[v1['station'] == station]['SO2'], 
        mode='lines', 
        name=station 
    ))

fig.update_layout(
    title='Average SO2 per Month',
    showlegend=True,
    plot_bgcolor='white',
)

Berdasarkan grafik diatas, dapat dilihat untuk variabel SO2 pada seluruh stasiun secara konsisten udara terbaik berada di sekitar bulan `juli` untuk setiap tahunnya.

#### NO2 (Nitrogen Dioxide)

In [30]:
fig = go.Figure()

for station in stations:
    fig.add_trace(go.Scatter(
        x=v1[v1['station'] == station]['year_month'],
        y=v1[v1['station'] == station]['NO2'], 
        mode='lines', 
        name=station 
    ))

fig.update_layout(
    title='Average NO2 per Month',
    showlegend=True,
    plot_bgcolor='white',
)

Berdasarkan grafik diatas, dapat dilihat untuk variabel NO2 terdapat perbedaan nilai yang cukup jauh untuk setiap stasiunnya, namun data-data tersebut secara konsisten memiliki pola yang sama yaitu paling rendah atau terbaik sekitar bulan `juli` setiap tahunnya.

#### CO (Carbon Dioxide)

In [31]:
fig = go.Figure()

for station in stations:
    fig.add_trace(go.Scatter(
        x=v1[v1['station'] == station]['year_month'],
        y=v1[v1['station'] == station]['CO'], 
        mode='lines', 
        name=station 
    ))

fig.update_layout(
    title='Average CO per Month',
    showlegend=True,
    plot_bgcolor='white',
)

Berdasarkan grafik diatas, dapat dilihat untuk variabel CO pada seluruh stasiun secara konsisten udara terbaik berada di sekitar bulan `juli` untuk setiap tahunnya.

#### O3 (Ozone)

In [32]:
fig = go.Figure()

for station in stations:
    fig.add_trace(go.Scatter(
        x=v1[v1['station'] == station]['year_month'],
        y=v1[v1['station'] == station]['O3'], 
        mode='lines', 
        name=station 
    ))

fig.update_layout(
    title='Average O3 per Month',
    showlegend=True,
    plot_bgcolor='white',
)

Berdasarkan grafik diatas, dapat dilihat untuk variabel O3 pada seluruh stasiun secara konsisten udara terbaik berada di sekitar bulan `januari` untuk setiap tahunnya.

### Pertanyaan 2: Bagaimana bulan dengan kondisi cuaca paling ideal setiap tahunnya, dilihat berdasarkan temperature(TEMP), pressure(PRES), suhu titik embun(DEWP), curah hujan (RAIN), dan kecepatan angin (WSPM)?

#### TEMP (Temperature)

In [33]:
fig = go.Figure()

for station in stations:
    fig.add_trace(go.Scatter(
        x=v1[v1['station'] == station]['year_month'],
        y=v1[v1['station'] == station]['TEMP'], 
        mode='lines', 
        name=station 
    ))

fig.update_layout(
    title='Average Temperature per Month',
    showlegend=True,
    plot_bgcolor='white',
)

berdasarkan grafik diatas, dapat dilihat secara konsisten untuk semua stasiun, suhu tertinggi berada pada bulan juli yaitu mencapai 27°C hingga 28°C dan terendah pada bulan januari yaitu mencapai -2°C hinggu -6 °C.\
suhu ideal berada di rentang nilai 20°C hingga 25°C, namun hingga 30°C masih dapat diterima, cukup panas jika berada diatas 32°C, dan cukup dingin jika berada dibawa 15°C.\
maka disimpulkan suhu terbaik secara konsisten untuk setiap stasiun berada pada bulan mei-juni dan agustus-september, dan terburuk secara konsisten berada pada sekitar bulan oktober-april.

#### Pressure (PRES)

In [34]:
fig = go.Figure()

for station in stations:
    fig.add_trace(go.Scatter(
        x=v1[v1['station'] == station]['year_month'],
        y=v1[v1['station'] == station]['PRES'], 
        mode='lines', 
        name=station 
    ))

fig.update_layout(
    title='Average Pressure per Month',
    showlegend=True,
    plot_bgcolor='white',
)

tekanan udara saling terkait dengan curah hujan, semakin rendah tekanan udara, semakin mungkin terjadi hujan, begitu juga sebaliknya.\
berdasarkan grafik diatas, tekanan udara terendah secara konsisten berada pada sekitar bulan juli, dan tertinggi pada sekitar bulan januari.

#### DEWP (Dew Point Temperature)

In [35]:
fig = go.Figure()

for station in stations:
    fig.add_trace(go.Scatter(
        x=v1[v1['station'] == station]['year_month'],
        y=v1[v1['station'] == station]['DEWP'], 
        mode='lines', 
        name=station 
    ))

fig.update_layout(
    title='Average Dew Point Temperature per Month',
    showlegend=True,
    plot_bgcolor='white',
)

Suhu titik embun juga saling terkait dengan curah hujan, semakin banyak uap air yang ada dimana hal ini dapat menandakan kemungkinan terjadi hujan, semakin tinggi suhu juga di mana embun terbentuk.\
pada grafik diatas, suhu titik embun terendah secara konsisten berada pada sekitar bulan januari, dan tertinggi pada sekitar bulan juli, artinya curah hujan akan tinggi pada bulan juli, dan rendah pada bulan januari. 

#### RAIN (Curah Hujan)

In [36]:
fig = go.Figure()

for station in stations:
    fig.add_trace(go.Scatter(
        x=v1[v1['station'] == station]['year_month'],
        y=v1[v1['station'] == station]['RAIN'], 
        mode='lines', 
        name=station 
    ))

fig.update_layout(
    title='Average Precipitaion per Month',
    showlegend=True,
    plot_bgcolor='white',
)

Berdasarkan bantuan dari hasil analisis sebelumnya, dimana semakin rendah tekanan udara dan semakin tinggi suhu titik embun maka kemungkinan terjadi hujan akan lebih tinggi.\
Hal ini dapat dikonfirmasi pada grafik diatas dimana curah hujan tertinggi secara konsisten berada pada sekitar bulan juli dan terendah berada pada sekitar bulan januari.

In [37]:
fig = go.Figure()

for station in stations:
    fig.add_trace(go.Scatter(
        x=v1[v1['station'] == station]['year_month'],
        y=v1[v1['station'] == station]['WSPM'], 
        mode='lines', 
        name=station 
    ))

fig.update_layout(
    title='Average Wind Speed per Month',
    showlegend=True,
    plot_bgcolor='white',
)

Berdasarkan grafik diatas, walaupun setiap stasiun memiliki jarak nilai yang cukup berbeda, namun memiliki pola yang sama, dimana terlihat kecepatan angin tertinggi berada pada bulan maret-mei dan november-januari, dan terendah pada bulan juni-oktober.

## Conclusion

- Berdasarkan hasil analisis yang dilakukan untuk pertanyaan nomor 1, dinyatakan berdasarkan variabel yang menggambarkan AQI, hampir seluruhnya menyatakan juli adalah bulan dengan kualitas udara terbaik secara konsisten setiap tahunnya dan untuk semua stasiun yang ada.
- Berdasarkan hasil analisis yang dilakukan untuk pertanyaan nomor 2, dapat dinyatakan bulan terbaik berada pada bulan september hingga oktober, hal ini dikarenakan pada bulan ini memiliki suhu yang ideal, dengan curah hujan dan kecepatan angin yang sedang.

## Source

[1] AirNow. “Technical Assistance Document for the Reporting of Daily Air Quality” 2018, https://www.airnow.gov/sites/default/files/2020-05/aqi-technical-assistance-document-sept2018.pdf