<a href="https://colab.research.google.com/github/aulialigar/Capstone_SIB_Dicoding/blob/main/notebooks/3_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Membangun Model Klasifikasi Kategori Kualitas Udara di DKI Jakarta

## Data Preparation

### Import Library

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

### Loading Data

In [2]:
# Download Data
!wget https://github.com/aulialigar/Capstone_SIB_Dicoding/raw/main/datasets/data.zip

--2021-12-15 03:43:32--  https://github.com/aulialigar/Capstone_SIB_Dicoding/raw/main/datasets/data.zip
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/aulialigar/Capstone_SIB_Dicoding/main/datasets/data.zip [following]
--2021-12-15 03:43:32--  https://raw.githubusercontent.com/aulialigar/Capstone_SIB_Dicoding/main/datasets/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69827 (68K) [application/zip]
Saving to: ‘data.zip’


2021-12-15 03:43:32 (45.0 MB/s) - ‘data.zip’ saved [69827/69827]



In [3]:
# Ekstrak berkas dataset yang terkompres zip
!unzip data.zip

Archive:  data.zip
  inflating: data_forecasting_final.csv  
  inflating: test_data_classification.csv  
  inflating: train_data_classification.csv  


In [4]:
# Memuat dataset dengan pandas
df_train = pd.read_csv('train_data_classification.csv')
df_test = pd.read_csv('test_data_classification.csv')

In [5]:
df_train

Unnamed: 0,pm10,so2,co,o3,no2,categori
0,60.0,4.0,73.0,27.0,14.0,3
1,32.0,2.0,16.0,33.0,9.0,0
2,27.0,2.0,19.0,20.0,9.0,0
3,22.0,2.0,16.0,15.0,6.0,0
4,25.0,2.0,17.0,15.0,8.0,0
...,...,...,...,...,...,...
4202,82.0,56.0,13.0,41.0,35.0,4
4203,82.0,53.0,18.0,40.0,45.0,4
4204,78.0,52.0,18.0,53.0,39.0,4
4205,90.0,54.0,15.0,81.0,35.0,4


In [6]:
df_test

Unnamed: 0,pm10,so2,co,o3,no2
0,58.509243,18.425926,29.222222,42.760012,22.213508
1,68.781558,14.471925,33.358762,51.942339,20.987288
2,31.6875,11.993007,21.34375,25.950862,10.232639
3,72.43735,22.073287,24.551591,77.545148,11.78623
4,74.80416,22.146537,24.858341,79.041463,11.677219
5,76.213284,22.219348,24.946994,80.494536,11.661887
6,76.777581,22.291317,24.844293,81.909962,11.729158
7,76.609904,22.362043,24.576982,83.293332,11.867954
8,75.82311,22.431122,24.171805,84.650238,12.067197
9,74.530055,22.498152,23.655506,85.986273,12.315809


In [7]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4207 entries, 0 to 4206
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pm10      4207 non-null   float64
 1   so2       4207 non-null   float64
 2   co        4207 non-null   float64
 3   o3        4207 non-null   float64
 4   no2       4207 non-null   float64
 5   categori  4207 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 197.3 KB


In [8]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   pm10    23 non-null     float64
 1   so2     23 non-null     float64
 2   co      23 non-null     float64
 3   o3      23 non-null     float64
 4   no2     23 non-null     float64
dtypes: float64(5)
memory usage: 1.0 KB


### Data Understanding

Dari output sebelumnya, dapat diketahui bahwa ada 5 fitur pada data latih, yaitu:
1. `pm10`
2. `so2`
3. `co`
4. `o3`
5. `no2`
6. `categori`

Fitur categori merupakan fitur target yang akan diprediksi pada data uji dan memiliki 5 nilai (telah diidentifikasi pada [Data Understanding and Preprocessing](https://github.com/aulialigar/Capstone_SIB_Dicoding/blob/main/notebooks/1_Data_Understanding_and_Preprocessing.ipynb)) , diantaranya:

- 0 : Baik
- 1 : Berbahaya
- 2 : Sangat Tidak Sehat
- 3 : Sedang
- 4 : Tidak Sehat

### Data Splitting

Data train akan dibagi lagi untuk melatih dan memvalidasi model

In [9]:
# Menentukan fitur (X) dan label (y)
X = df_train.drop(["categori"],axis =1)
y = df_train["categori"]

# Split dataset menjadi data latih dan data uji
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 60)

### Normalization

In [10]:
# Normalisasi dengan MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)

## Modelling

In [11]:
# Menyiapkan dataframe untuk analisis model
models = pd.DataFrame(index=['accuracy_score'], 
                      columns=['DecisionTree', 'KNN', 'RandomForest', 'SVM'])

### Decision Tree

In [12]:
# Buat model prediksi dengan Decision Tree
tree_classifier = tree.DecisionTreeClassifier()
model_tree = tree_classifier.fit(X_train, y_train)

# Lakukan prediksi dengan model Decision Tree
tree_pred = model_tree.predict(X_val)

# Hitung metriks akurasi dan simpan hasilnya
models.loc['accuracy_score','DecisionTree'] = accuracy_score(y_val, tree_pred)

### KNN

In [13]:
# Buat model prediksi dengan KNN
model_knn = KNeighborsClassifier(n_neighbors=3)
model_knn.fit(X_train, y_train)

# Lakukan prediksi dengan model KNN
knn_pred = model_knn.predict(X_val)

# Hitung metriks akurasi dan simpan hasilnya
models.loc['accuracy_score','KNN'] = accuracy_score(y_val, knn_pred)

### RandomForest

In [14]:
# Buat model prediksi dengan Random Forest
model_rf = RandomForestClassifier()
model_rf.fit(X_train, y_train)

# Lakukan prediksi dengan model Random Forest
rf_pred = model_rf.predict(X_val)

# Hitung metriks akurasi dan simpan hasilnya
models.loc['accuracy_score','RandomForest'] = accuracy_score(y_val, rf_pred)

### SVC

In [15]:
# Buat model prediksi dengan SVM Classifier
model_svc = SVC()
model_svc.fit(X_train, y_train)

# Lakukan prediksi dengan model SVM Classifier
svc_pred = model_svc.predict(X_val)

# Hitung metriks akurasi dan simpan hasilnya
models.loc['accuracy_score','SVM'] = accuracy_score(y_val, svc_pred)

### Evaluasi Model

In [16]:
models

Unnamed: 0,DecisionTree,KNN,RandomForest,SVM
accuracy_score,0.977435,0.920428,0.982185,0.947743


Dari hasil di atas dapat disimpulkan bahwa model dengan **Random Forest** memiliki kinerja paling baik dengan akurasi mencapai **98%**.

## Prediction on Data Test

### Normalization

In [17]:
# Normalisasi dengan MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df_test)
new_df_test = scaler.transform(df_test)
new_df_test

array([[0.59484798, 0.42866142, 0.7259534 , 0.20430178, 1.        ],
       [0.82266559, 0.1651842 , 1.        , 0.31590561, 0.89765187],
       [0.        , 0.        , 0.20400297, 0.        , 0.        ],
       [0.90374312, 0.67170549, 0.41652314, 0.62708728, 0.12967268],
       [0.95623381, 0.67658655, 0.43684539, 0.64527378, 0.12057387],
       [0.98748514, 0.68143835, 0.44271863, 0.66293473, 0.11929421],
       [1.        , 0.68623408, 0.43591465, 0.6801381 , 0.12490908],
       [0.99628129, 0.69094693, 0.41820525, 0.69695185, 0.13649386],
       [0.97883192, 0.69555006, 0.3913622 , 0.71344396, 0.15312395],
       [0.95015477, 0.70001668, 0.3571573 , 0.7296824 , 0.1738747 ],
       [0.91275275, 0.70431996, 0.31736233, 0.74573513, 0.19782152],
       [0.86912875, 0.70843308, 0.27374909, 0.76167014, 0.22403978],
       [0.82178567, 0.71232923, 0.22808937, 0.77755539, 0.25160485],
       [0.77322642, 0.7159816 , 0.18215495, 0.79345885, 0.27959213],
       [0.72595388, 0.71936335, 0.

### Predict

In [18]:
prediction = model_rf.predict(new_df_test)

In [19]:
df_test.insert(loc=5, column='categori', value=prediction)
df_test

Unnamed: 0,pm10,so2,co,o3,no2,categori
0,58.509243,18.425926,29.222222,42.760012,22.213508,4
1,68.781558,14.471925,33.358762,51.942339,20.987288,4
2,31.6875,11.993007,21.34375,25.950862,10.232639,0
3,72.43735,22.073287,24.551591,77.545148,11.78623,4
4,74.80416,22.146537,24.858341,79.041463,11.677219,4
5,76.213284,22.219348,24.946994,80.494536,11.661887,2
6,76.777581,22.291317,24.844293,81.909962,11.729158,2
7,76.609904,22.362043,24.576982,83.293332,11.867954,2
8,75.82311,22.431122,24.171805,84.650238,12.067197,2
9,74.530055,22.498152,23.655506,85.986273,12.315809,2


### Mapping

In [20]:
categori_dict = {
    0: 'BAIK',
    1: 'BERBAHAYA',
    2: 'SANGAT TIDAK SEHAT',
    3: 'SEDANG',
    4: 'TIDAK SEHAT'
}

df_test.categori = df_test.categori.map(categori_dict)
df_test

Unnamed: 0,pm10,so2,co,o3,no2,categori
0,58.509243,18.425926,29.222222,42.760012,22.213508,TIDAK SEHAT
1,68.781558,14.471925,33.358762,51.942339,20.987288,TIDAK SEHAT
2,31.6875,11.993007,21.34375,25.950862,10.232639,BAIK
3,72.43735,22.073287,24.551591,77.545148,11.78623,TIDAK SEHAT
4,74.80416,22.146537,24.858341,79.041463,11.677219,TIDAK SEHAT
5,76.213284,22.219348,24.946994,80.494536,11.661887,SANGAT TIDAK SEHAT
6,76.777581,22.291317,24.844293,81.909962,11.729158,SANGAT TIDAK SEHAT
7,76.609904,22.362043,24.576982,83.293332,11.867954,SANGAT TIDAK SEHAT
8,75.82311,22.431122,24.171805,84.650238,12.067197,SANGAT TIDAK SEHAT
9,74.530055,22.498152,23.655506,85.986273,12.315809,SANGAT TIDAK SEHAT


### Saving Result

In [21]:
df_test.to_csv('result.csv',index=False)

## Prediction on Forecasting Result

### Loading Data

In [22]:
# Download Data
!wget https://github.com/aulialigar/Capstone_SIB_Dicoding/raw/main/datasets/air_pollution_prediction.csv

--2021-12-15 03:43:33--  https://github.com/aulialigar/Capstone_SIB_Dicoding/raw/main/datasets/air_pollution_prediction.csv
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/aulialigar/Capstone_SIB_Dicoding/main/datasets/air_pollution_prediction.csv [following]
--2021-12-15 03:43:33--  https://raw.githubusercontent.com/aulialigar/Capstone_SIB_Dicoding/main/datasets/air_pollution_prediction.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3246 (3.2K) [text/plain]
Saving to: ‘air_pollution_prediction.csv’


2021-12-15 03:43:33 (37.8 MB/s) - ‘air_pollution_prediction.csv’ saved [3246/3246]



In [23]:
# Memuat dataset
df_forecast_result = pd.read_csv('air_pollution_prediction.csv')
new_df_forecast_result = df_forecast_result.iloc[:, 1:]
new_df_forecast_result

Unnamed: 0,pm10,so2,co,o3,no2
0,60.600922,49.358723,22.27212,83.021141,22.961458
1,58.361027,48.766232,24.21072,98.38868,19.515585
2,56.269749,48.218819,25.090414,111.493362,17.561869
3,54.317223,47.713047,25.489597,122.668373,16.45681
4,52.494259,47.245762,25.67074,132.197876,15.831768
5,50.792248,46.814026,25.752939,140.324158,15.478233
6,49.203175,46.415138,25.790241,147.253845,15.278267
7,47.71954,46.046597,25.807165,153.163147,15.165163
8,46.334335,45.706093,25.814844,158.202301,15.10119
9,45.041046,45.391499,25.818329,162.499451,15.065005


### Normalization

In [24]:
# Normalisasi dengan MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(new_df_forecast_result)
new_df_forecast_result = scaler.transform(new_df_forecast_result)
new_df_forecast_result

array([[1.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        1.00000000e+00],
       [9.23952730e-01, 9.16107901e-01, 5.46222123e-01, 1.48497975e-01,
        5.66206027e-01],
       [8.52951189e-01, 8.38598532e-01, 7.94085623e-01, 2.75129763e-01,
        3.20256661e-01],
       [7.86660453e-01, 7.66985236e-01, 9.06559921e-01, 3.83114946e-01,
        1.81143074e-01],
       [7.24768525e-01, 7.00821324e-01, 9.57598909e-01, 4.75199431e-01,
        1.02457813e-01],
       [6.66983090e-01, 6.39690894e-01, 9.80759413e-01, 5.53724459e-01,
        5.79520491e-02],
       [6.13032042e-01, 5.83211534e-01, 9.91269668e-01, 6.20686683e-01,
        3.27786821e-02],
       [5.62660755e-01, 5.31028918e-01, 9.96038165e-01, 6.77788824e-01,
        1.85402665e-02],
       [5.15631313e-01, 4.82816266e-01, 9.98201804e-01, 7.26482645e-01,
        1.04867819e-02],
       [4.71722506e-01, 4.38272185e-01, 9.99183664e-01, 7.68006408e-01,
        5.93161267e-03],
       [4.30727112e-01, 3.9711

### Predict

In [25]:
prediction = model_rf.predict(new_df_forecast_result)

In [26]:
df_forecast_result.insert(loc=6, column='categori', value=prediction)
df_forecast_result

Unnamed: 0,tanggal,pm10,so2,co,o3,no2,categori
0,2021-08-01,60.600922,49.358723,22.27212,83.021141,22.961458,4
1,2021-08-02,58.361027,48.766232,24.21072,98.38868,19.515585,4
2,2021-08-03,56.269749,48.218819,25.090414,111.493362,17.561869,4
3,2021-08-04,54.317223,47.713047,25.489597,122.668373,16.45681,4
4,2021-08-05,52.494259,47.245762,25.67074,132.197876,15.831768,4
5,2021-08-06,50.792248,46.814026,25.752939,140.324158,15.478233,4
6,2021-08-07,49.203175,46.415138,25.790241,147.253845,15.278267,4
7,2021-08-08,47.71954,46.046597,25.807165,153.163147,15.165163,2
8,2021-08-09,46.334335,45.706093,25.814844,158.202301,15.10119,2
9,2021-08-10,45.041046,45.391499,25.818329,162.499451,15.065005,2


### Mapping

In [27]:
df_forecast_result.categori = df_forecast_result.categori.map(categori_dict)
df_forecast_result

Unnamed: 0,tanggal,pm10,so2,co,o3,no2,categori
0,2021-08-01,60.600922,49.358723,22.27212,83.021141,22.961458,TIDAK SEHAT
1,2021-08-02,58.361027,48.766232,24.21072,98.38868,19.515585,TIDAK SEHAT
2,2021-08-03,56.269749,48.218819,25.090414,111.493362,17.561869,TIDAK SEHAT
3,2021-08-04,54.317223,47.713047,25.489597,122.668373,16.45681,TIDAK SEHAT
4,2021-08-05,52.494259,47.245762,25.67074,132.197876,15.831768,TIDAK SEHAT
5,2021-08-06,50.792248,46.814026,25.752939,140.324158,15.478233,TIDAK SEHAT
6,2021-08-07,49.203175,46.415138,25.790241,147.253845,15.278267,TIDAK SEHAT
7,2021-08-08,47.71954,46.046597,25.807165,153.163147,15.165163,SANGAT TIDAK SEHAT
8,2021-08-09,46.334335,45.706093,25.814844,158.202301,15.10119,SANGAT TIDAK SEHAT
9,2021-08-10,45.041046,45.391499,25.818329,162.499451,15.065005,SANGAT TIDAK SEHAT


### Saving Result

#### CSV

In [28]:
df_forecast_result.to_csv('final_result.csv',index=False)

#### JSON

In [29]:
df_forecast_result.to_json("data.json", orient="records")