<div style="text-align: center; background-color: #5A96E3; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 45px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  4. Modeling Data </div>

### 4.1 Import Required Libraries

In [61]:

import pandas as pd
import numpy as np
import math
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.svm import SVC

from sklearn.neighbors import KNeighborsClassifier

import warnings  # Để loại bỏ các warning
warnings.filterwarnings("ignore")

## 4.2 Read CSV file

In [62]:
# Đọc dữ liệu từ file CSV
data = pd.read_csv('../Data/air_pollution_cleaned.csv')

# Xem mẫu dữ liệu
data.head()

Unnamed: 0,dt,aqi,co,no,no2,o3,so2,pm2_5,pm10,nh3
0,2021-01-01 00:00:00,3,700.95,0.44,35.99,17.35,32.9,20.33,26.64,8.99
1,2021-01-01 01:00:00,3,847.82,2.46,38.04,18.06,36.24,23.32,30.54,9.37
2,2021-01-01 02:00:00,3,894.55,5.25,38.39,23.25,41.01,24.16,31.93,9.25
3,2021-01-01 03:00:00,3,827.79,6.2,36.33,33.98,43.39,23.2,30.91,8.61
4,2021-01-01 04:00:00,2,660.9,3.69,29.13,54.36,35.76,19.5,25.6,6.21


## 4.3 The problem needs to be solve


- Classify the aqi index based on the concentration levels of pollutants: no, co, so2, no2, o3, pm2_5, pm10, nh3.
    - Input: no, co, so2, no2, o3, pm2_5, pm10, nh3.
    - Output: aqi

## 4.4 Training models

In [63]:
#  features (X) 
X = data[['no', 'co', 'so2', 'no2', 'pm2_5', 'pm10', 'nh3']]  
#target variable (y)
y = data['aqi']  # Chọn cột AQI làm target variable

### Feature Scaling

In [64]:
# Feature Scaling (Standardisation)
scaler = StandardScaler()
X = scaler.fit_transform(X)

### Divide the data

In [65]:
# divide the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [66]:
X_train

array([[ 2.74524674,  4.07667137,  1.58271265, ...,  3.05853296,
         3.40840197,  3.88376539],
       [-0.6461352 , -0.4186191 , -0.11226167, ..., -0.56421195,
        -0.55805232, -0.19491982],
       [-0.59461859, -0.53831987, -0.12899183, ..., -0.29559602,
        -0.29000729, -0.51291208],
       ...,
       [-0.60102613, -0.4186191 , -0.51134569, ..., -0.32624926,
        -0.32848684, -0.51291208],
       [-0.25091823, -0.47624813,  0.85146525, ..., -0.40454428,
        -0.32291167,  0.43901972],
       [ 0.29346624,  0.26409702,  1.44991701, ...,  0.28816965,
         0.23569767,  0.06376841]])

In [67]:
X_test

array([[-0.59103037, -0.56491636, -0.31197795, ..., -0.63352027,
        -0.67720774, -0.4024839 ],
       [-0.38265722, -0.32995751,  0.07037591, ..., -0.41808587,
        -0.45759009, -0.15606547],
       [ 0.36215505,  0.02470211,  0.88492557, ..., -0.19095646,
        -0.1697587 , -0.06608695],
       ...,
       [-0.55155993, -0.09942809,  0.31958558, ..., -0.05800265,
         0.02078065, -0.01394031],
       [ 1.0036778 ,  0.38822723,  0.05364575, ...,  0.3993338 ,
         0.50068183,  0.29689456],
       [ 1.59957887,  0.88474804,  0.33631574, ...,  0.83192608,
         0.77561382,  0.64658379]])

In [68]:
y_train

17799    5
19980    2
8047     4
6745     2
29311    2
        ..
16850    5
6265     3
11284    4
860      4
15795    5
Name: aqi, Length: 27050, dtype: int64

In [69]:
y_test

14673    2
12610    4
10838    5
9273     4
1275     4
        ..
103      2
20024    4
11679    5
16604    5
15067    5
Name: aqi, Length: 6763, dtype: int64

### 4.4.1. Decision Tree

- A Decision Tree is a supervised machine learning model used for classification tasks. It works by splitting data into subsets based on feature values, creating a hierarchical tree structure. Each internal node represents a feature or attribute, each branch corresponds to a decision rule, and each leaf node represents a class label. At each step, the model chooses the feature and splitting criterion that best separates the classes, commonly using metrics like Gini Impurity or Entropy (Information Gain) to evaluate the quality of splits.

- The tree continues to grow until a stopping condition is met, such as reaching a maximum depth, a minimum number of samples in a node, or achieving subsets that are pure (all samples belong to the same class). During prediction, the model traverses the tree from the root to a leaf based on the input feature values, assigning the corresponding class label at the leaf.

We constructed a Decision Tree model with a relatively simple structure. The model used the entropy impurity criterion, with a maximum tree depth of 5, a minimum of 2 samples required to split a node, and at least 1 sample per leaf node.

In [70]:
# Init Decision Tree
decision_tree_model = DecisionTreeClassifier(criterion='entropy', max_depth=5, min_samples_split=2, min_samples_leaf=1)

# train Decision Tree model
decision_tree_model.fit(X_train, y_train)

In [71]:
# Cross_validation (split train data into 5 foldsfolds )
cross_validation_scores = cross_val_score(decision_tree_model, X_train, y_train, cv=5)
print("Accuracy ( using CV ):", cross_validation_scores)

Accuracy ( using CV ): [0.78262477 0.79057301 0.78151571 0.78262477 0.78539741]


In [72]:
# Predict on the test set
test_accuracy = decision_tree_model.score(X_test, y_test)
print(f'Accuracy (on test set): {test_accuracy}')

Accuracy (on test set): 0.7826408398639657



- As we can see, using cross-validation and evaluation on the test gives us quite similar results (77%). This indicates that our model is quite good
.


### Classification report:
- Precision: is the proportion of the class that assigns a label to positive is actually positive.
- Recall: is the positive sample rate assigned by the classifier.
- F11-score: A combination of precision and recall. F1-score = 2*(precision * recall)/(precision + recall).
- Support: The number of actual data points in each class.




In [73]:
# Predict on test 
y_pred_tree = decision_tree_model.predict(X_test)
print(classification_report(y_test, y_pred_tree))

              precision    recall  f1-score   support

           1       0.88      0.94      0.91       193
           2       0.84      0.94      0.89      1545
           3       0.56      0.36      0.44      1151
           4       0.60      0.65      0.63      1446
           5       0.91      0.95      0.93      2428

    accuracy                           0.78      6763
   macro avg       0.76      0.77      0.76      6763
weighted avg       0.77      0.78      0.77      6763



- Classes 1 and 5:
    - High performance with Precision, Recall, and F1-Score reach over 90%. This indicates that the model is very effective at classifying these two classes, with minimal misclassification.
- Class 2:
    - Precision and Recall are around 89-91%, showing that the model handles this class quite well.
- Classes 3 and 4:
    - The performance is significantly lower for these classes (F1-Score ~52% for Class 3 and ~58% for Class 4). This may be due to imbalanced data distribution or insufficient distinct features for these classes.

**GridSearchCV**
- We will use GridSearchCV for hyperparameter tuning

In [None]:

# Init Decision Tree
DT_model = DecisionTreeClassifier()

# Define the grid of hyperparameters to search 
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [ 5, 10, 15, 20,25],
    'min_samples_split': [1, 2, 5, 15, 25],
    'min_samples_leaf': [1, 2, 3, 4, 5]
}

#  Perform grid search with cross-validation
gridcv_decision_tree = GridSearchCV(DT_model, param_grid, cv=5, scoring='accuracy')
gridcv_decision_tree.fit(X_train, y_train)

print("Best Parameters:", gridcv_decision_tree.best_params_)
print("Best Accuracy:", gridcv_decision_tree.best_score_)

Best Parameters: {'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best Accuracy: 0.7947504621072089


Now we have the hyperparameters and we gonna apply into our model

In [75]:
# valuate the model on the test set using the best hyperparameters
best_model_tree = gridcv_decision_tree.best_estimator_
y_pred = best_model_tree.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Test Accuracy: {accuracy}')

Test Accuracy: 0.7929912760609197


### Now we will apply this model with sample 

In [None]:
# Create sample
sample = {
    'no': 0.09,
    'co': 500.76,
    'so2': 26,
    'no2': 14.43,
    'pm2_5': 22.5,
    'pm10': 35.2,
    'nh3': 12,
}


In [76]:
#Predict aqi
predict_df = pd.DataFrame([sample])
new_sample = scaler.transform(predict_df) 
predicted_aqi = best_model_tree.predict(new_sample)

print(f'AQI: {predicted_aqi}')

Predicted AQI: [2]


### 4.4.2 Random Forest

- Random Forest is a machine learning algorithm based on the ensemble method, utilizing multiple decision trees to solve classification and regression problems. It is one of the most powerful and popular models due to its ability to minimize overfitting and improve accuracy.

- How It Works: 
    - The initial dataset is divided into multiple subsets using the bootstrap sampling method (random sampling with replacement).
    - Each decision tree is built on a different subset, using a random subset of features selected at each split.
    - This use of different data subsets and features ensures the trees are independent and reduces correlation among them.
    - Classification: For each data point, the trees in the forest make individual predictions. The final output is determined based on the majority vote.

In [101]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score
# Init 
rf_model = RandomForestClassifier(random_state=42)
# train Decision Tree model
rf_model.fit(X_train, y_train)

# Cross-validation on train set
cv_scores = cross_val_score(rf_model, X_train, y_train, cv=5)
print("Cross-Validation Scores:", cv_scores)

#  Predict on the test set
test_accuracy = rf_model.score(X_test, y_test)
print(f'Test Accuracy: {test_accuracy}')

Cross-Validation Scores: [0.80924214 0.82107209 0.81774492 0.81441774 0.8168207 ]
Test Accuracy: 0.8181280496820937


### Classification

In [92]:
# Predict on test 
y_pred_RF = rf_model.predict(X_test)
print(classification_report(y_test, y_pred_RF))

              precision    recall  f1-score   support

           1       0.95      0.92      0.93       193
           2       0.89      0.94      0.92      1545
           3       0.65      0.53      0.59      1151
           4       0.66      0.68      0.67      1446
           5       0.91      0.95      0.93      2428

    accuracy                           0.82      6763
   macro avg       0.81      0.80      0.81      6763
weighted avg       0.81      0.82      0.81      6763



- Class 1: High precision (0.95) and recall (0.92) demonstrate that the model performs well in identifying this class, with few false positives.
- Class 2: Outstanding performance with the highest recall (0.94) and an F1-score of 0.92, indicating the model correctly identifies most data points in this class.
- Class 3: The weakest performance, with an F1-score of 0.59. Both precision (0.65) and recall (0.53) are low, suggesting the model struggles to classify this class accurately. This could be due to overlapping features with other classes or insufficient class-specific data.
- Class 4: Moderate performance with an F1-score of 0.67. Recall (0.68) is slightly higher than precision (0.66), indicating some false positives for this class.
- Class 5: Excellent performance with an F1-score of 0.93 and recall of 0.95, showing that the model accurately identifies most instances of this class.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score


# Init RandomForest model
RF_model = RandomForestClassifier(random_state=42)

# Define the grid of hyperparameters to search 
param_grid = {
    'n_estimators': [20, 40, 50],
    'max_depth': [2,5,7,9],
    'min_samples_split': [2,5,7],
    'min_samples_leaf': [1, 2, 3, 4],
    'max_features': ['sqrt', 'log2'],    
    'bootstrap': [True, False]   
}

#  Perform grid search with cross-validation
gridcv_RF = GridSearchCV(RF_model, param_grid, cv=5, scoring='accuracy')
gridcv_RF.fit(X_train, y_train)

print("Best Parameters:", gridcv_RF.best_params_)
print("Best Accuracy:", gridcv_RF.best_score_)


Best Parameters: {'bootstrap': False, 'max_depth': 9, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 40}
Best Accuracy: 0.8082809611829944


In [98]:
best_model_RF = gridcv_RF.best_estimator_
y_pred = gridcv_RF.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Test Accuracy: {accuracy}')

Test Accuracy: 0.8002365813987875


In [99]:
# Create new sample
sample_RF = {
    'no': 0.09,
    'co': 300.76,
    'so2': 16,
    'no2': 14.43,
    'pm2_5': 22.5,
    'pm10': 35.2,
    'nh3': 12,
}

In [100]:
new_df = pd.DataFrame([sample_RF])
new_sample = scaler.transform(new_df)
predicted_aqi = best_model_RF.predict(new_sample)
#Predict aqi
print(f' AQI: {predicted_aqi}')

 AQI: [2]


### 4.4.3. SVM Model
The Support Vector Machine (SVM) model is a supervised learning algorithm primarily used for classification and regression problems. SVM focuses on finding the best decision boundary to separate different data classes.

How SVM Works:

- Finding the Decision Boundary:

   - For multi-dimensional data, SVM aims to identify a decision boundary (hyperplane) that separates data points into different classes. In binary classification problems, this boundary is a straight line. In higher-dimensional spaces, it becomes a hyperplane.
- Optimizing the Boundary:

   - SVM seeks to find the most optimal boundary by choosing one that maximizes the distance from the closest data points to the boundary. These closest data pointss are known as support vectors.
- Kernel Trick:
   - When the data cannot be separated linearly, SVM uses the kernel trick to map the data into a higher-dimensional feature space where it can be linearly separated.

   

### Training model

In [None]:
#init SVM model
svm_model = SVC(kernel='linear', C=5, random_state=42)
svm_model.fit(X_train, y_train)
# cross validation on train set 
cv_scores = cross_val_score(svm_model, X_train, y_train, cv=5)
print("Cross-Validation Scores:", cv_scores)

test_accuracy = svm_model.score(X_test, y_test)
print(f'Test Accuracy: {test_accuracy}')

Cross-Validation Scores: [0.75656192 0.76303142 0.75951941 0.76118299 0.76469501]
Test Accuracy: 0.7530681650155256


### Classification

In [104]:

y_pred_svm = svm_model.predict(X_test)
# classification_report
print("\nclassification_report:")
print(classification_report(y_test,y_pred_svm))



classification_report:
              precision    recall  f1-score   support

           1       0.83      0.76      0.79       193
           2       0.84      0.89      0.86      1545
           3       0.49      0.48      0.49      1151
           4       0.57      0.56      0.56      1446
           5       0.92      0.91      0.91      2428

    accuracy                           0.75      6763
   macro avg       0.73      0.72      0.72      6763
weighted avg       0.75      0.75      0.75      6763



- Class 1: Precision (0.83) is reasonably high, but recall (0.76) is lower, indicating that while most predicted Class 1 instances are correct, some true instances of Class 1 are missed by the model.
- Class 2: Good performance with both precision (0.84) and recall (0.89), resulting in an F1-score of 0.86. The model performs well in identifying most instances of this class.
- Class 3: The lowest performance with precision (0.49) and recall (0.48), leading to a low F1-score of 0.49. The model struggles to accurately classify Class 3, likely due to data overlap or insufficient feature differentiation between classes.
- Class 4: Moderate performance with precision (0.57) and recall (0.56), resulting in an F1-score of 0.56. The model faces challenges distinguishing this class from others, possibly due to ambiguous class boundaries.
- Class 5: High performance with precision (0.92) and recall (0.91), leading to a strong F1-score of 0.91. The model effectively identifies most instances of this class.

### Grid Search CV

In [105]:
# Tạo một mô hình SVM
svm = SVC(kernel='linear', random_state=42)

# Định nghĩa các giá trị C để thử nghiệm
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}

# Tạo một GridSearchCV
svm_grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')

# Huấn luyện GridSearchCV trên dữ liệu
svm_grid_search.fit(X_train, y_train)

# In ra giá trị C tốt nhất được chọn
print("Best C:", svm_grid_search.best_params_['C'])

# In ra độ chính xác tốt nhất trên tập kiểm tra
print("Best Accuracy:", svm_grid_search.best_score_)

Best C: 10
Best Accuracy: 0.7619223659889094


In [107]:
#create samplesample
sample = {
    'no': 0.09,
    'co': 300.76,
    'so2': 16,
    'no2': 14.43,
    'pm2_5': 22.5,
    'pm10': 35.2,
    'nh3': 12,
}

Now we will apply the newest model we have

In [109]:
best_svm_model = svm_grid_search.best_estimator_
new_df = pd.DataFrame([sample ])
new_sample = scaler.transform(new_df)

# Predict aqi
predicted_aqi =  best_svm_model.predict(new_sample)

print(f'Predicted AQI: {predicted_aqi}')

Predicted AQI: [2]


## Summary

### The comparison between models:

1. Decision Tree:
    - Advantages:

        - Easy to understand and visualize. Decision rules are transparent.
        - Can model non-linear relationships between features.
        - Does not require normalization or standardization of data.
    - Disadvantages:

        - Overfitting: Prone to overfitting, especially with deep trees and small datasets.
        - Instability: Small changes in the data can lead to large changes in the model structure.
        - Tends to be biased toward classes with more data points, leading to poor performance on imbalanced datasets.
4.  Random Forest (RF):
    - Advantages:

        - By combining multiple decision trees, Random Forest reduces overfitting compared to a single decision tree.
        - Typically yields better performance and generalization than a single Decision Tree.
        -  Scalable to large datasets.
    - Disadvantages:
        - Less interpretable than a single decision tree due to the ensemble nature.
        - Requires more memory and processing power, especially with a large number of trees.
        - Training multiple trees can be time-consuming.
3. SVM (Support Vector Machine):
    - Advantages:
        - SVM is powerful when dealing with high-dimensional data.
        - Performs well with small to medium-sized datasets, especially when the classes are well-separated.
        - By maximizing the margin, SVM tends to avoid overfitting, especially with the use of regularization (C parameter).
        - The ability to apply kernels allows SVM to work on non-linearly separable data.
    - Disadvantages:

        - Training time is high, especially for large datasets.
        - The performance depends heavily on the choice of kernel, regularization, and other hyperparameters.
        - SVM models are not easily interpretable, especially in high-dimensional spaces.
        - SVM is sensitive to feature scaling and may perform poorly without proper normalization or standardization.
