# Predicting Dengue Case Severity

This AI model aims to predict dengue cases classification using different AI models

## Import Libraries and Dataset

The dataset was obtained from https://www.kaggle.com/datasets/siddhvr/dengue-prediction from the author Siddhvr.<br><br>

The dataset has the following features:

**serial**: It represents a unique identifier for each data entry. It is likely used for tracking or indexing purposes.

**tempmax**: It refers to the maximum temperature recorded for a specific time period.

**tempmin**: It represents the minimum temperature recorded for a specific time period.

**temp**: It denotes the average temperature during the specified time period.

**feelslikemax**: It indicates the maximum "feels like" temperature, which takes into account factors such as humidity and wind to estimate how the temperature actually feels.

**feelslikemin**: It represents the minimum "feels like" temperature during the specified time period.

**feelslike**: It denotes the average "feels like" temperature, which is an estimation of how the temperature feels to humans.

**dew**: It refers to the dew point, which is the temperature at which air becomes saturated with moisture, leading to the formation of dew.

**humidity**: It represents the relative humidity, indicating the amount of moisture present in the air relative to the maximum amount it could hold at that temperature.

**precip**: It denotes the total precipitation (rainfall) recorded during the specified time period.

**precipprob**: It represents the probability of precipitation occurring during the specified time period.

**precipcover**: It indicates the coverage or extent of precipitation in the given area during the specified time period.

**snow**: It represents the amount of snowfall recorded during the specified time period.

**snowdepth**: It denotes the depth of snow on the ground during the specified time period.

**windspeed**: It represents the speed of wind recorded during the specified time period.

**winddir**: It indicates the direction from which the wind is blowing during the specified time period.

**sealevelpressure**: It refers to the atmospheric pressure at sea level during the specified time period.

**cloudcover**: It represents the extent of cloud cover during the specified time period.

**visibility**: It denotes the horizontal visibility, indicating how far an observer can see clearly during the specified time period.

**solarradiation**: It represents the amount of solar radiation received during the specified time period.

**solarenergy**: It indicates the solar energy level during the specified time period.

**uvindex**: It represents the UV index, which is a measure of the intensity of ultraviolet (UV) radiation from the sun.

**conditions**: It represents the weather conditions during the specified time period, such as sunny, cloudy, rainy, etc.

**stations**: It refers to the number of weather stations used to collect the data for the given region.

**cases**: It denotes the number of dengue cases recorded in the specified region during the specified time period.

**labels**: It represents the severity of dengue cases.
<br><br>
The model will try to predict the severity level of dengue cases.

Install necessary dependencies

```
pip install xgboost numpy pandas scikit-learn
```

In [1]:
import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn import svm
from catboost import CatBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

data = pd.read_csv('dengue-dataset.csv')

## Data Preprocessing

#### Handle missing values

Identified missing values in features. If there are empty values, the data containing the empty value is removed.

In [2]:
# Check for empty values in the DataFrame
print(data.isnull().any())

serial              False
tempmax             False
tempmin             False
temp                False
feelslikemax        False
feelslikemin        False
feelslike           False
dew                 False
humidity            False
precip              False
precipprob          False
precipcover         False
snow                False
snowdepth           False
windspeed           False
winddir             False
sealevelpressure    False
cloudcover          False
visibility          False
solarradiation      False
solarenergy         False
uvindex             False
conditions          False
stations            False
cases               False
labels              False
dtype: bool


#### Feature scaling

We decided to normalize the numerical features using RobustScaler because it is not sensitive to outliers and it can be used with data that is not normally distributed.

In [3]:
# Select the columns to exclude from normalization
target_column = 'labels'
columns_to_exclude = ['serial', 'cases', target_column]

# Create a DataFrame with the columns to normalize
data_to_normalize = data.drop(columns=columns_to_exclude)

# Create an instance of StandardScaler
scaler = RobustScaler()

# Normalize the data
normalized_data = scaler.fit_transform(data_to_normalize)

# Convert the normalized data back to a DataFrame
normalized_df = pd.DataFrame(normalized_data, columns=data_to_normalize.columns)

# Combine the excluded columns with the normalized DataFrame
data = pd.concat([data[columns_to_exclude], normalized_df], axis=1)

# Print the final data
print(data)

     serial    cases  labels   tempmax   tempmin      temp  feelslikemax   
0         0   4925.0  normal  0.658224 -0.158157  0.274700      0.215525  \
1         1   5077.0  normal  0.667463  0.210397  0.530576      0.466383   
2         2   7579.0  normal  0.803912  0.126632  0.551810      0.327655   
3         3  13706.0  normal  0.369322  0.024390  0.280731     -0.082849   
4         4     82.0  normal -0.290636 -0.233062 -0.381661     -0.287779   
..      ...      ...     ...       ...       ...       ...           ...   
597     597   6729.0  normal  0.167832 -0.181818  0.203516     -0.396825   
598     598  10541.0  normal  0.279720  0.424242  0.474871     -0.333333   
599     599   6396.0  normal  0.363636  0.393939  0.644467      0.333333   
600     600  10883.0  normal  0.951049  0.545455  0.915822      0.714286   
601     601   7311.0  normal  0.643357  0.393939  0.780145     -0.047619   

     feelslikemin  feelslike       dew  ...  windspeed   winddir   
0        0.093473  

#### Categorical encoding

Values in the **labels** feature is mapped with these values:

**Severe Risk**: 4,
**High Risk**: 3,
**Moderate Risk**: 2,
**Low Risk**: 1,
**Minimal to No risk**: 0

In [4]:
# Define label-value mapping
desired_order = ['Minimal to No risk', 'Low Risk', 'Moderate Risk', 'High Risk', 'Severe Risk']
label_mapping = {
    'Severe Risk': 4,
    'High Risk': 3,
    'Moderate Risk': 2,
    'Low Risk': 1,
    'Minimal to No risk': 0
}

# Create an instance of LabelEncoder
label_encoder = LabelEncoder()

# Fit LabelEncoder with the unique labels from mapping dictionary
label_encoder.fit(desired_order)

# Encode the labels using the mapping dictionary
data['labels'] = data['labels'].map(label_mapping)

data



Unnamed: 0,serial,cases,labels,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,...,windspeed,winddir,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,uvindex,conditions,stations
0,0,4925.0,1,0.658224,-0.158157,0.274700,0.215525,0.093473,0.119011,-0.716338,...,0.007704,-0.268797,0.111727,-0.502018,0.332089,0.029927,0.033385,0.116438,0.279452,0.197260
1,1,5077.0,2,0.667463,0.210397,0.530576,0.466383,0.923721,0.431524,-0.480329,...,-0.087636,-0.415010,-0.424796,-0.596364,-0.177595,0.218941,0.223314,0.252033,-0.170732,-0.008130
2,2,7579.0,2,0.803912,0.126632,0.551810,0.327655,0.458871,0.234002,-0.895991,...,-0.052797,-0.039668,-0.155413,-0.313134,0.586795,0.301629,0.306306,0.414634,-0.044715,0.170732
3,3,13706.0,3,0.369322,0.024390,0.280731,-0.082849,0.351028,0.040200,-1.276890,...,0.079110,-0.201401,-0.446116,-0.522941,-0.117634,0.250756,0.258258,0.296748,-0.349593,-0.853659
4,4,82.0,0,-0.290636,-0.233062,-0.381661,-0.287779,-0.216643,-0.377284,-0.144659,...,0.341872,0.141840,0.367114,-0.703788,0.386772,-0.371592,-0.371341,-0.406504,-0.516260,2.951220
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
597,597,6729.0,2,0.167832,-0.181818,0.203516,-0.396825,-0.176471,-0.044280,-0.565181,...,0.130525,-0.402713,0.232821,-0.515504,-0.056299,0.595916,0.619347,1.000000,-0.500000,0.000000
598,598,10541.0,3,0.279720,0.424242,0.474871,-0.333333,0.411765,0.147601,-0.887020,...,0.087328,0.217889,0.416627,-0.511628,-0.056299,0.470997,0.470106,0.500000,-0.500000,0.000000
599,599,6396.0,2,0.363636,0.393939,0.644467,0.333333,0.382353,0.442804,-0.289319,...,0.294671,1.162357,0.465642,-1.286822,-0.056299,-0.133205,-0.126854,0.500000,-0.500000,0.000000
600,600,10883.0,3,0.951049,0.545455,0.915822,0.714286,1.205882,0.531365,-0.565181,...,-0.076819,-1.068249,0.330851,-1.201550,0.056299,-0.231356,-0.261170,0.000000,-0.500000,0.000000


#### Splitting the Dataset

The training set was used to train each model, while the testing set was used for validation (Train set: 76%, Test set: 24%).

In [5]:
# Select the columns to exclude from the train set
target_column = 'labels'
columns_to_exclude = ['serial', 'cases',target_column]

# Assign X to train variable and y to target variable
X = data.drop(columns=columns_to_exclude)
y = data[target_column]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.24, random_state=42)

## Model Training

In [6]:
# Create models and fit it to the training data
XGBmodel = xgb.XGBClassifier()
SVMmodel = svm.SVC()
CBmodel = CatBoostClassifier()
RFmodel = RandomForestClassifier()
MLPmodel = MLPClassifier()

RFmodel.fit(X_train, y_train)
XGBmodel.fit(X_train, y_train)
SVMmodel.fit(X_train, y_train)
CBmodel.fit(X_train, y_train)
MLPmodel.fit(X_train, y_train)

Learning rate set to 0.076029
0:	learn: 1.5674598	total: 149ms	remaining: 2m 29s
1:	learn: 1.5243433	total: 154ms	remaining: 1m 16s
2:	learn: 1.4896486	total: 159ms	remaining: 52.7s
3:	learn: 1.4573383	total: 163ms	remaining: 40.5s
4:	learn: 1.4268021	total: 167ms	remaining: 33.3s
5:	learn: 1.3979126	total: 171ms	remaining: 28.3s
6:	learn: 1.3731625	total: 175ms	remaining: 24.8s
7:	learn: 1.3496267	total: 179ms	remaining: 22.2s
8:	learn: 1.3308840	total: 183ms	remaining: 20.1s
9:	learn: 1.3075892	total: 187ms	remaining: 18.5s
10:	learn: 1.2892089	total: 190ms	remaining: 17.1s
11:	learn: 1.2706920	total: 194ms	remaining: 16s
12:	learn: 1.2493734	total: 198ms	remaining: 15s
13:	learn: 1.2346830	total: 202ms	remaining: 14.2s
14:	learn: 1.2143706	total: 206ms	remaining: 13.5s
15:	learn: 1.1986530	total: 209ms	remaining: 12.9s
16:	learn: 1.1775384	total: 213ms	remaining: 12.3s
17:	learn: 1.1629958	total: 217ms	remaining: 11.8s
18:	learn: 1.1468049	total: 221ms	remaining: 11.4s
19:	learn: 1.



## Model Evaluation

#### Make predictions on the test sets and compare errors and accuracy

In [7]:
# CatBoost
cb_y_pred = CBmodel.predict(X_test)
cb_mse = mean_squared_error(y_test, cb_y_pred)
cb_accuracy = accuracy_score(y_test, cb_y_pred)
print('Catboost Mean Squared Error:', cb_mse)
print("Catboost RMSE",np.sqrt(mean_squared_error(y_test,cb_y_pred)))
print('Catboost MAE:', mean_absolute_error(y_test,cb_y_pred))
print('Catboost Accuracy: ', cb_accuracy)

# Random Forest Classifier
rf_y_pred = RFmodel.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_y_pred)
rf_accuracy = accuracy_score(y_test, rf_y_pred)
print('RF Mean Squared Error:', rf_mse)
print("RF RMSE",np.sqrt(mean_squared_error(y_test,rf_y_pred)))
print('RF MAE:', mean_absolute_error(y_test,rf_y_pred))
print('RF Accuracy: ', rf_accuracy)

# XGBoost
xgb_y_pred = XGBmodel.predict(X_test)
xgb_mse = mean_squared_error(y_test, xgb_y_pred)
xgb_accuracy = accuracy_score(y_test, xgb_y_pred)
print('XGB Mean Squared Error:', xgb_mse)
print("XGB RMSE",np.sqrt(mean_squared_error(y_test,xgb_y_pred)))
print('XGB MAE:', mean_absolute_error(y_test,xgb_y_pred))
print('XGB Accuracy: ', xgb_accuracy)

# Support Vector Machine
svm_y_pred = SVMmodel.predict(X_test)
svm_mse = mean_squared_error(y_test, svm_y_pred)
svm_accuracy = accuracy_score(y_test, svm_y_pred)
print('SVM Mean Squared Error:', svm_mse)
print("SVM RMSE",np.sqrt(mean_squared_error(y_test,svm_y_pred)))
print('SVM MAE:', mean_absolute_error(y_test,svm_y_pred))
print('SVM Accuracy: ', svm_accuracy)

# Multilayer Perceptrion
mlp_y_pred = MLPmodel.predict(X_test)
mlp_mse = mean_squared_error(y_test, mlp_y_pred)
mlp_accuracy = accuracy_score(y_test, mlp_y_pred)
print('MLP Mean Squared Error:', mlp_mse)
print("MLP RMSE",np.sqrt(mean_squared_error(y_test,mlp_y_pred)))
print('MLP MAE:', mean_absolute_error(y_test,mlp_y_pred))
print('MLP Accuracy: ', mlp_accuracy)


Catboost Mean Squared Error: 0.8827586206896552
Catboost RMSE 0.9395523512235255
Catboost MAE: 0.5793103448275863
Catboost Accuracy:  0.5586206896551724
RF Mean Squared Error: 0.7862068965517242
RF RMSE 0.8866830868758714
RF MAE: 0.5517241379310345
RF Accuracy:  0.5517241379310345
XGB Mean Squared Error: 0.8344827586206897
XGB RMSE 0.9135002783911397
XGB MAE: 0.5724137931034483
XGB Accuracy:  0.5448275862068965
SVM Mean Squared Error: 1.1379310344827587
SVM RMSE 1.0667385033281394
SVM MAE: 0.6827586206896552
SVM Accuracy:  0.503448275862069
MLP Mean Squared Error: 1.013793103448276
MLP RMSE 1.0068729331193067
MLP MAE: 0.6413793103448275
MLP Accuracy:  0.5172413793103449


## Improve the model by hyperparameter tuning

To further improve the Accuracy score and errors, GridSearchCV would be used to find the best parameters for each models. Irrelevant features will be removed as well.

In [8]:
# Select the columns to exclude from the train set
target_column = 'labels'
columns_to_exclude = ['serial','cases', target_column]

# Assign X to train variable and y to target variable
X = data.drop(columns=columns_to_exclude)
y = data[target_column]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.24, random_state=42)

#### Random Forest Classifier

 Random Forest is an ensemble learning technique that combines
multiple decision trees to make predictions. It can handle both numerical and categorical
variables. Random Forest captures non-linear relationships and interactions between
weather variables and dengue cases. It's known for handling overfitting and handling
missing values. Additionally, it provides feature importance measures, allowing you to
identify the most influential weather variables

In [9]:
RFmodel = RandomForestClassifier()

# Define the parameter grid for tuning:
param_grid = {
    'n_estimators': [100, 200, 500],        # Number of trees
    'max_depth': [None, 5, 10],              # Maximum depth of each tree
    'min_samples_split': [2, 5, 10],         # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],           # Minimum number of samples required to be at a leaf node
    'bootstrap': [True, False]               # Whether bootstrap samples are used when building trees
}

# Perform grid search to find the best parameters:
grid_search = GridSearchCV(estimator=RFmodel, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best parameters and best score:
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Get the best model with the tuned parameters:
rf_best_model = grid_search.best_estimator_

# Evaluate the model:
rf_y_pred = rf_best_model.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_y_pred)
rf_accuracy = accuracy_score(y_test, rf_y_pred)
print('RF Mean Squared Error:', rf_mse)
print("RF RMSE",np.sqrt(mean_squared_error(y_test,rf_y_pred)))
print('RF MAE:', mean_absolute_error(y_test,rf_y_pred))
print('RF Accuracy: ', rf_accuracy)

Best Parameters:  {'bootstrap': True, 'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Best Score:  0.53179646440516
RF Mean Squared Error: 0.8137931034482758
RF RMSE 0.9021048184375671
RF MAE: 0.5793103448275863
RF Accuracy:  0.5310344827586206


In [10]:
# Random Forest Confusion Matrix
print("RF Confusion Matrix: ")
print(confusion_matrix(y_test, rf_y_pred), "\n")

# Random Forest Classification Report
print("RF Classification Report: ")
print(classification_report(y_test, rf_y_pred))

RF Confusion Matrix: 
[[34  1  0  0  0]
 [ 0  7  7  6  0]
 [ 0  8 15 17  1]
 [ 0  1 11 20  5]
 [ 0  1  6  4  1]] 

RF Classification Report: 
              precision    recall  f1-score   support

           0       1.00      0.97      0.99        35
           1       0.39      0.35      0.37        20
           2       0.38      0.37      0.37        41
           3       0.43      0.54      0.48        37
           4       0.14      0.08      0.11        12

    accuracy                           0.53       145
   macro avg       0.47      0.46      0.46       145
weighted avg       0.52      0.53      0.52       145



#### XGBoost

Gradient Boosting algorithms, such as XGBoost build
an ensemble of weak prediction models (usually decision trees) in a sequential manner.
They iteratively learn from the mistakes of previous models and focus on the
misclassified samples. Gradient Boosting models can capture complex interactions and
non-linear relationships between weather variables and dengue cases. They typically
offer high predictive accuracy and handle missing values

In [11]:
XGBmodel = xgb.XGBClassifier()

# Define the parameter grid for tuning:
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of boosting rounds
    'max_depth': [3, 5, 7],           # Maximum depth of each tree
    'learning_rate': [0.1, 0.01, 0.001],  # Learning rate
    'subsample': [0.8, 1.0],          # Subsample ratio of the training instances
    'colsample_bytree': [0.8, 1.0]    # Subsample ratio of columns when constructing each tree
}

# Perform grid search to find the best parameters:
grid_search = GridSearchCV(estimator=XGBmodel, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best parameters and best score:
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Get the best model with the tuned parameters:
xgb_best_model = grid_search.best_estimator_

# Evaluate the model:
xgb_y_pred = xgb_best_model.predict(X_test)
xgb_mse = mean_squared_error(y_test, xgb_y_pred)
xgb_accuracy = accuracy_score(y_test, xgb_y_pred)
print('XGB Mean Squared Error:', xgb_mse)
print("XGB RMSE",np.sqrt(mean_squared_error(y_test,xgb_y_pred)))
print('XGB MAE:', mean_absolute_error(y_test,xgb_y_pred))
print('XGB Accuracy: ', xgb_accuracy)

Best Parameters:  {'colsample_bytree': 0.8, 'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
Best Score:  0.5362398471094123
XGB Mean Squared Error: 0.7310344827586207
XGB RMSE 0.8550055454548939
XGB MAE: 0.5103448275862069
XGB Accuracy:  0.593103448275862


In [12]:
# XGBoost Confusion Matrix
print("XGB Confusion Matrix: ")
print(confusion_matrix(y_test, xgb_y_pred), "\n")

# XGBoost Classification Report
print("XGB Classification Report: ")
print(classification_report(y_test, xgb_y_pred))

XGB Confusion Matrix: 
[[33  2  0  0  0]
 [ 0 11  2  7  0]
 [ 0 13 14 14  0]
 [ 0  3  7 24  3]
 [ 0  1  3  4  4]] 

XGB Classification Report: 
              precision    recall  f1-score   support

           0       1.00      0.94      0.97        35
           1       0.37      0.55      0.44        20
           2       0.54      0.34      0.42        41
           3       0.49      0.65      0.56        37
           4       0.57      0.33      0.42        12

    accuracy                           0.59       145
   macro avg       0.59      0.56      0.56       145
weighted avg       0.62      0.59      0.59       145



#### Support Vector Machine

 SVMs aim to find an optimal hyperplane that
separates different classes in the data. They can be useful when trying to classify
dengue cases based on the weather variables. SVMs can handle high-dimensional data
and non-linear relationships. They also offer good generalization ability and robustness
against overfitting

In [13]:
SVMmodel = svm.SVC()

# Define the parameter grid for tuning:
param_grid = {
    'C': [0.1, 10, 100, 1000],               # Penalty parameter C of the error term
    'kernel': ['linear', 'rbf'],     # Kernel type: linear or radial basis function (rbf)
    'gamma': ['scale', 'auto'],      # Kernel coefficient for 'rbf': scale or auto
}

# Perform grid search to find the best parameters:
grid_search = GridSearchCV(estimator=SVMmodel, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best parameters and best score:
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Get the best model with the tuned parameters:
svm_best_model = grid_search.best_estimator_

# Evaluate the model:
svm_y_pred = svm_best_model.predict(X_test)
svm_mse = mean_squared_error(y_test, svm_y_pred)
svm_accuracy = accuracy_score(y_test, svm_y_pred)
print('SVM Mean Squared Error:', svm_mse)
print("SVM RMSE",np.sqrt(mean_squared_error(y_test,svm_y_pred)))
print('SVM MAE:', mean_absolute_error(y_test,svm_y_pred))
print('SVM Accuracy: ', svm_accuracy)

Best Parameters:  {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}
Best Score:  0.48791208791208796
SVM Mean Squared Error: 0.993103448275862
SVM RMSE 0.9965457582448796
SVM MAE: 0.6206896551724138
SVM Accuracy:  0.5379310344827586


In [14]:
# SVM Confusion Matrix
print("SVM Confusion Matrix: ")
print(confusion_matrix(y_test, svm_y_pred), "\n")

# SVM Classification Report
print("SVM Classification Report: ")
print(classification_report(y_test, svm_y_pred))

SVM Confusion Matrix: 
[[31  1  0  3  0]
 [ 1  3 11  5  0]
 [ 1  1 22 11  6]
 [ 1  0 16 17  3]
 [ 0  0  3  4  5]] 

SVM Classification Report: 
              precision    recall  f1-score   support

           0       0.91      0.89      0.90        35
           1       0.60      0.15      0.24        20
           2       0.42      0.54      0.47        41
           3       0.42      0.46      0.44        37
           4       0.36      0.42      0.38        12

    accuracy                           0.54       145
   macro avg       0.54      0.49      0.49       145
weighted avg       0.56      0.54      0.53       145



#### Catboost

In [15]:
CBmodel = CatBoostClassifier()

# Define the parameter grid for tuning:
param_grid = {
    'learning_rate': [0.1, 0.01],   # Learning rate
    'depth': [4, 6, 8],             # Tree depth
    'l2_leaf_reg': [1, 3, 5]        # L2 regularization
}

# Perform grid search to find the best parameters:
grid_search = GridSearchCV(estimator=CBmodel, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best parameters and best score:
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Get the best model with the tuned parameters:
cb_best_model = grid_search.best_estimator_

# Evaluate the model:
cb_y_pred = cb_best_model.predict(X_test)
cb_mse = mean_squared_error(y_test, cb_y_pred)
cb_accuracy = accuracy_score(y_test, cb_y_pred)
print('Catboost Mean Squared Error:', cb_mse)
print("Catboost RMSE",np.sqrt(mean_squared_error(y_test,cb_y_pred)))
print('Catboost MAE:', mean_absolute_error(y_test,cb_y_pred))
print('Catboost Accuracy: ', cb_accuracy)

0:	learn: 1.5365256	total: 1.8ms	remaining: 1.8s
1:	learn: 1.4772096	total: 2.99ms	remaining: 1.49s
2:	learn: 1.4267813	total: 4.04ms	remaining: 1.34s
3:	learn: 1.3820881	total: 5.05ms	remaining: 1.26s
4:	learn: 1.3370256	total: 6.27ms	remaining: 1.25s
5:	learn: 1.3084213	total: 7.66ms	remaining: 1.27s
6:	learn: 1.2818221	total: 8.76ms	remaining: 1.24s
7:	learn: 1.2583970	total: 9.88ms	remaining: 1.23s
8:	learn: 1.2343676	total: 11ms	remaining: 1.21s
9:	learn: 1.2048492	total: 12.2ms	remaining: 1.21s
10:	learn: 1.1879965	total: 13.4ms	remaining: 1.2s
11:	learn: 1.1660835	total: 14.5ms	remaining: 1.19s
12:	learn: 1.1441712	total: 15.6ms	remaining: 1.18s
13:	learn: 1.1349505	total: 16.8ms	remaining: 1.18s
14:	learn: 1.1225085	total: 18.2ms	remaining: 1.2s
15:	learn: 1.1100711	total: 19.6ms	remaining: 1.21s
16:	learn: 1.0961643	total: 21.2ms	remaining: 1.23s
17:	learn: 1.0858798	total: 22.3ms	remaining: 1.22s
18:	learn: 1.0729124	total: 23.5ms	remaining: 1.21s
19:	learn: 1.0589693	total: 

In [16]:
# Catboost Confusion Matrix
print("CB Confusion Matrix: ")
print(confusion_matrix(y_test, cb_y_pred), "\n")

# Catboost Classification Report
print("CB Classification Report: ")
print(classification_report(y_test, cb_y_pred))

CB Confusion Matrix: 
[[34  1  0  0  0]
 [ 1  9  7  3  0]
 [ 0  8 17 15  1]
 [ 0  1 10 23  3]
 [ 0  1  4  5  2]] 

CB Classification Report: 
              precision    recall  f1-score   support

           0       0.97      0.97      0.97        35
           1       0.45      0.45      0.45        20
           2       0.45      0.41      0.43        41
           3       0.50      0.62      0.55        37
           4       0.33      0.17      0.22        12

    accuracy                           0.59       145
   macro avg       0.54      0.52      0.53       145
weighted avg       0.58      0.59      0.58       145



#### Multilayer Perceptron

MLPs are particularly effective in capturing complex
nonlinear relationships, which can be advantageous when predicting dengue cases as
the incidence of the disease is influenced by various factors such as temperature,
humidity, precipitation, and other weather-related variables.

In [18]:
MLPmodel = MLPClassifier()

# Define the parameter grid for tuning:
param_grid = {
    'hidden_layer_sizes': [(100,), (50, 50), (100, 50)],   # Sizes of hidden layers
    'activation': ['relu', 'tanh'],                        # Activation function
    'solver': ['adam', 'sgd'],                              # Optimization algorithm
    'alpha': [0.0001, 0.001, 0.01],                         # L2 penalty (regularization term)
    'learning_rate': ['constant', 'adaptive'],               # Learning rate schedule
}

# Perform grid search to find the best parameters:
grid_search = GridSearchCV(estimator=MLPmodel, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best parameters and best score:
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Get the best model with the tuned parameters:
mlp_best_model = grid_search.best_estimator_

# Evaluate the model:
mlp_y_pred = mlp_best_model.predict(X_test)
mlp_mse = mean_squared_error(y_test, mlp_y_pred)
mlp_accuracy = accuracy_score(y_test, mlp_y_pred)
print('MLP Mean Squared Error:', mlp_mse)
print("MLP RMSE",np.sqrt(mean_squared_error(y_test,mlp_y_pred)))
print('MLP MAE:', mean_absolute_error(y_test,mlp_y_pred))
print('MLP Accuracy: ', mlp_accuracy)



Best Parameters:  {'activation': 'tanh', 'alpha': 0.01, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
Best Score:  0.505303392259914
MLP Mean Squared Error: 1.0551724137931036
MLP RMSE 1.02721585550122
MLP MAE: 0.6275862068965518
MLP Accuracy:  0.5448275862068965




In [None]:
# MLP Confusion Matrix
print("MLP Confusion Matrix: ")
print(confusion_matrix(y_test, mlp_y_pred), "\n")

# MLP Classification Report
print("MLP Classfication Report: ")
print(classification_report(y_test, mlp_y_pred))

MLP Confusion Matrix: 
[[31  1  1  2  0]
 [ 1 10  6  3  0]
 [ 2  9 18 11  1]
 [ 1  2 16 14  4]
 [ 0  1  4  2  5]] 

MLP Classfication Report: 
              precision    recall  f1-score   support

           0       0.89      0.89      0.89        35
           1       0.43      0.50      0.47        20
           2       0.40      0.44      0.42        41
           3       0.44      0.38      0.41        37
           4       0.50      0.42      0.45        12

    accuracy                           0.54       145
   macro avg       0.53      0.52      0.53       145
weighted avg       0.54      0.54      0.54       145



## Results and Discussion

After training the models and using them to predict the test cases, the accuracy score of each model is attained:

In [None]:
import pandas as pd
scores = pd.DataFrame([['Random Forest Classifier', rf_accuracy], ['XGBoost', xgb_accuracy], ['SVM', svm_accuracy], ['Catboost', cb_accuracy], ['Multilayer Perceptron', mlp_accuracy]], columns=['Model', 'Accuracy Score'])
scores

Unnamed: 0,Model,Accuracy Score
0,Random Forest Classifier,0.551724
1,XGBoost,0.593103
2,SVM,0.537931
3,Catboost,0.586207
4,Multilayer Perceptron,0.537931


Examining the accuracy score attained by the 5 models, it is clear that XGBoost attained the highest accuracy score of 59.31%. However, the accuracy score of the other models are also close ranging from 53.79% to 59.31%.

Since XGBoost attained the highest accuracy score out of the 5 models, the confusion matrix and classification report for that model may be analyzed in order to find the common source of error of that model:

In [19]:
# XGBoost Confusion Matrix
print("XGB Confusion Matrix: ")
print(confusion_matrix(y_test, xgb_y_pred), "\n")

# XGBoost Classification Report
print("XGB Classification Report: ")
print(classification_report(y_test, xgb_y_pred))

XGB Confusion Matrix: 
[[33  2  0  0  0]
 [ 0 11  2  7  0]
 [ 0 13 14 14  0]
 [ 0  3  7 24  3]
 [ 0  1  3  4  4]] 

XGB Classification Report: 
              precision    recall  f1-score   support

           0       1.00      0.94      0.97        35
           1       0.37      0.55      0.44        20
           2       0.54      0.34      0.42        41
           3       0.49      0.65      0.56        37
           4       0.57      0.33      0.42        12

    accuracy                           0.59       145
   macro avg       0.59      0.56      0.56       145
weighted avg       0.62      0.59      0.59       145



From the confusion matrix, it can be observed that many incorrect predictions was due to the model incorrectly classifying 'Moderate Risk' as 'Low Risk' (14 mistakes) and 'High Risk' (13 mistakes).