# Predicting Dengue Case Severity

This AI model aims to predict dengue cases classification using different AI models

## Import Libraries and Dataset

The dataset was obtained from https://www.kaggle.com/datasets/siddhvr/dengue-predictionfetal-health-classification from the author Siddhvr.<br><br>

The dataset has the following features: 

**serial**: It represents a unique identifier for each data entry. It is likely used for tracking or indexing purposes.

**tempmax**: It refers to the maximum temperature recorded for a specific time period.

**tempmin**: It represents the minimum temperature recorded for a specific time period.

**temp**: It denotes the average temperature during the specified time period.

**feelslikemax**: It indicates the maximum "feels like" temperature, which takes into account factors such as humidity and wind to estimate how the temperature actually feels.

**feelslikemin**: It represents the minimum "feels like" temperature during the specified time period.

**feelslike**: It denotes the average "feels like" temperature, which is an estimation of how the temperature feels to humans.

**dew**: It refers to the dew point, which is the temperature at which air becomes saturated with moisture, leading to the formation of dew.

**humidity**: It represents the relative humidity, indicating the amount of moisture present in the air relative to the maximum amount it could hold at that temperature.

**precip**: It denotes the total precipitation (rainfall) recorded during the specified time period.

**precipprob**: It represents the probability of precipitation occurring during the specified time period.

**precipcover**: It indicates the coverage or extent of precipitation in the given area during the specified time period.

**snow**: It represents the amount of snowfall recorded during the specified time period.

**snowdepth**: It denotes the depth of snow on the ground during the specified time period.

**windspeed**: It represents the speed of wind recorded during the specified time period.

**winddir**: It indicates the direction from which the wind is blowing during the specified time period.

**sealevelpressure**: It refers to the atmospheric pressure at sea level during the specified time period.

**cloudcover**: It represents the extent of cloud cover during the specified time period.

**visibility**: It denotes the horizontal visibility, indicating how far an observer can see clearly during the specified time period.

**solarradiation**: It represents the amount of solar radiation received during the specified time period.

**solarenergy**: It indicates the solar energy level during the specified time period.

**uvindex**: It represents the UV index, which is a measure of the intensity of ultraviolet (UV) radiation from the sun.

**conditions**: It represents the weather conditions during the specified time period, such as sunny, cloudy, rainy, etc.

**stations**: It refers to the number of weather stations used to collect the data for the given region.

**cases**: It denotes the number of dengue cases recorded in the specified region during the specified time period.

**labels**: It represents the severity of dengue cases.
<br><br>
The model will try to predict the severity level of dengue cases.

Install necessary dependencies

```
pip install xgboost catboost pandas scikit-learn
```

In [1]:
import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn import svm
from catboost import CatBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import LabelEncoder

data = pd.read_csv('dengue-dataset.csv')

## Data Preprocessing

#### Handle missing values

Identified missing values in features. If there are empty values, the data containing the empty value is removed.

In [None]:
# Check for empty values in the DataFrame
print(data.isnull().any())

#### Feature scaling

Normalize numerical features with RobustScaler

In [None]:
# Select the columns to exclude from normalization
target_column = 'labels'
columns_to_exclude = ['serial', 'cases', target_column]

# Create a DataFrame with the columns to normalize
data_to_normalize = data.drop(columns=columns_to_exclude)

# Create an instance of StandardScaler
scaler = RobustScaler()

# Normalize the data
normalized_data = scaler.fit_transform(data_to_normalize)

# Convert the normalized data back to a DataFrame
normalized_df = pd.DataFrame(normalized_data, columns=data_to_normalize.columns)

# Combine the excluded columns with the normalized DataFrame
data = pd.concat([data[columns_to_exclude], normalized_df], axis=1)

# Print the final data
print(data)

#### Categorical encoding

Values in the **labels** feature is mapped with these values:

**Severe Risk**: 4,
**High Risk**: 3,
**Moderate Risk**: 2,
**Low Risk**: 1,
**Minimal to No risk**: 0

In [None]:
# Define label-value mapping
desired_order = ['Minimal to No risk', 'Low Risk', 'Moderate Risk', 'High Risk', 'Severe Risk']
label_mapping = {
    'Severe Risk': 4,
    'High Risk': 3,
    'Moderate Risk': 2,
    'Low Risk': 1,
    'Minimal to No risk': 0
}

# Create an instance of LabelEncoder
label_encoder = LabelEncoder()

# Fit LabelEncoder with the unique labels from mapping dictionary
label_encoder.fit(desired_order)

# Encode the labels using the mapping dictionary
data['labels'] = data['labels'].map(label_mapping)

data



#### Splitting the Dataset

The training set was used to train each model, while the testing set was used for validation (Train set: 76%, Test set: 24%).

In [None]:
# Select the columns to exclude from the train set
target_column = 'labels'
columns_to_exclude = ['serial', 'cases',target_column]

# Assign X to train variable and y to target variable
X = data.drop(columns=columns_to_exclude)
y = data[target_column]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.24, random_state=42)

## Model Training

In [None]:
# Create models and fit it to the training data
XGBmodel = xgb.XGBClassifier()
SVMmodel = svm.SVC()
CBmodel = CatBoostClassifier()
RFmodel = RandomForestClassifier()
MLPmodel = MLPClassifier()

RFmodel.fit(X_train, y_train)
XGBmodel.fit(X_train, y_train)
SVMmodel.fit(X_train, y_train)
CBmodel.fit(X_train, y_train)
MLPmodel.fit(X_train, y_train)

## Model Evaluation

#### Make predictions on the test sets and compare errors and accuracy

In [None]:
# CatBoost
cb_y_pred = CBmodel.predict(X_test)
cb_mse = mean_squared_error(y_test, cb_y_pred)
cb_accuracy = r2_score(y_test, cb_y_pred)
print('Catboost Mean Squared Error:', cb_mse)
print("Catboost RMSE",np.sqrt(mean_squared_error(y_test,cb_y_pred)))
print('Catboost MAE:', mean_absolute_error(y_test,cb_y_pred))
print('Catboost R2 Accuracy: ', cb_accuracy)

# Random Forest Classifier
rf_y_pred = RFmodel.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_y_pred)
rf_accuracy = r2_score(y_test, rf_y_pred)
print('RF Mean Squared Error:', rf_mse)
print("RF RMSE",np.sqrt(mean_squared_error(y_test,rf_y_pred)))
print('RF MAE:', mean_absolute_error(y_test,rf_y_pred))
print('RF R2 Accuracy: ', rf_accuracy)

# XGBoost
xgb_y_pred = XGBmodel.predict(X_test)
xgb_mse = mean_squared_error(y_test, xgb_y_pred)
xgb_accuracy = r2_score(y_test, xgb_y_pred)
print('XGB Mean Squared Error:', xgb_mse)
print("XGB RMSE",np.sqrt(mean_squared_error(y_test,xgb_y_pred)))
print('XGB MAE:', mean_absolute_error(y_test,xgb_y_pred))
print('XGB R2 Accuracy: ', xgb_accuracy)

# Support Vector Machine
svm_y_pred = SVMmodel.predict(X_test)
svm_mse = mean_squared_error(y_test, svm_y_pred)
svm_accuracy = r2_score(y_test, svm_y_pred)
print('SVM Mean Squared Error:', svm_mse)
print("SVM RMSE",np.sqrt(mean_squared_error(y_test,svm_y_pred)))
print('SVM MAE:', mean_absolute_error(y_test,svm_y_pred))
print('SVM R2 Accuracy: ', svm_accuracy)

# Multilayer Perceptrion
mlp_y_pred = MLPmodel.predict(X_test)
mlp_mse = mean_squared_error(y_test, mlp_y_pred)
mlp_accuracy = r2_score(y_test, mlp_y_pred)
print('MLP Mean Squared Error:', mlp_mse)
print("MLP RMSE",np.sqrt(mean_squared_error(y_test,mlp_y_pred)))
print('MLP MAE:', mean_absolute_error(y_test,mlp_y_pred))
print('MLP R2 Accuracy: ', mlp_accuracy)


## Improve the model by hyperparameter tuning

To further improve the R2 accuracy score and errors, GridSearchCV would be used to find the best parameters for each models. Irrelevant features will be removed as well.

In [None]:
# Select the columns to exclude from the train set
target_column = 'labels'
columns_to_exclude = ['serial','cases', target_column]

# Assign X to train variable and y to target variable
X = data.drop(columns=columns_to_exclude)
y = data[target_column]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.24, random_state=42)

#### Random Forest Classifier

In [None]:
RFmodel = RandomForestClassifier()

# Define the parameter grid for tuning:
param_grid = {
    'n_estimators': [100, 200, 500],        # Number of trees
    'max_depth': [None, 5, 10],              # Maximum depth of each tree
    'min_samples_split': [2, 5, 10],         # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],           # Minimum number of samples required to be at a leaf node
    'bootstrap': [True, False]               # Whether bootstrap samples are used when building trees
}

# Perform grid search to find the best parameters:
grid_search = GridSearchCV(estimator=RFmodel, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best parameters and best score:
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Get the best model with the tuned parameters:
rf_best_model = grid_search.best_estimator_

# Evaluate the model:
rf_y_pred = rf_best_model.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_y_pred)
rf_accuracy = r2_score(y_test, rf_y_pred)
print('RF Mean Squared Error:', rf_mse)
print("RF RMSE",np.sqrt(mean_squared_error(y_test,rf_y_pred)))
print('RF MAE:', mean_absolute_error(y_test,rf_y_pred))
print('RF R2 Accuracy: ', rf_accuracy)

#### XGBoost

In [None]:
XGBmodel = xgb.XGBClassifier()

# Define the parameter grid for tuning:
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of boosting rounds
    'max_depth': [3, 5, 7],           # Maximum depth of each tree
    'learning_rate': [0.1, 0.01, 0.001],  # Learning rate
    'subsample': [0.8, 1.0],          # Subsample ratio of the training instances
    'colsample_bytree': [0.8, 1.0]    # Subsample ratio of columns when constructing each tree
}

# Perform grid search to find the best parameters:
grid_search = GridSearchCV(estimator=XGBmodel, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best parameters and best score:
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Get the best model with the tuned parameters:
xgb_best_model = grid_search.best_estimator_

# Evaluate the model:
xgb_y_pred = xgb_best_model.predict(X_test)
xgb_mse = mean_squared_error(y_test, xgb_y_pred)
xgb_accuracy = r2_score(y_test, xgb_y_pred)
print('XGB Mean Squared Error:', xgb_mse)
print("XGB RMSE",np.sqrt(mean_squared_error(y_test,xgb_y_pred)))
print('XGB MAE:', mean_absolute_error(y_test,xgb_y_pred))
print('XGB R2 Accuracy: ', xgb_accuracy)

SVM

In [None]:
SVMmodel = svm.SVC()

# Define the parameter grid for tuning:
param_grid = {
    'C': [0.1, 10, 100, 1000],               # Penalty parameter C of the error term
    'kernel': ['linear', 'rbf'],     # Kernel type: linear or radial basis function (rbf)
    'gamma': ['scale', 'auto'],      # Kernel coefficient for 'rbf': scale or auto
}

# Perform grid search to find the best parameters:
grid_search = GridSearchCV(estimator=SVMmodel, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best parameters and best score:
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Get the best model with the tuned parameters:
svm_best_model = grid_search.best_estimator_

# Evaluate the model:
svm_y_pred = svm_best_model.predict(X_test)
svm_mse = mean_squared_error(y_test, svm_y_pred)
svm_accuracy = r2_score(y_test, svm_y_pred)
print('SVM Mean Squared Error:', svm_mse)
print("SVM RMSE",np.sqrt(mean_squared_error(y_test,svm_y_pred)))
print('XGB MAE:', mean_absolute_error(y_test,svm_y_pred))
print('SVM R2 Accuracy: ', svm_accuracy)

#### Catboost

In [None]:
CBmodel = CatBoostClassifier()

# Define the parameter grid for tuning:
param_grid = {
    'learning_rate': [0.1, 0.01],   # Learning rate
    'depth': [4, 6, 8],                    # Tree depth
    'l2_leaf_reg': [1, 3, 5]               # L2 regularization
}

# Perform grid search to find the best parameters:
grid_search = GridSearchCV(estimator=CBmodel, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best parameters and best score:
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Get the best model with the tuned parameters:
cb_best_model = grid_search.best_estimator_

# Evaluate the model:
cb_y_pred = cb_best_model.predict(X_test)
cb_mse = mean_squared_error(y_test, cb_y_pred)
cb_accuracy = r2_score(y_test, cb_y_pred)
print('Catboost Mean Squared Error:', cb_mse)
print("Catboost RMSE",np.sqrt(mean_squared_error(y_test,cb_y_pred)))
print('Catboost MAE:', mean_absolute_error(y_test,cb_y_pred))
print('Catboost R2 Accuracy: ', cb_accuracy)

#### Multilayer Perceptron

In [None]:
MLPmodel = MLPClassifier()

# Define the parameter grid for tuning:
param_grid = {
    'hidden_layer_sizes': [(100,), (50, 50), (100, 50)],   # Sizes of hidden layers
    'activation': ['relu', 'tanh'],                        # Activation function
    'solver': ['adam', 'sgd'],                              # Optimization algorithm
    'alpha': [0.0001, 0.001, 0.01],                         # L2 penalty (regularization term)
    'learning_rate': ['constant', 'adaptive']               # Learning rate schedule
}

# Perform grid search to find the best parameters:
grid_search = GridSearchCV(estimator=MLPmodel, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best parameters and best score:
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Get the best model with the tuned parameters:
mlp_best_model = grid_search.best_estimator_

# Evaluate the model:
mlp_y_pred = mlp_best_model.predict(X_test)
mlp_mse = mean_squared_error(y_test, mlp_y_pred)
mlp_accuracy = r2_score(y_test, mlp_y_pred)
print('MLP Mean Squared Error:', mlp_mse)
print("MLP RMSE",np.sqrt(mean_squared_error(y_test,mlp_y_pred)))
print('MLP MAE:', mean_absolute_error(y_test,mlp_y_pred))
print('MLP R2 Accuracy: ', mlp_accuracy)