# Covid-19 Risk Classification in Malaysia  
## Part 1: Introduction  

### Objective  
The objective of this project is to **classify daily Covid-19 risk levels in Malaysia** into two categories:  
- **High risk**: If daily new cases exceed the `RISK_THRESHOLD` (default: 4,000, based on MCO 3.0 policy trigger)  
- **Low risk**: Otherwise  

This classification problem will be approached using **three supervised learning methods**:  
1. Linear Regression  
2. K-Nearest Neighbors (KNN)  
3. Logistic Regression  

### Motivation  
- During the Covid-19 pandemic, understanding the risk level of infection was crucial for decision-making in healthcare, government policies, and public awareness.  
- On **12 May 2021**, the Malaysian government announced **MCO 3.0** as daily new cases exceeded **4,000**, later escalating to a **full lockdown on 1 June 2021** when cases surged above **9,000**.  
- This project uses the **4,000-case threshold** as a real-world benchmark for classifying **high-risk periods**, making predictions more interpretable and policy-relevant.  
- By comparing different machine learning models, we aim to evaluate which method is most effective for predicting risk levels, balancing **accuracy, interpretability, and computational efficiency**.  

### Dataset  
- Source: [Ministry of Health Malaysia (MoH) GitHub Repository](https://github.com/MoH-Malaysia/covid19-public)  
- We will use **country-level daily data** such as:  
  - Daily new cases  
  - Daily testing numbers  
  - Hospital and ICU utilization  
  - Vaccination progress  

For this project, a **6-month consecutive date range (1 March 2021 – 31 August 2021)** was selected.  
- The **median point of this range is 1 June 2021**, which coincides with the start of the **Full Movement Control Order (FMCO)**, making this window highly representative of Malaysia’s Covid-19 crisis.  
- This ensures sufficient training samples while keeping the experiment focused on a **critical policy period**.  

### Methodology Overview  
The following steps will be performed in this notebook:  
1. **Data Collection** – Load Covid-19 datasets and preprocess them  
2. **Feature Engineering** – Select features, handle missing values, transform & scale data  
3. **Dataset Splitting** – Divide into training, validation, and testing sets  
4. **Model Training** – Train three models: Linear Regression, KNN, Logistic Regression  
5. **Plotting & Evaluation** – Evaluate with metrics & visualize ROC, PR, and error curves  
6. **Analysis & Conclusion** – Compare models and highlight key findings  

### Hyperparameters to Tune  
Each model will be trained with baseline parameters and then fine-tuned with the following hyperparameters:  

- **Linear Regression (Kotaro)**  
  - `alpha` → Regularization strength (for Ridge/Lasso)  
  - `penalty` → Type of regularization (`l1`, `l2`)  
  - `max_iter` → Maximum number of iterations for convergence  

- **K-Nearest Neighbors (Andrea)**  
  - `n_neighbors (k)` → Number of neighbors to consider  
  - `weights` → Uniform vs distance-based weighting  
  - `metric` → Distance measure (Euclidean, Manhattan, Minkowski)  

- **Logistic Regression (JeeSee)**  
  - `penalty` → Regularization type (`l1`, `l2`, `elasticnet`)  
  - `solver` → Optimization algorithm (`liblinear`, `lbfgs`, `saga`)  
  - `max_iter` → Maximum number of iterations for convergence  

### Evaluation Metrics  
We will compare all three models using:  
- Accuracy  
- Area Under Curve (AUC)  
- Recall  
- Precision  
- Specificity  
- F1 Score  
- Training Speed  

### Visualization  
Plots will be used to support evaluation:  
- ROC Curve  
- Precision-Recall Curve  
- Overfitting/Underfitting curves (Train vs Validation Errors)  


In [87]:
# Import libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set start date, end date, and risk threshold
START_DATE = pd.to_datetime("2021-03-01")
END_DATE = START_DATE + pd.DateOffset(months=6)
RISK_THRESHOLD = 4000


# Part 2: Data Collection

### Data Source
We will use official Malaysian Covid-19 datasets provided by the **Ministry of Health Malaysia (MoH)**, available on GitHub:  
[https://github.com/MoH-Malaysia/covid19-public](https://github.com/MoH-Malaysia/covid19-public)

### Selected Files
For this project, we will mainly use:
- `cases_malaysia.csv` → Daily Covid-19 cases (our primary feature for target creation)  
- `tests_malaysia.csv` → Daily testing numbers  
- `hospital.csv` → Hospitalization data  
- `icu.csv` → ICU utilization data  
- `vax_malaysia.csv` → Vaccination progress  

### Timeframe
We will extract a **6-month consecutive date range** during the pandemic.  
For example: **July 2021 – December 2021**, when Malaysia experienced a significant wave.  
(This period can be adjusted if needed, but must remain consecutive.)

### Target Variable
We define the **binary classification target**:
- **High risk (1)** if `cases_new > RISKTHRESHOLD`  
- **Low risk (0)** otherwise


In [89]:
# Read the csv files
datasets = {
    "cases": pd.read_csv("cases_malaysia.csv"),
    "tests": pd.read_csv("tests_malaysia.csv"),
    "hospital": pd.read_csv("hospital.csv"),
    "icu": pd.read_csv("icu.csv"),
    "vax": pd.read_csv("vax_malaysia.csv"),
}

# Clean data
for name, df in datasets.items():
    # Select 6-month range
    df["date"] = pd.to_datetime(df["date"])
    df = df[(df["date"] >= START_DATE) & (df["date"] <= END_DATE)].copy()

    # Group by date (and remove non-numeric columns)
    df = df.groupby("date").sum(numeric_only=True).reset_index()

    # Remove columns that have no or a single distinct value only
    nunique = df.nunique()
    cols_to_drop = nunique[nunique <= 1].index
    df.drop(columns=cols_to_drop, inplace=True, errors="ignore")

    # Fill in empty entries in numerical columns
    df.fillna(0, inplace=True)

    # Assign cleaned data
    datasets[name] = df

# Merge datasets on date
from functools import reduce
df = reduce(
    lambda left, right: pd.merge(left, right, on="date", how="left"),
    datasets.values()
)
# print("Final dataset shape:", df.shape)

# Create binary target variable (risk level)
df["risk_level"] = (df["cases_new"] > RISK_THRESHOLD).astype(int)

# Sort by date
df = df.sort_values("date").reset_index(drop=True)

df.columns

Index(['date', 'cases_new', 'cases_import', 'cases_recovered', 'cases_active',
       'cases_cluster', 'cases_unvax', 'cases_pvax', 'cases_fvax',
       'cases_child', 'cases_adolescent', 'cases_adult', 'cases_elderly',
       'cases_0_4', 'cases_5_11', 'cases_12_17', 'cases_18_29', 'cases_30_39',
       'cases_40_49', 'cases_50_59', 'cases_60_69', 'cases_70_79', 'cases_80',
       'rtk-ag', 'pcr', 'beds', 'beds_covid', 'beds_noncrit', 'admitted_pui',
       'admitted_covid', 'admitted_total', 'discharged_pui',
       'discharged_covid', 'discharged_total', 'hosp_covid', 'hosp_pui',
       'hosp_noncovid', 'beds_icu', 'beds_icu_rep', 'beds_icu_total',
       'beds_icu_covid', 'vent', 'vent_port', 'icu_covid', 'icu_pui',
       'icu_noncovid', 'vent_covid', 'vent_pui', 'vent_noncovid', 'vent_used',
       'vent_port_used', 'daily_partial', 'daily_full', 'daily_booster',
       'daily', 'daily_partial_adol', 'daily_full_adol', 'daily_partial_child',
       'daily_full_child', 'cumul_part