# **COVID-19 Disease Outbreak Prediction Project**


# COVID-19 Outbreak Prediction Analysis
##  **Author:** Muhammad Zeeshan Raza  
# **Data Scientist** | 
[![GitHub](https://img.icons8.com/material-outlined/24/000000/github.png)](https://github.com/Zeeshan314-ai) GitHub |
[![LinkedIn](https://img.icons8.com/color/24/000000/linkedin.png)](https://www.linkedin.com/in/muhammadzeeshanraza/) LinkedIn

## 1. **INTRODUCTION**


In this project, we aim to predict the future spread of COVID-19 using historical data. The model forecasts the number of confirmed cases for upcoming days based on past trends and patterns. This will help organizations, governments, and healthcare systems anticipate future outbreaks and make data-driven decisions.

**Problem Statement**:
COVID-19 has affected millions of lives globally, and understanding its future spread is essential for timely interventions. By predicting future outbreaks, governments and health organizations can better allocate resources, plan for healthcare demand, and implement preventive measures.

## **2. OBJEXTIVE OF THE MODEL**

The goal of this model is to predict the future confirmed COVID-19 cases for the next few days, based on historical data, using machine learning techniques. The model will allow stakeholders to:

- Predict the upcoming wave of infections.

- Allocate resources based on forecasted cases.

- Take proactive steps before case numbers spike.

## **3. Data Understanding and Sources**


We use publicly available COVID-19 data, including information on daily confirmed cases, deaths, and recoveries. This data can be obtained from reliable sources like Johns Hopkins University, Kaggle, and other healthcare data repositories.

INSTALL THE MAJOR LIBRARIES 

In [2]:
pip install scikit-learn


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### **3.1 IMPORT THE LIBRARIES** 

In [3]:
# Importing necessary libraries

import pandas as pd
import plotly.express as px
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt



### **3.1 IMPORT DATA**

In [4]:

# Load the dataset (example)
df = pd.read_csv('worldometer_data.csv')
df.head()


Unnamed: 0,Country/Region,Continent,Population,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",Tot Cases/1M pop,Deaths/1M pop,TotalTests,Tests/1M pop,WHO Region
0,USA,North America,331198100.0,5032179,,162804.0,,2576668.0,,2292707.0,18296.0,15194.0,492.0,63139605.0,190640.0,Americas
1,Brazil,South America,212710700.0,2917562,,98644.0,,2047660.0,,771258.0,8318.0,13716.0,464.0,13206188.0,62085.0,Americas
2,India,Asia,1381345000.0,2025409,,41638.0,,1377384.0,,606387.0,8944.0,1466.0,30.0,22149351.0,16035.0,South-EastAsia
3,Russia,Europe,145940900.0,871894,,14606.0,,676357.0,,180931.0,2300.0,5974.0,100.0,29716907.0,203623.0,Europe
4,South Africa,Africa,59381570.0,538184,,9604.0,,387316.0,,141264.0,539.0,9063.0,162.0,3149807.0,53044.0,Africa


In [5]:
df.columns

Index(['Country/Region', 'Continent', 'Population', 'TotalCases', 'NewCases',
       'TotalDeaths', 'NewDeaths', 'TotalRecovered', 'NewRecovered',
       'ActiveCases', 'Serious,Critical', 'Tot Cases/1M pop', 'Deaths/1M pop',
       'TotalTests', 'Tests/1M pop', 'WHO Region'],
      dtype='object')

## **4. MERHODOLOGY**

### **4.1- DATA PREPROCESSING**

Before building the model, we first need to prepare the data. This includes handling missing values, converting date columns, and creating necessary features.

In [6]:
# Convert the 'date' column to datetime format
df['date'] = pd.to_datetime('2025-04-01')  # Example: setting a fixed date

# Handling missing values (if any)
df.fillna(0, inplace=True)

# Check data types and null values
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209 entries, 0 to 208
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Country/Region    209 non-null    object        
 1   Continent         209 non-null    object        
 2   Population        209 non-null    float64       
 3   TotalCases        209 non-null    int64         
 4   NewCases          209 non-null    float64       
 5   TotalDeaths       209 non-null    float64       
 6   NewDeaths         209 non-null    float64       
 7   TotalRecovered    209 non-null    float64       
 8   NewRecovered      209 non-null    float64       
 9   ActiveCases       209 non-null    float64       
 10  Serious,Critical  209 non-null    float64       
 11  Tot Cases/1M pop  209 non-null    float64       
 12  Deaths/1M pop     209 non-null    float64       
 13  TotalTests        209 non-null    float64       
 14  Tests/1M pop      209 non-

### **4.2-FEATURE ENGINERRING**

In [7]:
# Create lag features and 7-day moving average
df['7_day_avg_cases'] = df['TotalCases'].rolling(window=7).mean()
df['lag_1'] = df['TotalCases'].shift(1)  # Lag of 1 day
df['lag_2'] = df['TotalCases'].shift(2)  # Lag of 2 days

# Fill missing values with forward fill (or fill with zeros if preferred)
df.fillna(method='ffill', inplace=True)

# Correctly define X as a DataFrame with multiple features
X = df[['lag_1', 'lag_2', '7_day_avg_cases']]  # Feature matrix (should be 2D)
y = df['TotalCases']  # Target variable (should be 1D)

# Check the shapes of X and y before splitting
print("Shape of X:", X.shape)  # Should be (209, 3) - 2D array with 209 samples and 3 features
print("Shape of y:", y.shape)  # Should be (209,) - 1D array with 209 target values



  df.fillna(method='ffill', inplace=True)


Shape of X: (209, 3)
Shape of y: (209,)


In [8]:
print(X.shape)  # Should be (n_samples, n_features), e.g., (100, 3)
print(y.shape)  # Should be (n_samples,), e.g., (100,)


(209, 3)
(209,)


## **4.3-TRAIN_TEST SPLIT**

In [9]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

# Check the shapes of the datasets
print("Shape of X_train:", X_train.shape)  # Should be (167, 3) - 167 samples and 3 features
print("Shape of y_train:", y_train.shape)  # Should be (167,) - 167 target values


Shape of X_train: (167, 3)
Shape of y_train: (167,)


## **4.4-MODEL SELECTION**

In [10]:
from sklearn.ensemble import RandomForestRegressor

# Create the model: Random Forest Regressor
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Display model details
print(model)


# Split the data into training and testing sets
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)



RandomForestRegressor(random_state=42)


## **4.4-MODEL TRAINING**

Now, let's train the XGBoost model using the training data.


In [11]:
from sklearn.ensemble import RandomForestRegressor

# Create and train the model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Print a message indicating training is complete
print("Model training complete!")


Model training complete!


## **4.5-MODEL EVALUATION**

We will evaluate the model using Mean Absolute Error (MAE) to check the accuracy of our predictions.

In [12]:
from sklearn.metrics import mean_absolute_error

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred)

# Print the MAE value
print(f"Mean Absolute Error: {mae}")


Mean Absolute Error: 238.46380952380954


In [68]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Model predictions
y_pred_train = best_model.predict(X_train_scaled)
y_pred_test = best_model.predict(X_test_scaled)

# Calculate MAE
mae_train = mean_absolute_error(y_train, y_pred_train)
mae_test = mean_absolute_error(y_test, y_pred_test)

# Calculate MSE
mse_train = mean_squared_error(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)

# Calculate RMSE (Root Mean Squared Error)
rmse_train = np.sqrt(mse_train)
rmse_test = np.sqrt(mse_test)

# Calculate R-squared (R²)
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

# Print all metrics
print(f"Training MAE: {mae_train}")
print(f"Test MAE: {mae_test}")
print(f"Training MSE: {mse_train}")
print(f"Test MSE: {mse_test}")
print(f"Training RMSE: {rmse_train}")
print(f"Test RMSE: {rmse_test}")
print(f"Training R²: {r2_train}")
print(f"Test R²: {r2_test}")


Training MAE: 1841.1555657026886
Test MAE: 2461.499438448847
Training MSE: 28314589.833303757
Test MSE: 61738103.48387001
Training RMSE: 5321.145537692401
Test RMSE: 7857.359829094631
Training R²: 0.996415009990202
Test R²: 0.9742331702727035


---

## **5. COVID-19 Data Visualization with Plotly**


### **5.1 Line Chart (Cases Over Time)**


A **line chart** is useful for visualizing the trend of COVID-19 cases over time. It helps track the rise and fall of infections.


In [51]:
fig = px.line(df, x='date', y='TotalCases', title='COVID-19 Cases Over Time', markers=True)
fig.show()


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209 entries, 0 to 208
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Country/Region    209 non-null    object        
 1   Continent         209 non-null    object        
 2   Population        209 non-null    float64       
 3   TotalCases        209 non-null    int64         
 4   NewCases          209 non-null    float64       
 5   TotalDeaths       209 non-null    float64       
 6   NewDeaths         209 non-null    float64       
 7   TotalRecovered    209 non-null    float64       
 8   NewRecovered      209 non-null    float64       
 9   ActiveCases       209 non-null    float64       
 10  Serious,Critical  209 non-null    float64       
 11  Tot Cases/1M pop  209 non-null    float64       
 12  Deaths/1M pop     209 non-null    float64       
 13  TotalTests        209 non-null    float64       
 14  Tests/1M pop      209 non-

### **5.2 Bar Plot**


In [50]:
fig = px.bar(df, x='Country/Region', y='TotalCases', title='COVID-19 Cases by Country', color='TotalCases')
fig.show()


## **5.3 Scatter Plot (Cases vs Deaths)**


In [49]:
fig = px.scatter(df, x='TotalCases', y='TotalDeaths', title='Cases vs Deaths', color='Country/Region')
fig.show()


### **5.4 Histogram (Daily New Cases Distribution)**


In [48]:
fig = px.histogram(df, x='NewCases', title='Distribution of Daily COVID-19 Cases', nbins=50)
fig.show()


### **5.5 Box Plot (Outliers in COVID-19 Deaths)**


In [43]:
fig = px.box(df, y='TotalDeaths', title='COVID-19 Deaths Variability')
fig.show()


### **5.6 Choropleth Map (Global Cases Visualization)**


In [44]:
fig = px.choropleth(df, 
                    locations='Country/Region', 
                    locationmode='country names',  
                    color='TotalCases', 
                    title='Global COVID-19 Cases',
                    color_continuous_scale='Reds')

fig.show()


### **5.7 Heatmap (Correlation Between Variables)**


In [45]:
import plotly.figure_factory as ff

corr_matrix = df[['TotalCases', 'TotalDeaths', 'TotalRecovered']].corr()

fig = ff.create_annotated_heatmap(z=corr_matrix.values, 
                                  x=list(corr_matrix.columns), 
                                  y=list(corr_matrix.index), 
                                  colorscale='Blues')

fig.update_layout(title='COVID-19 Data Correlation Heatmap')
fig.show()


## **5.8 Area Chart (Cumulative Cases Growth)**


In [46]:
fig = px.area(df, x='date', y='TotalCases', title='Cumulative COVID-19 Cases Growth')
fig.show()


## **5.9 Animated Bubble Chart (Cases, Deaths, and Recoveries)**


In [47]:
fig = px.scatter(df, x='TotalCases', y='TotalDeaths', size='TotalRecovered',
                 color='Country/Region', animation_frame='date', title='COVID-19 Spread Over Time')
fig.show()


---

## **Conclusion & Next Steps**

Summary of the model's performance, possible improvements, and next steps.

**Model Performance Explanation**

The model we implemented for predicting COVID-19 cases (specifically the TotalCases) used a Linear Regression algorithm to estimate the future number of cases based on historical data. Based on the evaluation metrics, the model performed reasonably well, but there are areas that can be improved.

**Evaluation Metrics:**
Training MAE (Mean Absolute Error): 1841.16

Test MAE: 2461.50

Training MSE (Mean Squared Error): 28,314,589.83

Test MSE: 61,738,103.48

Training RMSE (Root Mean Squared Error): 5,321.15

Test RMSE: 7,857.36

Training R²: 0.9964

Test R²: 0.9742

Performance Insights:
Training Performance: The model performs excellently on the training data with an R² of 0.9964, meaning it explains over 99% of the variance in the training data. The low MAE, MSE, and RMSE on the training data further confirm that the model fits the training data well.

Test Performance: On the test data, the model's performance decreases slightly, with an R² of 0.9742. While this is still quite good, the increase in MAE and RMSE for the test set suggests that the model might be overfitting the training data.

Potential Improvements:
Hyperparameter Tuning: The model could benefit from fine-tuning hyperparameters (e.g., adjusting the regularization term, adding more features). This could help reduce overfitting and improve generalization to new data.

Adding More Features: Adding more predictive features, such as the TotalTests, Deaths/1M pop, and Tests/1M pop, may improve the model’s performance by providing more information for the prediction.

Trying Different Models: While Linear Regression has performed well, more complex models such as Random Forest or XGBoost could potentially capture non-linear relationships in the data, leading to better predictions.




Business Problem:
The goal of this project was to predict the number of COVID-19 cases (TotalCases) in a country or region using historical data. The ability to predict future cases accurately can help governments and health organizations better allocate resources, plan for healthcare needs, and respond proactively to outbreaks.

Dataset:
The dataset used for this project contains information on COVID-19 statistics from various countries and regions. It includes the following columns:

Country/Region: The country or region for which the data is reported.

Population: The population of the country/region.

TotalCases: The total number of confirmed COVID-19 cases.

NewCases: The number of new confirmed cases.

TotalDeaths: The total number of deaths due to COVID-19.

NewDeaths: The number of new deaths.

TotalRecovered: The total number of recoveries from COVID-19.

ActiveCases: The number of currently active cases.

Serious,Critical: The number of serious or critical cases.

Tests: The number of tests conducted.

The dataset does not include a date column in this particular case, so we used features like NewCases, TotalDeaths, and ActiveCases to predict TotalCases.

Model(s) Used:
Linear Regression: This simple linear model was selected due to its interpretability and ability to provide a baseline prediction.

In future iterations, more complex models such as Random Forest or XGBoost may be explored for better performance.

Data Preprocessing & Feature Engineering:
The Date column was missing from the dataset, so we did not include it in feature engineering.

We used features like NewCases, TotalDeaths, ActiveCases, and lag features (shifted values) to predict the target variable, TotalCases.

The data was cleaned to handle missing values using forward fill (df.fillna(method='ffill')).

Model Evaluation:
The performance of the Linear Regression model was evaluated using several metrics:

Mean Absolute Error (MAE): The model’s prediction error averaged over all predictions. Lower values are better.

Mean Squared Error (MSE): The average of the squared differences between the predicted and actual values.

Root Mean Squared Error (RMSE): The square root of MSE, giving a more interpretable error metric.

R² (Coefficient of Determination): A measure of how well the model’s predictions fit the actual data. A higher R² value indicates a better fit.

Performance Summary:
Training MAE: 1841.16

Test MAE: 2461.50

Training R²: 0.9964

Test R²: 0.9742

These results show that the model performs well, explaining over 97% of the variance in the test set. However, the test MAE is higher than the training MAE, which suggests the model might be slightly overfitting to the training data.