# Capstone Project- Covid 19

# Problem Statement

Given data about COVID-19 patients, write code to visualize the impact and
analyze the trend of rate of infection and recovery as well as make predictions
about the number of cases expected a week in future based on the current
trends

# Project Objective

This project aims to conduct a comprehensive analysis of COVID-19 data. With the help of this data we will examine and visualize the impact and key trends in the data like rate of infection and recovery to develop forecasts for future case numbers.

# Data Description

The dataset consists of the following features:
--  ------          --------------  -----  
 0.  **Province/State**:    Name of the state
 1.  **Country/Region**:    Name of the country
 2.   **Lat**           :    Latitude of state
 3.   **Long**          :    Longitude of state
 4.  **Date**          :    Date of the day when cases are recorded
 5.   **Confirmed**     :    Total confirmed cases on that date
 6.   **Deaths**        :    Total confirmed deaths on that date
 7.   **Recovered**     :    Total no of recovered patients on that date
 8.  **Active**        :    Total no of active cases recorded on that date
 9.   **WHO Region**    :    WHO Region name under which state comes

# Data Preprocessing steps and inspiration

Data preprocessing is an important step where we make our data ready for analyzing by removing or replacing all unnecessary rows or columns. This would lead to a good dataset consisting of impactful features.

Step 1 : import all necessary libraries


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
!pip install plotly



In [None]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

*Step 2: read the datset with the help of pandas*

In [None]:

df=pd.read_csv('/content/Covid_19_Clean_Complete (1).csv')
df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
0,,Afghanistan,33.93911,67.709953,2020-01-22,0,0,0,0,Eastern Mediterranean
1,,Albania,41.1533,20.1683,2020-01-22,0,0,0,0,Europe
2,,Algeria,28.0339,1.6596,2020-01-22,0,0,0,0,Africa
3,,Andorra,42.5063,1.5218,2020-01-22,0,0,0,0,Europe
4,,Angola,-11.2027,17.8739,2020-01-22,0,0,0,0,Africa


*Step 3: Find out columns and their data types*

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49068 entries, 0 to 49067
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Province/State  14664 non-null  object 
 1   Country/Region  49068 non-null  object 
 2   Lat             49068 non-null  float64
 3   Long            49068 non-null  float64
 4   Date            49068 non-null  object 
 5   Confirmed       49068 non-null  int64  
 6   Deaths          49068 non-null  int64  
 7   Recovered       49068 non-null  int64  
 8   Active          49068 non-null  int64  
 9   WHO Region      49068 non-null  object 
dtypes: float64(2), int64(4), object(4)
memory usage: 3.7+ MB


*Step 4: Find out total number of rows and columns in the dataset*

In [None]:
df.shape

(49068, 10)

*Step 5: Find null values in the dataset*

In [None]:
df.isnull().sum()
#Since There is a huge number of null value rows we will not drop it as we might lose a lot of data.

Unnamed: 0,0
Province/State,34404
Country/Region,0
Lat,0
Long,0
Date,0
Confirmed,0
Deaths,0
Recovered,0
Active,0
WHO Region,0


In [None]:
#Renaming Columns with long columns name
df=df.rename(columns={'Province/State':'State', 'Country/Region':'Country'})
df.columns

Index(['State', 'Country', 'Lat', 'Long', 'Date', 'Confirmed', 'Deaths',
       'Recovered', 'Active', 'WHO Region'],
      dtype='object')

*Step 6: Find duplicate values in the dataset*

In [None]:
df.duplicated().sum()

0

In [None]:
#Total countries in the dataset
df.Country.nunique()

187

In [None]:
#Total states in the dataset
df.State.nunique()

78

In [None]:
# Finding Start and End date of data recording
print(f'Start date :{df.Date.min()} end_date :{df.Date.max()}')

Start date :2020-01-22 end_date :2020-07-27


In [None]:
pd.to_datetime(df.Date.max())-pd.to_datetime(df.Date.min())

Timedelta('187 days 00:00:00')

In [None]:
start_date=df[df.Date=='2020-01-22']
end_date=df[df.Date=='2020-07-27']

In [None]:
#  Data On the First day and last day of data recording
initial_data_record= start_date.groupby('Country')[['Confirmed','Deaths','Recovered','Active']].sum().reset_index()
final_data_record=end_date.groupby('Country')[['Confirmed','Deaths','Recovered','Active']].sum().reset_index()
print(initial_data_record)
print('**************************************************************')
print(final_data_record)

                Country  Confirmed  Deaths  Recovered  Active
0           Afghanistan          0       0          0       0
1               Albania          0       0          0       0
2               Algeria          0       0          0       0
3               Andorra          0       0          0       0
4                Angola          0       0          0       0
..                  ...        ...     ...        ...     ...
182  West Bank and Gaza          0       0          0       0
183      Western Sahara          0       0          0       0
184               Yemen          0       0          0       0
185              Zambia          0       0          0       0
186            Zimbabwe          0       0          0       0

[187 rows x 5 columns]
**************************************************************
                Country  Confirmed  Deaths  Recovered  Active
0           Afghanistan      36263    1269      25198    9796
1               Albania       4880     144   

In [None]:
#total confirmed cases From day one to current day
confirmed_cases= df.groupby(by='Date')['Confirmed'].sum().reset_index()
confirmed_cases

Unnamed: 0,Date,Confirmed
0,2020-01-22,555
1,2020-01-23,654
2,2020-01-24,941
3,2020-01-25,1434
4,2020-01-26,2118
...,...,...
183,2020-07-23,15510481
184,2020-07-24,15791645
185,2020-07-25,16047190
186,2020-07-26,16251796


In [None]:
#total Deaths  From day one to current day
death_cases=df.groupby(by='Date')['Deaths'].sum().reset_index()
death_cases

Unnamed: 0,Date,Deaths
0,2020-01-22,17
1,2020-01-23,18
2,2020-01-24,26
3,2020-01-25,42
4,2020-01-26,56
...,...,...
183,2020-07-23,633506
184,2020-07-24,639650
185,2020-07-25,644517
186,2020-07-26,648621


In [None]:
#total Recoveries From day one to current day
recovered_cases=df.groupby(by='Date')['Recovered'].sum().reset_index()
recovered_cases

Unnamed: 0,Date,Recovered
0,2020-01-22,28
1,2020-01-23,30
2,2020-01-24,36
3,2020-01-25,39
4,2020-01-26,52
...,...,...
183,2020-07-23,8710969
184,2020-07-24,8939705
185,2020-07-25,9158743
186,2020-07-26,9293464


In [None]:
#Date wise Case summary
case_summ=df.groupby('Date')[['Confirmed','Deaths','Recovered','Active']].sum().reset_index()


Unnamed: 0,Date,Confirmed,Deaths,Recovered,Active
0,2020-01-22,555,17,28,510
1,2020-01-23,654,18,30,606
2,2020-01-24,941,26,36,879
3,2020-01-25,1434,42,39,1353
4,2020-01-26,2118,56,52,2010
...,...,...,...,...,...
183,2020-07-23,15510481,633506,8710969,6166006
184,2020-07-24,15791645,639650,8939705,6212290
185,2020-07-25,16047190,644517,9158743,6243930
186,2020-07-26,16251796,648621,9293464,6309711


In [None]:
# Comparison Graph btw confirmed , Recovered , Death and active cases from day 1 to current day
fig=px.line(case_summ,x='Date',
        y=['Confirmed','Deaths','Recovered','Active'] ,
        title='Cases distribution according to timeline',
        )
fig.show()

In [None]:
#Top 10 countries with most active cases
active= df.groupby(by='Country')['Active'].sum().reset_index().sort_values(by='Active',ascending=False).head(10)
fig= px.bar(active,
            y='Country',
            x=['Active'],
            title='Top 10 countries with most active cases',
            orientation='h',
            labels={'Value':'Country', 'Variable':'Number of cases active'})
fig.show()

In [None]:
#Top 10 countries with least active cases
active= df.groupby(by='Country')['Active'].sum().reset_index().sort_values(by='Active',ascending=False).tail(10)
fig= px.bar(active,
            x='Country',
            y=['Active'],
            title='Top 10 countries with least cases',
            labels={'Value':'Country', 'Variable':'Number of cases active'})
fig.show()

In [None]:
#Fatality Rate (Deaths/Confirmed) of each country and comparison btw highest fatality and lowest fatality rate
country_data= df.groupby(by='Country')[['Deaths','Confirmed']].sum().reset_index()
country_data['Fatality']= country_data.Deaths/country_data.Confirmed
country_data.sort_values(by='Fatality', ascending=False)

highest_fatality=country_data.head(1)
lowest_fatality=country_data.tail(1)

fig= make_subplots(rows=1, cols=2, subplot_titles=('Country with highest Fatality','Country with lowest Fatality'),
                   shared_yaxes=True)

fig.add_trace(
    go.Bar(x=highest_fatality['Country'],
           y=highest_fatality['Fatality']),
    row=1,col=1)

fig.add_trace(
    go.Bar(x=lowest_fatality['Country'],
           y=lowest_fatality['Fatality']),
    row=1,col=2
)
fig.update_layout(yaxis_title='Fatality rate')
fig.show()

In [None]:
Regional_data= df.groupby('WHO Region')[['Recovered','Deaths','Confirmed','Active']].sum().reset_index()

fig= px.bar(Regional_data, x='WHO Region', y=['Recovered','Deaths','Confirmed','Active'],
            barmode='group', title="Clustered Bar Chart of COVID-19 Metrics by Region")
fig.show()

# Algorithm : Motivation and Reasons

Prophet library is a good option while working with time series dataset because of multiple reasons. Firstly, it has a user friendly design and straightforward API which requires minimal tuning. Secondly, it automatically detects seasonality, trends and patterns whithout manual adjustments. It also allows the user to add holiday or other effects to improve model accuracy. Thirdly, the model provides confidence intervals for its forecast giving an estimate of uncertainity which is valuable for decision making. Last but not the least, prophet library works well with null/missing values and outliers which are very common in real time data.

#Assumptions

Decomposition: Time series is assumed to be decomposable into
1. Trend : Long trend progression of data
2. Seasonality: Periodic Patterns
3. Holiday/Events : Events having impact on data
4. Errors: Noise

Trend Modeling: Library assumes that trends can be classified as
1. Piecewise Linear: Linear segments connected at changepoints where growth rate change abruptly.
2. Logistic growth:Suitable when there is a saturating effect or capacity limit.

These changepoints can capture the major shifts in the time series trend.

Seasonality Modeling: Seasonal components are assumed to be periodic which repeats over consistent intervals. Past seasonal behaviour is a good indicator of future patterns.

Errors: Prophet assumes that the errors are independent and normally distributed. This helps in making confidence intervals around forecasts.

Holiday/Events: Model assumes that these events occurs on a specific dates and their impact is fixed. This helps the model to adjust forcast around these events.

In [None]:
!pip install prophet




In [None]:
from prophet import Prophet
model=Prophet(yearly_seasonality=False, daily_seasonality=True,interval_width=0.95, changepoint_prior_scale=0.1 ,uncertainty_samples=1000)

In [None]:
dataset=pd.read_csv('/content/Covid_19_Clean_Complete (1).csv',parse_dates=['Date'])

In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49068 entries, 0 to 49067
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Province/State  14664 non-null  object        
 1   Country/Region  49068 non-null  object        
 2   Lat             49068 non-null  float64       
 3   Long            49068 non-null  float64       
 4   Date            49068 non-null  datetime64[ns]
 5   Confirmed       49068 non-null  int64         
 6   Deaths          49068 non-null  int64         
 7   Recovered       49068 non-null  int64         
 8   Active          49068 non-null  int64         
 9   WHO Region      49068 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(4), object(3)
memory usage: 3.7+ MB


In [None]:
#calculate total confirm cases on each date
confirmed= dataset.groupby('Date')['Confirmed'].sum().reset_index()
confirmed

Unnamed: 0,Date,Confirmed
0,2020-01-22,555
1,2020-01-23,654
2,2020-01-24,941
3,2020-01-25,1434
4,2020-01-26,2118
...,...,...
183,2020-07-23,15510481
184,2020-07-24,15791645
185,2020-07-25,16047190
186,2020-07-26,16251796


In [None]:
confirmed.rename(columns={'Date': 'ds', 'Confirmed':'y'}, inplace=True)


In [None]:
model.fit(confirmed)

DEBUG:cmdstanpy:input tempfile: /tmp/tmp5iohvgc0/6gr_29eb.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmp5iohvgc0/7jkqj0tk.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.11/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=25744', 'data', 'file=/tmp/tmp5iohvgc0/6gr_29eb.json', 'init=/tmp/tmp5iohvgc0/7jkqj0tk.json', 'output', 'file=/tmp/tmp5iohvgc0/prophet_model0yq4vtkz/prophet_model-20250302092637.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
09:26:37 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
09:26:37 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing


<prophet.forecaster.Prophet at 0x792269cdd6d0>

In [None]:
#Making predictions for the upcoming week
prediction= model.make_future_dataframe(periods=7)
forecast=model.predict(prediction)
forecast

Unnamed: 0,ds,trend,yhat_lower,yhat_upper,trend_lower,trend_upper,additive_terms,additive_terms_lower,additive_terms_upper,daily,...,weekly,weekly_lower,weekly_upper,yearly,yearly_lower,yearly_upper,multiplicative_terms,multiplicative_terms_lower,multiplicative_terms_upper,yhat
0,2020-01-22,-7.743521e+05,-1.711295e+04,6.168695e+03,-7.743521e+05,-7.743521e+05,768756.830100,768756.830100,768756.830100,-3.102682e+06,...,-6583.932389,-6583.932389,-6583.932389,3.878023e+06,3.878023e+06,3.878023e+06,0.0,0.0,0.0,-5.595238e+03
1,2020-01-23,-6.921265e+05,-9.787420e+03,1.245844e+04,-6.921265e+05,-6.921265e+05,693813.064115,693813.064115,693813.064115,-3.102682e+06,...,1043.048248,1043.048248,1043.048248,3.795452e+06,3.795452e+06,3.795452e+06,0.0,0.0,0.0,1.686542e+03
2,2020-01-24,-6.099010e+05,-2.585749e+03,2.083638e+04,-6.099010e+05,-6.099010e+05,618963.159206,618963.159206,618963.159206,-3.102682e+06,...,8881.150216,8881.150216,8881.150216,3.712764e+06,3.712764e+06,3.712764e+06,0.0,0.0,0.0,9.062183e+03
3,2020-01-25,-5.276754e+05,3.300881e+02,2.246164e+04,-5.276754e+05,-5.276754e+05,539272.740317,539272.740317,539272.740317,-3.102682e+06,...,11562.843925,11562.843925,11562.843925,3.630392e+06,3.630392e+06,3.630392e+06,0.0,0.0,0.0,1.159731e+04
4,2020-01-26,-4.454499e+05,-7.071681e+03,1.519181e+04,-4.454499e+05,-4.454499e+05,449340.771538,449340.771538,449340.771538,-3.102682e+06,...,3389.948058,3389.948058,3389.948058,3.548633e+06,3.548633e+06,3.548633e+06,0.0,0.0,0.0,3.890888e+03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,2020-07-30,1.739801e+07,1.721069e+07,1.723399e+07,1.739682e+07,1.739884e+07,-175666.317514,-175666.317514,-175666.317514,-3.102682e+06,...,1043.048248,1043.048248,1043.048248,2.925973e+06,2.925973e+06,2.925973e+06,0.0,0.0,0.0,1.722235e+07
191,2020-07-31,1.749805e+07,1.743608e+07,1.746021e+07,1.749537e+07,1.749981e+07,-49972.376595,-49972.376595,-49972.376595,-3.102682e+06,...,8881.150216,8881.150216,8881.150216,3.043829e+06,3.043829e+06,3.043829e+06,0.0,0.0,0.0,1.744808e+07
192,2020-08-01,1.759809e+07,1.764159e+07,1.766714e+07,1.759381e+07,1.760153e+07,56000.689844,56000.689844,56000.689844,-3.102682e+06,...,11562.843925,11562.843925,11562.843925,3.147120e+06,3.147120e+06,3.147120e+06,0.0,0.0,0.0,1.765409e+07
193,2020-08-02,1.769813e+07,1.781801e+07,1.784409e+07,1.769183e+07,1.770323e+07,133576.117112,133576.117112,133576.117112,-3.102682e+06,...,3389.948058,3389.948058,3389.948058,3.232868e+06,3.232868e+06,3.232868e+06,0.0,0.0,0.0,1.783171e+07


In [None]:
fig= px.line(forecast, x='ds', y=['yhat'])
fig.show()

In [None]:
#cross validation
from prophet.diagnostics import performance_metrics,cross_validation


In [None]:
df_cv= cross_validation(model, initial='80 days', period='5 days', horizon='10 days')
pm=performance_metrics(df_cv)
pm

INFO:prophet:Making 20 forecasts with cutoffs between 2020-04-13 00:00:00 and 2020-07-17 00:00:00


  0%|          | 0/20 [00:00<?, ?it/s]

DEBUG:cmdstanpy:input tempfile: /tmp/tmp5iohvgc0/pa7amm39.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmp5iohvgc0/g1jc8fce.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.11/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=77673', 'data', 'file=/tmp/tmp5iohvgc0/pa7amm39.json', 'init=/tmp/tmp5iohvgc0/g1jc8fce.json', 'output', 'file=/tmp/tmp5iohvgc0/prophet_modelbnozlul7/prophet_model-20250302094149.csv', 'method=optimize', 'algorithm=newton', 'iter=10000']
09:41:49 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
09:41:49 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
DEBUG:cmdstanpy:input tempfile: /tmp/tmp5iohvgc0/jyu7wbfq.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmp5iohvgc0/uugi5aju.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local

Unnamed: 0,horizon,mse,rmse,mae,mape,mdape,smape,coverage
0,1 days,37361980000.0,193292.479001,137881.903087,0.014972,0.015867,0.015134,0.2
1,2 days,50629200000.0,225009.332234,163295.168822,0.017499,0.020129,0.017721,0.2
2,3 days,65736680000.0,256391.653143,188762.923006,0.020335,0.022775,0.020623,0.25
3,4 days,85329610000.0,292112.319483,216685.157325,0.023144,0.025792,0.02351,0.25
4,5 days,109801200000.0,331362.576359,246394.836545,0.025762,0.029021,0.026218,0.3
5,6 days,137123000000.0,370301.275593,276786.231655,0.028558,0.03183,0.029112,0.3
6,7 days,170696500000.0,413154.381844,311474.332305,0.031706,0.036847,0.032386,0.3
7,8 days,212973000000.0,461490.005296,348506.233509,0.034723,0.040793,0.035556,0.3
8,9 days,253070400000.0,503061.05686,382310.647796,0.037582,0.044258,0.038559,0.3
9,10 days,303227300000.0,550660.780642,421245.007256,0.04104,0.047661,0.042186,0.3


# Model Evaluation and Techniques

In this project, we evaluated our Prophet-based forecasting model using several metrics including MAE, RMSE, and MAPE, while also analyzing the coverage of our forecast’s uncertainty intervals. We applied time series cross-validation, using a rolling window approach, to assess out-of-sample performance.


On the technique side, our approach involved decomposing the time series into trend, seasonal, and holiday effects. We handled missing and duplicate values during preprocessing, and fit our model to make predictions for the upcoming week. During cross validation, we trained our model in such a way that the initial 80 days of our recorded dataset were used to train our data. Prophet will re-train the model at regular intervals of 5 days at each cut off & the model will make predictions for the next 10 days. Also, as our data set for a short span of time we tuned our model by enabling daily seasonality,widen interval_width, higher changepoint_prior_scale to make more felxible change adaptivity and use high uncertainty_samples to get maximum output accuracy out of a small dataset.



# Inferences

1. Model prediction is more reliable in the short term than the long term as Mean Absolute Pecentage error show a clear increasing trend as the forecast horizon increases.
2. The very low MAPE (ranging from about 1.5% to 4.1%) suggests that the central forecasts (the predicted values themselves) are quite accurate on a relative percentage basis.
3. Even at its best, about 30% of the actual values are falling within the prediction intervals.This indicates that the uncertainty estimates are still too narrow.




# Future Possibilities

1. Incorporating Additional Data Sources:
Expansion dataset by including more recent data or adding more independent features like  vaccination, policies, demographics etc to improve model inputs and capture external influences.
2. Advanced Modeling Techniques:
Experiment with alternative forecasting methods such as LSTM neural networks, ARIMA variants, or ensemble models to compare performance and robustness against Prophet.
3. Interactive Visualization Tools:
Creation of user-friendly dashboards using tools like Plotly Dash, Tableau, or Power BI can be done. These platforms can help stakeholders visualize trends, compare predictions with actual data, and explore model components interactively.

# Conclusion

In this project, we developed a forecasting model for COVID-19 cases using Prophet. By leveraging 187 days of historical data, we decomposed the time series into trend, seasonality, and holiday components to generate short-term forecasts. Our evaluation—through cross-validation and metrics such as MAE, RMSE, and MAPE—demonstrated that the point forecasts are relatively accurate, especially for the short-term horizons (1–3 days ahead), with low percentage errors.


Overall, the project highlights Prophet’s potential for capturing COVID-19 trends while also underlining the challenges of modeling uncertainty with a small dataset.

# References

1. sklearn.com documentation
2. Wikipedia
3. Datacamp.com
4. Geeksforgeeks