## Step:1 Dataset Description and Objective

**Dataset Description:**

The dataset comprises COVID-19 data, including confirmed cases, recoveries, and deaths, from various global regions with a focus on India. It is available in CSV and Excel formats, regularly updated. Each record includes the date, country/region, confirmed cases, recoveries, and deaths.

**Objective:**

This project aims to analyze COVID-19 trends using Python. Objectives include visualizing impacts, analyzing infection and recovery rates, predicting future cases using Facebook Prophet, and deriving actionable insights for informed decision-making.

## Step:2 Import Necessary libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import plotly.express as px
import plotly.graph_objs as go

## Step:3 Data Collection and Preprocessing

**- Data loading**

In [None]:
df = pd.read_csv(r"E:\csv\Intel csv\covid_19_clean_complete (4).csv")
df.head()

**- Inspect first few rows**
* to understand the structure of the data.

In [None]:
# Display first 5 rows
df.head()

**- Checking column name**

In [None]:
df.columns

* column names are inconsistent and difficult to manage, therefore I am going to rename them for clarity and ease of use.

**- Renaming the columns**
* This renaming ensures consistency and clarity in column names, which helps in avoiding confusion during data analysis and interpretation.

In [None]:
# Renaming the columns
df.rename(columns = {
    'Province/State': 'state',
    'Country/Region': 'country',
    'Lat':'lat',
    'Long':'long',
    'Date': 'date',
    'Confirmed':'confirmed',
    'Deaths':'deaths',
    'Recovered': 'recovered',
    'Active': 'active',
    'WHO Region':'WHO' 
}, inplace = True)

In [None]:
df.info()

From the provided DataFrame, we can deduce the following information:

`Geographical Data:`
* The dataset contains information about COVID-19 cases across different geographical locations, including latitude (Lat) and longitude (Long) coordinates.
* Each entry corresponds to a specific province/state (Province/State) and country/region (Country/Region). 

`Temporal Data:` The dataset includes data recorded over time, with the Date column indicating the date of observation.   

`COVID-19 Metrics:` Key metrics related to the COVID-19 pandemic are recorded, including:     
* Confirmed: Total confirmed COVID-19 cases.
* Deaths: Total deaths attributed to COVID-19.
* Recovered: Total number of individuals who have recovered from COVID-19.
* Active: Total active cases (confirmed cases minus deaths and recoveries). 

`Data Types:` The data types of the columns include:
* Float64 for latitude and longitude coordinates (Lat and Long).
* Int64 for numerical data such as confirmed cases, deaths, recoveries, and active cases.
* Object for categorical data such as province/state, country/region, date, and WHO region.

`Missing Values:` There are missing values in the Province/State column, indicating that not all entries have province/state information.

`WHO Region:` The WHO Region column categorizes countries/regions into World Health Organization (WHO) regions, providing additional contextual information for analysis.

Overall, this DataFrame provides a comprehensive snapshot of COVID-19 cases, deaths, and recoveries across different geographical locations over time, enabling analysis and visualization of the pandemic's impact.

**- Checking for Duplicates:**
* Identify and remove duplicate rows in the dataset to ensure data integrity.

In [None]:
# Checking Duplicates Values
df.duplicated().sum()

* Sum is 0, meaning there are no duplicate rows in the DataFrame.

**- Summary Statisitc**

In [None]:
df.describe()

Overall, the describe method output offers valuable insights into the distribution, variability, and magnitude of COVID-19 metrics across different geographic locations and time periods. These insights can inform further analysis and decision-making related to the pandemic response and mitigation efforts.

**- Total no of active cases**

In [None]:
df['active'].sum()

**- Countrywise sum of confirmed, death, recovered and active cases**

In [None]:
top=df[df["date"]=="2020-07-27"]
world=top.groupby("country").sum()[["confirmed","deaths","recovered","active"]].reset_index()
world

## Step 4: Data Visualization

**4.1 Country-wise active cases**

In [None]:
# px.choropleth: This function from Plotly Express is used to create a choropleth map.

fig=px.choropleth(world,locations='country',
                  locationmode='country names',color='active',
                  hover_name='country', range_color=[1,15000],
                  color_continuous_scale="peach",title='Country-wise active cases')
fig.show()

* By looking at the above choropleth plot, we can get a general idea of which country has the most active cases and which has the least.

**4.2 Country-wise death cases**

In [None]:
figure = px.choropleth(world,locations="country",
                       locationmode = "country names", color="deaths",
                       hover_name="country",range_color=[1,10000],
                       color_continuous_scale="blues",
                       title="Country-wise death cases")
figure.show()

* By looking at the above choropleth plot, we can get a general idea of which country has the most death cases and which has the least.

**4.3 Country-wise recovered cases**

In [None]:
figure = px.choropleth(world,locations="country",
                       locationmode = "country names", color="recovered",
                       hover_name="country",range_color=[1,10000],
                       color_continuous_scale="blues",
                       title="Country-wise recovered cases")
figure.show()

* By looking at the above choropleth plot, we can get a general idea of which country has the most recovered cases and which has the least.

## Step 5: Analyzing Trends

**5.1 Trend of COVID-19 spread over time**

In [None]:
# Group by date and sum confirmed cases
total_cases = df.groupby("date")['confirmed'].sum().reset_index()

# Create a trace for the trend of confirmed cases
trace = go.Scatter(x=total_cases['date'], y=total_cases['confirmed'], mode='lines', marker=dict(color='red'))

# Create layout
layout = go.Layout(title='Worldwide Confirmed Cases Over Time',
                   xaxis=dict(title='Dates', tickangle=90, tickfont=dict(size=10)),
                   yaxis=dict(title='Total cases', tickfont=dict(size=15)),
                   width=950, height=500)

# Create figure
fig = go.Figure(data=[trace], layout=layout)

# Show plot
fig.show()

* In the beginning, you can see that the number of cases is very low. However, gradually, and then suddenly, the number of COVID cases grows from March 17, 2020, to July 27, 2020

In [None]:
total_cases

* On January 22, 2020, the number of confirmed cases worldwide was 555.            
* On January 23, 2020, the number of confirmed cases worldwide was 654.         
* On July 27, 2020, the number of confirmed cases worldwide was 16,480,485.          

These data points indicate the progression of confirmed COVID-19 cases globally over time, highlighting the significant increase in cases from January to July 2020.

**5.2 Top 20 countries having most death cases**

In [None]:
top_deaths =df.groupby(by="country")["deaths"].sum().sort_values(ascending=False).head(20).reset_index()
top_deaths

In [None]:
# Sort the data by deaths
top_deaths = top_deaths.sort_values(by='deaths', ascending=True)

# Create a horizontal bar plot
fig = go.Figure(go.Bar(
    x=top_deaths['deaths'],
    y=top_deaths['country'],
    orientation='h',  # horizontal bars
    marker=dict(color='red')  # bar color
))

# Customize layout
fig.update_layout(
    title='Top 20 Countries with Most Death Cases',
    xaxis=dict(title='Total Cases', tickfont=dict(size=15)),
    yaxis=dict(title='Country', tickfont=dict(size=15)),
    width=950, height=550
)

# Show plot
fig.show()


**5.3 Top 20 Countries with highest active cases**

In [None]:
top_actives =df.groupby(by="country")["active"].sum().sort_values(ascending=False).head(20).reset_index()
top_actives

In [None]:
# Sort the data by active cases 
top_actives = top_actives.sort_values(by='active', ascending=True)

# Create a horizontal bar plot
fig = go.Figure(go.Bar(
    x=top_actives['active'],
    y=top_actives['country'],
    orientation='h',  # horizontal bars
    marker=dict(color='blue')  # bar color
))

# Customize layout
fig.update_layout(
    title='Top 20 Countries with Most Active Cases',
    xaxis=dict(title='Total Cases', tickfont=dict(size=15)),
    yaxis=dict(title='Country', tickfont=dict(size=15)),
    width=950, height=550
)

# Show plot
fig.show()

**5.4 Top 20 Countries with highest confirmed cases**

In [None]:
top_confirmed =df.groupby(by="country")["confirmed"].sum().sort_values(ascending=False).head(20).reset_index()
top_confirmed

In [None]:
# Sort the data by confirmed cases 
top_confirmed = top_confirmed.sort_values(by='confirmed', ascending=True)

# Create a horizontal bar plot
fig = go.Figure(go.Bar(
    x=top_confirmed['confirmed'],
    y=top_confirmed['country'],
    orientation='h',  # horizontal bars
    marker=dict(color='green')  # bar color
))

# Customize layout
fig.update_layout(
    title='Top 20 Countries with Most Confirmed Cases',
    xaxis=dict(title='Total Cases', tickfont=dict(size=15)),
    yaxis=dict(title='Country', tickfont=dict(size=15)),
    width=950, height=550
)

# Show plot
fig.show()

**5.5 Day-wise recovered, deaths, confirmed, and active cases for the 'top 5 countries with the most active cases'**
* The top 5 countries with the highest active cases are: US, Brazil, UK, Russia, and India.

**5.5.1 For US**

In [None]:
US = df[df.country=="US"]
US = US.groupby(by="date")[["recovered","deaths","confirmed","active"]].sum().reset_index()

In [None]:
US

**5.5.2 For Brazil**

In [None]:
Brazil = df[df.country=="Brazil"]
Brazil = Brazil.groupby(by="date")[["recovered","deaths","confirmed","active"]].sum().reset_index()

In [None]:
Brazil

**5.5.3 For UK**

In [None]:
UK = df[df.country =="United Kingdom"]
UK = UK.groupby(by = "date")[["recovered", "deaths", "confirmed", "active"]].sum().reset_index()

In [None]:
UK

**5.5.4 For Russia**

In [None]:
Russia = df[df.country == "Russia"]
Russia = Russia.groupby(by = "date")[["recovered", "deaths", "confirmed", "active"]].sum().reset_index()

In [None]:
Russia

**5.5.5 For India**

In [None]:
India = df[df.country =="India"]
India = India.groupby(by = "date")[["recovered", "deaths", "confirmed", "active"]].sum().reset_index()

In [None]:
India

**5.6 Top 5 Countries: COVID-19 Confirmed Cases Over Time**

In [None]:
# Add figsize as width and height
figsize = (2, 2)

# Create traces for each country
trace_us = go.Scatter(x=US.index, y=US['confirmed'], mode='lines', name='US', line=dict(color='pink'))
trace_brazil = go.Scatter(x=Brazil.index, y=Brazil['confirmed'], mode='lines', name='Brazil', line=dict(color='blue'))
trace_uk = go.Scatter(x=UK.index, y=UK['confirmed'], mode='lines', name='UK', line=dict(color='yellow'))
trace_russia = go.Scatter(x=Russia.index, y=Russia['confirmed'], mode='lines', name='Russia', line=dict(color='green'))
trace_india = go.Scatter(x=India.index, y=India['confirmed'], mode='lines', name='India', line=dict(color='red'))

# Combine traces into data list
data = [trace_us, trace_brazil, trace_uk, trace_russia, trace_india]

# Define layout
layout = go.Layout(
    title='Top 5 Countries: COVID-19 Confirmed Cases Over Time',
    xaxis=dict(title='No. of days', tickfont=dict(size=20)),
    yaxis=dict(title='Confirmed cases', tickfont=dict(size=20)),
    width=950, height=500
)

# Create figure
fig = go.Figure(data=data, layout=layout)

# Show plot
fig.show()

* The United States has the highest number of confirmed cases, followed by Brazil, India, Russia, the United Kingdom, and so forth.

**5.7 Top 5 Countries: COVID-19 Death Cases Over Time**

In [None]:
# Create traces for each country
trace_us = go.Scatter(x=US.index, y=US['deaths'], mode='lines', name='US', line=dict(color='pink'))
trace_brazil = go.Scatter(x=Brazil.index, y=Brazil['deaths'], mode='lines', name='Brazil', line=dict(color='blue'))
trace_uk = go.Scatter(x=UK.index, y=UK['deaths'], mode='lines', name='UK', line=dict(color='yellow'))
trace_russia = go.Scatter(x=Russia.index, y=Russia['deaths'], mode='lines', name='Russia', line=dict(color='green'))
trace_india = go.Scatter(x=India.index, y=India['deaths'], mode='lines', name='India', line=dict(color='red'))

# Combine traces into data list
data = [trace_us, trace_brazil, trace_uk, trace_russia, trace_india]

# Define layout
layout = go.Layout(
    title='Top 5 Countries: COVID-19 Death Cases Over Time',
    xaxis=dict(title='No. of days', tickfont=dict(size=20)),
    yaxis=dict(title='Death cases', tickfont=dict(size=20)),
    width=950, height=500
)

# Create figure
fig = go.Figure(data=data, layout=layout)

# Show plot
fig.show()

* The United States has the highest number of death cases, followed by Brazil, UK, India, the Russia, and so forth.

**5.8 Top 5 Countries: COVID-19 Recovered Cases Over Time**

In [None]:
# Create traces for each country
trace_us = go.Scatter(x=US.index, y=US['recovered'], mode='lines', name='US', line=dict(color='pink'))
trace_brazil = go.Scatter(x=Brazil.index, y=Brazil['recovered'], mode='lines', name='Brazil', line=dict(color='blue'))
trace_uk = go.Scatter(x=UK.index, y=UK['recovered'], mode='lines', name='UK', line=dict(color='yellow'))
trace_russia = go.Scatter(x=Russia.index, y=Russia['recovered'], mode='lines', name='Russia', line=dict(color='green'))
trace_india = go.Scatter(x=India.index, y=India['recovered'], mode='lines', name='India', line=dict(color='red'))

# Combine traces into data list
data = [trace_us, trace_brazil, trace_uk, trace_russia, trace_india]

# Define layout
layout = go.Layout(
    title='Top 5 Countries: COVID-19 Recovered Cases Over Time',
    xaxis=dict(title='No. of days', tickfont=dict(size=20)),
    yaxis=dict(title='Recovered cases', tickfont=dict(size=20)),
    width=950, height=500
)

# Create figure
fig = go.Figure(data=data, layout=layout)

# Show plot
fig.show()

* The Brazil has the highest number of recovered cases, followed by US, India, Russia, the United Kingdom, and so forth.

## Step 6: Time Series Modeling:

>**FORECASTING USING FBPROPHET**
>
>* so previosly we have discussed ARIMA, SARIMA, SARIMAX so in order to build these models we have to make our data stationary first then only we can build these models
><br>
>
>* `But if you using FBPROPHET then here is no need to make data stationary --> to do accurate time-series forecasting`
>* FBPROPHET is open-sourced library by Facebook that is designed for creating accurate time-series forecasts.
>* Whether it's daily observations, irregular intervals, or seasonality present in our data

**6.1 Installing FBPROPHET**

In [None]:
!pip install prophet

**6.2 Importing prophet class**

In [None]:
from prophet import Prophet

**6.3 Data Preprocessing for FBPROPHET**

In [None]:
df.head()

In [None]:
# Checking information about data
df.info()

In [None]:
# converting date column in standard date format
df["date"]=pd.to_datetime(df["date"])

In [None]:
df.head()

In [None]:
df.info()

**Note for FBProphet:**

`In FBProphet, the input DataFrame should contain only two columns:`
* ds (Date): This column represents the independent variable, typically denoting the time or date.
* y: This column represents the dependent variable, which is the quantity you want to forecast, such as the number of confirmed cases, deaths, or recoveries.

**6.4 Here we are creating a separate DataFrame for confirmed case which have only 2 column**

In [None]:
confirmed=df.groupby("date").sum()["confirmed"].reset_index()
confirmed

**6.5 Here we are creating a separate DataFrame for "death" cases which have only 2 column**

In [None]:
deaths=df.groupby("date").sum()["deaths"].reset_index()
deaths

**6.6 Here we are creating a separate DataFrame for "recovered" cases which have only 2 column**

In [None]:
recovered=df.groupby("date").sum()["recovered"].reset_index()
recovered

**6.7 Here we are creating a separate DataFrame for "active" cases which have only 2 column**

In [None]:
active=df.groupby("date").sum()["active"].reset_index()
active

## Step 7: Prediction and Visualization:

**7.1 Forecasting for the confirmed cases** 

In [None]:
confirmed.rename(columns={"date":"ds","confirmed":"y"},inplace=True)

In [None]:
confirmed

**7.1.1 Defining the Confidence Interval in Prophet's Forecasting**
* interval_width is nothing but confidence interval means we have confidence of 95% that prediction is correct and 5% i am taking as error margin

In [None]:
con_model=Prophet(interval_width=0.95)  

**7.1.2 Model Training: Fitting Data to the Model**

In [None]:
con_model.fit(confirmed)

**7.1.3 Forecasting confirmed case for over 7 days**
* periods = 7 means, we are forecasting confirmed case for over 7 days
* above you see we have 188 rows --> afterrunning this code 7 more days added now row will become = 188+7 = 195

In [None]:
future=con_model.make_future_dataframe(periods=7)

In [None]:
future

In [None]:
forecast=con_model.predict(future)
forecast[["ds","yhat","yhat_lower","yhat_upper"]].tail(7)

**7.1.4 Visualization of Confirmed Cases Historical + Forecast of 7 days**

In [None]:
confirmed_plot=con_model.plot(forecast)
plt.legend()
plt.title("Visualization of Confirmed Cases Historical + prediction + Forecast of 7 days")
plt.show()

* In this graph, if you observe, the black dots represent the actual data.
* The blue line represents the "prediction" or "y-hat".
* At the end of the black dots, if you notice the blue line further extending, it signifies the "prediction of future 7 days confirmed cases".
* Now, observing the blue line closely, you'll discern a blurry shadow around it, denoting the "5% margin of error" or deviation region.
* This implies that our prediction can vary up to the upper blurry point, i.e., the maximum value represented by yhat_uppermargin, and down to the lower blurry point, i.e., the minimum value represented by yhat_lowermargin.

In [None]:
confirmed_forecast_plot1 = con_model.plot_components(forecast)
plt.show()

* Trend: This represents the trend in our data.
* Weekly: This indicates the future 7-day predictions provided by our model.
    * On Sunday, cases are projected to be 7000, followed by a sudden decrease on Monday, and a further decrease on Tuesday.
    * Subsequently, after Tuesday, there is an increase in confirmed cases.

**7.2 Forecasting for death cases:**

In [None]:
deaths

In [None]:
deaths.rename(columns={"date":"ds","deaths":"y"},inplace=True)

In [None]:
deaths

In [None]:
death_model=Prophet(interval_width=0.95)

In [None]:
# Model training
death_model.fit(deaths)

In [None]:
future=death_model.make_future_dataframe(periods=7)

In [None]:
forecast=death_model.predict(future)
forecast[["ds","yhat","yhat_lower","yhat_upper"]].tail(7)

In [None]:
death_plot=death_model.plot(forecast)
plt.legend()
plt.title("Visualization of Death Cases Historical + prediction + Forecast of 7 days")
plt.show()

In [None]:
death_forecast_plot1 = death_model.plot_components(forecast)

From the plot we can say that On Sunday, the number of deaths decreases Then, from Monday to Friday, deaths increase up to around 800, and from Friday to Saturday, there's a slight decrease back to around 750.

**7.3 Forecasting for recovered cases:** 

In [None]:
recovered

In [None]:
recovered.rename(columns={"date":"ds","recovered":"y"},inplace=True)

In [None]:
recovered

In [None]:
recovered_model=Prophet(interval_width=0.95)

In [None]:
# Training of Model (fiting thr data to the model)
recovered_model.fit(recovered)

In [None]:
future=recovered_model.make_future_dataframe(periods=7)

In [None]:
forecast=recovered_model.predict(future)
forecast[["ds","yhat","yhat_lower","yhat_upper"]].tail(7)

In [None]:
recovered_plot=recovered_model.plot(forecast)
plt.legend()
plt.title("Visualization of Recovered Cases Historical + prediction + Forecast of 7 days")
plt.show()

In [None]:
recovered_forecast_plot1 = recovered_model.plot_components(forecast)

**7.4 Forecasting for active cases:**

In [None]:
active

In [None]:
active.rename(columns={"date":"ds","active":"y"},inplace=True)

In [None]:
active

In [None]:
active_model=Prophet(interval_width=0.95)

In [None]:
# Training of Model (fiting thr data to the model)
active_model.fit(active)

In [None]:
future=active_model.make_future_dataframe(periods=7)

In [None]:
forecast=active_model.predict(future)
forecast[["ds","yhat","yhat_lower","yhat_upper"]].tail(7)

In [None]:
active_plot=active_model.plot(forecast)
plt.legend()
plt.title("Visualization of Active Cases Historical + prediction + Forecast of 7 days")
plt.show()

In [None]:
active_forecast_plot1 = active_model.plot_components(forecast)

## Step 8: Model Saving and Loading

### 8.1: Import necessary libraries

In [None]:
!pip install joblib
import joblib
from prophet import Prophet

### 8.2: Saving trained models

In [None]:
# Save the trained confirmed model
joblib.dump(con_model, 'model/prophet_model_confirmed.joblib')

# Save the trained death model
joblib.dump(death_model, 'model/prophet_model_death.joblib')

# Save the trained recovered model
joblib.dump(recovered_model, 'model/prophet_model_recovered.joblib')

# Save the trained active model
joblib.dump(active_model, 'model/prophet_model_active.joblib')

print("Models saved successfully!")

### 8.3: Loading Models 

In [None]:
# Import necessary libraries
import joblib
from prophet import Prophet

# Load the trained confirmed model
loaded_con_model = joblib.load('model/prophet_model_confirmed.joblib')

# Load the trained death model
loaded_death_model = joblib.load('model/prophet_model_death.joblib')

# Load the trained recovered model
loaded_recovered_model = joblib.load('model/prophet_model_recovered.joblib')

# Load the trained active model
loaded_active_model = joblib.load('model/prophet_model_active.joblib')

print("Model loaded successfully")

### 8.4: Making predictions

In [None]:
# Future predictions using the loaded confirmed model
future = loaded_con_model.make_future_dataframe(periods=7)
forecast = loaded_con_model.predict(future)
confirmed_plot = loaded_con_model.plot(forecast)
confirmed_forecast_plot1 = loaded_con_model.plot_components(forecast)

# Future predictions using the loaded death model
future = loaded_death_model.make_future_dataframe(periods=7)
forecast = loaded_death_model.predict(future)
death_plot = loaded_death_model.plot(forecast)
death_forecast_plot1 = loaded_death_model.plot_components(forecast)

# Future predictions using the loaded recovered model
future = loaded_recovered_model.make_future_dataframe(periods=7)
forecast = loaded_recovered_model.predict(future)
recovered_plot = loaded_recovered_model.plot(forecast)
recovered_forecast_plot1 = loaded_recovered_model.plot_components(forecast)

# Future predictions using the loaded active model
future = loaded_active_model.make_future_dataframe(periods=7)
forecast = loaded_active_model.predict(future)
active_plot = loaded_active_model.plot(forecast)
active_forecast_plot1 = loaded_active_model.plot_components(forecast)

print("Future predictions generated successfully!")

## Step 9: Conclusion

Throughout the analyzed period from January to April, the recorded COVID-19 cases showed a relatively stable trend without significant fluctuations. However, from April onwards, a notable shift occurred, marked by an exponential increase in confirmed cases, active infections, deaths, and recoveries.

By examining the visualizations of confirmed cases, active infections, recoveries, and deaths, coupled with a 7-day forecast and historical data, we discern a consistent pattern with gradual changes over time. Yet, focusing specifically on the data from the last 7 days reveals more nuanced insights, showcasing sudden spikes and declines in case numbers.

This analysis underscores the importance of closely monitoring recent trends for timely intervention and response. Understanding these fluctuations in the short term can aid in implementing targeted strategies to mitigate the spread of the virus and manage healthcare resources effectively.

Moving forward, continued vigilance and adaptive measures are imperative to navigate the dynamic landscape of the COVID-19 pandemic and safeguard public health.