# **EDA_COVID-19_Global_Dataset**
# **By Amit Kharche**
**Follow me** on [Linkedin](https://www.linkedin.com/in/amit-kharche) and [Medium](https://medium.com/@amitkharche14) for more insights on **Data Science** and **Artificial intelligence (AI)**

---
# **Table of Contents**
---

**1.** [**Introduction**](#Section1)<br>
**2.** [**Problem Statement**](#Section2)<br>
**3.** [**Installing & Importing Libraries**](#Section3)<br>
  - **3.1** [**Installing Libraries**](#Section31)<br>
  - **3.2** [**Upgrading Libraries**](#Section32)<br>
  - **3.3** [**Importing Libraries**](#Section33)<br>

**4.** [**Data Acquisition & Description**](#Section4)<br>
**5.** [**Data Pre-Profiling**](#Section5)<br>
**6.** [**Data Cleaning**](#Section6)<br>
**7.** [**Data Post-Profiling**](#Section7)<br>
**8.** [**Exploratory Data Analysis**](#Section8)<br>
  - [**8.1. What is the global trend of confirmed cases over time?**](#section81)<br>
  - [**8.2. Which countries have the highest total confirmed cases?**](#section82)<br>
  - [**8.3. What is the distribution of deaths across WHO regions?**](#section83)<br>
  - [**8.4. How do recoveries compare across countries?**](#section84)<br>
  - [**8.5. What is the trend of active cases globally?**](#section85)<br>
  - [**8.6. Is there a correlation between confirmed cases and deaths?**](#section86)<br>
  - [**8.7. How do confirmed cases vary across WHO regions?**](#section87)<br>
  - [**8.8. What is the distribution of confirmed cases across countries?**](#section88)<br>
  - [**8.9. Which countries have the highest number of deaths?**](#section89)<br>
  - [**8.10. What is the geographical distribution of confirmed cases?**](#section810)<br>
  - [**8.11. What is the trend of deaths over time globally?**](#section811)<br>
  - [**8.12. How do active cases compare across countries?**](#section812)<br>

**9.** [**Summarization**](#Section9)<br>
  - **9.1** [**Conclusion**](#Section9.1)<br>
  - **9.2** [**Actionable Insights***](#Section9.1)<br>

---
<a name = Section1></a>
# **1. Introduction**
---
The **COVID-19 pandemic**, caused by the novel coronavirus **SARS-CoV-2**, emerged in **late 2019** and quickly escalated into a **global health crisis**. Worldwide, **governments**, **healthcare systems**, and **researchers** have relied on **data** to monitor the **spread**, **impact**, and **control** of the virus.

This dataset offers a **comprehensive**, **time-series** view of COVID-19 cases across **countries** and **regions**, capturing essential metrics such as:

- **Confirmed cases**
- **Deaths**
- **Recoveries**
- **Active cases**
- **Geographic coordinates**
- **WHO regional classifications**

The objective of this notebook is to perform **Exploratory Data Analysis (EDA)** to uncover **patterns**, **trends**, and **insights** that reveal the **dynamics** of the pandemic over time and across different **geographical** and **organizational** boundaries. These insights can support **public health strategies**, **policy-making**, and **future research**.

---

---
<a name = Section2></a>
# **2. Problem Statement**
The goal of this **Exploratory Data Analysis (EDA)** is to derive actionable insights from the **COVID-19 dataset** by examining its **structure**, **quality**, and **temporal dynamics**. This analysis will:

- Assess the **completeness** and **consistency** of the data.
- Explore the **temporal progression** of the pandemic at **global** and **regional** levels.
- Identify **countries** or **regions** with the highest and lowest **confirmed cases**, **death rates**, and **recovery rates**.
- Analyze **trends in active cases** over time and across **WHO regions**.
- Create **visualizations** to illustrate the **geographical spread** and **intensity** of the outbreak.
- Detect **anomalies**, **reporting inconsistencies**, and **data gaps** that may affect interpretation.

By the end of this EDA, we aim to uncover **meaningful patterns** that can support **public health responses**, guide **policy decisions**, and inspire **further research** into the pandemic’s impact and trajectory.


---
<center><img style="width:60%; height:350px" src=""  height="300" width="">

---
<a name = Section3></a>
# **3. Installing and Importing Libraries**
---

<a name = Section31></a>
### **3.1 Installing Libraries**

In [None]:
!pip install -q datascience                                         # Package that is required by pandas profiling
!pip install -q pandas-profiling                                    # Library to generate basic statistics about data
!pip install ydata_profiling

<a name = Section32></a>
### **3.2 Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync.

- Make sure not to execute the cell under Installing Libraries and Upgrading Libraries again after restarting the runtime.

In [None]:
!pip install -q --upgrade datascience                               # Package that is required by pandas profiling
!pip install -q --upgrade pandas-profiling                          # Library to generate basic statistics about data

<a name = Section33></a>
### **3.3 Importing Libraries**

- You can headstart with the basic libraries as imported inside the cell below.

- If you want to import some additional libraries, feel free to do so.


In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
from ydata_profiling import ProfileReport                        # To perform data profiling
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # For numerical python operations
#-------------------------------------------------------------------------------------------------------------------------------
import plotly.graph_objs as go                                      # For interactive graphs
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once
import matplotlib.pyplot as plt                 # Importing pyplot interface of matplotlib
import seaborn as sns                           # Importing seaborn library for interactive visualization
%matplotlib inline

---
<a name = Section4></a>
# **4. Data Acquisition & Description**
---

This section focuses on acquiring the dataset and understanding its **key features**.

The data captures the **global progression of COVID-19**, providing detailed, time-series information on the pandemic's spread. It contains **49,068 records** representing daily counts of **confirmed cases**, **deaths**, **recoveries**, and **active cases** across various countries and regions.

The dataset includes a mix of **geographic information**, **temporal attributes**, and **pandemic metrics**.

Below is the detailed description of each feature in the dataset:

| Id  | Feature              | Description                                                                 |
|-----|----------------------|-----------------------------------------------------------------------------|
| 01  | Province/State       | Province or state of the reported case                                     |
| 02  | Country/Region       | Country or region of the reported case                                     |
| 03  | Lat                  | Latitude of the location                                                    |
| 04  | Long                 | Longitude of the location                                                   |
| 05  | Date                 | Date of the reported case                                                  |
| 06  | Confirmed            | Number of confirmed cases                                                  |
| 07  | Deaths               | Number of deaths                                                           |
| 08  | Recovered            | Number of recoveries                                                       |
| 09  | Active               | Number of active cases                                                     |
| 10  | WHO Region           | WHO region classification                                                  |

> **Note**: Missing values appear in the `Province/State` column.

In [None]:
data = pd.read_csv(filepath_or_buffer = 'https://raw.githubusercontent.com/amitkharche/exploratory_data_analysis_projects_amit_kharche/refs/heads/main/EDA_Covid19_India_Data_amit_kharche/covid_19_data.csv')
print('Data Shape:', data.shape)
data.head()

<a id=section301></a>
### Data Description

* The dataset consist information of passenger borading and deboarding information and the services provided during the travel in flight.
* Dataset has __83123 Observation__ and __24 columns__. Below is the name of the column and their description.

In [None]:
data.describe().T

### **Data Information**

In [None]:
data.info()

**Observation**

  - The train set has **83123 samples (rows)** and **24 columns**.
 
  - There are **19 columns** with a **numeric** datatype and **5 columns** with an **object** datatype.
  
  - There are **missing** values in the data.

---
<a name = Section5></a>
# **5. Data Pre-Profiling**
---

- Here, we will perform **Pandas Profiling before preprocessing** our dataset, so we will name the **output file** as __bd_train_before_preprocessing.html__. 


- The file will be stored in the directory of your notebook. Open it using the jupyter notebook file explorer and take a look at it and see what insights you can develop from it. 


- Or you can **output the profiling report** in the **current jupyter notebook** as well as shown in the code below. 

In [None]:
"""from pandas_profiling import ProfileReport

# Generate the profile report
profile = ProfileReport(data, title="Profile Report", explorative=True)

# Save the report as HTML
profile.to_file("Pre_Profiling_Report.html")

print("Profiling report saved successfully!")"""

---
<a name = Section6></a>
# **6. Data Cleaning**
---

<a id=section401></a>
### 6.1 Data Preprocessing

In [None]:
data.isnull().sum()

In [None]:
# Fill missing values in Province/State with 'Not Available'
data['Province/State'].fillna('Not Available', inplace=True)

In [None]:
# Remove duplicate rows
data.drop_duplicates(inplace=True)

In [None]:
# Convert 'Date' to datetime format
data['Date'] = pd.to_datetime(data['Date'])

In [None]:
# Convert 'WHO Region' to categorical type
data['WHO Region'] = data['WHO Region'].astype('category')

In [None]:
# Check for any remaining missing values
missing_values = data.isnull().sum()
print("Missing values after preprocessing:")
print(missing_values)

<a id=section7></a>
---
<a name = Section7></a>
# **7. Data Post-Profiling**
---

####  Pandas Profiling after Data Preprocessing

- Here, we will perform **Pandas Profiling after preprocessing** our dataset, so we will name the **output file** as __avocado_train_after_preprocessing.html__.

In [None]:
"""from pandas_profiling import ProfileReport

# Generate the profile report
profile = ProfileReport(data, title="Profile Report", explorative=True)

# Save the report as HTML
profile.to_file("Post_Profiling_Report.html")

print("Profiling report saved successfully!")"""

---
<a name = Section8></a>
# **8. Exploratory Data Analysis**
---

<a id=section81></a>
**8.1. What is the trend of confirmed cases over time globally?**

In [None]:
data.groupby('Date')['Confirmed'].sum().plot(kind='line', figsize=(10, 5), title='Global Confirmed Cases Over Time')

### Observation:


<a id=section82></a>
**8.2. Which countries have the highest number of confirmed cases?**

In [None]:
data.groupby('Country/Region')['Confirmed'].sum().nlargest(10).plot(kind='bar', figsize=(10, 5), title='Top 10 Countries by Confirmed Cases')

### Observation:



<a id=section83></a>
**8.3. What is the distribution of deaths across WHO regions?**

In [None]:
import seaborn as sns
sns.boxplot(x='WHO Region', y='Deaths', data=data)


### Observation:



<a id=section84></a>
**8.4. How do recoveries compare across different countries?**

In [None]:
#data.groupby('Country/Region')['Recovered'].sum().nlargest(10).plot(kind='barh', color='green', figsize=(10, 5), title='Top 10 Countries by Recoveries')


### Observation:



<a id=section85></a>
**8.5. What is the trend of active cases over time globally?**

In [None]:
#data.groupby('Date')['Active'].sum().plot(kind='area', alpha=0.4, figsize=(10, 5), title='Global Active Cases Over Time')

### Observation:




<a id=section86></a>
**8.6. What is the correlation between confirmed cases and deaths?**

In [None]:
#sns.scatterplot(x='Confirmed', y='Deaths', data=data)

### Observation:




<a id=section87></a>
**8.7. How do confirmed cases vary across WHO regions?**

In [None]:
#sns.violinplot(x='WHO Region', y='Confirmed', data=data)

### Observation:




<a id=section88></a>
**8.8. What is the distribution of confirmed cases across countries?**

In [None]:
#data.groupby('Country/Region')['Confirmed'].sum().plot(kind='hist', bins=30, figsize=(10, 5), title='Distribution of Confirmed Cases')


### Observation:




<a id=section89></a>
**8.9. Which countries have the highest number of deaths?**

In [None]:
#data.groupby('Country/Region')['Deaths'].sum().nlargest(10).plot(kind='bar', color='red', figsize=(10, 5), title='Top 10 Countries by Deaths')


### Observation:



<a id=section810></a>
**8.10. What is the geographical distribution of confirmed cases?**

In [None]:
!pip install folium

In [None]:
import folium
map = folium.Map(location=[20, 0], zoom_start=2)
for _, row in data.iterrows():
    folium.CircleMarker(
        location=[row['Lat'], row['Long']],
        radius=max(row['Confirmed'] / 100000, 1),
        color='blue',
        fill=True,
        fill_color='blue'
    ).add_to(map)
map.save('covid_map.html')


In [None]:
import pandas as pd
import folium
from IPython.display import display

# Load the dataset
data = pd.read_csv('covid_19_data.csv')

# Fill missing values
data['Province/State'].fillna('Not Available', inplace=True)

# Create a base map
covid_map = folium.Map(location=[20, 0], zoom_start=2)

# Add circle markers
for _, row in data.iterrows():
    if row['Confirmed'] > 0:
        folium.CircleMarker(
            location=[row['Lat'], row['Long']],
            radius=max(row['Confirmed'] / 100000, 1),
            color='blue',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6,
            popup=f"{row['Country/Region']}: {row['Confirmed']} cases"
        ).add_to(covid_map)

# Display the map inline
covid_map

### Observation:




In [None]:
import pandas as pd
import folium

# Load the dataset
data = pd.read_csv('covid_19_data.csv')
data['Province/State'].fillna('Not Available', inplace=True)

# Create the map
covid_map = folium.Map(location=[20, 0], zoom_start=2)

# Add markers with tooltips
for _, row in data.iterrows():
    if row['Confirmed'] > 0:
        folium.CircleMarker(
            location=[row['Lat'], row['Long']],
            radius=max(row['Confirmed'] / 100000, 1),
            color='blue',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6,
            tooltip=f"{row['Country/Region']}, {row['Province/State']}: {row['Confirmed']} cases"
        ).add_to(covid_map)

# Display the map
covid_map

<a id=section811></a>
**8.11. What is the trend of deaths over time globally?**

In [None]:
#data.groupby('Date')['Deaths'].sum().plot(kind='line', color='red', figsize=(10, 5), title='Global Deaths Over Time')

### Observation:



<a id=section812></a>
**8.12. How do active cases compare across countries?**

In [None]:
#data.groupby('Country/Region')['Active'].sum().nlargest(10).plot(kind='bar', color='orange', figsize=(10, 5), title='Top 10 Countries by Active Cases')

### Observation:



---
<a name = Section9></a>
# **9. Summarization**
---

<a name = Section9.1></a>
### **9.1 Conclusion**

* Passenger satisfaction is influenced by a combination of **service quality**, **flight experience**, and **demographic factors**.
* **Business class travelers**, **loyal customers**, and **older age groups** show higher satisfaction, suggesting these segments receive or perceive better service.
* Service components like **seat comfort**, **inflight entertainment**, and **online boarding** play a critical role in determining overall satisfaction.
* **Delays in flight schedules** significantly correlate with dissatisfaction, indicating the importance of operational efficiency.
* While gender does not majorly impact satisfaction, **flight distance**, **purpose of travel**, and **digital touchpoints** like booking and check-in show clear patterns in passenger sentiment.


<a name = Section9.2></a>
### **9.2 Actionable Insights**


1. **Enhance Service in Economy Class**

   * Focus on improving comfort, food quality, and entertainment for **Eco class** passengers to elevate their experience.

2. **Reward and Retain Loyal Customers**

   * Maintain high service standards and personalized perks for **loyal customers** to drive long-term satisfaction and retention.

3. **Prioritize On-time Performance**

   * Invest in **operations and schedule management** to reduce both departure and arrival delays and improve satisfaction scores.

4. **Upgrade Digital Interfaces**

   * Improve the **ease of online booking** and **online boarding** processes to enhance the digital customer journey, especially for tech-savvy users.

5. **Target Personal Travelers with Value Offers**

   * Since **business travelers** are generally more satisfied, develop strategies and bundled offerings to **boost satisfaction for personal travelers**.

6. **Focus on High-impact Amenities**

   * Features like **seat comfort**, **entertainment**, and **cleanliness** should be continuously optimized as they strongly correlate with satisfaction.

7. **Customize Experience by Age Segment**

   * Offer tailored experiences or communication strategies for **younger passengers** to address their dissatisfaction trends.

8. **Monitor and Leverage Flight Distance**

   * Since **longer flights correlate with higher satisfaction**, ensure that short-haul flights receive **proportional quality** attention to bridge the gap.

9. **Use Correlation Patterns to Align Services**

   * Leverage high correlations among onboard services (e.g., **check-in, boarding, cleanliness**) to create **integrated service improvement plans**.
