## EDA on Covid-19 Dataset
Dataset downloaded from: *BigQuery -> Public Datasets -> covid19_jhu_csse -> summary*

**Anusha Cherupalla**

Perform EDA on your chosen dataset and produce a thorough EDA report for your Data Scientist peers that meets the following criteria:

- Report must perform basic EDA as per the video instructions across the videos and should include basic information and relevant statistics. 
- Report should include all relevant steps, including how you loaded the file into a dataframe.
- Report must include screenshots of the functions you used and the outputs you generated to perform your analysis.
- Report must include histograms, boxplots, and scatter plots, as well as your interpretation of what these visual indicate, for relevant fields. (You can include additional visualizations in the Appendix if desired.)
- Report must be no longer than 10 pages, not including Appendix materials for citations, additional contextual supporting information, etc. There are no page limits for the Appendix.

In [None]:
import pandas as pd

df = pd.read_csv('/Users/anushacherupalla/Downloads/Covid Cases Summary')

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape

### Drop Columns?
Looking at the head and tail, there are only US data. I will list out the different countries and provinces in this dataset, to see if I can drop any redundant columns.

In [None]:
# how many different countries are included in this data
df['country_region'].nunique()

In [None]:
# list countries and counts; create a df for this
c = df['country_region'].value_counts()
c_df = pd.DataFrame({'country_region': c.index, 'count': c.values})

In [None]:
print(c)

In [None]:
c_df

In [None]:
# how many different provinces/states are included in this data
df['province_state'].nunique()

In [None]:
# list provinces/states and counts; create a df for this
ps = df['province_state'].value_counts()
ps_df = pd.DataFrame({'province_state': ps.index, 'count': ps.values})

In [None]:
ps

In [None]:
ps_df

### Date Range

What is the range of dates that this dataset covers?

In [None]:
date = df['date']

min_date = date.min()
max_date = date.max()

date_range = (min_date, max_date)

print(f"The dataset covers dates from {min_date} to {max_date}.")

### Confirmed Cases

In [None]:
confirmed = df['confirmed']

print(f"The dataset covers confirmed cases values from {confirmed.min()} to {confirmed.max()}.")

In [None]:
import matplotlib.pyplot as plt

confirmed = df['confirmed']

plt.hist(confirmed, bins=100, color='green', edgecolor='black')
plt.xlabel('Confirmed Cases')
plt.ylabel('Frequency')
plt.title('Histogram of Confirmed Cases')
plt.show()

In [None]:
import numpy as np

confirmed_drop = confirmed.dropna()
c_hist, c_bin_edges = np.histogram(confirmed_drop, bins=100)
c_bin_ranges = [f"{c_bin_edges[i]:.2f} - {c_bin_edges[i+1]:.2f}" for i in range(len(c_bin_edges) - 1)]
c_hist_df = pd.DataFrame({'Bin Range': c_bin_ranges, 'Counts': c_hist})

print(c_hist_df)

In [None]:
# boxplot of confirmed cases
plt.figure(figsize=(8, 6))
plt.boxplot(confirmed_drop, vert=False)
plt.xlabel('Confirmed Cases')
plt.title('Box and Whisker Plot of Confirmed Cases')
plt.grid(True)
plt.show()

In [None]:
# scatterplot of confirmed cases over time
plt.figure(figsize=(12, 6))
plt.scatter(date, confirmed, color='green', alpha=0.5)
plt.xlabel('Date')
plt.ylabel('Confirmed Cases')
plt.title('Scatterplot of Confirmed Cases over Time')
plt.grid(True)

plt.show()

### Deaths

In [None]:
deaths = df['deaths']

print(f"The dataset covers deaths values from {deaths.min()} to {deaths.max()}.")

In [None]:
deaths = df['deaths']

plt.hist(deaths, bins=100, color='firebrick', edgecolor='brown')
plt.xlabel('Deaths')
plt.ylabel('Frequency')
plt.title('Histogram of Deaths')
plt.show()

In [None]:
# boxplot of deaths
deaths_drop = deaths.dropna()

plt.figure(figsize=(8, 6))
plt.boxplot(deaths_drop, vert=False)
plt.xlabel('Deaths')
plt.title('Box and Whisker Plot of Deaths')
plt.grid(True)
plt.show()

In [None]:
# scatterplot of deaths over time

plt.figure(figsize=(12, 6))
plt.scatter(date, deaths, color='firebrick', alpha=0.5)
plt.xlabel('Date')
plt.ylabel('Deaths')
plt.title('Scatterplot of Deaths over Time')
plt.grid(True)

plt.show()

### Recovered

In [None]:
recovered = df['recovered']

print(f"The dataset covers confirmed cases values from {recovered.min()} to {recovered.max()}.")

In [None]:
recovered = df['recovered']

plt.hist(recovered, bins=100, color='rebeccapurple', edgecolor='black')
plt.xlabel('Recovered Cases')
plt.ylabel('Frequency')
plt.title('Histogram of Recovered Cases')
plt.show()

In [None]:
# boxplot of recovered cases
recovered_drop = recovered.dropna()

plt.figure(figsize=(8, 6))
plt.boxplot(recovered_drop, vert=False)
plt.xlabel('Recovered Cases')
plt.title('Box and Whisker Plot of Recovered Cases')
plt.grid(True)
plt.show()

In [None]:
# scatterplot of recovered cases over time

plt.figure(figsize=(12, 6))
plt.scatter(date, recovered, color='rebeccapurple', alpha=0.5)
plt.xlabel('Date')
plt.ylabel('Recovered Cases')
plt.title('Scatterplot of Recovered Cases over Time')
plt.grid(True)

plt.show()

### Active

In [None]:
active = df['active']

print(f"The dataset covers active cases values from {active.min()} to {active.max()}.")

In [None]:
active = df['active']

plt.hist(active, bins=100, color='gold', edgecolor='black')
plt.xlabel('Active Cases')
plt.ylabel('Frequency')
plt.title('Histogram of Active Cases')
plt.show()

In [None]:
# boxplot of recovered cases
active_drop = active.dropna()

plt.figure(figsize=(8, 6))
plt.boxplot(active_drop, vert=False)
plt.xlabel('Active Cases')
plt.title('Box and Whisker Plot of Active Cases')
plt.grid(True)
plt.show()

In [None]:
# scatterplot of active cases over time

plt.figure(figsize=(12, 6))
plt.scatter(date, active, color='gold', alpha=0.5)
plt.xlabel('Date')
plt.ylabel('Active Cases')
plt.title('Scatterplot of Active Cases over Time')
plt.grid(True)
 
plt.show()

### Time Series Line Plot for Confirmed, Deaths, Active and Recovered in the US

In [None]:
# filter data for US
location_data = df[df['country_region'] == 'US']

# create a time series line plot
plt.figure(figsize=(12, 6))
plt.plot(location_data['date'], location_data['confirmed'], label='Confirmed', color='green')
plt.plot(location_data['date'], location_data['deaths'], label='Deaths',color='firebrick')
plt.plot(location_data['date'], location_data['active'], label='Active',color='rebeccapurple')
plt.plot(location_data['date'], location_data['recovered'], label='Recovered',color='gold')

plt.xlabel('Date')
plt.ylabel('Counts')
plt.title(f'Time Series for {location}')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)

plt.show()

In [None]:
df_drop = df.dropna()

df_drop.describe()