-sandbox
##Covid-19 South Korea Analysis
<div style="text-align: left; line-height: 0; padding-top: 3px;">
  <img src="https://miro.medium.com/max/3840/1*Mf9K7Nj-wMlZHqt4cSOWNA.jpeg"></div>

###Analyzing South KoreaCOVID-19 Pandemic Data
In this notebook we will try to understand the patterns underlying the Coronavirus pandemic in South Kores. We will use the freely available outbreak data availableon Kaggle to answer questions like:

- South Korea Regions affected by the virus 
- What are the causes of Covid-19 infection
- Timeline trend of the Accumulated Cases
- Recovered Patients in each region
- Fatalities by Region
- % of Recovered Patient cases by type of Infection
- % of Fatality cases by type Infection

###The Data
Data is taken from a public repo : A kaggle data source

In [4]:
%sql
DROP TABLE IF EXISTS patient_csv;

CREATE TABLE patient_csv
USING csv
OPTIONS (path "/databricks-datasets/COVID/SouthKorea/patient.csv", header "true")


In [5]:
%sql
--Lets look at the patient data to get ourselves familiarize with the schema and data 
select * from patient_csv limit 10

patient_id,sex,birth_year,country,region,disease,group,infection_reason,infection_order,infected_by,contact_number,confirmed_date,released_date,deceased_date,state
1,female,1984,China,filtered at airport,,,visit to Wuhan,1,,45,2020-01-20,2020-02-06,,released
2,male,1964,Korea,filtered at airport,,,visit to Wuhan,1,,75,2020-01-24,2020-02-05,,released
3,male,1966,Korea,capital area,,,visit to Wuhan,1,,16,2020-01-26,2020-02-12,,released
4,male,1964,Korea,capital area,,,visit to Wuhan,1,,95,2020-01-27,2020-02-09,,released
5,male,1987,Korea,capital area,,,visit to Wuhan,1,,31,2020-01-30,2020-03-02,,released
6,male,1964,Korea,capital area,,,contact with patient,2,3.0,17,2020-01-30,2020-02-19,,released
7,male,1991,Korea,capital area,,,visit to Wuhan,1,,9,2020-01-30,2020-02-15,,released
8,female,1957,Korea,Jeollabuk-do,,,visit to Wuhan,1,,113,2020-01-31,2020-02-12,,released
9,female,1992,Korea,capital area,,,contact with patient,2,5.0,2,2020-01-31,2020-02-24,,released
10,female,1966,Korea,capital area,,,contact with patient,3,6.0,43,2020-01-31,2020-02-19,,released


###No. of patients in each region
It tells us which regions experienced most number of patients

In [7]:
%sql
select region, count(patient_id) from patient_csv where region is not null group by region order by count(patient_id) desc

region,count(patient_id)
capital area,191
Gyeongsangbuk-do,140
Daegu,57
Daejeon,13
Gwangju,11
Gangwon-do,5
Jeju-do,4
filtered at airport,4
Jeollabuk-do,3
Jeollanam-do,3


###Infection reasons for the patients
There are many reason for infection but most cases have similar reasons that they have visited to wuhan. 
You can see a trend each month separated by different pie charts below. Since January when the cases actually started from the church gathering, then the contact with other people and eventually more cases in the following months by various reasons of infection - mainly during from differnt countries and coming in contact with regions outside the original source.
This insight also highlights how the cases spread globally.

In [9]:
%sql
SELECT infection_reason, month(cast(confirmed_date as date)) AS month, count(patient_id) FROM patient_csv where infection_reason is not null
group by infection_reason, month order by month


infection_reason,month,count(patient_id)
visit to Wuhan,1,7
contact with patient,1,4
contact with patient in Japan,2,1
visit to China,2,2
pilgrimage to Israel,2,6
residence in Wuhan,2,2
visit to Vietnam,2,1
visit to Wuhan,2,1
visit to Daegu,2,49
contact with patient,2,67


####Timeline trend of the Accumulated Cases
The graph below shows the timeline on the increase in number of patients from January 2020 with highest spikes of no. of patients ~1062 around end of February - beginnning of March 2020.

In [11]:
%sql
Use covid;
select confirmed_date, count(patient_id) from covid.patient_csv where confirmed_date is not null group by confirmed_date order by confirmed_date;

confirmed_date,count(patient_id)
2020-01-20,1
2020-01-24,1
2020-01-26,1
2020-01-27,1
2020-01-30,3
2020-01-31,4
2020-02-01,1
2020-02-02,3
2020-02-04,1
2020-02-05,5


###Recovered Patients in each region
Looks like "capital area" has the highest number of recovered patients.

In [13]:
%sql
select region, datediff(released_date, confirmed_date) as recovery_time, count(patient_id) from patient_csv  where released_date is not null group by region, recovery_time order by count(patient_id) desc

region,recovery_time,count(patient_id)
capital area,7,5
capital area,5,4
capital area,16,3
capital area,10,3
capital area,4,3
Daegu,8,2
Gyeongsangbuk-do,8,2
capital area,17,2
capital area,24,2
capital area,19,2


###Fatalities by Region
There are 3 regions which got impacted significantly in terms of fatalities. This correlates to the actual situations we heard in the news where on 19 February 2020, cases in South Korea had a sudden jump from a gathering at a Shincheonji Church.

In [15]:
%sql
select region, datediff(deceased_date, confirmed_date) as fatality_time, count(patient_id) from patient_csv  where deceased_date is not null group by region, fatality_time order by count(patient_id) desc

region,fatality_time,count(patient_id)
Daegu,6,4
Gyeongsangbuk-do,-1,4
Gyeongsangbuk-do,1,4
Daegu,8,3
Daegu,0,3
Daegu,5,3
Daegu,4,2
Gyeongsangbuk-do,3,2
Daegu,1,2
Gyeongsangbuk-do,4,2


###% of Recovered Patient cases by type of Infection 
This graph shows the recovery percentage of patient from different types of infections. 40% of total recovered were infected by patient contact and second most recovery was from the patients who visited wuhan.

In [17]:
%sql

select infection_reason, datediff(released_date, confirmed_date) as recovery_time, count(patient_id) from patient_csv where released_date is not null group by infection_reason, recovery_time order by recovery_time desc

infection_reason,recovery_time,count(patient_id)
visit to Wuhan,32,1
contact with patient,25,1
contact with patient,24,2
visit to Wuhan,23,1
residence in Wuhan,22,1
contact with patient,22,1
residence in Wuhan,21,1
contact with patient,20,1
contact with patient,19,2
visit to Wuhan,17,2


###% of Fatality cases by type Infection
This graph shows the recovery percentage of deaths from different types of infections. It mainly shows the fatalities with "unknown reason". We can assmue a lot of factors for fatalities but we may need more data or insights.

In [19]:
%sql
select infection_reason, datediff(deceased_date, confirmed_date) as fatality, count(patient_id) from patient_csv where deceased_date is not null group by infection_reason, fatality order by fatality desc

infection_reason,fatality,count(patient_id)
,12,1
,8,3
,7,1
,6,4
,5,3
,4,4
,3,2
,2,2
,1,6
,0,5


Lets find more insights if time plays a factor in the recovery of patients

In [21]:
%sql
-- Below shows the number of patients with duration from the confirmed date to the deceased date. On an average, they spent 3 days since confirmation date.
select infection_reason, state, datediff(deceased_date, confirmed_date) as fatality, count(patient_id) from patient_csv where deceased_date is not null group by infection_reason, state, fatality order by fatality 


infection_reason,state,fatality,count(patient_id)
,deceased,-2,1
contact with patient,deceased,-1,1
,deceased,-1,3
,deceased,0,5
,deceased,1,6
,deceased,2,2
,deceased,3,2
,deceased,4,4
,deceased,5,3
,deceased,6,4


In [22]:
%sql
-- Below shows the number of patients with duration from the confirmed date to the recovery date. On an average, the patients who spent more days since confirmation, were recovered.

select infection_reason, state, datediff(released_date, confirmed_date) as recovery, count(patient_id) from patient_csv where released_date is not null group by infection_reason, state, recovery order by recovery 

infection_reason,state,recovery,count(patient_id)
visit to Daegu,released,1,1
,released,3,1
contact with patient,released,4,2
,released,4,2
,released,5,4
visit to Thailand,released,5,1
contact with patient,released,6,1
contact with patient in Singapore,released,7,1
contact with patient,released,7,3
,released,7,3
