<a href="https://colab.research.google.com/github/chitinglow/Covid19/blob/master/COVID19_Exploratory_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis on COVID-19

## Data Description

1.) Case Data - Data of COVID-19 infection cases in South Korea
<br/> 
<br/>
2.) Patient Data  
&nbsp;&nbsp;i. PatientInfo: Epidemiological data of COVID-19 patients in South Korea
<br/>
&nbsp;&nbsp;ii. PatientRoute: Route data of COVID-19 patients in South Korea
<br/>
<br/>
3.) Time Series data
<br/>
&nbsp;&nbsp;i. Time: Time series data of COVID-19 status in South Korea
<br/>
&nbsp;&nbsp;ii. TimeAge: Time series data of COVID-19 status in terms of the age in South Korea
<br/>
&nbsp;&nbsp;iii. TimeGender: Time series data of COVID-19 status in terms of gender in South Korea
<br/>
&nbsp;&nbsp;iv.TimeProvince: Time series data of COVID-19 status in terms of the Province in South Korea
<br/>
<br/>
4) Additional Data
<br/>
&nbsp;&nbsp;i. Region: Location and statistical data of the regions in South Korea
<br/>
&nbsp;&nbsp;ii. Weather: Data of the weather in the regions of South Korea
<br/>
&nbsp;&nbsp;iii. SearchTrend: Trend data of the keywords searched in NAVER which is one of the largest portals in South Korea
<br/>
&nbsp;&nbsp;iv. SeoulFloating: Data of floating population in Seoul, South Korea (from SK Telecom Big Data Hub)

In [0]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
import pandas_profiling

#case data
case = pd.read_csv("/content/drive/My Drive/Colab Notebooks/Case.csv")

#patient data
patientinfo = pd.read_csv("/content/drive/My Drive/Colab Notebooks/PatientInfo.csv")
patientroute = pd.read_csv("/content/drive/My Drive/Colab Notebooks/PatientRoute.csv")

#time series data
time = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Time.csv')
timeage = pd.read_csv('/content/drive/My Drive/Colab Notebooks/TimeAge.csv')
timegender = pd.read_csv('/content/drive/My Drive/Colab Notebooks/TimeGender.csv')
timeprovince = pd.read_csv('/content/drive/My Drive/Colab Notebooks/TimeProvince.csv')

#additional data
region = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Region.csv')
weather = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Weather.csv')
searchtrend = pd.read_csv('/content/drive/My Drive/Colab Notebooks/SearchTrend.csv')
searchfloating = pd.read_csv("/content/drive/My Drive/Colab Notebooks/SeoulFloating.csv")

## Case information

# Case 

In [34]:
case.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111 entries, 0 to 110
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   case_id         111 non-null    int64 
 1   province        111 non-null    object
 2   city            111 non-null    object
 3   group           111 non-null    bool  
 4   infection_case  111 non-null    object
 5   confirmed       111 non-null    int64 
 6   latitude        111 non-null    object
 7   longitude       111 non-null    object
dtypes: bool(1), int64(2), object(5)
memory usage: 6.3+ KB


### Description of the case variables

* Case id: Unique identifier of infection case
* Province: Special City/Metropolitan City/Province
* City: City/Country/District
* Group: TRUE (group infection); FALSE (not group)
* Infection case: Infection case based on different cluster
* Confirmed: The accumulated number of the comfirmed
* Latitude: The latitude of the infection group
* Longitude: The longitude of the infection group 

In [35]:
case.head()

Unnamed: 0,case_id,province,city,group,infection_case,confirmed,latitude,longitude
0,1000001,Seoul,Guro-gu,True,Guro-gu Call Center,98,37.508163,126.884387
1,1000002,Seoul,Dongdaemun-gu,True,Dongan Church,20,37.592888,127.056766
2,1000003,Seoul,Guro-gu,True,Manmin Central Church,41,37.481059,126.894343
3,1000004,Seoul,Eunpyeong-gu,True,Eunpyeong St. Mary's Hospital,14,37.63369,126.9165
4,1000005,Seoul,Seongdong-gu,True,Seongdong-gu APT,13,37.55713,127.0403


In [39]:
#descriptive stats for numerical values
case[['confirmed']].describe()

Unnamed: 0,confirmed
count,111.0
mean,87.27027
std,441.976345
min,0.0
25%,5.0
50%,10.0
75%,31.0
max,4508.0


In [7]:
#Checking for missing values
case.isnull().sum()

case_id           0
province          0
city              0
group             0
infection_case    0
confirmed         0
latitude          0
longitude         0
dtype: int64

In [36]:
case.infection_case.value_counts()

etc                                      17
overseas inflow                          17
contact with patient                     16
Shincheonji Church                       15
Cheongdo Daenam Hospital                  3
Guro-gu Call Center                       3
Uijeongbu St. Mary’s Hospital             2
Onchun Church                             2
Seosan-si Laboratory                      2
Manmin Central Church                     2
Seongdong-gu APT                          1
Gyeongsan Jeil Silver Town                1
gym facility in Sejong                    1
Eunpyeong St. Mary's Hospital             1
Goesan-gun Jangyeon-myeon                 1
Geochang-gun Woongyang-myeon              1
Suwon Saeng Myeong Saem Church            1
Suyeong-gu Kindergarten                   1
Second Mi-Ju Hospital                     1
Haeundae-gu Catholic Church               1
Bonghwa Pureun Nursing Home               1
River of Grace Community Church           1
Hanmaeum Changwon Hospital      

## Patient Section

In [8]:
patientinfo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3253 entries, 0 to 3252
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   patient_id          3253 non-null   int64  
 1   global_num          2082 non-null   float64
 2   sex                 3200 non-null   object 
 3   birth_year          2833 non-null   float64
 4   age                 3192 non-null   object 
 5   country             3142 non-null   object 
 6   province            3253 non-null   object 
 7   city                3177 non-null   object 
 8   disease             18 non-null     object 
 9   infection_case      2441 non-null   object 
 10  infection_order     31 non-null     float64
 11  infected_by         763 non-null    float64
 12  contact_number      597 non-null    float64
 13  symptom_onset_date  462 non-null    object 
 14  confirmed_date      3253 non-null   object 
 15  released_date       1137 non-null   object 
 16  deceas

### Description of Patient Info

* Patient id: Unique identifier of the patient
* Global num: The number given by KCDC
* Sex : Patient's sex
* Birth year: The birth year of the patient
* Age: The age of the patient (in group)
* Country: The country of the patient
* Province: The province of the patient
* City: The city of the patient
* Disease: TRUE (underlying disease); FALSE (no disease)
* Infection case: The case of infection
* Infection order: The order of infection
* Infected by: The ID of who infected the patient
* Contact Number: The number of contacts with people
* Symptom onset date: The date of symptom onset
* Confirmed date: The date of being confirmed
* Released date: The date of being released
* Deceased date: The date of being deceased
* State: isolated/relleased/deceased

In [43]:
print(patientinfo.head())
patientinfo.age.value_counts()
patientinfo.state.value_counts()

   patient_id  global_num     sex  ...  released_date deceased_date     state
0  1000000001         2.0    male  ...     2020-02-05           NaN  released
1  1000000002         5.0    male  ...     2020-03-02           NaN  released
2  1000000003         6.0    male  ...     2020-02-19           NaN  released
3  1000000004         7.0    male  ...     2020-02-15           NaN  released
4  1000000005         9.0  female  ...     2020-02-24           NaN  released

[5 rows x 18 columns]


isolated    1747
released    1439
deceased      67
Name: state, dtype: int64

In [0]:
patientinfo.age.value_counts()
patientinfo.state.value_counts()

In [12]:
patientroute.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5321 entries, 0 to 5320
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   patient_id  5321 non-null   int64  
 1   global_num  1081 non-null   float64
 2   date        5321 non-null   object 
 3   province    5321 non-null   object 
 4   city        5321 non-null   object 
 5   type        5321 non-null   object 
 6   latitude    5321 non-null   float64
 7   longitude   5321 non-null   float64
dtypes: float64(3), int64(1), object(4)
memory usage: 332.7+ KB


In [11]:
#Patient state
patientinfo['state'].value_counts()

isolated    1747
released    1439
deceased      67
Name: state, dtype: int64

In [10]:
#Patient across city
print(patientinfo['city'].value_counts(), '\n')
print(patientroute['city'].value_counts(), "\n")


print(patientinfo['province'].value_counts(), '\n')
print(patientroute['province'].value_counts())

Gyeongsan-si     630
Seongnam-si      123
Cheonan-si       105
Bucheon-si        73
Bonghwa-gun       71
                ... 
Taean-gun          1
Iksan-si           1
Gyeryong-si        1
Danyang-gun        1
Sancheong-gun      1
Name: city, Length: 150, dtype: int64 

Cheonan-si        465
Jung-gu           308
Wonju-si          274
Gangnam-gu        173
Guro-gu           162
                 ... 
Dongducheon-si      1
Naju-si             1
Osan-si             1
Yeoncheon-gun       1
Cheongsong-gun      1
Name: city, Length: 149, dtype: int64 

Gyeongsangbuk-do     1204
Gyeonggi-do           634
Seoul                 610
Chungcheongnam-do     139
Busan                 123
Gyeongsangnam-do      114
Incheon                87
Daegu                  63
Sejong                 46
Chungcheongbuk-do      44
Ulsan                  42
Daejeon                39
Gangwon-do             37
Gwangju                27
Jeollabuk-do           17
Jeollanam-do           15
Jeju-do                12
Name:

In [13]:
patientroute.describe()

Unnamed: 0,patient_id,global_num,latitude,longitude
count,5321.0,1081.0,5321.0,5321.0
mean,2666089000.0,2700.370953,36.652024,127.704122
std,1993828000.0,2874.370561,0.947463,0.923083
min,1000000000.0,2.0,33.454642,126.301005
25%,1000000000.0,298.0,35.856172,126.958035
50%,1300000000.0,1370.0,36.824764,127.122581
75%,4100000000.0,4224.0,37.509682,128.628476
max,6100000000.0,9082.0,38.193169,129.475746


## Time Section

In [14]:
time.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       85 non-null     object
 1   time       85 non-null     int64 
 2   test       85 non-null     int64 
 3   negative   85 non-null     int64 
 4   confirmed  85 non-null     int64 
 5   released   85 non-null     int64 
 6   deceased   85 non-null     int64 
dtypes: int64(6), object(1)
memory usage: 4.8+ KB


In [15]:
time.describe()

Unnamed: 0,time,test,negative,confirmed,released,deceased
count,85.0,85.0,85.0,85.0,85.0,85.0
mean,7.905882,176692.447059,160680.811765,4573.588235,1691.094118,60.894118
std,8.046921,183124.164954,173503.582977,4367.632764,2549.75445,72.455245
min,0.0,1.0,0.0,1.0,0.0,0.0
25%,0.0,3110.0,2552.0,27.0,4.0,0.0
50%,0.0,109591.0,71580.0,4212.0,31.0,22.0
75%,16.0,338036.0,315447.0,8961.0,3166.0,111.0
max,16.0,518743.0,494815.0,10537.0,7447.0,217.0


In [16]:
timeage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 387 entries, 0 to 386
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       387 non-null    object
 1   time       387 non-null    int64 
 2   age        387 non-null    object
 3   confirmed  387 non-null    int64 
 4   deceased   387 non-null    int64 
dtypes: int64(3), object(2)
memory usage: 15.2+ KB


In [17]:
timeage.describe()

Unnamed: 0,time,confirmed,deceased
count,387.0,387.0,387.0
mean,0.0,963.503876,13.105943
std,0.0,712.198877,22.087828
min,0.0,32.0,0.0
25%,0.0,440.0,0.0
50%,0.0,833.0,1.0
75%,0.0,1313.0,16.5
max,0.0,2879.0,103.0


In [18]:
timegender.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86 entries, 0 to 85
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       86 non-null     object
 1   time       86 non-null     int64 
 2   sex        86 non-null     object
 3   confirmed  86 non-null     int64 
 4   deceased   86 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 3.5+ KB


In [19]:
timegender.describe()

Unnamed: 0,time,confirmed,deceased
count,86.0,86.0,86.0
mean,0.0,4335.651163,58.965116
std,0.0,1257.367406,30.65537
min,0.0,1591.0,9.0
25%,0.0,3345.75,33.25
50%,0.0,4174.0,56.5
75%,0.0,5494.75,83.75
max,0.0,6294.0,115.0


In [20]:
#gender distribution
timegender['sex'].value_counts()

female    43
male      43
Name: sex, dtype: int64

In [22]:
timeprovince.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1445 entries, 0 to 1444
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       1445 non-null   object
 1   time       1445 non-null   int64 
 2   province   1445 non-null   object
 3   confirmed  1445 non-null   int64 
 4   released   1445 non-null   int64 
 5   deceased   1445 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 67.9+ KB


In [21]:
timeprovince['province'].value_counts()

Gyeonggi-do          85
Chungcheongnam-do    85
Jeollanam-do         85
Gangwon-do           85
Incheon              85
Daejeon              85
Sejong               85
Busan                85
Gyeongsangnam-do     85
Daegu                85
Jeollabuk-do         85
Ulsan                85
Gwangju              85
Seoul                85
Chungcheongbuk-do    85
Gyeongsangbuk-do     85
Jeju-do              85
Name: province, dtype: int64

## Region Section

In [23]:
region.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   code                      244 non-null    int64  
 1   province                  244 non-null    object 
 2   city                      244 non-null    object 
 3   latitude                  244 non-null    float64
 4   longitude                 244 non-null    float64
 5   elementary_school_count   244 non-null    int64  
 6   kindergarten_count        244 non-null    int64  
 7   university_count          244 non-null    int64  
 8   academy_ratio             244 non-null    float64
 9   elderly_population_ratio  244 non-null    float64
 10  elderly_alone_ratio       244 non-null    float64
 11  nursing_home_count        244 non-null    int64  
dtypes: float64(5), int64(5), object(2)
memory usage: 23.0+ KB


In [24]:
region.describe()

Unnamed: 0,code,latitude,longitude,elementary_school_count,kindergarten_count,university_count,academy_ratio,elderly_population_ratio,elderly_alone_ratio,nursing_home_count
count,244.0,244.0,244.0,244.0,244.0,244.0,244.0,244.0,244.0,244.0
mean,32912.090164,36.396996,127.661401,74.180328,107.901639,4.151639,1.294754,20.92373,10.644672,1159.258197
std,19373.349736,1.060304,0.904781,402.713482,588.78832,22.513041,0.592898,8.087428,5.604886,6384.185085
min,10000.0,33.488936,126.263554,4.0,4.0,0.0,0.19,7.69,3.3,11.0
25%,14027.5,35.405263,126.927663,14.75,16.0,0.0,0.87,14.1175,6.1,111.0
50%,30075.0,36.386601,127.38425,22.0,31.0,1.0,1.27,18.53,8.75,300.0
75%,51062.5,37.466119,128.473953,36.25,55.25,3.0,1.6125,27.2625,14.625,694.5
max,80000.0,38.380571,130.905883,6087.0,8837.0,340.0,4.18,40.26,24.7,94865.0


In [25]:
print(region['province'].value_counts(), "\n")
print(region['city'].value_counts(), "\n")

Gyeonggi-do          32
Seoul                26
Gyeongsangbuk-do     24
Jeollanam-do         23
Gangwon-do           19
Gyeongsangnam-do     19
Busan                17
Chungcheongnam-do    16
Jeollabuk-do         15
Chungcheongbuk-do    12
Incheon              11
Daegu                 9
Ulsan                 6
Gwangju               6
Daejeon               6
Korea                 1
Sejong                1
Jeju-do               1
Name: province, dtype: int64 

Jung-gu        6
Dong-gu        6
Seo-gu         5
Nam-gu         4
Buk-gu         4
              ..
Muan-gun       1
Busan          1
Jangsu-gun     1
Daedeok-gu     1
Jungnang-gu    1
Name: city, Length: 222, dtype: int64 



## Weather section

In [26]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25023 entries, 0 to 25022
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   code                   25023 non-null  int64  
 1   province               25023 non-null  object 
 2   date                   25023 non-null  object 
 3   avg_temp               25008 non-null  float64
 4   min_temp               25018 non-null  float64
 5   max_temp               25020 non-null  float64
 6   precipitation          25023 non-null  float64
 7   max_wind_speed         25014 non-null  float64
 8   most_wind_direction    24994 non-null  float64
 9   avg_relative_humidity  25003 non-null  float64
dtypes: float64(7), int64(1), object(2)
memory usage: 1.9+ MB


In [27]:
weather.describe()

Unnamed: 0,code,avg_temp,min_temp,max_temp,precipitation,max_wind_speed,most_wind_direction,avg_relative_humidity
count,25023.0,25008.0,25018.0,25020.0,25023.0,25014.0,24994.0,25003.0
mean,32124.645326,13.621057,9.437153,18.526379,3.267086,5.102778,195.947027,65.564572
std,20313.522756,9.636505,10.021912,9.686541,12.655798,2.022522,106.909278,17.232745
min,10000.0,-14.8,-19.2,-11.9,0.0,1.0,20.0,10.4
25%,13500.0,5.6,1.0,10.5,0.0,3.8,90.0,53.5
50%,20000.0,14.1,9.5,19.4,0.0,4.7,230.0,66.6
75%,50500.0,21.9,18.1,26.6,0.2,6.0,290.0,78.6
max,70000.0,33.9,30.3,40.0,310.0,29.4,360.0,100.0


## Search Section

In [28]:
searchtrend.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1564 entries, 0 to 1563
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         1564 non-null   object 
 1   cold         1564 non-null   float64
 2   flu          1564 non-null   float64
 3   pneumonia    1564 non-null   float64
 4   coronavirus  1564 non-null   float64
dtypes: float64(4), object(1)
memory usage: 61.2+ KB


In [29]:
searchtrend.describe()

Unnamed: 0,cold,flu,pneumonia,coronavirus
count,1564.0,1564.0,1564.0,1564.0
mean,0.193994,0.255173,0.227153,1.76622
std,0.470578,0.779597,0.473678,9.248131
min,0.05163,0.00981,0.06881,0.00154
25%,0.106698,0.046405,0.132192,0.00618
50%,0.13463,0.10608,0.16808,0.00863
75%,0.166855,0.261647,0.212833,0.01227
max,15.72071,27.32727,11.3932,100.0


In [30]:
searchfloating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 432000 entries, 0 to 431999
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   date        432000 non-null  object
 1   hour        432000 non-null  int64 
 2   birth_year  432000 non-null  int64 
 3   sex         432000 non-null  object
 4   province    432000 non-null  object
 5   city        432000 non-null  object
 6   fp_num      432000 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 23.1+ MB


In [31]:
searchfloating.describe()

Unnamed: 0,hour,birth_year,fp_num
count,432000.0,432000.0,432000.0
mean,11.5,45.0,27860.034884
std,6.922195,17.078271,13122.838441
min,0.0,20.0,4480.0
25%,5.75,30.0,18940.0
50%,11.5,45.0,25690.0
75%,17.25,60.0,34000.0
max,23.0,70.0,127640.0


In [32]:
searchfloating['sex'].value_counts()

male      216000
female    216000
Name: sex, dtype: int64

In [33]:
searchfloating['city'].value_counts()

Guro-gu            17280
Gangbuk-gu         17280
Seocho-gu          17280
Jongno-gu          17280
Nowon-gu           17280
Eunpyeong-gu       17280
Gangnam-gu         17280
Yeongdeungpo-gu    17280
Geumcheon-gu       17280
Yangcheon-gu       17280
Songpa-gu          17280
Gangdong-gu        17280
Gangseo-gu         17280
Dongjag-gu         17280
Seodaemun-gu       17280
Seongbuk-gu        17280
Gwanak-gu          17280
Mapo-gu            17280
Dobong-gu          17280
Dongdaemun-gu      17280
Gwangjin-gu        17280
Jung-gu            17280
Jungnang-gu        17280
Yongsan-gu         17280
Seongdong-gu       17280
Name: city, dtype: int64