<a href="https://colab.research.google.com/github/chitinglow/Covid19/blob/master/COVID19_Exploratory_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis on COVID-19

## Data Description

1.) Case Data - Data of COVID-19 infection cases in South Korea
<br/> 
<br/>
2.) Patient Data  
&nbsp;&nbsp;i. PatientInfo: Epidemiological data of COVID-19 patients in South Korea
<br/>
&nbsp;&nbsp;ii. PatientRoute: Route data of COVID-19 patients in South Korea
<br/>
<br/>
3.) Time Series data
<br/>
&nbsp;&nbsp;i. Time: Time series data of COVID-19 status in South Korea
<br/>
&nbsp;&nbsp;ii. TimeAge: Time series data of COVID-19 status in terms of the age in South Korea
<br/>
&nbsp;&nbsp;iii. TimeGender: Time series data of COVID-19 status in terms of gender in South Korea
<br/>
&nbsp;&nbsp;iv.TimeProvince: Time series data of COVID-19 status in terms of the Province in South Korea
<br/>
<br/>
4) Additional Data
<br/>
&nbsp;&nbsp;i. Region: Location and statistical data of the regions in South Korea
<br/>
&nbsp;&nbsp;ii. Weather: Data of the weather in the regions of South Korea
<br/>
&nbsp;&nbsp;iii. SearchTrend: Trend data of the keywords searched in NAVER which is one of the largest portals in South Korea
<br/>
&nbsp;&nbsp;iv. SeoulFloating: Data of floating population in Seoul, South Korea (from SK Telecom Big Data Hub)

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
import pandas_profiling

#case data
case = pd.read_csv("/content/drive/My Drive/Colab Notebooks/Case.csv")

#patient data
patientinfo = pd.read_csv("/content/drive/My Drive/Colab Notebooks/PatientInfo.csv")
patientroute = pd.read_csv("/content/drive/My Drive/Colab Notebooks/PatientRoute.csv")

#time series data
time = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Time.csv')
timeage = pd.read_csv('/content/drive/My Drive/Colab Notebooks/TimeAge.csv')
timegender = pd.read_csv('/content/drive/My Drive/Colab Notebooks/TimeGender.csv')
timeprovince = pd.read_csv('/content/drive/My Drive/Colab Notebooks/TimeProvince.csv')

#additional data
region = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Region.csv')
weather = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Weather.csv')
searchtrend = pd.read_csv('/content/drive/My Drive/Colab Notebooks/SearchTrend.csv')
searchfloating = pd.read_csv("/content/drive/My Drive/Colab Notebooks/SeoulFloating.csv")

## Case information

# Case 

### Description of the case variables

* Case id: Unique identifier of infection case
* Province: Special City/Metropolitan City/Province
* City: City/Country/District
* Group: TRUE (group infection); FALSE (not group)
* Infection case: Infection case based on different cluster
* Confirmed: The accumulated number of the comfirmed
* Latitude: The latitude of the infection group
* Longitude: The longitude of the infection group 

In [6]:
case.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111 entries, 0 to 110
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   case_id         111 non-null    int64 
 1   province        111 non-null    object
 2   city            111 non-null    object
 3   group           111 non-null    bool  
 4   infection_case  111 non-null    object
 5   confirmed       111 non-null    int64 
 6   latitude        111 non-null    object
 7   longitude       111 non-null    object
dtypes: bool(1), int64(2), object(5)
memory usage: 6.3+ KB


In [7]:
case.head()

Unnamed: 0,case_id,province,city,group,infection_case,confirmed,latitude,longitude
0,1000001,Seoul,Guro-gu,True,Guro-gu Call Center,98,37.508163,126.884387
1,1000002,Seoul,Dongdaemun-gu,True,Dongan Church,20,37.592888,127.056766
2,1000003,Seoul,Guro-gu,True,Manmin Central Church,41,37.481059,126.894343
3,1000004,Seoul,Eunpyeong-gu,True,Eunpyeong St. Mary's Hospital,14,37.63369,126.9165
4,1000005,Seoul,Seongdong-gu,True,Seongdong-gu APT,13,37.55713,127.0403


In [8]:
#descriptive stats for numerical values
case[['confirmed']].describe()

Unnamed: 0,confirmed
count,111.0
mean,87.27027
std,441.976345
min,0.0
25%,5.0
50%,10.0
75%,31.0
max,4508.0


In [9]:
#Checking for missing values
case.isnull().sum()

case_id           0
province          0
city              0
group             0
infection_case    0
confirmed         0
latitude          0
longitude         0
dtype: int64

In [10]:
case.infection_case.value_counts()

etc                                      17
overseas inflow                          17
contact with patient                     16
Shincheonji Church                       15
Cheongdo Daenam Hospital                  3
Guro-gu Call Center                       3
Onchun Church                             2
Uijeongbu St. Mary’s Hospital             2
Manmin Central Church                     2
Seosan-si Laboratory                      2
Milal Shelter                             1
Suwon Saeng Myeong Saem Church            1
Geochang-gun Woongyang-myeon              1
Gyeongsan Jeil Silver Town                1
Seongdong-gu APT                          1
Dongan Church                             1
Haeundae-gu Catholic Church               1
Wings Tower                               1
Gyeongsan Cham Joeun Community Center     1
Geochang Church                           1
River of Grace Community Church           1
Jung-gu Fashion Company                   1
Suyeong-gu Kindergarten         

## Patient Section

### Description of Patient Info

* Patient id: Unique identifier of the patient
* Global num: The number given by KCDC
* Sex : Patient's sex
* Birth year: The birth year of the patient
* Age: The age of the patient (in group)
* Country: The country of the patient
* Province: The province of the patient
* City: The city of the patient
* Disease: TRUE (underlying disease); FALSE (no disease)
* Infection case: The case of infection
* Infection order: The order of infection
* Infected by: The ID of who infected the patient
* Contact Number: The number of contacts with people
* Symptom onset date: The date of symptom onset
* Confirmed date: The date of being confirmed
* Released date: The date of being released
* Deceased date: The date of being deceased
* State: isolated/relleased/deceased

In [11]:
patientinfo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3253 entries, 0 to 3252
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   patient_id          3253 non-null   int64  
 1   global_num          2082 non-null   float64
 2   sex                 3200 non-null   object 
 3   birth_year          2833 non-null   float64
 4   age                 3192 non-null   object 
 5   country             3142 non-null   object 
 6   province            3253 non-null   object 
 7   city                3177 non-null   object 
 8   disease             18 non-null     object 
 9   infection_case      2441 non-null   object 
 10  infection_order     31 non-null     float64
 11  infected_by         763 non-null    float64
 12  contact_number      597 non-null    float64
 13  symptom_onset_date  462 non-null    object 
 14  confirmed_date      3253 non-null   object 
 15  released_date       1137 non-null   object 
 16  deceas

In [38]:
patientinfo.head()

Unnamed: 0,patient_id,global_num,sex,birth_year,age,country,province,city,disease,infection_case,infection_order,infected_by,contact_number,symptom_onset_date,confirmed_date,released_date,deceased_date,state
0,1000000001,2.0,male,1964.0,50s,Korea,Seoul,Gangseo-gu,,overseas inflow,1.0,,75.0,2020-01-22,2020-01-23,2020-02-05,,released
1,1000000002,5.0,male,1987.0,30s,Korea,Seoul,Jungnang-gu,,overseas inflow,1.0,,31.0,,2020-01-30,2020-03-02,,released
2,1000000003,6.0,male,1964.0,50s,Korea,Seoul,Jongno-gu,,contact with patient,2.0,2002000000.0,17.0,,2020-01-30,2020-02-19,,released
3,1000000004,7.0,male,1991.0,20s,Korea,Seoul,Mapo-gu,,overseas inflow,1.0,,9.0,2020-01-26,2020-01-30,2020-02-15,,released
4,1000000005,9.0,female,1992.0,20s,Korea,Seoul,Seongbuk-gu,,contact with patient,2.0,1000000000.0,2.0,,2020-01-31,2020-02-24,,released


In [39]:
patientinfo.age.value_counts()

20s     779
50s     582
40s     445
30s     418
60s     390
70s     195
80s     154
10s     136
0s       47
90s      45
100s      1
Name: age, dtype: int64

In [40]:
patientinfo.state.value_counts()

isolated    1747
released    1439
deceased      67
Name: state, dtype: int64

In [13]:
patientinfo.age.value_counts()
patientinfo.state.value_counts()

isolated    1747
released    1439
deceased      67
Name: state, dtype: int64

### Description of Patient Route

* Patient id: Unique identifier of the patient
* Global num: The number given by KCDC
* Date: Date of the route (format: YYYY-MM-DD)
* Province: Name of Province
* City: Name of City
* Type: Location type (gym, hospital, etc...)
* Latitude: Route latitude
* Longitude: Route longitude

In [14]:
patientroute.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5321 entries, 0 to 5320
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   patient_id  5321 non-null   int64  
 1   global_num  1081 non-null   float64
 2   date        5321 non-null   object 
 3   province    5321 non-null   object 
 4   city        5321 non-null   object 
 5   type        5321 non-null   object 
 6   latitude    5321 non-null   float64
 7   longitude   5321 non-null   float64
dtypes: float64(3), int64(1), object(4)
memory usage: 332.7+ KB


In [43]:
patientroute.head()

Unnamed: 0,patient_id,global_num,date,province,city,type,latitude,longitude
0,1000000001,2.0,2020-01-22,Gyeonggi-do,Gimpo-si,airport,37.615246,126.715632
1,1000000001,2.0,2020-01-24,Seoul,Jung-gu,hospital,37.567241,127.005659
2,1000000002,5.0,2020-01-25,Seoul,Seongbuk-gu,etc,37.59256,127.017048
3,1000000002,5.0,2020-01-26,Seoul,Seongbuk-gu,store,37.59181,127.016822
4,1000000002,5.0,2020-01-26,Seoul,Seongdong-gu,public_transportation,37.563992,127.029534


In [44]:
patientroute[['global_num']].describe()

Unnamed: 0,global_num
count,1081.0
mean,2700.370953
std,2874.370561
min,2.0
25%,298.0
50%,1370.0
75%,4224.0
max,9082.0


## Time Section

### Description of Time series data

* Date: Date of reported case (format: YYYY-MM-DD)
* Time: Time of case (24 Hour format)
* Test: The accumulated number of tests
* Negative: The accumulated number of negative results
* Comfirmed: The accumulated number of positive results
* Released: The accumulated number of releases
* Deceased: The accumulated number of deceases


In [46]:
time.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       85 non-null     object
 1   time       85 non-null     int64 
 2   test       85 non-null     int64 
 3   negative   85 non-null     int64 
 4   confirmed  85 non-null     int64 
 5   released   85 non-null     int64 
 6   deceased   85 non-null     int64 
dtypes: int64(6), object(1)
memory usage: 4.8+ KB


In [47]:
time.head()

Unnamed: 0,date,time,test,negative,confirmed,released,deceased
0,2020-01-20,16,1,0,1,0,0
1,2020-01-21,16,1,0,1,0,0
2,2020-01-22,16,4,3,1,0,0
3,2020-01-23,16,22,21,1,0,0
4,2020-01-24,16,27,25,2,0,0


In [19]:
time.describe()

Unnamed: 0,time,test,negative,confirmed,released,deceased
count,85.0,85.0,85.0,85.0,85.0,85.0
mean,7.905882,176692.447059,160680.811765,4573.588235,1691.094118,60.894118
std,8.046921,183124.164954,173503.582977,4367.632764,2549.75445,72.455245
min,0.0,1.0,0.0,1.0,0.0,0.0
25%,0.0,3110.0,2552.0,27.0,4.0,0.0
50%,0.0,109591.0,71580.0,4212.0,31.0,22.0
75%,16.0,338036.0,315447.0,8961.0,3166.0,111.0
max,16.0,518743.0,494815.0,10537.0,7447.0,217.0


### Description of Time series data of COVID-19 status in terms of the age

* Date: Date of cases (format: YYYY-MM-DD)
* Time: Time (24 Hours format)
* Age: The age of patients (Age group in 10 years format)
* Comfirmed: The accumulated number of the confirmed case
* Deceased: The accumulated number of the deceased case

In [49]:
timeage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 387 entries, 0 to 386
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       387 non-null    object
 1   time       387 non-null    int64 
 2   age        387 non-null    object
 3   confirmed  387 non-null    int64 
 4   deceased   387 non-null    int64 
dtypes: int64(3), object(2)
memory usage: 15.2+ KB


In [50]:
timeage.head()

Unnamed: 0,date,time,age,confirmed,deceased
0,2020-03-02,0,0s,32,0
1,2020-03-02,0,10s,169,0
2,2020-03-02,0,20s,1235,0
3,2020-03-02,0,30s,506,1
4,2020-03-02,0,40s,633,1


In [21]:
timeage.describe()

Unnamed: 0,time,confirmed,deceased
count,387.0,387.0,387.0
mean,0.0,963.503876,13.105943
std,0.0,712.198877,22.087828
min,0.0,32.0,0.0
25%,0.0,440.0,0.0
50%,0.0,833.0,1.0
75%,0.0,1313.0,16.5
max,0.0,2879.0,103.0


### Description of Time series data in term of gender

* Date: Date of cases (format: YYYY-MM-DD)
* Time: Time (24 Hours format)
* Sex: The gender of patients
* Comfirmed: The accumulated number of the confirmed case
* Deceased: The accumulated number of the deceased case

In [22]:
timegender.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86 entries, 0 to 85
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       86 non-null     object
 1   time       86 non-null     int64 
 2   sex        86 non-null     object
 3   confirmed  86 non-null     int64 
 4   deceased   86 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 3.5+ KB


In [51]:
timegender.head()

Unnamed: 0,date,time,sex,confirmed,deceased
0,2020-03-02,0,male,1591,13
1,2020-03-02,0,female,2621,9
2,2020-03-03,0,male,1810,16
3,2020-03-03,0,female,3002,12
4,2020-03-04,0,male,1996,20


In [23]:
timegender.describe()

Unnamed: 0,time,confirmed,deceased
count,86.0,86.0,86.0
mean,0.0,4335.651163,58.965116
std,0.0,1257.367406,30.65537
min,0.0,1591.0,9.0
25%,0.0,3345.75,33.25
50%,0.0,4174.0,56.5
75%,0.0,5494.75,83.75
max,0.0,6294.0,115.0


### Description of Time series data in term of Province

* Date: Date of cases (format: YYYY-MM-DD)
* Time: Time (24 Hours format)
* Province: The name of the province
* Comfirmed: The accumulated number of the comfirmed in the province
* Released: The accumulated number of the released in the province
* Deceased: The accumulated number of the deceased in the province

In [25]:
timeprovince.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1445 entries, 0 to 1444
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       1445 non-null   object
 1   time       1445 non-null   int64 
 2   province   1445 non-null   object
 3   confirmed  1445 non-null   int64 
 4   released   1445 non-null   int64 
 5   deceased   1445 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 67.9+ KB


## Region Section

### Description data of the regions

* Code: The code of the region
* Province: The name of the province
* City: The name of the city
* Latitude: Latitude of the region
* Longitude: Longitude of the region
* Elementary School Count: The number of elementary schools region
* Kindergarten Count: The number of kindergardens region
* University Count: The number of universities in the region
* Academy ratio: The ration of academies
* Elderly Population Ratio: The ratio of the elderly population
* Elderly Alone Ratio: The ratio of elderly households living alone
* Nursing Home Count: The number of nursing homes in the region


In [27]:
region.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   code                      244 non-null    int64  
 1   province                  244 non-null    object 
 2   city                      244 non-null    object 
 3   latitude                  244 non-null    float64
 4   longitude                 244 non-null    float64
 5   elementary_school_count   244 non-null    int64  
 6   kindergarten_count        244 non-null    int64  
 7   university_count          244 non-null    int64  
 8   academy_ratio             244 non-null    float64
 9   elderly_population_ratio  244 non-null    float64
 10  elderly_alone_ratio       244 non-null    float64
 11  nursing_home_count        244 non-null    int64  
dtypes: float64(5), int64(5), object(2)
memory usage: 23.0+ KB


In [53]:
region.head()

Unnamed: 0,code,province,city,latitude,longitude,elementary_school_count,kindergarten_count,university_count,academy_ratio,elderly_population_ratio,elderly_alone_ratio,nursing_home_count
0,10000,Seoul,Seoul,37.566953,126.977977,607,830,48,1.44,15.38,5.8,22739
1,10010,Seoul,Gangnam-gu,37.518421,127.047222,33,38,0,4.18,13.17,4.3,3088
2,10020,Seoul,Gangdong-gu,37.530492,127.123837,27,32,0,1.54,14.55,5.4,1023
3,10030,Seoul,Gangbuk-gu,37.639938,127.025508,14,21,0,0.67,19.49,8.5,628
4,10040,Seoul,Gangseo-gu,37.551166,126.849506,36,56,1,1.17,14.39,5.7,1080


In [52]:
region[["elementary_school_count", "kindergarten_count",	"university_count",	"academy_ratio",	"elderly_population_ratio",	"elderly_alone_ratio",	"nursing_home_count"]].describe()

Unnamed: 0,elementary_school_count,kindergarten_count,university_count,academy_ratio,elderly_population_ratio,elderly_alone_ratio,nursing_home_count
count,244.0,244.0,244.0,244.0,244.0,244.0,244.0
mean,74.180328,107.901639,4.151639,1.294754,20.92373,10.644672,1159.258197
std,402.713482,588.78832,22.513041,0.592898,8.087428,5.604886,6384.185085
min,4.0,4.0,0.0,0.19,7.69,3.3,11.0
25%,14.75,16.0,0.0,0.87,14.1175,6.1,111.0
50%,22.0,31.0,1.0,1.27,18.53,8.75,300.0
75%,36.25,55.25,3.0,1.6125,27.2625,14.625,694.5
max,6087.0,8837.0,340.0,4.18,40.26,24.7,94865.0


## Weather section

### Description data of the weather

* Code: The code of the region
* Province: The name of the province
* Date: Date (format: YYYY-MM-DD)
* avg_temp: The average temperature
* min_temp: The lowest temperature
* max_temp: The highest temperature
* precipitation: The daily precipitation
* max_wind_speed: The maximum wind speed
* most_wind_direction: The most frequent wind direction
* avg_relative_humidity: The average relative humidity

In [30]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25023 entries, 0 to 25022
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   code                   25023 non-null  int64  
 1   province               25023 non-null  object 
 2   date                   25023 non-null  object 
 3   avg_temp               25008 non-null  float64
 4   min_temp               25018 non-null  float64
 5   max_temp               25020 non-null  float64
 6   precipitation          25023 non-null  float64
 7   max_wind_speed         25014 non-null  float64
 8   most_wind_direction    24994 non-null  float64
 9   avg_relative_humidity  25003 non-null  float64
dtypes: float64(7), int64(1), object(2)
memory usage: 1.9+ MB


In [54]:
weather.head()

Unnamed: 0,code,province,date,avg_temp,min_temp,max_temp,precipitation,max_wind_speed,most_wind_direction,avg_relative_humidity
0,10000,Seoul,2016-01-01,1.2,-3.3,4.0,0.0,3.5,90.0,73.0
1,11000,Busan,2016-01-01,5.3,1.1,10.9,0.0,7.4,340.0,52.1
2,12000,Daegu,2016-01-01,1.7,-4.0,8.0,0.0,3.7,270.0,70.5
3,13000,Gwangju,2016-01-01,3.2,-1.5,8.1,0.0,2.7,230.0,73.1
4,14000,Incheon,2016-01-01,3.1,-0.4,5.7,0.0,5.3,180.0,83.9


In [57]:
weather[["avg_temp",	"min_temp",	"max_temp",	"precipitation",	"max_wind_speed",	"most_wind_direction",	"avg_relative_humidity"]].describe()

Unnamed: 0,avg_temp,min_temp,max_temp,precipitation,max_wind_speed,most_wind_direction,avg_relative_humidity
count,25008.0,25018.0,25020.0,25023.0,25014.0,24994.0,25003.0
mean,13.621057,9.437153,18.526379,3.267086,5.102778,195.947027,65.564572
std,9.636505,10.021912,9.686541,12.655798,2.022522,106.909278,17.232745
min,-14.8,-19.2,-11.9,0.0,1.0,20.0,10.4
25%,5.6,1.0,10.5,0.0,3.8,90.0,53.5
50%,14.1,9.5,19.4,0.0,4.7,230.0,66.6
75%,21.9,18.1,26.6,0.2,6.0,290.0,78.6
max,33.9,30.3,40.0,310.0,29.4,360.0,100.0


## Search Section

In [32]:
searchtrend.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1564 entries, 0 to 1563
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         1564 non-null   object 
 1   cold         1564 non-null   float64
 2   flu          1564 non-null   float64
 3   pneumonia    1564 non-null   float64
 4   coronavirus  1564 non-null   float64
dtypes: float64(4), object(1)
memory usage: 61.2+ KB


In [33]:
searchtrend.describe()

Unnamed: 0,cold,flu,pneumonia,coronavirus
count,1564.0,1564.0,1564.0,1564.0
mean,0.193994,0.255173,0.227153,1.76622
std,0.470578,0.779597,0.473678,9.248131
min,0.05163,0.00981,0.06881,0.00154
25%,0.106698,0.046405,0.132192,0.00618
50%,0.13463,0.10608,0.16808,0.00863
75%,0.166855,0.261647,0.212833,0.01227
max,15.72071,27.32727,11.3932,100.0


In [34]:
searchfloating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 432000 entries, 0 to 431999
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   date        432000 non-null  object
 1   hour        432000 non-null  int64 
 2   birth_year  432000 non-null  int64 
 3   sex         432000 non-null  object
 4   province    432000 non-null  object
 5   city        432000 non-null  object
 6   fp_num      432000 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 23.1+ MB


In [35]:
searchfloating.describe()

Unnamed: 0,hour,birth_year,fp_num
count,432000.0,432000.0,432000.0
mean,11.5,45.0,27860.034884
std,6.922195,17.078271,13122.838441
min,0.0,20.0,4480.0
25%,5.75,30.0,18940.0
50%,11.5,45.0,25690.0
75%,17.25,60.0,34000.0
max,23.0,70.0,127640.0


In [36]:
searchfloating['sex'].value_counts()

female    216000
male      216000
Name: sex, dtype: int64

In [37]:
searchfloating['city'].value_counts()

Geumcheon-gu       17280
Gwanak-gu          17280
Dongdaemun-gu      17280
Gangseo-gu         17280
Yongsan-gu         17280
Eunpyeong-gu       17280
Gangnam-gu         17280
Guro-gu            17280
Yangcheon-gu       17280
Gwangjin-gu        17280
Jongno-gu          17280
Gangdong-gu        17280
Dongjag-gu         17280
Seongdong-gu       17280
Mapo-gu            17280
Seongbuk-gu        17280
Seocho-gu          17280
Gangbuk-gu         17280
Dobong-gu          17280
Yeongdeungpo-gu    17280
Seodaemun-gu       17280
Jung-gu            17280
Songpa-gu          17280
Nowon-gu           17280
Jungnang-gu        17280
Name: city, dtype: int64