# Desafio Módulo 4 - Análise da Covid-19 na Korea

**Objetivos:**

Exercitar os seguintes conceitos trabalhados no Módulo:
* Aprender a mexer no Databricks e no seu notebook;
* Fazer análises de dados empregando o PySpark

**Enunciado:**

Vamos utilizar três arquivos de dados, que também serão disponibilizados 
separadamente, relacionados a casos de COVID. Eles se encontram internamente no 
Databricks, no diretório dbfs:/databricks-datasets/COVID/coronavirusdataset/
- dbfs:/databricks-datasets/COVID/coronavirusdataset/Case.csv
- dbfs:/databricks-datasets/COVID/coronavirusdataset/PatientInfo.csv
- dbfs:/databricks-datasets/COVID/coronavirusdataset/PatientRoute.csv

De posse deles, faça as análises necessárias, com o PySpark, para responder às questões 
do Desafio.

É recomendado que você crie uma conta no ambiente Databricks Community Edition 
(gratuito) e resolva as questões, utilizando o PySpark no Databricks Notebook. Esse 
ambiente não precisa de configurações e você pode começar imediatamente, aplicando 
para responder às questões.

## 1. Datasets e inicializando o spark

### Procurando datasets

Primeiro vamos procurar os datasets que serão utilizados dentro do Databricks:

In [0]:
%fs ls /databricks-datasets/COVID/coronavirusdataset/

path,name,size,modificationTime
dbfs:/databricks-datasets/COVID/coronavirusdataset/.DS_Store,.DS_Store,6148,1594102716000
dbfs:/databricks-datasets/COVID/coronavirusdataset/Case.csv,Case.csv,11711,1595191979000
dbfs:/databricks-datasets/COVID/coronavirusdataset/PatientInfo.csv,PatientInfo.csv,488859,1595191979000
dbfs:/databricks-datasets/COVID/coronavirusdataset/PatientRoute.csv,PatientRoute.csv,718510,1594102718000
dbfs:/databricks-datasets/COVID/coronavirusdataset/Policy.csv,Policy.csv,5713,1595191981000
dbfs:/databricks-datasets/COVID/coronavirusdataset/Region.csv,Region.csv,19082,1595191981000
dbfs:/databricks-datasets/COVID/coronavirusdataset/SearchTrend.csv,SearchTrend.csv,71722,1595191981000
dbfs:/databricks-datasets/COVID/coronavirusdataset/SeoulFloating.csv,SeoulFloating.csv,49682281,1595191981000
dbfs:/databricks-datasets/COVID/coronavirusdataset/Time.csv,Time.csv,6604,1595191981000
dbfs:/databricks-datasets/COVID/coronavirusdataset/TimeAge.csv,TimeAge.csv,27114,1595191981000


### Criando uma conexão com pyspark

In [0]:
pip install findspark 

Python interpreter will be restarted.
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1
Python interpreter will be restarted.


In [0]:
import findspark
findspark.init()

from pyspark.context import SparkContext
sc=SparkContext.getOrCreate()

In [0]:
sc

In [0]:
# spark session
from pyspark.sql import SparkSession
spark_session =  SparkSession.builder.enableHiveSupport().getOrCreate()

In [0]:
# sparksession criado pelo Databricks
spark

## 2. Lendo os dados - Spark dataframe

### Case - Dataset

In [0]:
# Case dataset
df1 = spark.read.csv('/databricks-datasets/COVID/coronavirusdataset/Case.csv', header=True, sep=',', inferSchema=True)
df1.show()

+--------+--------+---------------+-----+--------------------+---------+---------+----------+
| case_id|province|           city|group|      infection_case|confirmed| latitude| longitude|
+--------+--------+---------------+-----+--------------------+---------+---------+----------+
| 1000001|   Seoul|     Yongsan-gu| true|       Itaewon Clubs|      139|37.538621|126.992652|
| 1000002|   Seoul|      Gwanak-gu| true|             Richway|      119| 37.48208|126.901384|
| 1000003|   Seoul|        Guro-gu| true| Guro-gu Call Center|       95|37.508163|126.884387|
| 1000004|   Seoul|   Yangcheon-gu| true|Yangcheon Table T...|       43|37.546061|126.874209|
| 1000005|   Seoul|      Dobong-gu| true|     Day Care Center|       43|37.679422|127.044374|
| 1000006|   Seoul|        Guro-gu| true|Manmin Central Ch...|       41|37.481059|126.894343|
| 1000007|   Seoul|from other city| true|SMR Newly Planted...|       36|        -|         -|
| 1000008|   Seoul|  Dongdaemun-gu| true|       Dongan Churc

In [0]:
#ou
display(df1)

case_id,province,city,group,infection_case,confirmed,latitude,longitude
1000001,Seoul,Yongsan-gu,True,Itaewon Clubs,139,37.538621,126.992652
1000002,Seoul,Gwanak-gu,True,Richway,119,37.48208,126.901384
1000003,Seoul,Guro-gu,True,Guro-gu Call Center,95,37.508163,126.884387
1000004,Seoul,Yangcheon-gu,True,Yangcheon Table Tennis Club,43,37.546061,126.874209
1000005,Seoul,Dobong-gu,True,Day Care Center,43,37.679422,127.044374
1000006,Seoul,Guro-gu,True,Manmin Central Church,41,37.481059,126.894343
1000007,Seoul,from other city,True,SMR Newly Planted Churches Group,36,-,-
1000008,Seoul,Dongdaemun-gu,True,Dongan Church,17,37.592888,127.056766
1000009,Seoul,from other city,True,Coupang Logistics Center,25,-,-
1000010,Seoul,Gwanak-gu,True,Wangsung Church,30,37.481735,126.930121


In [0]:
# imprimindo arvore do schema
df1.printSchema()

root
 |--  case_id: integer (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- group: boolean (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- confirmed: integer (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)



In [0]:
# total de dados dentro do dataset
df1.count()

Out[8]: 174

In [0]:
#mostrando o tipo
type(df1)

Out[9]: pyspark.sql.dataframe.DataFrame

[Usando Pandas]

In [0]:
# convertendo o dataset pyspark para pandas
df1p = df1.toPandas() 
df1p

Unnamed: 0,case_id,province,city,group,infection_case,confirmed,latitude,longitude
0,1000001,Seoul,Yongsan-gu,True,Itaewon Clubs,139,37.538621,126.992652
1,1000002,Seoul,Gwanak-gu,True,Richway,119,37.48208,126.901384
2,1000003,Seoul,Guro-gu,True,Guro-gu Call Center,95,37.508163,126.884387
3,1000004,Seoul,Yangcheon-gu,True,Yangcheon Table Tennis Club,43,37.546061,126.874209
4,1000005,Seoul,Dobong-gu,True,Day Care Center,43,37.679422,127.044374
...,...,...,...,...,...,...,...,...
169,6100012,Gyeongsangnam-do,-,False,etc,20,-,-
170,7000001,Jeju-do,-,False,overseas inflow,14,-,-
171,7000002,Jeju-do,-,False,contact with patient,0,-,-
172,7000003,Jeju-do,-,False,etc,4,-,-


In [0]:
type(df1p)

Out[11]: pandas.core.frame.DataFrame

In [0]:
#visualizando algumas informações
df1p.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174 entries, 0 to 173
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0    case_id        174 non-null    int32 
 1   province        174 non-null    object
 2   city            174 non-null    object
 3   group           174 non-null    bool  
 4   infection_case  174 non-null    object
 5   confirmed       174 non-null    int32 
 6   latitude        174 non-null    object
 7   longitude       174 non-null    object
dtypes: bool(1), int32(2), object(5)
memory usage: 8.5+ KB


In [0]:
df1p.value_counts().sum()

Out[13]: 174

In [0]:
df1p.describe().round(4)

Unnamed: 0,case_id,confirmed
count,174.0,174.0
mean,2686216.0,65.4885
std,1943218.0,355.0977
min,1000001.0,0.0
25%,1100006.0,4.0
50%,1700004.0,10.0
75%,4100004.0,31.75
max,7000004.0,4511.0


In [0]:
# casos unicos 1
df1p['city'].value_counts().nunique()

Out[15]: 7

In [0]:
# casos unicos 2
df1p['group'].value_counts().nunique()

Out[16]: 2

In [0]:
# casos unicos 3
df1p['province'].value_counts().nunique()

Out[17]: 10

In [0]:
# casos unicos 4
df1p['infection_case'].value_counts().nunique()

Out[18]: 8

In [0]:
# média de casos confirmados
df1p['confirmed'].mean().round(2)

Out[41]: 65.49

In [0]:
#No dataset Case.csv, a cidade (city) que apresentou mais casos (confirmed) foi: namgu
df1p['city'].value_counts()

Out[20]: -                  53
from other city    51
Seo-gu              5
Gangnam-gu          4
Gyeongsan-si        3
Seongnam-si         3
Jung-gu             3
Guro-gu             3
Dong-gu             2
Anyang-si           2
Jongno-gu           2
Geochang-gun        2
Gwanak-gu           2
Yangcheon-gu        2
Dalseong-gun        2
Suwon-si            2
Sejong              2
Geumcheon-gu        1
Uijeongbu-si        1
Bucheon-si          1
Gangseo-gu          1
Seosan-si           1
Jinju-si            1
Changnyeong-gun     1
Seocho-gu           1
Seongdong-gu        1
Bonghwa-gun         1
Nam-gu              1
Yeongdeungpo-gu     1
Wonju-si            1
Muan-gun            1
Dongnae-gu          1
Changwon-si         1
Yongsan-gu          1
Suyeong-gu          1
Haeundae-gu         1
Dobong-gu           1
Goesan-gun          1
Yangsan-si          1
Chilgok-gun         1
Dongdaemun-gu       1
Yechun-gun          1
Seodaemun-gu        1
Gumi-si             1
Jin-gu              1
C

In [0]:
# No dataset Case.csv, a média de casos (confirmed), na província (province) de Seoul foi de:
df1p.groupby('province')['confirmed'].mean().round(2)

Out[44]: province
Busan                 15.60
Chungcheongbuk-do      8.57
Chungcheongnam-do     19.75
Daegu                668.00
Daejeon               13.10
Gangwon-do             7.75
Gwangju                8.60
Gyeonggi-do           45.45
Gyeongsangbuk-do     101.85
Gyeongsangnam-do      11.00
Incheon               28.86
Jeju-do                4.75
Jeollabuk-do           4.60
Jeollanam-do           5.00
Sejong                 8.17
Seoul                 33.68
Ulsan                 12.75
Name: confirmed, dtype: float64

## Patient info dataset

In [0]:
# Case dataset
df2 = spark.read.csv('/databricks-datasets/COVID/coronavirusdataset/PatientInfo.csv', header=True, sep=',', inferSchema=True)
display(df2)

patient_id,sex,age,country,province,city,infection_case,infected_by,contact_number,symptom_onset_date,confirmed_date,released_date,deceased_date,state
1000000001,male,50s,Korea,Seoul,Gangseo-gu,overseas inflow,,75.0,2020-01-22,2020-01-23T00:00:00.000+0000,2020-02-05T00:00:00.000+0000,,released
1000000002,male,30s,Korea,Seoul,Jungnang-gu,overseas inflow,,31.0,,2020-01-30T00:00:00.000+0000,2020-03-02T00:00:00.000+0000,,released
1000000003,male,50s,Korea,Seoul,Jongno-gu,contact with patient,2002000001.0,17.0,,2020-01-30T00:00:00.000+0000,2020-02-19T00:00:00.000+0000,,released
1000000004,male,20s,Korea,Seoul,Mapo-gu,overseas inflow,,9.0,2020-01-26,2020-01-30T00:00:00.000+0000,2020-02-15T00:00:00.000+0000,,released
1000000005,female,20s,Korea,Seoul,Seongbuk-gu,contact with patient,1000000002.0,2.0,,2020-01-31T00:00:00.000+0000,2020-02-24T00:00:00.000+0000,,released
1000000006,female,50s,Korea,Seoul,Jongno-gu,contact with patient,1000000003.0,43.0,,2020-01-31T00:00:00.000+0000,2020-02-19T00:00:00.000+0000,,released
1000000007,male,20s,Korea,Seoul,Jongno-gu,contact with patient,1000000003.0,0.0,,2020-01-31T00:00:00.000+0000,2020-02-10T00:00:00.000+0000,,released
1000000008,male,20s,Korea,Seoul,etc,overseas inflow,,0.0,,2020-02-02T00:00:00.000+0000,2020-02-24T00:00:00.000+0000,,released
1000000009,male,30s,Korea,Seoul,Songpa-gu,overseas inflow,,68.0,,2020-02-05T00:00:00.000+0000,2020-02-21T00:00:00.000+0000,,released
1000000010,female,60s,Korea,Seoul,Seongbuk-gu,contact with patient,1000000003.0,6.0,,2020-02-05T00:00:00.000+0000,2020-02-29T00:00:00.000+0000,,released


In [0]:
# total de dados dentro do dataset
df2.count()

Out[23]: 5165

In [0]:
#No dataset PatientInfo.csv, pessoas do sexo (sex) feminino (female), da cidade (city) Jongno-gu, e do grupo de idade (age) 10s são:

filtro_f = df2.filter((df2.sex == 'female') & (df2.city == 'Jongno-gu') & (df2.age == '10s'))
filtro_f.show()

+----------+------+---+-------+--------+---------+--------------------+-----------+--------------+------------------+-------------------+-------------+-------------+--------+
|patient_id|   sex|age|country|province|     city|      infection_case|infected_by|contact_number|symptom_onset_date|     confirmed_date|released_date|deceased_date|   state|
+----------+------+---+-------+--------+---------+--------------------+-----------+--------------+------------------+-------------------+-------------+-------------+--------+
|1000000333|female|10s|  Korea|   Seoul|Jongno-gu|     overseas inflow|       null|          null|              null|2020-03-23 00:00:00|         null|         null|released|
|1000000342|female|10s|  Korea|   Seoul|Jongno-gu|contact with patient| 1000000333|          null|              null|2020-03-24 00:00:00|         null|         null|released|
+----------+------+---+-------+--------+---------+--------------------+-----------+--------------+------------------+--------

In [0]:
#contando os resultados
filtro_f.count()

Out[25]: 2

In [0]:
# convertendo o dataset para pandas
df2 = df2.toPandas() 
df2

Unnamed: 0,patient_id,sex,age,country,province,city,infection_case,infected_by,contact_number,symptom_onset_date,confirmed_date,released_date,deceased_date,state
0,1000000001,male,50s,Korea,Seoul,Gangseo-gu,overseas inflow,,75,2020-01-22,2020-01-23,2020-02-05,NaT,released
1,1000000002,male,30s,Korea,Seoul,Jungnang-gu,overseas inflow,,31,,2020-01-30,2020-03-02,NaT,released
2,1000000003,male,50s,Korea,Seoul,Jongno-gu,contact with patient,2002000001,17,,2020-01-30,2020-02-19,NaT,released
3,1000000004,male,20s,Korea,Seoul,Mapo-gu,overseas inflow,,9,2020-01-26,2020-01-30,2020-02-15,NaT,released
4,1000000005,female,20s,Korea,Seoul,Seongbuk-gu,contact with patient,1000000002,2,,2020-01-31,2020-02-24,NaT,released
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5160,7000000015,female,30s,Korea,Jeju-do,Jeju-do,overseas inflow,,25,,2020-05-30,2020-06-13,NaT,released
5161,7000000016,,,Korea,Jeju-do,Jeju-do,overseas inflow,,,,2020-06-16,2020-06-24,NaT,released
5162,7000000017,,,Bangladesh,Jeju-do,Jeju-do,overseas inflow,,72,,2020-06-18,NaT,NaT,isolated
5163,7000000018,,,Bangladesh,Jeju-do,Jeju-do,overseas inflow,,,,2020-06-18,NaT,NaT,isolated


In [0]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5165 entries, 0 to 5164
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   patient_id          5165 non-null   int64         
 1   sex                 4043 non-null   object        
 2   age                 3785 non-null   object        
 3   country             5165 non-null   object        
 4   province            5165 non-null   object        
 5   city                5071 non-null   object        
 6   infection_case      4246 non-null   object        
 7   infected_by         1346 non-null   object        
 8   contact_number      791 non-null    object        
 9   symptom_onset_date  690 non-null    object        
 10  confirmed_date      5162 non-null   datetime64[ns]
 11  released_date       1587 non-null   datetime64[ns]
 12  deceased_date       66 non-null     datetime64[ns]
 13  state               5165 non-null   object      

In [0]:
# casos unicos 1
df2['sex'].value_counts().nunique()

Out[28]: 2

In [0]:
# casos unicos 2 - correta
df2['state'].value_counts().nunique()

Out[29]: 3

In [0]:
# casos unicos 3
df2['infection_case'].value_counts().nunique()

Out[30]: 31

## Patient route dataset

In [0]:
# PR dataset
df3 = spark.read.csv('/databricks-datasets/COVID/coronavirusdataset/PatientRoute.csv', header=True, sep=',', inferSchema=True)
display(df3)

patient_id,date,province,city,type,latitude,longitude
1000000001,2020-01-22T00:00:00.000+0000,Gyeonggi-do,Gimpo-si,airport,37.61525,126.7156
1000000001,2020-01-24T00:00:00.000+0000,Seoul,Jung-gu,hospital,37.56724,127.0057
1000000002,2020-01-25T00:00:00.000+0000,Seoul,Seongbuk-gu,etc,37.59256,127.017
1000000002,2020-01-26T00:00:00.000+0000,Seoul,Seongbuk-gu,store,37.59181,127.0168
1000000002,2020-01-26T00:00:00.000+0000,Seoul,Seongdong-gu,public_transportation,37.56399,127.0295
1000000002,2020-01-26T00:00:00.000+0000,Seoul,Seongbuk-gu,public_transportation,37.59033,127.0152
1000000002,2020-01-26T00:00:00.000+0000,Seoul,Seongbuk-gu,store,37.58959,127.0098
1000000002,2020-01-27T00:00:00.000+0000,Seoul,Seongbuk-gu,restaurant,37.59206,127.0189
1000000002,2020-01-27T00:00:00.000+0000,Seoul,Dongdaemun-gu,store,37.56626,127.0658
1000000002,2020-01-28T00:00:00.000+0000,Seoul,Seongbuk-gu,etc,37.59167,127.0184


In [0]:
# total de linhas dentro do dataset
df3.count()

Out[47]: 10410

In [0]:
# convertendo o dataset para pandas
df3p = df3.toPandas() 
df3p

Unnamed: 0,patient_id,date,province,city,type,latitude,longitude
0,1000000001,2020-01-22,Gyeonggi-do,Gimpo-si,airport,37.61525,126.7156
1,1000000001,2020-01-24,Seoul,Jung-gu,hospital,37.56724,127.0057
2,1000000002,2020-01-25,Seoul,Seongbuk-gu,etc,37.59256,127.0170
3,1000000002,2020-01-26,Seoul,Seongbuk-gu,store,37.59181,127.0168
4,1000000002,2020-01-26,Seoul,Seongdong-gu,public_transportation,37.56399,127.0295
...,...,...,...,...,...,...,...
10405,6100000098,2020-03-28,Seoul,Dobong-gu,hospital,37.64586,127.0286
10406,6100000120,2020-05-16,Gyeonggi-do,Gunpo-si,etc,37.36167,126.9352
10407,6100000120,2020-05-16,Gyeonggi-do,Suwon-si,public_transportation,37.28214,126.9699
10408,6100000120,2020-05-16,Daegu,Jung-gu,etc,35.87144,128.6014


In [0]:
df3p.describe()

Unnamed: 0,patient_id,latitude,longitude
count,10410.0,10410.0,10410.0
mean,2087839000.0,36.955887,127.434226
std,1784860000.0,0.840833,0.798448
min,1000000000.0,33.45464,126.301
25%,1000001000.0,36.365183,126.9294
50%,1000001000.0,37.47843,127.049
75%,3009000000.0,37.53695,127.9202
max,6100000000.0,38.19317,129.4757


In [0]:
#No dataset PatientRoute.csv, a maior latitude encontrada é de:
df3p['latitude'].max()

Out[51]: 38.19317

In [0]:
#No dataset PatientRoute.csv, a menor longitude encontrada é de:
df3p['longitude'].min()

Out[53]: 126.301

In [0]:
#Pacientes detectados (type) no aeroporto (airport), e que tiveram caso de infecção (infection_case) como contato com pacientes (contact with patient) retornam uma contagem de:

# Para responder a pergunta:
# primeiro vou fazer um join entre df2 (PatientInfo) e df3(PatientRoute)
pat_info= spark.read.csv("dbfs:/databricks-datasets/COVID/coronavirusdataset/PatientInfo.csv", header="true", inferSchema="true")
pat_info.show(5)

+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+-------------------+-------------------+-------------+--------+
|patient_id|   sex|age|country|province|       city|      infection_case|infected_by|contact_number|symptom_onset_date|     confirmed_date|      released_date|deceased_date|   state|
+----------+------+---+-------+--------+-----------+--------------------+-----------+--------------+------------------+-------------------+-------------------+-------------+--------+
|1000000001|  male|50s|  Korea|   Seoul| Gangseo-gu|     overseas inflow|       null|            75|        2020-01-22|2020-01-23 00:00:00|2020-02-05 00:00:00|         null|released|
|1000000002|  male|30s|  Korea|   Seoul|Jungnang-gu|     overseas inflow|       null|            31|              null|2020-01-30 00:00:00|2020-03-02 00:00:00|         null|released|
|1000000003|  male|50s|  Korea|   Seoul|  Jongno-gu|contact with patient| 2002000001|

In [0]:
pat_route = spark.read.csv("dbfs:/databricks-datasets/COVID/coronavirusdataset/PatientRoute.csv", header="true", inferSchema="true")
pat_route.show(5)

+----------+-------------------+-----------+------------+--------------------+--------+---------+
|patient_id|               date|   province|        city|                type|latitude|longitude|
+----------+-------------------+-----------+------------+--------------------+--------+---------+
|1000000001|2020-01-22 00:00:00|Gyeonggi-do|    Gimpo-si|             airport|37.61525| 126.7156|
|1000000001|2020-01-24 00:00:00|      Seoul|     Jung-gu|            hospital|37.56724| 127.0057|
|1000000002|2020-01-25 00:00:00|      Seoul| Seongbuk-gu|                 etc|37.59256|  127.017|
|1000000002|2020-01-26 00:00:00|      Seoul| Seongbuk-gu|               store|37.59181| 127.0168|
|1000000002|2020-01-26 00:00:00|      Seoul|Seongdong-gu|public_transporta...|37.56399| 127.0295|
+----------+-------------------+-----------+------------+--------------------+--------+---------+
only showing top 5 rows



In [0]:
# com os dois dentro das variaveis, vamos dar join neles na coluna em comum (no caso patient_id)
patient = pat_info.join(pat_route, on='patient_id', how='leftouter')
patient.show(5)

+----------+----+---+-------+--------+-----------+---------------+-----------+--------------+------------------+-------------------+-------------------+-------------+--------+-------------------+-----------+-----------+--------------------+--------+---------+
|patient_id| sex|age|country|province|       city| infection_case|infected_by|contact_number|symptom_onset_date|     confirmed_date|      released_date|deceased_date|   state|               date|   province|       city|                type|latitude|longitude|
+----------+----+---+-------+--------+-----------+---------------+-----------+--------------+------------------+-------------------+-------------------+-------------+--------+-------------------+-----------+-----------+--------------------+--------+---------+
|1000000001|male|50s|  Korea|   Seoul| Gangseo-gu|overseas inflow|       null|            75|        2020-01-22|2020-01-23 00:00:00|2020-02-05 00:00:00|         null|released|2020-01-24 00:00:00|      Seoul|    Jung-gu| 

In [0]:
# com o join feito, dentro do dataset patient vou montar o filtro que corresponde a pergunta
filtro_f = patient.filter((patient.type == 'airport') & (patient.infection_case == 'contact with patient'))
filtro_f.show()

+----------+------+---+-------+----------------+-----------+--------------------+-----------+--------------+------------------+-------------------+-------------------+-------------+--------+-------------------+-----------+----------+-------+--------+---------+
|patient_id|   sex|age|country|        province|       city|      infection_case|infected_by|contact_number|symptom_onset_date|     confirmed_date|      released_date|deceased_date|   state|               date|   province|      city|   type|latitude|longitude|
+----------+------+---+-------+----------------+-----------+--------------------+-----------+--------------+------------------+-------------------+-------------------+-------------+--------+-------------------+-----------+----------+-------+--------+---------+
|1000000034|  male|20s|  Korea|           Seoul|  Songpa-gu|contact with patient| 1000000031|          null|              null|2020-02-24 00:00:00|2020-03-17 00:00:00|         null|released|2020-02-20 00:00:00|    Inc

In [0]:
#contando a quantidade de ocorrencias
filtro_f.count()

Out[56]: 20