# Análises COVID-19

## Digital Innovation One

### Prof. Dr. Neylson Crepalde

Vamos analisar as séries temporais sobre a contaminação do vírus COVID-19 pelo mundo.

### Adilton Botelho da Rocha Filho

Irei criar o notebook acompanhando o professor e utilizarei o dataset mais recente do Kaggle.

Primeiro se dará a importação das bibliotecas necessárias para o projeto.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import plotly.express as px
import plotly.graph_objects as go

Importação dos dados para o projeto.
Aqui eu estou importando de um arquivo que baixei do Kaggle, mas eu poderia utilizar uma URL que contenha o dataframe, se esse fosse o caso, eu teria que criar uma variável que recebe a URL e na declaração da variável dataframe, ao invés de colocar o nome do arquivo que importado, será o nome da variável criada para a url.

In [2]:
df = pd.read_csv("covid_19_data.csv", parse_dates=['ObservationDate', 'Last Update'])
df

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,2020-01-22,Anhui,Mainland China,2020-01-22 17:00:00,1.0,0.0,0.0
1,2,2020-01-22,Beijing,Mainland China,2020-01-22 17:00:00,14.0,0.0,0.0
2,3,2020-01-22,Chongqing,Mainland China,2020-01-22 17:00:00,6.0,0.0,0.0
3,4,2020-01-22,Fujian,Mainland China,2020-01-22 17:00:00,1.0,0.0,0.0
4,5,2020-01-22,Gansu,Mainland China,2020-01-22 17:00:00,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...
306424,306425,2021-05-29,Zaporizhia Oblast,Ukraine,2021-05-30 04:20:55,102641.0,2335.0,95289.0
306425,306426,2021-05-29,Zeeland,Netherlands,2021-05-30 04:20:55,29147.0,245.0,0.0
306426,306427,2021-05-29,Zhejiang,Mainland China,2021-05-30 04:20:55,1364.0,1.0,1324.0
306427,306428,2021-05-29,Zhytomyr Oblast,Ukraine,2021-05-30 04:20:55,87550.0,1738.0,83790.0


In [3]:
# Conferir os tipos de cada coluna.
df.dtypes

SNo                         int64
ObservationDate    datetime64[ns]
Province/State             object
Country/Region             object
Last Update        datetime64[ns]
Confirmed                 float64
Deaths                    float64
Recovered                 float64
dtype: object

Nomes de colunas não devem ter letras maiúsculas e nem caracteres especiais, abaixo entrará uma função para limpar o nome dessas colunas.

In [4]:
import re

def corrige_colunas(col_name):
    return re.sub(r"[/| ]", "", col_name).lower()

# A função acima irá procurar barra (/) e espaços ( ) no nome das colunas e irá substituir por nada, ou seja, irá remover, além disso, a função irá colocar tudo minúsculo.

In [5]:
# Corrigindo as colunas do dataframe:

df.columns = [corrige_colunas(col) for col in df.columns]

In [6]:
df

Unnamed: 0,sno,observationdate,provincestate,countryregion,lastupdate,confirmed,deaths,recovered
0,1,2020-01-22,Anhui,Mainland China,2020-01-22 17:00:00,1.0,0.0,0.0
1,2,2020-01-22,Beijing,Mainland China,2020-01-22 17:00:00,14.0,0.0,0.0
2,3,2020-01-22,Chongqing,Mainland China,2020-01-22 17:00:00,6.0,0.0,0.0
3,4,2020-01-22,Fujian,Mainland China,2020-01-22 17:00:00,1.0,0.0,0.0
4,5,2020-01-22,Gansu,Mainland China,2020-01-22 17:00:00,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...
306424,306425,2021-05-29,Zaporizhia Oblast,Ukraine,2021-05-30 04:20:55,102641.0,2335.0,95289.0
306425,306426,2021-05-29,Zeeland,Netherlands,2021-05-30 04:20:55,29147.0,245.0,0.0
306426,306427,2021-05-29,Zhejiang,Mainland China,2021-05-30 04:20:55,1364.0,1.0,1324.0
306427,306428,2021-05-29,Zhytomyr Oblast,Ukraine,2021-05-30 04:20:55,87550.0,1738.0,83790.0


# Brasil

Somente será investigado os dados do Brasil.

In [7]:
df.loc[df.countryregion == 'Brazil']

Unnamed: 0,sno,observationdate,provincestate,countryregion,lastupdate,confirmed,deaths,recovered
84,85,2020-01-23,,Brazil,2020-01-23 17:00:00,0.0,0.0,0.0
2525,2526,2020-02-26,,Brazil,2020-02-26 23:53:02,1.0,0.0,0.0
2631,2632,2020-02-27,,Brazil,2020-02-26 23:53:02,1.0,0.0,0.0
2742,2743,2020-02-28,,Brazil,2020-02-26 23:53:02,1.0,0.0,0.0
2852,2853,2020-02-29,,Brazil,2020-02-29 21:03:05,2.0,0.0,0.0
...,...,...,...,...,...,...,...,...
306272,306273,2021-05-29,Roraima,Brazil,2021-05-30 04:20:55,103222.0,1635.0,96188.0
306290,306291,2021-05-29,Santa Catarina,Brazil,2021-05-30 04:20:55,965277.0,15174.0,921496.0
306292,306293,2021-05-29,Sao Paulo,Brazil,2021-05-30 04:20:55,3254893.0,111123.0,2895697.0
306298,306299,2021-05-29,Sergipe,Brazil,2021-05-30 04:20:55,233932.0,5054.0,208146.0


Nem todos os dados estão separados por estado

In [8]:
brasil = df.loc[
    (df.countryregion == 'Brazil') &
    (df.confirmed > 0)
]

In [9]:
brasil

Unnamed: 0,sno,observationdate,provincestate,countryregion,lastupdate,confirmed,deaths,recovered
2525,2526,2020-02-26,,Brazil,2020-02-26 23:53:02,1.0,0.0,0.0
2631,2632,2020-02-27,,Brazil,2020-02-26 23:53:02,1.0,0.0,0.0
2742,2743,2020-02-28,,Brazil,2020-02-26 23:53:02,1.0,0.0,0.0
2852,2853,2020-02-29,,Brazil,2020-02-29 21:03:05,2.0,0.0,0.0
2981,2982,2020-03-01,,Brazil,2020-02-29 21:03:05,2.0,0.0,0.0
...,...,...,...,...,...,...,...,...
306272,306273,2021-05-29,Roraima,Brazil,2021-05-30 04:20:55,103222.0,1635.0,96188.0
306290,306291,2021-05-29,Santa Catarina,Brazil,2021-05-30 04:20:55,965277.0,15174.0,921496.0
306292,306293,2021-05-29,Sao Paulo,Brazil,2021-05-30 04:20:55,3254893.0,111123.0,2895697.0
306298,306299,2021-05-29,Sergipe,Brazil,2021-05-30 04:20:55,233932.0,5054.0,208146.0


A partir de certa data do arquivo que eu estou utilizando, os dados para o Brasil ficam separados por estado, se seguir a risca o procedimento do professor, irá ter um problema nesta parte, pois na época que ele realizou o procedimento, não havia esse dado no arquivo que ele utilizou, abaixo eu realizei a soma dos dados dos estados, criei um dataframe novo, nele não consta o sno (serial number), o provincestate (que seriam os estados), o countryregion (que é o mesmo para todos os dados que serão manipulados) e o lastupdate (que eu creio não ser útil para a análise, já que temos o observationdate).

In [10]:
brasil_corrigido = brasil.groupby(['observationdate'], as_index=False).agg({'confirmed': 'sum', 'deaths': 'sum', 'recovered': 'sum'})

In [11]:
brasil_corrigido

Unnamed: 0,observationdate,confirmed,deaths,recovered
0,2020-02-26,1.0,0.0,0.0
1,2020-02-27,1.0,0.0,0.0
2,2020-02-28,1.0,0.0,0.0
3,2020-02-29,2.0,0.0,0.0
4,2020-03-01,2.0,0.0,0.0
...,...,...,...,...
454,2021-05-25,16194209.0,452031.0,14231991.0
455,2021-05-26,16274695.0,454429.0,14272174.0
456,2021-05-27,16342162.0,456674.0,14455810.0
457,2021-05-28,16391930.0,459045.0,14492701.0


In [12]:
# Gráfico da evolução de casos confirmados
px.line(brasil_corrigido, 'observationdate', 'confirmed', title='Casos confirmados no Brasil')

# Novos casos por dia

In [13]:
# Técnica de programação funcional

brasil_corrigido['novoscasos'] = list(map(
    lambda x: 0 if (x==0) else brasil_corrigido['confirmed'].iloc[x] - brasil_corrigido['confirmed'].iloc[x-1],
    np.arange(brasil_corrigido.shape[0])
))

In [14]:
brasil_corrigido

Unnamed: 0,observationdate,confirmed,deaths,recovered,novoscasos
0,2020-02-26,1.0,0.0,0.0,0.0
1,2020-02-27,1.0,0.0,0.0,0.0
2,2020-02-28,1.0,0.0,0.0,0.0
3,2020-02-29,2.0,0.0,0.0,1.0
4,2020-03-01,2.0,0.0,0.0,0.0
...,...,...,...,...,...
454,2021-05-25,16194209.0,452031.0,14231991.0,73453.0
455,2021-05-26,16274695.0,454429.0,14272174.0,80486.0
456,2021-05-27,16342162.0,456674.0,14455810.0,67467.0
457,2021-05-28,16391930.0,459045.0,14492701.0,49768.0


In [15]:
# Visualizando

px.line(brasil_corrigido, x='observationdate', y='novoscasos', title='Novos casos por dia')

# Mortes

In [16]:
fig = go.Figure()

fig.add_trace(
    go.Scatter(x=brasil_corrigido.observationdate, y=brasil_corrigido.deaths, name='Mortes', mode='lines+markers', line={'color':'red'})
)

# Layout

fig.update_layout(title='Mortes por COVID-19 no Brasil')

fig.show()

# Taxa de crescimento

taxa_crescimento = (presente/passado)**(1/n) - 1

In [17]:
def taxa_crescimento(dataset, variable, data_inicio=None, data_fim=None):
    
    # Se data_inicio for None, define com oa primeira data disponível
    
    if data_inicio == None:
        data_inicio = dataset.observationdate.loc[dataset[variable] > 0].min()
    else:
        data_inicio = pd.to_datetime(data_inicio)

    if data_fim == None:
        data_fim = dataset.observationdate.iloc[-1]
    else:
        data_fim = pd.to_datetime(data_fim)

    # Define os valores do presente e passado

    passado = dataset.loc[dataset.observationdate == data_inicio, variable].values[0]
    presente = dataset.loc[dataset.observationdate == data_fim, variable].values[0]

    # Define o número de pontos no tempo que vamos avaliar

    n = (data_fim - data_inicio).days

    # Calcular a taxa

    taxa = (presente/passado)**(1/n) - 1

    return taxa*100
    

In [18]:
# Taxa de crescimento médio do COVID no Brasil em todo o período

taxa_crescimento(brasil_corrigido, 'confirmed')

3.694820710228286