Este script tem como fonte o vídeo **Visualizando Dados do Coronavírus (COVID19) com Python** a partir do link: <https://www.youtube.com/watch?v=Xk0zHZBa7LM>

# Carregando a base de dados

In [1]:
import pandas as pd

- Lendo a base de dados:

In [4]:
df = pd.read_csv('covid_19_clean_complete.csv', sep = ',', parse_dates=['Date'])

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19836 entries, 0 to 19835
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Province/State  6080 non-null   object        
 1   Country/Region  19836 non-null  object        
 2   Lat             19836 non-null  float64       
 3   Long            19836 non-null  float64       
 4   Date            19836 non-null  datetime64[ns]
 5   Confirmed       19836 non-null  int64         
 6   Deaths          19836 non-null  int64         
 7   Recovered       19836 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(3), object(2)
memory usage: 1.2+ MB


- Quando o pandas não identifica exatamente o formato da variável, ele classifica como `object`.

- O código a seguir ignora qualquer tipo de aviso:

In [6]:
import warnings
warnings.filterwarnings('ignore')

In [7]:
df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Afghanistan,33.0,65.0,2020-01-22,0,0,0
1,,Albania,41.1533,20.1683,2020-01-22,0,0,0
2,,Algeria,28.0339,1.6596,2020-01-22,0,0,0
3,,Andorra,42.5063,1.5218,2020-01-22,0,0,0
4,,Angola,-11.2027,17.8739,2020-01-22,0,0,0


# Criando novas colunas

In [8]:
# Casos Ativos
df['Active'] = df['Confirmed'] - df['Deaths'] - df['Recovered']

In [9]:
# Substituindo Mainland China por China
df['Country/Region'] = df['Country/Region'].replace('Mainland China', 'China')

In [10]:
# Preenchendo missing values
df[['Province/State']] = df[['Province/State']].fillna('')

In [11]:
# Preenchendo NA com zero
df[['Confirmed', 'Deaths', 'Recovered', 'Active']] = df[['Confirmed', 'Deaths', 'Recovered', 'Active']].fillna(0)

- Mudando o tipo da variável para inteiro:
    + Caso uma variável **numérica** não seja reconhecida como inteiro, podemos converter:

 `df['Recovered'] = df['Recovered'].astype(int)`

In [12]:
df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active
0,,Afghanistan,33.0,65.0,2020-01-22,0,0,0,0
1,,Albania,41.1533,20.1683,2020-01-22,0,0,0,0
2,,Algeria,28.0339,1.6596,2020-01-22,0,0,0,0
3,,Andorra,42.5063,1.5218,2020-01-22,0,0,0,0
4,,Angola,-11.2027,17.8739,2020-01-22,0,0,0,0


## **Examinando os dados temporais**

In [13]:
df.Date.describe()

count                   19836
unique                     76
top       2020-01-25 00:00:00
freq                      261
first     2020-01-22 00:00:00
last      2020-04-06 00:00:00
Name: Date, dtype: object

Temos dados de 22 de janeiro de 2020 até 06 de abril de 2020.

## Agrupando dados

- **Obtendo o número de confirmados, mortes, recuperados e ativos agrupando por data e região:**

In [15]:
df_agrupado = df.groupby(['Date', 'Country/Region'])['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()

- **Ordenando o dataframe por maiores números de casos:**

In [16]:
df_agrupado.sort_values(by = 'Confirmed', ascending=False)

Unnamed: 0,Date,Country/Region,Confirmed,Deaths,Recovered,Active
13971,2020-04-06,US,366614,10783,19581,336250
13787,2020-04-05,US,337072,9619,17448,310005
13603,2020-04-04,US,308850,8407,14652,285791
13419,2020-04-03,US,275586,7087,9707,258792
13235,2020-04-02,US,243453,5926,9001,228526
...,...,...,...,...,...,...
8756,2020-03-09,Mauritania,0,0,0,0
4967,2020-02-17,Zimbabwe,0,0,0,0
8754,2020-03-09,Mali,0,0,0,0
4968,2020-02-18,Afghanistan,0,0,0,0


- **Criando uma nova data frame chamda df_group_paises**

In [17]:
df_group_paises = df_agrupado.sort_values(by = 'Confirmed', ascending=False)

In [18]:
df_group_paises.head()

Unnamed: 0,Date,Country/Region,Confirmed,Deaths,Recovered,Active
13971,2020-04-06,US,366614,10783,19581,336250
13787,2020-04-05,US,337072,9619,17448,310005
13603,2020-04-04,US,308850,8407,14652,285791
13419,2020-04-03,US,275586,7087,9707,258792
13235,2020-04-02,US,243453,5926,9001,228526


- **Agrupando por data para casos recuperados, mortes e ativos:**

In [19]:
temp = df.groupby('Date')['Recovered', 'Deaths', 'Active'].sum().reset_index()

- **Remodelando o dataframe com a variável e valor para ter quantidades de recuperados, mortos e ativos:**