# Projeto: Machine Learning II
***Projeto Santander Coders 2023.2 | Ada***

**Professor:** Jorge Chamby-Diaz

**Descrição:** Este notebook contém o desenvolvimento do projeto proposto como conclusão da disciplina de `Machine Learning II`.

**Dataset(s) Utilizado(s):** [covid19br](https://github.com/wcota/covid19br/)

> W. Cota, “Monitoring the number of COVID-19 cases and deaths in brazil at municipal and federative units level”, SciELOPreprints:362 (2020), 10.1590/scielopreprints.362

## Parte 0: Bibliotecas Utilizadas e Imports

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

## Parte 1: Visualizando o conjunto de dados

In [11]:
df = pd.read_csv('https://raw.githubusercontent.com/wcota/covid19br/master/cases-brazil-states.csv')

# Como sugerido pelo autor do dataset
# vamos alterar a coluna 'date' para um formato 'datatime'
df['date'] = pd.to_datetime(df['date'])

df.head(10)

Unnamed: 0,epi_week,date,country,state,city,newDeaths,deaths,newCases,totalCases,deathsMS,...,tests,tests_per_100k_inhabitants,vaccinated,vaccinated_per_100_inhabitants,vaccinated_second,vaccinated_second_per_100_inhabitants,vaccinated_single,vaccinated_single_per_100_inhabitants,vaccinated_third,vaccinated_third_per_100_inhabitants
0,9,2020-02-25,Brazil,SP,TOTAL,0,0,1,1,0,...,,,,,,,,,,
1,9,2020-02-25,Brazil,TOTAL,TOTAL,0,0,1,1,0,...,,,,,,,,,,
2,9,2020-02-26,Brazil,SP,TOTAL,0,0,0,1,0,...,,,,,,,,,,
3,9,2020-02-26,Brazil,TOTAL,TOTAL,0,0,0,1,0,...,,,,,,,,,,
4,9,2020-02-27,Brazil,SP,TOTAL,0,0,0,1,0,...,,,,,,,,,,
5,9,2020-02-27,Brazil,TOTAL,TOTAL,0,0,0,1,0,...,,,,,,,,,,
6,9,2020-02-28,Brazil,SP,TOTAL,0,0,1,2,0,...,,,,,,,,,,
7,9,2020-02-28,Brazil,TOTAL,TOTAL,0,0,1,2,0,...,,,,,,,,,,
8,9,2020-02-29,Brazil,SP,TOTAL,0,0,0,2,0,...,,,,,,,,,,
9,9,2020-02-29,Brazil,TOTAL,TOTAL,0,0,0,2,0,...,,,,,,,,,,


In [12]:
print('Shape: ', df.shape)
df.info()

Shape:  (30842, 26)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30842 entries, 0 to 30841
Data columns (total 26 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   epi_week                               30842 non-null  int64         
 1   date                                   30842 non-null  datetime64[ns]
 2   country                                30842 non-null  object        
 3   state                                  30842 non-null  object        
 4   city                                   30842 non-null  object        
 5   newDeaths                              30842 non-null  int64         
 6   deaths                                 30842 non-null  int64         
 7   newCases                               30842 non-null  int64         
 8   totalCases                             30842 non-null  int64         
 9   deathsMS                               30

Com a ajuda da descrição das colunas fornecida pelo autor [aqui](https://github.com/wcota/covid19br/blob/master/DESCRIPTION.md), vamos filtrar alguns dados e criar novas colunas que podem ser úteis para nossa análise:

In [13]:
# Criando novas colunas
df['active_cases'] = df['totalCases'] - df['deaths'] - df['recovered']

In [14]:

df[['epi_week', 'vaccinated', 'vaccinated_second', 'vaccinated_third', 'active_cases']].head(10)

Unnamed: 0,epi_week,vaccinated,vaccinated_second,vaccinated_third,active_cases
0,9,,,,
1,9,,,,
2,9,,,,
3,9,,,,
4,9,,,,
5,9,,,,
6,9,,,,
7,9,,,,
8,9,,,,
9,9,,,,


In [15]:
df[['epi_week', 'vaccinated', 'vaccinated_second', 'vaccinated_third', 'active_cases']].tail(10)

Unnamed: 0,epi_week,vaccinated,vaccinated_second,vaccinated_third,active_cases
30832,311,14716972.0,13809945.0,12211486.0,728776.0
30833,311,2979308.0,2772139.0,2030138.0,197245.0
30834,311,1328876.0,1186864.0,697735.0,116795.0
30835,311,473511.0,369284.0,175984.0,28436.0
30836,311,9877242.0,9431834.0,8439439.0,774032.0
30837,311,6308275.0,5862014.0,4166331.0,356600.0
30838,311,2023048.0,1860868.0,1376006.0,39613.0
30839,311,43381275.0,40857206.0,29057424.0,1440403.0
30840,311,1183632.0,1029319.0,595805.0,79842.0
30841,311,183869403.0,170385963.0,126348017.0,9004794.0
