# A COVID-19 Analysis (week 5 report)

## Background

Based on the information at the [Pan-American Health Organization site](https://www.paho.org/bra/index.php?option=com_content&view=article&id=6101:covid19&Itemid=875):


    The World Health Organization (WHO) declared, on January 30, 2020, that the outbreak of the disease caused by the new coronavirus (COVID-19) constitutes a Public Health Emergency of International Importance - the Organization's highest alert level, as provided at the International Health Regulations. On March 11, 2020, COVID-19 was characterized by WHO as a pandemic.

## The problem

Through the data analysis I'll try to show the main epicenters of virus discemination and the places that are potencially related to that problem, but to be more  concise and practical, the analysis will be centered in the city that I live, São Paulo.  

Because the theme is so delicate, I'm not trying to take any precipitated conclusions nor find any guilted subjects.  
At the end of the work I hope that the audience could take their own decisions with a little more knowledge over the situation.  

## The data

In that work I'll try to explore some datasets related with the virus and the disease that it causes. 
The main data, used in the analysis, will be downloaded from the government sites, but some information (related to the districts bondaries) are downloaded from other sources and previous stored on my GitHub account.  

##### Demografic data:
https://www.prefeitura.sp.gov.br/cidade/secretarias/subprefeituras/subprefeituras/dados_demograficos/index.php

This data will be obtained using BeaultifulSoup as data scraping tool.
This data is relevant to us to demonstrate that the virus discemination is related to the populational density of the districts.

##### District boundaries data:
https://artefolha.carto.com/tables/distritos_sp/public/map


The site offers a link to download the geojson of the city we're analysing.
The data was downloaded and stored at my GitHub repository to be used on the Jupiter Notebook:
https://raw.githubusercontent.com/UgoCesar19/Coursera_Capstone/master/distritos_sp.geojson

That data will be used to plot the populational density, using cloropleth map with folium.

##### Covid data:
https://opendatasus.saude.gov.br/dataset/casos-nacionais

This is the core of this work.
This is the government data about Flu Syndrome, and tests on COVID-19.
The data have been updated from monday to fryday.
We'll usse that data to discover the number of confirmed cases, by week, by day, by neighborhood, and to obtain the geospatial data through geopy.

##### Venues data:
Here we'll use the [Foursquare API](https://developer.foursquare.com/places) to obtain the main places among each neighborhood that reported COVID-19 cases that are potentially active.  

## Metodology

First we scrape the government demographic data, showing the districts with the most populational density on a cloropleth map.

![alt text](https://raw.githubusercontent.com/UgoCesar19/Coursera_Capstone/master/populational%20density.png "Pupulational Density")

Then we present an basic panorama about historical data on COVID-19, showing number of confirmed cases by week, and number of potentially active cases by day.  

![alt text](https://raw.githubusercontent.com/UgoCesar19/Coursera_Capstone/master/confirmed%20cases%20by%20week.png "Confirmed cases per week")


![alt text](https://raw.githubusercontent.com/UgoCesar19/Coursera_Capstone/master/confirmed%20cases%20by%20day.png "Potentially active cases per day")


After that the analysis will took the path of relating the city demografic data with geospatial data about the virus discemination, foccusing on potetially active cases.  

With that in mind we try to determine the main epicenter of the virus in the city. At this point, because the Flu Syndrome is a huge dataset composed by the government from various sources, we have a lot of noise. Many of the the neighborhoods found using geopy doesn't pertain to the city we're analysing:  

![alt text](https://raw.githubusercontent.com/UgoCesar19/Coursera_Capstone/master/active%20cases%20noise.png "Data noise")


Because of the noise and the geospatial nature of the data we choose DBSCAN (Density Based Spatial Clustering of Aplications With Noise) as the machine learning algoritm to explore and understand the neighborhood data.  

Eliminating the noise and grouping the cases by cluster we could determine the main epicenter in the city. We break this process into two clusterization steps.  

The first will lead us to the data related to our city.  

![alt text](https://raw.githubusercontent.com/UgoCesar19/Coursera_Capstone/master/first%20clusterization.png "First clusteriztion")


The second will spot the main epicenter inside city area.  

![alt text](https://raw.githubusercontent.com/UgoCesar19/Coursera_Capstone/master/second%20clusterization.png "Second clusteriztion")


At that point, we use the Foursquare API to search by venues that are potentially related to virus discemination near the epicenter.

![alt text](https://raw.githubusercontent.com/UgoCesar19/Coursera_Capstone/master/venues%20found.png "Venues found")


Lastly we present the main kinds of venues potencially related to virus discemination, and present the main neighborhoods with virus discemination on the epicenter along with the places near that neighborhoods.  

![alt text](https://raw.githubusercontent.com/UgoCesar19/Coursera_Capstone/master/main%20neighborhoods%20and%20venues.png "Venues found")


## Results

It's important to state that at the moment of this publication (07/2020) the crisis around the COVID-19 is in course, so the data could and will change across the days and, the conclusions about that will evolve too.  
The intention is that this analysis continues to be made as a tool to spot the problem that are evolving.

## Discussion

The important thing about that analysis is that we're using the data to make our conclusions, despite of the battle involving the media and the government.  

I think that the government could learn with that problem to invest more in health and better prepare the country infraestruture, so in the future we could meke important decisions in less time, before the things come at the point we are now.  

## Conclusion

Again, becouse this is a delicate theme, I like to keep this work as an evolving piece and wait for feedback, to better the analysis.  

Other thing is that, althoug this work toke the path that I choose, I hope that other analysts could use that as an example and make their own analysis.  

I hope too that people could better take their own decisions based on my study. Everyone could help, especially if they are well informed.  