<a href="https://colab.research.google.com/github/danielpy108/Covid19MXAnalysis/blob/master/Covid19_MX_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Covid19 MX analysis

The dataset is provided by the government and you can download it at: https://www.gob.mx/salud/documentos/datos-abiertos-152127?idiom=es.

We'll use different libraries for data manipulation and visualization.

The idea is to verify when the dataset is updated and download it, otherwise keep it stored as it is google drive.



## Modules needed to do the work

In [0]:
# Modules for making magic a real thing
import numpy as np
import pandas as pd
import plotly.graph_objects as go

In [4]:
# Google colab for reading from Google drive
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [0]:
# Modules needed to scrape the webpage in order to check for an update in the database 
import re
import requests
import lxml.html
from datetime import date

## Stage 1: Download and parse

After downloading the dataset, I'll scrape the website to see if there's an update 
in the data.

To do:

- [x] Scrape the webpage 
- [ ] Compare the dates of the dataset of the webpage and the current one downloaded (download if there's one automatically)

In [89]:
# Path to the DSet file 
DSET_PATH = '/content/drive/My Drive/Colab Notebooks/Covid19MXPythonAnalysis/Dataset'
!echo $DSET_PATH

/content/drive/My Drive/Colab Notebooks/Covid19MXPythonAnalysis/Dataset


In [79]:
# Scrape the page for updates in the database
WEBPAGE = 'https://www.gob.mx/salud/documentos/datos-abiertos-152127'
response = requests.get(WEBPAGE, stream=True)                                                        # Get the object response from the webpage, using stream is faster than string method
response.raw.decode_content = True                                                                   # Decode the raw content of the response object (HTML)
tree = lxml.html.parse(response.raw)                                                                 # Make the HTML parse tree in order to find elements by id, class, xpath, etc...
DB_INFO = ''.join\
    (tree.xpath('/html/body/main/div/div[1]/div[4]/div/table[2]/tbody/tr[1]/td[1]/text()'))          # Get the text of the XPATH provided HTML element 
date_pattern = re.compile('Base de Datos \* (.*)') 
LATEST_DB_DATE = date_pattern.search(DB_INFO).group(1)                                               # Extract the DATE pattern from the string
TODAY_DATE = date.today().strftime('%d/%m/%Y')                                                       # Get today date in the same format as the one in the webpage

print(f'The last date the database was updated: {LATEST_DB_DATE}')

The last date the database was updated: 27/04/2020


In [92]:
# Download dataset to Google Drive
# Keep the csv file
# Erase the .zip file
!wget http://187.191.75.115/gobmx/salud/datos_abiertos/datos_abiertos_covid19.zip -P '$DSET_PATH'
!unzip -d '$DSET_PATH' '$DSET_PATH/datos_abiertos_covid19.zip'
!rm '$DSET_PATH/datos_abiertos_covid19.zip'
!ls '$DSET_PATH'

200427COVID19MEXICO.csv


## Stage 2: Analysis with Pandas 
<p> 
    The purpose of this thing is to perform an analysis of the dataset provided
    by the fucking mexican government.
</p>

<table>
    <thead>
        <tr>
            <th> Name </th>
            <th> Age </th>
            <th> State </th>
            <th> Recovered </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th> Tyler Durden </th>
            <th> 34 </th>
            <th> LA </th>
            <th> 1 </th>
        </tr>
    </tbody>
</table>

### Information abou the fields

+ RESULTADOS: integer from 1-3

### To Do

- [ ] Perform statistical anaylisis (mean, var, std)
- [ ] Perform 'queries' to see wich patients are oldest, has diseases, ..., etc
- [ ] Check for correlation between variables
- [ ] Make different plots for visualizing and understanding the data
- [ ] Maybe create a basic ML supervised learning algorithm for prediction? 

In [96]:
# Read the dataset from the CVS file
df = pd.read_csv(DSET_PATH + '/200427COVID19MEXICO.csv', engine='python')
df.head()

Unnamed: 0,FECHA_ACTUALIZACION,ID_REGISTRO,ORIGEN,SECTOR,ENTIDAD_UM,SEXO,ENTIDAD_NAC,ENTIDAD_RES,MUNICIPIO_RES,TIPO_PACIENTE,FECHA_INGRESO,FECHA_SINTOMAS,FECHA_DEF,INTUBADO,NEUMONIA,EDAD,NACIONALIDAD,EMBARAZO,HABLA_LENGUA_INDIG,DIABETES,EPOC,ASMA,INMUSUPR,HIPERTENSION,OTRA_COM,CARDIOVASCULAR,OBESIDAD,RENAL_CRONICA,TABAQUISMO,OTRO_CASO,RESULTADO,MIGRANTE,PAIS_NACIONALIDAD,PAIS_ORIGEN,UCI
0,2020-04-27,09e8dc,2,9,15,1,15,15,37,2,2020-04-09,2020-03-28,9999-99-99,2,1,75,1,2,2,1,2,2,2,2,2,2,2,2,2,2,1,99,México,99,1
1,2020-04-27,1dd782,2,12,9,1,15,9,3,1,2020-04-16,2020-04-02,9999-99-99,97,2,31,1,2,2,2,2,2,2,2,2,2,1,2,2,2,2,99,México,99,97
2,2020-04-27,0efbaf,2,9,28,2,16,28,32,1,2020-04-06,2020-04-04,9999-99-99,97,2,22,1,97,2,2,2,2,2,2,2,2,2,2,2,1,1,99,México,99,97
3,2020-04-27,013a6c,1,3,15,2,15,15,106,1,2020-04-16,2020-04-14,9999-99-99,97,2,26,1,97,2,2,2,2,2,2,2,2,1,2,2,1,3,99,México,99,97
4,2020-04-27,091a48,1,12,15,2,15,15,31,2,2020-04-06,2020-04-04,9999-99-99,2,1,50,1,97,2,2,2,2,2,2,2,2,2,2,2,2,2,99,México,99,2


### Get information about the dataset

In [103]:
# List the fields in the DS
fields = list(df.columns)
for n, field in enumerate(fields):
    print(f'Field #{n}:\t {field}')

Field #0:	 FECHA_ACTUALIZACION
Field #1:	 ID_REGISTRO
Field #2:	 ORIGEN
Field #3:	 SECTOR
Field #4:	 ENTIDAD_UM
Field #5:	 SEXO
Field #6:	 ENTIDAD_NAC
Field #7:	 ENTIDAD_RES
Field #8:	 MUNICIPIO_RES
Field #9:	 TIPO_PACIENTE
Field #10:	 FECHA_INGRESO
Field #11:	 FECHA_SINTOMAS
Field #12:	 FECHA_DEF
Field #13:	 INTUBADO
Field #14:	 NEUMONIA
Field #15:	 EDAD
Field #16:	 NACIONALIDAD
Field #17:	 EMBARAZO
Field #18:	 HABLA_LENGUA_INDIG
Field #19:	 DIABETES
Field #20:	 EPOC
Field #21:	 ASMA
Field #22:	 INMUSUPR
Field #23:	 HIPERTENSION
Field #24:	 OTRA_COM
Field #25:	 CARDIOVASCULAR
Field #26:	 OBESIDAD
Field #27:	 RENAL_CRONICA
Field #28:	 TABAQUISMO
Field #29:	 OTRO_CASO
Field #30:	 RESULTADO
Field #31:	 MIGRANTE
Field #32:	 PAIS_NACIONALIDAD
Field #33:	 PAIS_ORIGEN
Field #34:	 UCI


In [107]:
# Get the total number of records (patients) that has been studied
print(f'{df.shape[0]} patiens has beeen studied')

71103 patiens has beeen studied


## Making Plots 

Better to understand the data in a visual way for an easier comprehension.

In [122]:
ages = np.array(df['EDAD'])
tests = np.array(df['RESULTADO'], dtype=np.int)

fig = go.Figure(
    data = [
        go.Histogram(
            x = ages,
            marker_color='#330C73',
        ),
    ]
)

fig.update_layout(
    title_text = 'Total population ages for Covid19 tests',
    xaxis_title_text = 'Age',
    yaxis_title_text = 'Frequency',
)

fig.show()