## COLOMBIA COMPLETE COVID-19 DATASET

**Author: Camilo Esteban Ruiz**

Email: ruiznho123@gmail.com

[LinkedIn Profile](http://linkedin.com/in/camesruiz)

## Introduction

Version 1.0 (March 23th)

Coronavirus (COVID-19) made its outbreak in Colombia with the first confirmed in the contry on march 06, since then, number of confirmed cases has been increasing and deaths related to the virus are starting to have the first confirmed cases.

This notebook emphazises on giving some insights on the virus spread in Colombia.

Feel free to click on the "Code" button above each output to go deeper into the code for the graphs.

In [3]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import plotly.express as px
from matplotlib.pyplot import plot
import seaborn as sn
%matplotlib inline 

from sklearn.linear_model import LinearRegression
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

/kaggle/input/colombia-shape-files-by-departments/depto.shp
/kaggle/input/colombia-shape-files-by-departments/depto.dbf
/kaggle/input/colombia-shape-files-by-departments/depto.shx
/kaggle/input/colombia-shape-files-by-departments/depto.prj
/kaggle/input/colombia-covid19-complete-dataset/covid-19-colombia-confirmed.csv
/kaggle/input/colombia-covid19-complete-dataset/colombia_departamentos.csv
/kaggle/input/colombia-covid19-complete-dataset/covid-19-colombia.csv
/kaggle/input/colombia-covid19-complete-dataset/covid-19-colombia-deaths.csv
/kaggle/input/colombia-covid19-complete-dataset/covid-19-colombia-all.csv
/kaggle/input/colombia-covid19-complete-dataset/Casos1.csv


In [4]:
#Data import

colombia_df = pd.read_csv('../input/colombia-covid19-complete-dataset/covid-19-colombia-all.csv')
confirmed = pd.read_csv('../input/colombia-covid19-complete-dataset/covid-19-colombia-confirmed.csv')
deaths = pd.read_csv('../input/colombia-covid19-complete-dataset/covid-19-colombia-deaths.csv',encoding='ISO-8859-1')
col_df = pd.read_csv('../input/colombia-covid19-complete-dataset/covid-19-colombia.csv')
cases = pd.read_csv('../input/colombia-covid19-complete-dataset/Casos1.csv',encoding='ISO-8859-1')

In [5]:
#Dataframe overview
col_df.head()

Unnamed: 0,date,confirmed,confirmed_daily,deaths,deaths_daily,recovered,recovered_daily
0,2020-03-05,0,0,0,0,0,0
1,2020-03-06,1,1,0,0,0,0
2,2020-03-07,1,0,0,0,0,0
3,2020-03-08,2,1,0,0,0,0
4,2020-03-09,3,1,0,0,0,0


In [6]:
#Number of actual active cases calculation
col_df['active'] = col_df.confirmed - col_df.deaths - col_df.recovered

## 1. Cases Evolution Over Time

Graphs for confirmed cases, deaths and recoveries over time are generated since the first case was confirmed (march 06). It is clear that it tends to have a exponential behavior.

First case corresponds to a 19 year old woman who arrived to the country the week before from Milan, Italy.

In [7]:
#Plotting
fig = px.line(col_df, x="date", y="confirmed", 
              title="Colombia Confirmed Cases")
fig.show()

fig = px.line(col_df, x="date", y="deaths", 
              title="Colombia Confirmed Deaths")
fig.show()

fig = px.line(col_df, x="date", y="recovered", 
              title="Colombia Confirmed Recoveries")
fig.show()

Total of active cases

*Confirmed cases - Deaths - Recoveries*

In [8]:
fig = px.line(col_df, x="date", y="active", 
              title="Colombia Active Cases")
fig.show()

In [9]:
cols = confirmed.keys()

confirmed1 = confirmed.loc[:, cols[1]:cols[-1]]
deaths1 = deaths.loc[:, cols[1]:cols[-1]]

In [10]:
#Number of days since the outbreak (March 6th)
days = np.array([i for i in range(len(col_df.index))]).reshape(-1, 1)

dates = confirmed1.keys()
state_cases = []
total_deaths = [] 

#Total number of cases
for i in dates:
    confirmed_sum = confirmed1[i].sum()
    death_sum = deaths1[i].sum()
    
    state_cases.append(confirmed_sum)
    total_deaths.append(death_sum)

print('Total number of confirmed cases: ',confirmed_sum)

Total number of confirmed cases:  378


## 2. Number of Cases by State (Departamento)

Actual number of cases in each region. As of march 23, Bogota has most of the number of cases. Beign the capital, most of the international passengers traffic pass through it via El Dorado airport.

In [11]:
state_cases = np.array(state_cases).reshape(-1, 1)
total_deaths = np.array(total_deaths).reshape(1, -1)

deptos = np.array(confirmed.state)
total = np.array(confirmed.loc[:,cols[-1]])
total_d = np.array(deaths.loc[:,cols[-1]])

In [12]:
#Plotting
df = pd.DataFrame({'state':deptos,'confirmed':total})
fig = px.bar(df.sort_values('confirmed', ascending=False)[:10][::-1], 
             x='confirmed', y='state', color_discrete_sequence=['#84DCC6'],
             title='Confirmed Cases by State', text='confirmed', orientation='h')
fig.show()

Now, a graph showing the number of deaths per state is generated.

The first death corresponds to a 58 year old taxi driver who passed away on march 16th in Cartagena after having contact with several international passengers by picking them up from the airport, but was officialy related to COVID-19 and reported until march 22nd.

In [13]:
df = pd.DataFrame({'state':deptos,'confirmed':total_d})
fig = px.bar(df.sort_values('confirmed', ascending=False)[:10][::-1], 
             x='confirmed', y='state', color_discrete_sequence=['#84DCC6'],
             title='Confirmed Deaths by State', text='confirmed', orientation='h')
fig.show()

## 3. Mortality and Recovery Rate

Moratlity rate per total of confirmed cases. Initially, the World Health Organization made an estimate of 2% of deaths over the total number of cases globally.

A graph showing death rate over time in Colombia is generated

In [14]:
#Mortality and recovery rate calculation
col_df['death_rate'] = (col_df.deaths/col_df.confirmed) * 100
col_df['recover_rate'] = (col_df.recovered/col_df.confirmed) * 100
col_df['inf_rate'] = (col_df.confirmed/48258494) * 100

col_df

Unnamed: 0,date,confirmed,confirmed_daily,deaths,deaths_daily,recovered,recovered_daily,active,death_rate,recover_rate,inf_rate
0,2020-03-05,0,0,0,0,0,0,0,,,0.0
1,2020-03-06,1,1,0,0,0,0,1,0.0,0.0,2e-06
2,2020-03-07,1,0,0,0,0,0,1,0.0,0.0,2e-06
3,2020-03-08,2,1,0,0,0,0,2,0.0,0.0,4e-06
4,2020-03-09,3,1,0,0,0,0,3,0.0,0.0,6e-06
5,2020-03-10,3,0,0,0,0,0,3,0.0,0.0,6e-06
6,2020-03-11,9,6,0,0,0,0,9,0.0,0.0,1.9e-05
7,2020-03-12,9,0,0,0,0,0,9,0.0,0.0,1.9e-05
8,2020-03-13,16,7,0,0,0,0,16,0.0,0.0,3.3e-05
9,2020-03-14,24,8,0,0,0,0,24,0.0,0.0,5e-05


In [15]:
#Temp dataframe for plotting multiple traces
df = pd.DataFrame([col_df.date,col_df.death_rate,col_df.recover_rate])
df_melt = col_df.melt(id_vars='date', value_vars=['death_rate', 'recover_rate'])

In [16]:
fig = px.line(df_melt, x="date", y="value", 
              title="Colombia Mortality and Recover Rate (%)",color='variable')

print('Death Rate: ',col_df.death_rate.max() , '%')
print('Recover Rate: ',col_df.recover_rate.max() , '%')
fig.show()

Death Rate:  0.9803921568627451 %
Recover Rate:  1.9607843137254901 %


Infection rate over time assuming a population of 48'258.494 (result of the 2018 census)

In [17]:
fig = px.line(col_df, x="date", y="inf_rate", 
              title="Colombia Infection Rate (%) (Population: 48'258.494)")

print('Infecion Rate: ',col_df.inf_rate.max() , '%')
fig.show()

Infecion Rate:  0.0007832817990548979 %


## 4. Confirmed Daily Cases

In [18]:
df = pd.DataFrame({'Date':col_df.date,'Confirmed':col_df.confirmed_daily})
fig = px.bar(df, y='Confirmed', x='Date', color_discrete_sequence=['#84DCC6'],
             title='Confirmed Daily Cases', text='Confirmed', orientation='v')
fig.show()

df = pd.DataFrame({'Date':col_df.date,'Deaths':col_df.deaths_daily})
fig = px.bar(df, y='Deaths', x='Date', color_discrete_sequence=['#84DCC6'],
             title='Confirmed Daily Deaths', text='Deaths', orientation='v')
fig.show()

df = pd.DataFrame({'Date':col_df.date,'Recovered':col_df.recovered_daily})
fig = px.bar(df, y='Recovered', x='Date', color_discrete_sequence=['#84DCC6'],
             title='Confirmed Daily Recoveries', text='Recovered', orientation='v')
fig.show()

In [19]:
df_melt = col_df.melt(id_vars='date', value_vars=['recovered_daily','deaths_daily', 'confirmed_daily'])
fig = px.bar(df_melt, y='value', x='date', color='variable',
             title='Confirmed Daily Cases', text='value', orientation='v',barmode='group')
fig.show()

## 5. Cases by Sex

Number of total confirmed cases by sex and age groups

In [20]:
cases

Unnamed: 0,ID,date,Departamento,Edad,Sexo,Tipo
0,1,2020-03-06,Bogota,10 a 19,F,Importado
1,2,2020-03-09,Valle,30 a 39,M,Importado
2,3,2020-03-09,Antioquia,50 a 59,F,Importado
3,4,2020-03-11,Antioquia,50 a 59,M,Relacionado
4,5,2020-03-11,Antioquia,20 a 29,M,Relacionado
...,...,...,...,...,...,...
301,302,2020-03-23,Meta,30 a 39,F,Importado
302,303,2020-03-23,Antioquia,70 a 79,F,Importado
303,304,2020-03-23,Antioquia,30 a 39,M,Importado
304,305,2020-03-23,Meta,20 a 29,M,Importado


In [21]:
male = cases.loc[cases['Sexo'] == 'M'].count()[0]
female = cases.loc[cases['Sexo'] == 'F'].count()[0]

sex_grouped = pd.DataFrame({'M': [male], 'F': [female]}).T
sex_grouped.columns = ['n']
sex_grouped

Unnamed: 0,n
M,157
F,149


In [22]:
fig = px.bar(sex_grouped, y='n', x= sex_grouped.index,
             title='Cases by Sex', text='n', orientation='v')
fig.show()

Graph above shows that cases by sex are equally distributed, being males most ofthe infected cases.

Next plot shows that most of the infected cases are people in the 20-29 group, and the lowest number are in the 0-9 and 80-89 that are parte of the most vulnerable population.

In [23]:
age_grouped = cases.groupby(['Edad']).count()
age_grouped['ID']

Edad
0 a 9       1
10 a 19    10
20 a 29    78
30 a 39    58
40 a 49    51
50 a 59    59
60 a 69    27
70 a 79    15
80 a 89     7
Name: ID, dtype: int64

In [24]:
fig = px.bar(age_grouped, y='ID', x= age_grouped.index,
             title='Cases by Age Groups', text='ID', orientation='v')
fig.show()