## Caso COVID-19 Fase 1
Julián Correa, Guillermo Cortés, Cristian Sarmiento

## Creditos de la información
La data utilizada en el ejercicio es tomada del **repositorio de datos COVID-19 del Centro de Ciencias de Datos e Ingeniria de Sistemas (CSSE) de la Universidad Johns Hopkins (https://github.com/CSSEGISandData/COVID-19)[https://github.com/CSSEGISandData/COVID-19]** 

### Preprocesarmiento de información

In [1]:
## import libraries and dependencies
import pandas as pd

## read downloaded csv file
covid_cases = pd.read_csv('time_series_covid19_confirmed_global.csv')
covid_cases.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,2/28/23,3/1/23,3/2/23,3/3/23,3/4/23,3/5/23,3/6/23,3/7/23,3/8/23,3/9/23
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,209322,209340,209358,209362,209369,209390,209406,209436,209451,209451
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,334391,334408,334408,334427,334427,334427,334427,334427,334443,334457
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,271441,271448,271463,271469,271469,271477,271477,271490,271494,271496
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,47866,47875,47875,47875,47875,47875,47875,47875,47890,47890
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,105255,105277,105277,105277,105277,105277,105277,105277,105288,105288


Al realizar la inspección nos encontramos con que la tabla esta en formato horizontan donde las fechas son columnas en el data frame lo que hace complicado el procesamiento de la información. Procedemos hacer el comando **melt** de pandas para convertir **date** en variable y **value** como otra variable que contiene el numero de casos.

In [2]:
## melt dataframe to get dates
id_vars = ['Province/State', 'Country/Region', 'Lat', 'Long']
df = covid_cases.melt(id_vars=id_vars, var_name='date')
df.sample(n=5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,value
234896,,Sao Tome and Principe,0.1864,6.6131,4/13/22,5948
230010,,Tonga,-21.179,-175.1982,3/27/22,5591
85775,,Serbia,44.0165,21.0059,11/13/20,77264
197649,,Ukraine,48.3794,31.1656,12/5/21,3668794
82420,Yukon,Canada,64.2823,-135.0,11/2/20,23


## Revisión de tipos de variables

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330327 entries, 0 to 330326
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Province/State  104013 non-null  object 
 1   Country/Region  330327 non-null  object 
 2   Lat             328041 non-null  float64
 3   Long            328041 non-null  float64
 4   date            330327 non-null  object 
 5   value           330327 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 15.1+ MB


Encontramos que la variable **date** no corresponde al tipo de valor de fecha, procedemos hacer el ajuste convirtiendo **object** a tipo **datetime64**

In [4]:
## Transform date column to datetime format
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%y')

df.sample(n=5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,value
111426,,"Korea, North",40.3399,127.5101,2021-02-10,0
84572,,Mauritius,-20.348404,57.552152,2020-11-09,467
104956,Nunavut,Canada,70.2998,-83.1076,2021-01-19,266
118038,Reunion,France,-21.1151,55.5364,2021-03-05,13125
85803,,Tuvalu,-7.1095,177.6493,2020-11-13,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330327 entries, 0 to 330326
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   Province/State  104013 non-null  object        
 1   Country/Region  330327 non-null  object        
 2   Lat             328041 non-null  float64       
 3   Long            328041 non-null  float64       
 4   date            330327 non-null  datetime64[ns]
 5   value           330327 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 15.1+ MB


## 1. ¿En cuál mes se presentó el mayor número de contagios?

In [6]:
# df['month_year'] = df['date'].dt.strftime('%Y-%m') ## No es eficiente en ejecución, tardda 12 segundos
df['month_year'] = df['date'].apply(lambda x: f'{x.year}-{x.month}') ## este procedimiento tarda sólo 1.8
df_sorted = df.copy()
df_sorted.sort_values(by=['Country/Region', 'Province/State', 'date'], inplace=True)
df_sorted['new_cases'] = df.groupby(['Country/Region', 'Province/State'])['value'].diff().fillna(0)

In [7]:
df_agg_month = df_sorted.groupby(['month_year'], as_index=False)['new_cases'].sum()
df_agg_month

Unnamed: 0,month_year,new_cases
0,2020-1,9267.0
1,2020-10,91876.0
2,2020-11,164658.0
3,2020-12,224391.0
4,2020-2,69599.0
5,2020-3,20899.0
6,2020-4,53646.0
7,2020-5,37757.0
8,2020-6,18726.0
9,2020-7,29464.0


In [8]:
cond_max_cases = df_agg_month['new_cases'] == df_agg_month['new_cases'].max()
df_agg_month[cond_max_cases]

Unnamed: 0,month_year,new_cases
24,2022-1,3348699.0


In [None]:
max_month = df_agg_month[cond_max_cases]['month_year'].iloc[0]
print(f'El mes y el año con mas casos es {max_month}')

#### 2. ¿En ese mismo mes, cuál fue el país que reportó más contagios?

In [10]:
cond = (df_sorted['month_year'] == max_month)
df_country = df_sorted[cond].groupby(['Country/Region'], as_index=False)['new_cases'].sum()
df_country.sort_values(by='new_cases').tail()

Unnamed: 0,Country/Region,new_cases
128,Netherlands,36048.0
190,United Kingdom,41996.0
63,France,237440.0
33,Canada,849919.0
9,Australia,2154890.0


In [11]:
df_country[df_country['new_cases'] == df_country['new_cases'].max()]

Unnamed: 0,Country/Region,new_cases
9,Australia,2154890.0


#### 3. ¿Cuál es el país con el menor número de casos reportados hasta la fecha?

In [12]:
df_min_country = df.groupby('Country/Region', as_index=False).agg({'value':'sum'})
df_min_country[(df_min_country['value'] == df_min_country['value'].min())]

Unnamed: 0,Country/Region,value
93,"Korea, North",300


***
___