## Caso COVID-19 Fase 1
Julián Correa, Guillermo Cortés, Cristian Sarmiento

## Creditos de la información
La data utilizada en el ejercicio es tomada del **repositorio de datos COVID-19 del Centro de Ciencias de Datos e Ingeniria de Sistemas (CSSE) de la Universidad Johns Hopkins (https://github.com/CSSEGISandData/COVID-19)[https://github.com/CSSEGISandData/COVID-19]** 

### Preprocesarmiento de información

In [1]:
## import libraries and dependencies
import pandas as pd

## read downloaded csv file
covid_cases = pd.read_csv('time_series_covid19_confirmed_global.csv')
covid_cases.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,2/28/23,3/1/23,3/2/23,3/3/23,3/4/23,3/5/23,3/6/23,3/7/23,3/8/23,3/9/23
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,209322,209340,209358,209362,209369,209390,209406,209436,209451,209451
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,334391,334408,334408,334427,334427,334427,334427,334427,334443,334457
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,271441,271448,271463,271469,271469,271477,271477,271490,271494,271496
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,47866,47875,47875,47875,47875,47875,47875,47875,47890,47890
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,105255,105277,105277,105277,105277,105277,105277,105277,105288,105288


Al realizar la inspección nos encontramos con que la tabla esta en formato horizontan donde las fechas son columnas en el data frame lo que hace complicado el procesamiento de la información. Procedemos hacer el comando **melt** de pandas para convertir **date** en variable y **value** como otra variable que contiene el numero de casos.

In [2]:
## melt dataframe to get dates
id_vars = ['Province/State', 'Country/Region', 'Lat', 'Long']
df = covid_cases.melt(id_vars=id_vars, var_name='date')
df.sample(n=5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,value
46100,,Indonesia,-0.7893,113.9213,6/29/20,55092
87341,Gansu,China,35.7518,104.2861,11/19/20,181
63649,Heilongjiang,China,47.862,127.7615,8/29/20,948
145380,South Australia,Australia,-34.9285,138.6007,6/8/21,769
111792,,Somalia,5.152149,46.199616,2/11/21,5013


## Revisión de tipos de variables

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330327 entries, 0 to 330326
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Province/State  104013 non-null  object 
 1   Country/Region  330327 non-null  object 
 2   Lat             328041 non-null  float64
 3   Long            328041 non-null  float64
 4   date            330327 non-null  object 
 5   value           330327 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 15.1+ MB


Encontramos que la variable **date** no corresponde al tipo de valor de fecha, procedemos hacer el ajuste convirtiendo **object** a tipo **datetime64**

In [4]:
## Transform date column to datetime format
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%y')

df.sample(n=5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,value
223553,,Japan,36.204824,138.252924,2022-03-05,5348422
329588,Saint Pierre and Miquelon,France,46.8852,-56.3159,2023-03-07,3452
73775,Ningxia,China,37.2692,106.1655,2020-10-03,75
309570,Prince Edward Island,Canada,46.5107,-63.4168,2022-12-28,55319
91761,,India,20.593684,78.96288,2020-12-04,9608211


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330327 entries, 0 to 330326
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   Province/State  104013 non-null  object        
 1   Country/Region  330327 non-null  object        
 2   Lat             328041 non-null  float64       
 3   Long            328041 non-null  float64       
 4   date            330327 non-null  datetime64[ns]
 5   value           330327 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 15.1+ MB


## 1. ¿En cuál mes se presentó el mayor número de contagios?

In [38]:
# df['month_year'] = df['date'].dt.strftime('%Y-%m') ## No es eficiente en ejecución, tardda 12 segundos
df_sorted = df.copy()
df_sorted.sort_values(by=['Country/Region', 'Province/State', 'date'], inplace=True)
df_sorted.drop('Province/State', axis=1, inplace=True)
# Aggregate values by 'Country/Region' and 'date'
df_sorted = df_sorted.groupby(['Country/Region', 'date'])['value'].sum().reset_index()
df_sorted['new_cases'] = df_sorted.groupby('Country/Region')['value'].transform(lambda x: x - x.shift(1).fillna(0))
df_sorted['month_year'] = df_sorted['date'].dt.to_period('M')
df_sorted

Unnamed: 0,Country/Region,date,value,new_cases,month_year
0,Afghanistan,2020-01-22,0,0.0,2020-01
1,Afghanistan,2020-01-23,0,0.0,2020-01
2,Afghanistan,2020-01-24,0,0.0,2020-01
3,Afghanistan,2020-01-25,0,0.0,2020-01
4,Afghanistan,2020-01-26,0,0.0,2020-01
...,...,...,...,...,...
229738,Zimbabwe,2023-03-05,264127,0.0,2023-03
229739,Zimbabwe,2023-03-06,264127,0.0,2023-03
229740,Zimbabwe,2023-03-07,264127,0.0,2023-03
229741,Zimbabwe,2023-03-08,264276,149.0,2023-03


In [39]:
df_agg_month = df_sorted.groupby(['month_year'], as_index=False)['new_cases'].sum()
df_agg_month

Unnamed: 0,month_year,new_cases
0,2020-01,9927.0
1,2020-02,76096.0
2,2020-03,783348.0
3,2020-04,2412716.0
4,2020-05,2901229.0
5,2020-06,4292072.0
6,2020-07,7118656.0
7,2020-08,7941296.0
8,2020-09,8498335.0
9,2020-10,12122070.0


In [40]:
cond_max_cases = df_agg_month['new_cases'] == df_agg_month['new_cases'].max()
df_agg_month[cond_max_cases]

Unnamed: 0,month_year,new_cases
24,2022-01,90483564.0


In [42]:
max_month = df_agg_month[cond_max_cases]['month_year'].iloc[0]
print(f'El mes y el año con mas casos es {max_month}')

El mes y el año con mas casos es 2022-01


#### 2. ¿En ese mismo mes, cuál fue el país que reportó más contagios?

In [43]:
cond = (df_sorted['month_year'] == max_month)
df_country = df_sorted[cond].groupby(['Country/Region'], as_index=False)['new_cases'].sum()
df_country.sort_values(by='new_cases').tail()

Unnamed: 0,Country/Region,new_cases
190,United Kingdom,4420003.0
86,Italy,4857433.0
80,India,6607920.0
63,France,9188713.0
186,US,20336435.0


In [44]:
df_country[df_country['new_cases'] == df_country['new_cases'].max()]

Unnamed: 0,Country/Region,new_cases
186,US,20336435.0


#### 3. ¿Cuál es el país con el menor número de casos reportados hasta la fecha?

In [45]:
df_min_country = df.groupby('Country/Region', as_index=False).agg({'value':'sum'})
df_min_country[(df_min_country['value'] == df_min_country['value'].min())]

Unnamed: 0,Country/Region,value
93,"Korea, North",300


***
___