## Caso COVID-19 Fase 1
Julián Correa, Guillermo Cortés, Cristian Sarmiento

### Preprocesarmiento de información

In [45]:
## import libraries and dependencies
import pandas as pd

## read downloaded csv file
covid_cases = pd.read_csv('time_series_covid19_confirmed_global.csv')
covid_cases.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,2/28/23,3/1/23,3/2/23,3/3/23,3/4/23,3/5/23,3/6/23,3/7/23,3/8/23,3/9/23
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,209322,209340,209358,209362,209369,209390,209406,209436,209451,209451
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,334391,334408,334408,334427,334427,334427,334427,334427,334443,334457
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,271441,271448,271463,271469,271469,271477,271477,271490,271494,271496
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,47866,47875,47875,47875,47875,47875,47875,47875,47890,47890
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,105255,105277,105277,105277,105277,105277,105277,105277,105288,105288


Al realizar la inspección nos encontramos con que la tabla esta en formato horizontan donde las fechas son columnas en el data frame lo que hace complicado el procesamiento de la información. Procedemos hacer el comando **melt** de pandas para convertir **date** en variable y **value** como otra variable que contiene el numero de casos.

In [48]:
## melt dataframe to get dates
id_vars = ['Province/State', 'Country/Region', 'Lat', 'Long']
df = covid_cases.melt(id_vars=id_vars, var_name='date')
df.sample(n=5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,value
48356,,Colombia,4.5709,-74.2973,7/7/20,124494
291730,St Martin,France,18.0708,-63.0501,10/27/22,12026
311045,Qinghai,China,35.7452,95.9956,1/2/23,655
312122,,Algeria,28.0339,1.6596,1/6/23,271250
178489,,Madagascar,-18.766947,46.869107,9/30/21,42898


## Revisión de tipos de variables

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330327 entries, 0 to 330326
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Province/State  104013 non-null  object 
 1   Country/Region  330327 non-null  object 
 2   Lat             328041 non-null  float64
 3   Long            328041 non-null  float64
 4   date            330327 non-null  object 
 5   value           330327 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 15.1+ MB


Encontramos que la variable **date** no corresponde al tipo de valor de fecha, procedemos hacer el ajuste convirtiendo **object** a tipo **datetime64**

In [50]:
## Transform date column to datetime format
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%y')
df.sample(n=5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,value
40959,,Pakistan,30.3753,69.3451,2020-06-11,125933
109443,Cook Islands,New Zealand,-21.2367,-159.7777,2021-02-03,0
273319,,Paraguay,-23.4425,-58.4438,2022-08-24,714340
214821,,Comoros,-11.6455,43.3333,2022-02-03,7870
84651,,United Arab Emirates,23.424076,53.847818,2020-11-09,143289


In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330327 entries, 0 to 330326
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   Province/State  104013 non-null  object        
 1   Country/Region  330327 non-null  object        
 2   Lat             328041 non-null  float64       
 3   Long            328041 non-null  float64       
 4   date            330327 non-null  datetime64[ns]
 5   value           330327 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 15.1+ MB


## 1. ¿En cuál mes se presentó el mayor número de contagios?

In [54]:
#df['month'] = df['date'].apply(lambda x: x.month)
df['month'] = df['date'].dt.month 
df['year'] = df['date'].dt.year
df_agg_month = df.groupby(['month', 'year'], as_index=False).agg({'value': 'sum'})
df_agg_month

Unnamed: 0,month,year,value
0,1,2020,38557
1,1,2021,2924193988
2,1,2022,10275647283
3,1,2023,20669212563
4,2,2020,1672160
5,2,2021,3062477618
6,2,2022,11575136242
7,2,2023,18853048685
8,3,2020,8977762
9,3,2021,3768712195


In [55]:
df_agg_month[df_agg_month['value'] == df_agg_month['value'].max()]

Unnamed: 0,month,year,value
3,1,2023,20669212563


#### 2. ¿En ese mismo mes, cuál fue el país que reportó más contagios?

In [60]:
max_month = df_agg_month.iloc[0]['month']
max_year = df_agg_month.iloc[0]['year']
cond = ((df['month'] == max_month) & (df['year'] == max_year))
df_country = df[cond].groupby('Country/Region', as_index=False).agg({'value':'sum'})

df_country[df_country['value'] == df_country['value'].max()]

Unnamed: 0,Country/Region,value
37,China,38008


#### 3. ¿Cuál es el país con el menor número de casos reportados hasta la fecha?

In [61]:
df_min_country = df.groupby('Country/Region', as_index=False).agg({'value':'sum'})
df_min_country[(df_min_country['value'] == df_min_country['value'].min())]

Unnamed: 0,Country/Region,value
93,"Korea, North",300


***
___