## Caso COVID-19 Fase 1
Julián Correa, Guillermo Cortés, Cristian Sarmiento

## Creditos de la información
La data utilizada en el ejercicio es tomada del **repositorio de datos COVID-19 del Centro de Ciencias de Datos e Ingeniria de Sistemas (CSSE) de la Universidad Johns Hopkins (https://github.com/CSSEGISandData/COVID-19)[https://github.com/CSSEGISandData/COVID-19]** 

### Preprocesarmiento de información

In [1]:
## import libraries and dependencies
import pandas as pd

## read downloaded csv file
covid_cases = pd.read_csv('time_series_covid19_confirmed_global.csv')
covid_cases.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,2/28/23,3/1/23,3/2/23,3/3/23,3/4/23,3/5/23,3/6/23,3/7/23,3/8/23,3/9/23
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,209322,209340,209358,209362,209369,209390,209406,209436,209451,209451
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,334391,334408,334408,334427,334427,334427,334427,334427,334443,334457
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,271441,271448,271463,271469,271469,271477,271477,271490,271494,271496
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,47866,47875,47875,47875,47875,47875,47875,47875,47890,47890
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,105255,105277,105277,105277,105277,105277,105277,105277,105288,105288


Al realizar la inspección nos encontramos con que la tabla esta en formato horizontan donde las fechas son columnas en el data frame lo que hace complicado el procesamiento de la información. Procedemos hacer el comando **melt** de pandas para convertir **date** en variable y **value** como otra variable que contiene el numero de casos.

In [2]:
## melt dataframe to get dates
id_vars = ['Province/State', 'Country/Region', 'Lat', 'Long']
df = covid_cases.melt(id_vars=id_vars, var_name='date')
df.sample(n=5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,value
98217,,Sweden,60.128161,18.643501,12/26/20,396048
215162,,Hungary,47.1625,19.5033,2/4/22,1616846
99036,Curacao,Netherlands,12.1696,-68.99,12/29/20,4179
234038,,Solomon Islands,-9.6457,160.1562,4/10/22,12005
74399,Reunion,France,-21.1151,55.5364,10/5/20,4328


## Revisión de tipos de variables

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330327 entries, 0 to 330326
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Province/State  104013 non-null  object 
 1   Country/Region  330327 non-null  object 
 2   Lat             328041 non-null  float64
 3   Long            328041 non-null  float64
 4   date            330327 non-null  object 
 5   value           330327 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 15.1+ MB


Encontramos que la variable **date** no corresponde al tipo de valor de fecha, procedemos hacer el ajuste convirtiendo **object** a tipo **datetime64**

In [4]:
## Transform date column to datetime format
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%y')

df.sample(n=5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,value
176297,,Argentina,-38.4161,-63.6167,2021-09-23,5246998
303817,Liaoning,China,41.2956,122.6085,2022-12-08,2831
298451,,New Zealand,-40.9006,174.886,2022-11-19,1894029
262992,,Algeria,28.0339,1.6596,2022-07-20,266591
120454,,Senegal,14.4974,-14.4524,2021-03-13,36726


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330327 entries, 0 to 330326
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   Province/State  104013 non-null  object        
 1   Country/Region  330327 non-null  object        
 2   Lat             328041 non-null  float64       
 3   Long            328041 non-null  float64       
 4   date            330327 non-null  datetime64[ns]
 5   value           330327 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 15.1+ MB


## 1. ¿En cuál mes se presentó el mayor número de contagios?

In [6]:
df['month'] = df['date'].apply(lambda x: x.month)
df_sorted = df.copy()
df_sorted.sort_values(by=['Country/Region', 'Province/State', 'date'], inplace=True)
df_sorted['new_cases'] = df.groupby(['Country/Region', 'Province/State'])['value'].diff().fillna(0)

In [7]:
df_sorted['month'] = df_sorted['date'].dt.month 
df_sorted['year'] = df_sorted['date'].dt.year
df_agg_month = df_sorted.groupby(['month', 'year'], as_index=False)['new_cases'].sum()
df_agg_month

Unnamed: 0,month,year,new_cases
0,1,2020,9267.0
1,1,2021,212988.0
2,1,2022,3348699.0
3,1,2023,560610.0
4,2,2020,69599.0
5,2,2021,104654.0
6,2,2022,1413694.0
7,2,2023,111660.0
8,3,2020,20899.0
9,3,2021,139870.0


In [8]:
df_agg_month[df_agg_month['new_cases'] == df_agg_month['new_cases'].max()]

Unnamed: 0,month,year,new_cases
2,1,2022,3348699.0


#### 2. ¿En ese mismo mes, cuál fue el país que reportó más contagios?

In [9]:
max_month = df_agg_month.iloc[0]['month']
max_year = df_agg_month.iloc[0]['year']
print(f'El mes y el año con mas casos es {int(max_month)}-{int(max_year)}')

El mes y el año con mas casos es 1-2020


In [12]:
cond = ((df_sorted['month'] == max_month) & (df_sorted['year'] == max_year))
df_country = df_sorted[cond]
df_country = df_country.groupby(['Country/Region', 'month', 'year'], as_index=False)['new_cases'].sum()
df_country_max[df_country['new_cases'] == df_agg_month['new_cases'].max()]

Unnamed: 0,Country/Region,month,year,new_cases
0,Afghanistan,1,2020,0.0
1,Albania,1,2020,0.0
2,Algeria,1,2020,0.0
3,Andorra,1,2020,0.0
4,Angola,1,2020,0.0
...,...,...,...,...
196,West Bank and Gaza,1,2020,0.0
197,Winter Olympics 2022,1,2020,0.0
198,Yemen,1,2020,0.0
199,Zambia,1,2020,0.0


#### 3. ¿Cuál es el país con el menor número de casos reportados hasta la fecha?

In [11]:
df_min_country = df.groupby('Country/Region', as_index=False).agg({'value':'sum'})
df_min_country[(df_min_country['value'] == df_min_country['value'].min())]

Unnamed: 0,Country/Region,value
93,"Korea, North",300


***
___