#### Evaluación de relaciones bivariantes y multivariantes para vairables predictoras y de respuesta

##### 1. Detección de valores atípicos y extremos en relaciones bivariantes

Importación de librerías

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Carga de datos

In [9]:
covidtotals = pd.read_csv( filepath_or_buffer = './data/covidtotals.csv', index_col = 'iso_code' )
covidtotals.info()

<class 'pandas.core.frame.DataFrame'>
Index: 221 entries, AFG to ZWE
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   lastdate             221 non-null    object 
 1   location             221 non-null    object 
 2   total_cases          192 non-null    float64
 3   total_deaths         185 non-null    float64
 4   total_cases_mill     192 non-null    float64
 5   total_deaths_mill    185 non-null    float64
 6   population           221 non-null    float64
 7   population_density   206 non-null    float64
 8   median_age           190 non-null    float64
 9   gdp_per_capita       193 non-null    float64
 10  aged_65_older        188 non-null    float64
 11  total_tests_thous    13 non-null     float64
 12  life_expectancy      217 non-null    float64
 13  hospital_beds_thous  170 non-null    float64
 14  diabetes_prevalence  200 non-null    float64
 15  region               221 non-null    object

Una buena forma de emezar con el análisis bivariante es hallar correlaciones

In [10]:
totvars = ['location','total_cases_mill','total_deaths_mill']
demovars = ['population_density','aged_65_older', 'gdp_per_capita','life_expectancy', 'diabetes_prevalence']
covidkeys = covidtotals.loc[:,totvars + demovars]
covidkeys.head().T

iso_code,AFG,ALB,DZA,AND,AGO
location,Afghanistan,Albania,Algeria,Andorra,Angola
total_cases_mill,3314.321,46061.922,3261.77,181466.382,1201.566
total_deaths_mill,139.102,853.43,86.338,1643.694,28.144
population_density,54.422,104.871,17.348,163.755,23.89
aged_65_older,2.581,13.188,6.211,,2.405
gdp_per_capita,1803.987,11803.431,13913.839,,5819.495
life_expectancy,64.83,78.57,76.88,83.73,61.15
diabetes_prevalence,9.59,10.08,6.73,7.97,3.94


Ahora construimos la matriz de correlación de Pearson

In [44]:
# Observar los resultados como una matrix
#corrmatrix = covidkeys.corr(method = 'pearson', numeric_only = True)

# Filtrar los valores más altos
corrmatrix = (covidkeys
              .corr(method='pearson', numeric_only=True)
              .reset_index(names='Variable 1')
              .melt(id_vars='Variable 1', var_name='Variable 2')
              .query(' `Variable 1` != `Variable 2` ')
              .sort_values(by = 'value', ascending = False)
              )

corrmatrix['aux'] = (corrmatrix['Variable 1'] + corrmatrix['Variable 2']).apply(set)

corrmatrix

Unnamed: 0,Variable 1,Variable 2,value,aux
38,aged_65_older,life_expectancy,0.729937,"{_, a, o, f, c, t, d, y, p, e, g, x, l, r, n, ..."
26,life_expectancy,aged_65_older,0.729937,"{_, a, o, f, c, t, y, d, p, e, g, x, l, r, n, ..."
1,total_deaths_mill,total_cases_mill,0.709783,"{_, o, a, c, t, d, m, e, s, l, i, h}"
7,total_cases_mill,total_deaths_mill,0.709783,"{_, o, a, c, t, m, d, e, s, l, i, h}"
39,gdp_per_capita,life_expectancy,0.681222,"{_, a, c, f, t, d, y, p, e, g, x, l, r, n, i}"
33,life_expectancy,gdp_per_capita,0.681222,"{_, a, f, c, t, y, d, p, e, g, x, l, r, n, i}"
5,life_expectancy,total_cases_mill,0.570582,"{_, a, o, f, c, t, y, m, p, e, s, x, l, n, i}"
35,total_cases_mill,life_expectancy,0.570582,"{_, o, a, c, f, t, m, y, p, e, s, x, l, n, i}"
21,total_cases_mill,aged_65_older,0.533905,"{_, o, a, c, t, m, d, e, s, g, l, r, i, 5, 6}"
3,aged_65_older,total_cases_mill,0.533905,"{_, a, o, c, t, d, m, e, s, g, l, r, i, 5, 6}"


In [34]:
# Crear un DataFrame de ejemplo
data = {
    'Variable 1': ['aged_65_older', 'life_expectancy', 'total_deaths_mill', 'total_cases_mill'],
    'Variable 2': ['life_expectancy', 'aged_65_older', 'total_cases_mill', 'total_deaths_mill'],
    'value': [0.729937, 0.729937, 0.709783, 0.709783]
}

df = pd.DataFrame(data)

# Concatenar las columnas "Variable 1" y "Variable 2" y convertirlas en un conjunto (set)
df['Combined Set'] = df['Variable 1'] + df['Variable 2']
df['Combined Set'] = df['Combined Set'].apply(set)

# Mostrar el DataFrame resultante
df

Unnamed: 0,Variable 1,Variable 2,value,Combined Set
0,aged_65_older,life_expectancy,0.729937,"{_, a, o, f, c, t, d, y, p, e, g, x, l, r, n, ..."
1,life_expectancy,aged_65_older,0.729937,"{_, a, o, f, c, t, y, d, p, e, g, x, l, r, n, ..."
2,total_deaths_mill,total_cases_mill,0.709783,"{_, o, a, c, t, d, m, e, s, l, i, h}"
3,total_cases_mill,total_deaths_mill,0.709783,"{_, o, a, c, t, m, d, e, s, l, i, h}"
