## Ejercicios de pair programming 23 enero: Anova

In [1]:
# Tratamiento de datos
# -----------------------------------------------------------------------
import numpy as np
import pandas as pd
import random 

# Estadísticos
# -----------------------------------------------------------------------
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.multivariate.manova import MANOVA
from sklearn.preprocessing import StandardScaler

In [2]:
df = pd.read_csv("../datos/world_risk_index_sin_outliers_est.csv", index_col = 0)
df.head(2)

Unnamed: 0,region,exposure_category,wri_category,vulnerability_category,susceptibility_category,wri,exposure,vulnerability,susceptibility,lack_of_coping_capabilities,lack_of_adaptive_capacities,year,exposure_Sklearn
0,Papua-Neuguinea,Very High,Very High,Very High,Very High,2.90648,23.26,1.296928,1.179006,0.962932,1.537045,2011.0,0.895683
1,Madagaskar,Very High,Very High,Very High,Very High,2.594391,20.68,1.545395,2.260942,1.017385,0.974085,2011.0,0.792566


In [3]:
outliers = pd.read_csv("../datos/world_risk_index_outliers_est.csv", index_col = 0)
outliers.head(2)

Unnamed: 0,region,exposure_category,wri_category,vulnerability_category,susceptibility_category,wri,exposure,vulnerability,susceptibility,lack_of_coping_capabilities,lack_of_adaptive_capacities,year,exposure_Sklearn
0,Vanuatu,Very High,Very High,High,High,1.640675,56.33,0.801253,0.792708,0.541556,0.926242,2011.0,0.563758
1,Tonga,Very High,Very High,Medium,Medium,1.29257,56.04,0.376459,0.030528,0.707655,0.185736,2011.0,0.560853


### Info columnas
|Columna| Tipo de dato | Descripcion |
|-------|--------------|-------------|
|Region| String|	Name of the region.
|WRI	| Decimal |	World Risk Score of the region.
|Exposure	| Decimal |	Risk/exposure to natural hazards such as earthquakes, hurricanes, floods, droughts, and sea ​​level rise.
|Vulnerability	| Decimal |	Vulnerability depending on infrastructure, nutrition, housing situation, and economic framework conditions.
|Susceptibility	| Decimal |	Susceptibility depending on infrastructure, nutrition, housing situation, and economic framework conditions.
|Lack of Coping Capabilities	| Decimal |	Coping capacities in dependence of governance, preparedness and early warning, medical care, and social and material security.
|Lack of Adaptive Capacities| Decimal |	Adaptive capacities related to coming natural events, climate change, and other challenges.
|Year	| Decimal |	Year data is being described.
|WRI Category| String|	WRI Category for the given WRI Score.
|Exposure Category| String|	Exposure Category for the given Exposure Score.
|Vulnerability Categoy| String|	Vulnerability Category for the given Vulnerability Score.
|Susceptibility Category| String|	Susceptibility Category for the given Susceptibility Score.

Link a la base de datos : https://www.kaggle.com/datasets/tr1gg3rtrash/global-disaster-risk-index-time-series-dataset

### Nuestra variable respuesta es Exposure_Sklearn, queremos saber cual es el riesgo de desastres naturales dependiendo del resto de variables

In [4]:
df.columns

Index(['region', 'exposure_category', 'wri_category', 'vulnerability_category',
       'susceptibility_category', 'wri', 'exposure', 'vulnerability',
       'susceptibility', 'lack_of_coping_capabilities',
       'lack_of_adaptive_capacities', 'year', 'exposure_Sklearn'],
      dtype='object')

In [5]:
lm = ols("exposure_Sklearn ~ region  + exposure_category + wri_category + vulnerability_category + susceptibility_category + wri + vulnerability + susceptibility + lack_of_coping_capabilities + lack_of_adaptive_capacities + year", data=df).fit()
sm.stats.anova_lm(lm)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
region,282.0,52.15227,0.1849371,1515.901394,0.0
exposure_category,4.0,2.431976,0.607994,4983.634289,0.0
wri_category,4.0,0.2696753,0.06741882,552.621759,3.58138e-286
vulnerability_category,4.0,0.009013377,0.002253344,18.470318,8.374934e-15
susceptibility_category,4.0,0.004095541,0.001023885,8.392631,1.087883e-06
wri,1.0,0.9898364,0.9898364,8113.537339,0.0
vulnerability,1.0,0.3879276,0.3879276,3179.783122,0.0
susceptibility,1.0,0.0003623731,0.0003623731,2.970317,0.08502588
lack_of_coping_capabilities,1.0,0.0003803809,0.0003803809,3.117924,0.07765316
lack_of_adaptive_capacities,1.0,5.938657e-07,5.938657e-07,0.004868,0.9443868


El DF nos indica las que son columnas categorica (*region, exposure_category, wri_category,vulnerability_category, susceptibility_category*) y numerica todas las que tienen un valor de 1.

El F evalua la capacidad que tiene cada variable predictora de influir sobre la variable respuesta. Por lo cual la que influyen mas son *wri* y *vulnerability*

Mirando la columna del PR(>F) podemos concluir que *lack_of_adaptive_capacities y year* son mayores de 0.05 por lo cual NO influyen sobre nuestra variable respuesta.

In [6]:
lm.summary()

0,1,2,3
Dep. Variable:,exposure_Sklearn,R-squared:,0.997
Model:,OLS,Adj. R-squared:,0.996
Method:,Least Squares,F-statistic:,1517.0
Date:,"Mon, 23 Jan 2023",Prob (F-statistic):,0.0
Time:,20:05:43,Log-Likelihood:,5434.1
No. Observations:,1706,AIC:,-10260.0
Df Residuals:,1401,BIC:,-8598.0
Df Model:,304,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0752,0.265,-0.284,0.777,-0.595,0.445
region[T.Albania],0.0697,0.011,6.323,0.000,0.048,0.091
region[T.Albanien],0.0773,0.009,8.956,0.000,0.060,0.094
region[T.Algeria],0.0298,0.011,2.763,0.006,0.009,0.051
region[T.Algerien],0.0327,0.008,3.965,0.000,0.017,0.049
region[T.Angola],0.0063,0.005,1.246,0.213,-0.004,0.016
region[T.Argentina],-0.0372,0.012,-3.187,0.001,-0.060,-0.014
region[T.Argentinien],-0.0318,0.009,-3.411,0.001,-0.050,-0.014
region[T.Armenia],0.0188,0.011,1.667,0.096,-0.003,0.041

0,1,2,3
Omnibus:,203.339,Durbin-Watson:,1.768
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1453.934
Skew:,0.3,Prob(JB):,0.0
Kurtosis:,7.483,Cond. No.,11700000.0


- El P>|t| de la columna *region* en algunas regiones es menor de 0.05 por lo cual nos quedamos con esta columna (la region influye sobre la variable respuesta).
- exposure_category también influye.  
- El P>|t| se ve afectado especialmente por la categoria de region al ser que tenemos unas 200 mas o menos.
- R square de nuestra variables predictoras explican un 95% de nuestra variable respuesta. 

In [7]:
outliers.head()

Unnamed: 0,region,exposure_category,wri_category,vulnerability_category,susceptibility_category,wri,exposure,vulnerability,susceptibility,lack_of_coping_capabilities,lack_of_adaptive_capacities,year,exposure_Sklearn
0,Vanuatu,Very High,Very High,High,High,1.640675,56.33,0.801253,0.792708,0.541556,0.926242,2011.0,0.563758
1,Tonga,Very High,Very High,Medium,Medium,1.29257,56.04,0.376459,0.030528,0.707655,0.185736,2011.0,0.560853
2,Philippinen,Very High,Very High,High,High,0.72511,45.09,0.552087,0.592868,0.773824,0.106661,2011.0,0.451167
3,Salomonen,Very High,Very High,Very High,High,0.628547,36.4,1.475212,1.440562,0.987862,1.731821,2011.0,0.364119
4,Guatemala,Very High,Very High,High,High,0.315014,38.42,0.588423,0.627259,0.439602,0.589349,2011.0,0.384353


In [8]:
outliers.isnull().sum()

region                         0
exposure_category              0
wri_category                   0
vulnerability_category         0
susceptibility_category        0
wri                            0
exposure                       0
vulnerability                  0
susceptibility                 0
lack_of_coping_capabilities    0
lack_of_adaptive_capacities    0
year                           0
exposure_Sklearn               0
dtype: int64

In [9]:
lm_outliers = ols("exposure_Sklearn ~ region  + exposure_category + wri_category + vulnerability_category + susceptibility_category + wri + vulnerability + susceptibility + lack_of_coping_capabilities + lack_of_adaptive_capacities + year", data=outliers).fit()
sm.stats.anova_lm(lm_outliers)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
region,34.0,4.096072,0.120473,2230.781605,3.856812e-198
exposure_category,1.0,0.000751,0.000751,13.898944,0.0002667701
wri_category,3.0,0.013438,0.004479,82.944525,1.71913e-32
vulnerability_category,4.0,0.010774,0.002693,49.87483,3.0001880000000003e-27
susceptibility_category,4.0,0.033227,0.008307,153.816624,6.534501e-54
wri,1.0,0.304438,0.304438,5637.24385,3.22113e-127
vulnerability,1.0,0.019489,0.019489,360.884488,5.7745730000000006e-43
susceptibility,1.0,6.5e-05,6.5e-05,1.20552,0.2738606
lack_of_coping_capabilities,1.0,0.001618,0.001618,29.963477,1.658478e-07
lack_of_adaptive_capacities,1.0,6.7e-05,6.7e-05,1.241386,0.2668646


Nos hemos dado cuenta que las columnas con la que nos quedamos son las misma del dataframe DF.

lm.summary()