### Types of missing data:

**MCAR: Missing Completely At Random**  
The missing values in the data set occur completely at random. They don't depend on any other data. Example: when a device such a security camera stops working.  
*How to handle this kind of missing data? Apply data deletion or imputation (imputation is more recommended)*  

**MAR: Missing At Random**  
    The missing values depends on other observed values. Example: devices required a periodic maintenance to ensure consistent operation, so the data will be missing during those maintenance period.  
    *How to handle this kind of missing data? Single or multiple imputation (consider one or several columns during imputation)*  

**MNAR: Missing Not At Random**  
    The missing values depends on the missing values themselves. They are very difficult to identify. And we may not even know that the data is missing. Example: tools have limitations. When attempting to track data out in areas beyond the measurement range, missing values are generated. Example, a scale not detecting very small or very large values.  
    *How to handle this kind of missing data? This kind of missing values require to perform sensivity analysis*  

In [2]:
import janitor
import matplotlib.pyplot as plt
import missingno
import numpy as np
import pandas as pd
import scipy.stats
import seaborn as sns
import session_info
import sklearn.compose
import sklearn.impute
import sklearn.preprocessing
import statsmodels.api as sm
import statsmodels.datasets
import statsmodels.formula.api as smf

from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import BayesianRidge, Ridge
from sklearn.neighbors import KNeighborsRegressor
from statsmodels.graphics.mosaicplot import mosaic

In [None]:
%run pandas-missing-extension.ipynb

In [8]:
arg_di_df = pd.read_csv('../data/processed/WDICSV_PROCESSED.csv').clean_names(case_type="snake")
print(arg_di_df.shape)
arg_di_df.info()

(24, 100)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 100 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   year                      24 non-null     int64  
 1   eg_elc_accs_zs            24 non-null     float64
 2   fx_own_totl_40_zs         4 non-null      float64
 3   fx_own_totl_60_zs         4 non-null      float64
 4   fx_own_totl_ol_zs         4 non-null      float64
 5   fx_own_totl_pl_zs         4 non-null      float64
 6   fx_own_totl_so_zs         4 non-null      float64
 7   fx_own_totl_yg_zs         4 non-null      float64
 8   fx_own_totl_zs            4 non-null      float64
 9   it_cel_sets               24 non-null     float64
 10  it_mlt_main               24 non-null     float64
 11  it_net_bbnd               21 non-null     float64
 12  it_net_user_zs            24 non-null     float64
 13  ny_gdp_mktp_kd            24 non-null     float64
 14  n

In [4]:
sns.set_style(
    rc={
        "figure.figsize": (8, 6)
    },
    style="whitegrid"
)

Evaluación del mecanismo de valores faltantes por prueba de t-test

Información
    two-sided: las medias de las distribuciones subyacentes a las muestras son desiguales.
    less: la media de la distribución subyacente a la primera muestra es menor que la media de la distribución subyacente a la segunda muestra.
    greater: la media de la distribución subyacente a la primera muestra es mayor que la media de la distribución subyacente a la segunda muestra.


In [72]:
groupby_series = (
    arg_di_df
        .select("year_of_dictatorship", "sh_dth_1014")
        .transform_column("sh_dth_1014", lambda x: x.isna(), elementwise = False)
        .groupby("year_of_dictatorship")
)

if groupby_series.indices.get(False) is None:
    non_military_years = pd.DataFrame()
else:
    non_military_years = arg_di_df.iloc[groupby_series.indices.get(False)]
    
if groupby_series.indices.get(True) is None:
    military_years = pd.DataFrame()
else:
    military_years = arg_di_df.iloc[groupby_series.indices.get(True)]

scipy.stats.ttest_ind(
    a = military_years,
    b = non_military_years,
    alternative="two-sided"
)    

ValueError: Array shapes are incompatible for broadcasting.

In [12]:
(
    smf.ols(
        formula="sh_dth_1014 ~ sh_dth_1519",
        data = arg_di_df
    )
    .fit()
    .summary()
    .tables[0]
)

0,1,2,3
Dep. Variable:,sh_dth_1014,R-squared:,0.189
Model:,OLS,Adj. R-squared:,0.153
Method:,Least Squares,F-statistic:,5.139
Date:,"Sat, 14 Dec 2024",Prob (F-statistic):,0.0336
Time:,11:07:45,Log-Likelihood:,-146.64
No. Observations:,24,AIC:,297.3
Df Residuals:,22,BIC:,299.6
Df Model:,1,,
Covariance Type:,nonrobust,,
