### Types of missing data:

**MCAR: Missing Completely At Random**  
The missing values in the data set occur completely at random. They don't depend on any other data. Example: when a device such a security camera stops working.  
*How to handle this kind of missing data? Apply data deletion or imputation (imputation is more recommended)*  

**MAR: Missing At Random**  
    The missing values depends on other observed values. Example: devices required a periodic maintenance to ensure consistent operation, so the data will be missing during those maintenance period.  
    *How to handle this kind of missing data? Single or multiple imputation (consider one or several columns during imputation)*  

**MNAR: Missing Not At Random**  
    The missing values depends on the missing values themselves. They are very difficult to identify. And we may not even know that the data is missing. Example: tools have limitations. When attempting to track data out in areas beyond the measurement range, missing values are generated. Example, a scale not detecting very small or very large values.  
    *How to handle this kind of missing data? This kind of missing values require to perform sensivity analysis*  

In [1]:
import janitor
import matplotlib.pyplot as plt
import missingno
import numpy as np
import pandas as pd
import scipy.stats
import seaborn as sns
import session_info
import sklearn.compose
import sklearn.impute
import sklearn.preprocessing
import statsmodels.api as sm
import statsmodels.datasets
import statsmodels.formula.api as smf

from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import BayesianRidge, Ridge
from sklearn.neighbors import KNeighborsRegressor
from statsmodels.graphics.mosaicplot import mosaic

In [2]:
%run utils/u.0.0-pandas_missing_extension.ipynb

In [44]:
arg_di_df = pd.read_csv('../data/interim/WDICSV_INTERIM.csv').clean_names(case_type="snake")
years_of_military_dictatorship = [
    (1930,1932),
    (1943,1946),
    (1955,1958),
    (1962,1963),
    (1966,1973),
    (1976,1983)
]

arg_di_df['year_of_dictatorship'] = arg_di_df['year'].apply(lambda year: any(start <= year <= end for start, end in years_of_military_dictatorship))

print(arg_di_df.shape)
arg_di_df.info()

(64, 99)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 99 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   year                      64 non-null     int64  
 1   eg_elc_accs_zs            33 non-null     float64
 2   fx_own_totl_zs            4 non-null      float64
 3   fx_own_totl_ol_zs         4 non-null      float64
 4   fx_own_totl_40_zs         4 non-null      float64
 5   fx_own_totl_pl_zs         4 non-null      float64
 6   fx_own_totl_60_zs         4 non-null      float64
 7   fx_own_totl_so_zs         4 non-null      float64
 8   fx_own_totl_yg_zs         4 non-null      float64
 9   per_si_allsi_adq_pop_tot  12 non-null     float64
 10  per_allsp_adq_pop_tot     12 non-null     float64
 11  per_sa_allsa_adq_pop_tot  12 non-null     float64
 12  per_lm_alllm_adq_pop_tot  10 non-null     float64
 13  se_prm_tenr               26 non-null     float64
 14  sl_

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  arg_di_df['year_of_dictatorship'] = arg_di_df['year'].apply(lambda year: any(start <= year <= end for start, end in years_of_military_dictatorship))


In [4]:
sns.set_style(
    rc={
        "figure.figsize": (8, 6)
    },
    style="whitegrid"
)

In [47]:
arg_di_df = arg_di_df.assign(gdp_higher_than_past_year = lambda x: x["ny_gdp_mktp_kd_zg"] > 0)
arg_di_df[["gdp_higher_than_past_year", "ny_gdp_mktp_kd_zg"]]

Unnamed: 0,gdp_higher_than_past_year,ny_gdp_mktp_kd_zg
0,False,
1,True,5.427843
2,False,-0.852022
3,False,-5.308197
4,True,10.130298
...,...,...
59,False,-2.000861
60,False,-9.900485
61,True,10.718010
62,True,4.956370


In [48]:
def get_greatest_year_range(row):
    if(row["sp_pop_0014_to_zs"] > row["sp_pop_1564_to_zs"] and row["sp_pop_0014_to_zs"] > row["sp_pop_65_up_to_zs"]): return "0 to 14 years"
    elif(row["sp_pop_1564_to_zs"] > row["sp_pop_65_up_to_zs"] and row["sp_pop_1564_to_zs"] > row["sp_pop_0014_to_zs"]): return "15 to 64 years"
    elif(row["sp_pop_65_up_to_zs"] > row["sp_pop_0014_to_zs"] and row["sp_pop_65_up_to_zs"] > row["sp_pop_1564_to_zs"]): return "65 and above"

arg_di_df["greatest_year_range"] = arg_di_df.apply(lambda x: get_greatest_year_range(x), axis=1)
arg_di_df[["greatest_year_range", "sp_pop_0014_to_zs", "sp_pop_1564_to_zs", "sp_pop_65_up_to_zs"]]


Unnamed: 0,greatest_year_range,sp_pop_0014_to_zs,sp_pop_1564_to_zs,sp_pop_65_up_to_zs
0,15 to 64 years,31.116272,63.713503,5.170225
1,15 to 64 years,30.978752,63.712308,5.308940
2,15 to 64 years,30.831083,63.721570,5.447347
3,15 to 64 years,30.684197,63.730859,5.584945
4,15 to 64 years,30.543819,63.732703,5.723478
...,...,...,...,...
59,15 to 64 years,23.939544,64.459858,11.600598
60,15 to 64 years,23.656398,64.616148,11.727454
61,15 to 64 years,23.356737,64.821355,11.821907
62,15 to 64 years,23.053332,65.028340,11.918328


In [65]:
arg_di_df["labor_force_percentage_of_population"] = arg_di_df.apply(lambda x: "more_than_90" if x["sl_tlf_totl_in"] / x["sp_pop_totl"] > 0.2 else "up_to_90", axis=1)
# arg_di_df[["labor_force_percentage_of_population", "sl_tlf_totl_in", "sp_pop_totl"]]
arg_di_df["labor_force_percentage_of_population"].describe()

count               64
unique               2
top       more_than_90
freq                33
Name: labor_force_percentage_of_population, dtype: object

Evaluación del mecanismo de valores faltantes por prueba de t-test

Información
    two-sided: las medias de las distribuciones subyacentes a las muestras son desiguales.
    less: la media de la distribución subyacente a la primera muestra es menor que la media de la distribución subyacente a la segunda muestra.
    greater: la media de la distribución subyacente a la primera muestra es mayor que la media de la distribución subyacente a la segunda muestra.


In [52]:
groupby_series = (
    arg_di_df
        .select("year_of_dictatorship", "sh_dth_1014")
        .transform_column("sh_dth_1014", lambda x: x.isna(), elementwise = False)
        .groupby("year_of_dictatorship")
)

if groupby_series.indices.get(False) is None:
    non_military_years = pd.DataFrame()
else:
    non_military_years = arg_di_df.iloc[groupby_series.indices.get(False)]
    
if groupby_series.indices.get(True) is None:
    military_years = pd.DataFrame()
else:
    military_years = arg_di_df.iloc[groupby_series.indices.get(True)]

scipy.stats.ttest_ind(
    a = military_years,
    b = non_military_years,
    alternative="two-sided"
)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

TypeError: MissingMethods.missing_mosaic_plot() missing 3 required positional arguments: 'target_var', 'x_categorical_var', and 'y_categorical_var'

In [12]:
(
    smf.ols(
        formula="sh_dth_1014 ~ sh_dth_1519",
        data = arg_di_df
    )
    .fit()
    .summary()
    .tables[0]
)

0,1,2,3
Dep. Variable:,sh_dth_1014,R-squared:,0.189
Model:,OLS,Adj. R-squared:,0.153
Method:,Least Squares,F-statistic:,5.139
Date:,"Sat, 14 Dec 2024",Prob (F-statistic):,0.0336
Time:,11:07:45,Log-Likelihood:,-146.64
No. Observations:,24,AIC:,297.3
Df Residuals:,22,BIC:,299.6
Df Model:,1,,
Covariance Type:,nonrobust,,
