# Missing Data Handling Protocol

Here we define a protocol for identifying the missing mechanism from a database provided as user input. The method works as follows:

- Initially, the user must provide the database with missing values in a CSV file (Comma-Separated Values) and its data type (numeric, categorical, or mixed);

- Four methods are used to identify MCAR (Dixon, Little, Homocedascity, and Regression Based, and the result is given by majority vote; that is, if two or more methods consider that the database is MCAR, this will be the final result;

- If the database is not identified as MCAR, the RB Regression method is called to recognize the MAR. If the technique identifies MAR, the result is displayed; otherwise, the result is considered MNAR. It should be noted that the methods only provide an indication for MCAR, MAR, or MNAR, and other analyzes may be necessary, including the participation of the domain specialist;

- Next, the missing data imputation process is started:

> If the base type is equal to Numerical, then the absence mechanism is checked, resulting in (3): if MCAR, impute using the KNN method, otherwise impute with multiple imputation;

> If the database type is Categorical, then the absence mechanism is checked, resulting in (3): if MAR, then impute using the KNN method, otherwise impute with multiple imputation;

> If the base type is equal to Mixed, then the absence mechanism is checked, resulting in (3): if MCAR, then impute using GAIN; if MAR, then impute using multiple imputation; if MNAR, then impute using KNN.

- In the end, the complete (imputed) database is returned.

### Load and run packages

In [None]:
import warnings
warnings.filterwarnings("ignore")

#You should set you R Path here
import os
os.environ['R_HOME'] = 'C:\Program Files\R\R-4.2.1'

import sys
import sklearn.neighbors._base
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
from IPython import display
os.chdir(f"C:\Anaconda3\Scripts\MALTA\src")
%run load_data/load_data.ipynb
display.clear_output(wait=True)
%run missing_data_generation/missing_values_generator.ipynb
display.clear_output(wait=True)
%run missing_mechanism_detection/little_test.ipynb
display.clear_output(wait=True)
%run missing_mechanism_detection/dixon_test.ipynb
display.clear_output(wait=True)
%run missing_mechanism_detection/homoscedasticity_test.ipynb
display.clear_output(wait=True)
%run missing_mechanism_detection/rb_test.ipynb
display.clear_output(wait=True)
%run imputation/imputation_methods.ipynb
display.clear_output(wait=True)

### Set filename and load dataset

In [None]:
# Load dataset
import chardet
import rpy2
from rpy2.rinterface_lib.embedded import RRuntimeError
from IPython import display
# Get the dataset file (with missing data)
filename = input('Enter the path and name of the .csv file (Ex.: "C:\Anaconda\Scripst\dataset.csv"): ')
data_type = input('Enter the data type (N = Numerical, C = Categorical, M = Mixed): ')
delimiter = ";" #Change the delimiter if necessary
encode ='utf-8'
original_data = load_dataset(filename,delimiter, encode)


# Handle Missing Data

- A conditional structure is defined first to find the missing mechanism:

>> If two or more methods performed in (2) have decided MCAR, the engine receives that result;
>> Else if two or more methods performed in (2) have decided MAR, the missing mechanism gets this one;
>> Otherwise, the missing mechanism will be MNAR.

- A conditional structure is defined to perform the imputation of missing data:

> If the base type is equal to Numerical, then the missing mechanism has checked in (3): if MCAR, impute using the KNN method, otherwise impute with multiple imputation;

> If the database type is Categorical, then the missing mechanism has checked in (3): if MAR, then impute using the KNN method, otherwise impute with multiple imputation;

> If the base type equals Mixed, then the missing mechanism has checked in (3): if MCAR, then impute using GAIN; if MAR, then impute using multiple imputation; if MNAR, then impute using KNN.

- Return the complete (imputed) database.

In [None]:
def missing_data_handling_protocol(data, data_type):
    """
    Handle missing data in a dataset using different methods depending on the base type and missing mechanism.

    Parameters:
        data (pandas Dataframe): The dataset with missing data.
        data_type (str): The base type of the dataset (either 'N - Numerical', 'C - Categorical', or 'M - Mixed').
        
    Returns:
        The imputed dataset.
    """   
    
    #######################################################################
    #Detecting Missing Mechanism
    #######################################################################
   
    original = True
    result_list = []

    #######################################################################
    #Dixon´s Test
    #######################################################################

    try:      
        result = dixon_test(original_data)
        if (len(result) > 1 and result[1] > 0.5):
            result_list.append("MCAR")
        else:
            result_list.append()

    except RRuntimeError as e:
        result_list.append()
        print("Unable to invert array.")   
        display.clear_output(wait=True)

    except UnicodeDecodeError as e:
        result_list.append("")
        print("Unable to open the file.")
        display.clear_output(wait=True)

    display.clear_output(wait=True)
    #######################################################################
    #Litte´s Test
    #######################################################################

    try:      
        result = little_test(original_data)
        if (result == "The Missing Mechanism is MCAR."):
            result_list.append("MCAR")

        else: 
            result_list.append("")


    except RRuntimeError as e:
        result_list.append("")
        print("Unable to invert array.")   
        display.clear_output(wait=True)

    except UnicodeDecodeError as e:
        result_list.append("")
        print("Unable to open the file.")
        display.clear_output(wait=True)

    display.clear_output(wait=True)
    #######################################################################
    #Jalal´s Test
    #######################################################################

    try:      
        result = homoscedasticity_test(original_data)
        if (result == "The Missing Mechanism is MCAR."):
            result_list.append("MCAR")

        else: 
            result_list.append("")


    except RRuntimeError as e:
        result_list.append("")
        print("Unable to invert array.")   
        display.clear_output(wait=True)

    except UnicodeDecodeError as e:
        result_list.append("")
        print("Unable to open the file.")
        display.clear_output(wait=True)

    display.clear_output(wait=True)
    #######################################################################
    #RB´s Test
    #######################################################################

    try:      
        result = rb_test(original_data)
        if (result == "The Missing Mechanism is MCAR."):
            result_list.append("MCAR")

        else: 
            result_list.append("")


    except RRuntimeError as e:
        result_list.append("")
        print("Unable to invert array.")  
        display.clear_output(wait=True)


    except UnicodeDecodeError as e:
        result_list.append("")
        print("Unable to open the file.")
        display.clear_output(wait=True)

    display.clear_output(wait=True)
    #######################################################################
    #Checking Results
    #######################################################################
    print(f'----------------------Checking Results-------------------------')
    if (result_list.count('MCAR') >= 2):
            print( "The Missing Mechanism is MCAR.")
            missing_mechanism = 'MCAR'
    else:
        try:
            result = rb_test(original_data)
            if (result == 'The Missing Mechanism is MAR.'):
                result_list.append("MAR")
                print( "The Missing Mechanism is MAR.")
                missing_mechanism = 'MAR'
            else: 
                print('The Missing Mechanism is MNAR.')
                result_list.append("MNAR")
                missing_mechanism = 'MNAR'
            print('------------------------------------------')

        except RRuntimeError as e:
            result_list.append("")
            print("Unable to invert array.")
            display.clear_output(wait=True)

        except UnicodeDecodeError as e:
            result_list.append("")
            print("Unable to open the file.")
            display.clear_output(wait=True)
            
    #######################################################################
    #Imputation Missing Values
    #######################################################################

    #Let's separate the class labels to generate the missings and imputations
    #Get the input features
    columns = original_data.columns
    class_name = columns[-1] #Get name of the last column (class)
    columns_tmp = list(columns)
    columns_tmp.remove(class_name)

    data_x = data.iloc[:, 0:data.shape[1]-1].values #Predictive attributes
    data_y = data.iloc[:, -1].values #Class attribute
    X = pd.DataFrame(data_x, columns = columns_tmp) #Get a pandas dataframe from data (predictive attributes)
    y = pd.DataFrame(data_y, columns = [class_name]) #Get a pandas dataframe from data (class attribute)
       
    if data_type == 'N':
        if missing_mechanism == 'MCAR':
            # Impute missing values using KNN method
            imputed_data = knn_imputation(X, 10)#Complete data with predictive attributes
            print('------------------------------------------')
            print('KNN Imputation complete sucessful.')
        else:
            # Impute missing values using multiple imputation
            imputed_data = multiple_imputation(X, 10)#Complete data with predictive attributes
            print('Multiple Imputation complete sucessful.')

    elif data_type == 'C':
        if missing_mechanism == 'MAR':
            # Impute missing values using KNN method
            imputed_data = knn_imputation(X, 10)#Complete data with predictive attributes
            print('------------------------------------------')
            print('KNN Imputation complete sucessful.')
            
        else:
            # Impute missing values using multiple imputation
            imputed_data = multiple_imputation(X, 10)#Complete data with predictive attributes
            print('------------------------------------------')
            print('Multiple Imputation complete sucessful.')

    elif data_type == 'M':
        if missing_mechanism == 'MCAR':
            # Impute missing values using GAIN (Generative Adversarial Imputation Network)
            gain_parameters = {'batch_size': 128,'hint_rate': 0.9,'alpha': 100,'iterations': 1000}
            imputed_data = gain_imputation(X.values, gain_parameters)#Complete data with predictive attributes
            print('------------------------------------------')
            print('GAIN Imputation complete sucessful.')
        elif missing_mechanism == 'MAR':
            # Impute missing values using multiple imputation
            # (implement your own multiple imputation algorithm or use an existing package)
            imputed_data = multiple_imputation(X, 10)#Complete data with predictive attributes
            print('------------------------------------------')
            print('Multiple Imputation complete sucessful.')
        else:
            # Impute missing values using KNN method
            imputed_data = knn_imputation(X, 10)#Complete data with predictive attributes
            print('------------------------------------------')
            print('KNN Imputation complete sucessful.')

    imputed_data = pd.concat([imputed_data, y.set_index(imputed_data.index) ], axis=1)
    
    return imputed_data


In [None]:
missing_data_handling_protocol(original_data, data_type)