# Context

Since the dawn of human life on the face of the earth, the global population has been booming. The population was estimated to be 1 billion people in the year 1800. The figure had increased to a new high of 6 billion humans by the turn of the twentieth century. Day in and day out, 227,000 people are being added to the world; it is projected that by the end of the 21st century, the world's population may exceed 11 billion.

As per reports, as a consequence of the unsustainable increase in population and a lack of access to adequate health care, food, and shelter, the number of genetic disorder ailments have increased. Hereditary illnesses are becoming more common due to a lack of understanding about the need for genetic testing. Often kids die as a result of these illnesses, thus genetic testing during pregnancy is critical.

# Task
You are hired as a Machine Learning Engineer from a government agency. You are given a dataset that contains medical information about children who have genetic disorders. Your task is to predict the following:

Genetic disorder

Disorder subclass

#### Importando librerias y datos

In [1]:
import pandas as pd
import numpy as np
import io
import gc
import time
import os
import pandas_profiling as pp 
from datetime import date


# settings
import warnings
warnings.filterwarnings("ignore")
gc.enable()

In [38]:
ROOT_PATH = 'C:\\Users\\Usuario\\Desktop\\Proyecto Final\\src\\data\\'

In [3]:
df_train = pd.read_csv(ROOT_PATH+'raw\\train.csv')
df_test = pd.read_csv(ROOT_PATH+'raw\\test.csv')

In [4]:
df_train.shape

(22083, 45)

In [5]:
df_test.shape

(9465, 43)

In [6]:
df_train.sample(3)

Unnamed: 0,Patient Id,Patient Age,Genes in mother's side,Inherited from father,Maternal gene,Paternal gene,Blood cell count (mcL),Patient First Name,Family Name,Father's name,...,Birth defects,White Blood cell count (thousand per microliter),Blood test result,Symptom 1,Symptom 2,Symptom 3,Symptom 4,Symptom 5,Genetic Disorder,Disorder Subclass
14607,PID0x8813,10.0,No,No,No,Yes,5.046047,Bessie,Gordon,Advait,...,Singular,8.692767,slightly abnormal,1.0,1.0,0.0,1.0,0.0,Mitochondrial genetic inheritance disorders,Leigh syndrome
4538,PID0x3126,12.0,No,No,No,No,4.938106,Angela,Hanners,Konur,...,Singular,4.574872,,0.0,1.0,0.0,0.0,1.0,Mitochondrial genetic inheritance disorders,
2314,PID0x974c,12.0,No,No,Yes,No,4.73987,Janice,Jacks,Karamo,...,Multiple,11.276502,slightly abnormal,1.0,0.0,0.0,1.0,0.0,Mitochondrial genetic inheritance disorders,Mitochondrial myopathy


## Preprocessing

### Eliminando columnas innecesarias

In [7]:
drop_cols = ["Patient Id", "Patient First Name", "Family Name", "Father's name", 
             "Institute Name", "Location of Institute", "Parental consent", "Place of birth", 
             "Test 1", "Test 2", "Test 3", "Test 4", "Test 5"]

df_train.drop(drop_cols, axis=1, inplace=True)
df_test.drop(drop_cols, axis=1, inplace=True)

In [8]:
# dropear filas en train donde los target estan vacias
df_train.dropna(how='any', subset=["Genetic Disorder", "Disorder Subclass"], inplace=True)

In [9]:
df_train.sample(3)

Unnamed: 0,Patient Age,Genes in mother's side,Inherited from father,Maternal gene,Paternal gene,Blood cell count (mcL),Mother's age,Father's age,Status,Respiratory Rate (breaths/min),...,Birth defects,White Blood cell count (thousand per microliter),Blood test result,Symptom 1,Symptom 2,Symptom 3,Symptom 4,Symptom 5,Genetic Disorder,Disorder Subclass
13322,,Yes,,Yes,Yes,5.021891,36.0,,Alive,Normal (30-60),...,Multiple,4.320677,normal,1.0,1.0,0.0,1.0,0.0,Single-gene inheritance diseases,Cystic fibrosis
8838,3.0,Yes,Yes,Yes,No,4.924373,37.0,,Deceased,Tachypnea,...,Multiple,,abnormal,1.0,,,0.0,1.0,Mitochondrial genetic inheritance disorders,Mitochondrial myopathy
6817,8.0,Yes,No,Yes,No,5.047846,22.0,,Deceased,Tachypnea,...,Singular,6.966016,,1.0,1.0,1.0,0.0,,Mitochondrial genetic inheritance disorders,Leigh syndrome


In [10]:
#Análisis Exploratorio de los datos
#Create a Pandas Profiling report to get a quick grasp of the data
reporteEdadsucio = pp.ProfileReport(df_train, title="Reporte Exploratorio inicial", explorative=True)

if not os.path.exists(ROOT_PATH+'raw'):
    os.makedirs(ROOT_PATH+'raw')

reporteEdadsucio.to_file("raw_data/EDA_geneticdisorder_test.html")    

Summarize dataset:   0%|          | 0/46 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [12]:
def unistats(df):
    import pandas as pd
    output_df = pd.DataFrame(columns=['Count','Missing','Unique','Dtype','Numeric','Mode','Mean','Min','25%','Median','75%','Max','Std','Skew','Kurt'])
    for col in df:
        if pd.api.types.is_numeric_dtype(df[col]) and df[col].dtype !='bool' and pd.isnull(df[col]).all()!=True:
            
            output_df.loc[col] = [df[col].count(), df[col].isnull().sum(), df[col].nunique(), df[col].dtype,pd.api.types.is_numeric_dtype(df[col]),
                                  df[col].mode().values[0], df[col].mean(),df[col].min(), df[col].quantile(0.25), df[col].median(),
                                  df[col].quantile(0.75), df[col].max(), df[col].std(), df[col].skew(), df[col].kurt()]
        else:
            output_df.loc[col] = [df[col].count(), df[col].isnull().sum(), df[col].nunique(), df[col].dtype,pd.api.types.is_numeric_dtype(df[col]),
                                   '-' if pd.isnull(df[col]).all() else df[col].mode().values[0],'-','-','-','-','-','-','-','-','-']
    return output_df.sort_values(by=['Numeric', 'Skew','Unique'],ascending=False)

In [13]:
unistats(df_train)

Unnamed: 0,Count,Missing,Unique,Dtype,Numeric,Mode,Mean,Min,25%,Median,75%,Max,Std,Skew,Kurt
Symptom 5,16434,1613,2,float64,True,0.0,0.464342,0.0,0.0,0.0,1.0,1.0,0.498742,0.143008,-1.97979
Patient Age,16987,1060,15,float64,True,4.0,6.948784,0.0,3.0,7.0,11.0,14.0,4.314395,0.017398,-1.208603
White Blood cell count (thousand per microliter),16440,1607,14250,float64,True,3.0,7.47574,3.0,5.422143,7.470549,9.51747,12.0,2.65112,0.008824,-0.971344
Blood cell count (mcL),18047,0,18047,float64,True,4.14623,4.899198,4.14623,4.764199,4.900306,5.033654,5.609829,0.199061,0.004334,-0.051664
Symptom 4,16481,1566,2,float64,True,0.0,0.498999,0.0,0.0,0.0,1.0,1.0,0.500014,0.004005,-2.000227
No. of previous abortion,16501,1546,5,float64,True,2.0,1.999455,0.0,1.0,2.0,3.0,4.0,1.40947,0.001486,-1.288054
Father's age,13629,4418,45,float64,True,20.0,41.972559,20.0,30.0,42.0,53.0,64.0,13.064441,-0.004126,-1.2213
Mother's age,13590,4457,34,float64,True,23.0,34.576453,18.0,26.0,35.0,43.0,51.0,9.823005,-0.007945,-1.214144
Symptom 3,16517,1530,2,float64,True,1.0,0.537749,0.0,0.0,1.0,1.0,1.0,0.498588,-0.151442,-1.977305
Symptom 2,16401,1646,2,float64,True,1.0,0.5521,0.0,0.0,1.0,1.0,1.0,0.497293,-0.209562,-1.956322


## Separando las variables target, categoricas y no categoricas

In [15]:
def get_categorical_features(dataFrame):
    categorical_feats = [f for f in dataFrame.columns if dataFrame[f].dtype == 'object']
    return categorical_feats

def get_non_categorical_features(dataFrame, categorical_feats):
    non_categorical_features  = [f for f in dataFrame.columns if f not in categorical_feats]
    return non_categorical_features

def remove_element_from_list(list_of_elements, element):
    if element in list_of_elements: list_of_elements.remove(element)
    return list_of_elements

In [16]:
target_feature = ['Genetic Disorder', 'Disorder Subclass']
categorical_features = get_categorical_features(df_train)
categorical_features = remove_element_from_list(categorical_features, target_feature)

non_categorical_features = get_non_categorical_features(df_train, categorical_features)
non_categorical_features = remove_element_from_list(non_categorical_features, target_feature)

In [18]:
print("Total de variables o columnas: ", len(df_train.columns))
print("Variables categóricas : ", len(categorical_features))
print("Variables no categóricas : ", len(non_categorical_features))

Total de variables o columnas:  32
Variables categóricas :  21
Variables no categóricas :  11


In [19]:
def convert_to_categorical(df, column):
    df[column] = df[column].astype('category')
    return df

def convert_to_numeric(df, column):
    df[column] = df[column].apply(pd.to_numeric, errors='coerce')
    return df

### Separando las variables target del DataFrame

In [20]:
target_feature = ['Genetic Disorder', 'Disorder Subclass']

df_target = pd.DataFrame(df_train, columns = target_feature)
df_train.drop(target_feature, axis=1, inplace=True, errors='ignore')

### Uniendo las bases de datos train y test para el preprocesamiento(preprocessing)

In [21]:
# df_test con valor fuente = 0
# df_train con valor fuente = 1

df_merged = pd.concat([df_test.assign(fuente=0), df_train.assign(fuente=1)])


In [22]:
df_merged.sample(2)

Unnamed: 0,Patient Age,Genes in mother's side,Inherited from father,Maternal gene,Paternal gene,Blood cell count (mcL),Mother's age,Father's age,Status,Respiratory Rate (breaths/min),...,No. of previous abortion,Birth defects,White Blood cell count (thousand per microliter),Blood test result,Symptom 1,Symptom 2,Symptom 3,Symptom 4,Symptom 5,fuente
11887,10.0,Yes,No,,No,4.966551,,,Deceased,,...,3.0,Singular,,inconclusive,,0.0,0.0,1.0,1.0,1
2619,3.0,No,Yes,,Yes,4.616247,41.0,45.0,Alive,-99.0,...,2.0,Multiple,12.0,slightly abnormal,1.0,0.0,1.0,0.0,0.0,0


### Aplicando el Ordinal Encoding  a las variables categóricas para transformarlas a numéricas mientras preservo aún los nan

In [25]:
categorical_features = get_categorical_features(df_merged)
non_categorical_features = get_non_categorical_features(df_merged, categorical_features)

In [26]:
categorical_features

["Genes in mother's side",
 'Inherited from father',
 'Maternal gene',
 'Paternal gene',
 'Status',
 'Respiratory Rate (breaths/min)',
 'Heart Rate (rates/min',
 'Follow-up',
 'Gender',
 'Birth asphyxia',
 'Autopsy shows birth defect (if applicable)',
 'Folic acid details (peri-conceptional)',
 'H/O serious maternal illness',
 'H/O radiation exposure (x-ray)',
 'H/O substance abuse',
 'Assisted conception IVF/ART',
 'History of anomalies in previous pregnancies',
 'Birth defects',
 'Blood test result']

In [27]:
non_categorical_features

['Patient Age',
 'Blood cell count (mcL)',
 "Mother's age",
 "Father's age",
 'No. of previous abortion',
 'White Blood cell count (thousand per microliter)',
 'Symptom 1',
 'Symptom 2',
 'Symptom 3',
 'Symptom 4',
 'Symptom 5',
 'fuente']

In [28]:
# ref: https://krrai77.medium.com/using-fancyimpute-in-python-eadcffece782

from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()

#Esta función codificara los datos no nulos y los reemplazara en el dataframe original
def ordinalencode(df):
    nonulls = np.array(df.dropna())
    impute_reshape = nonulls.reshape(-1,1)
    impute_ordinal = encoder.fit_transform(impute_reshape)
    df.loc[df.notnull()] = np.squeeze(impute_ordinal)
    return df

#Codificando todas las variables categóricas
for columns in categorical_features:
      ordinalencode(df_merged[columns])

In [29]:
df_merged.sample(5)

Unnamed: 0,Patient Age,Genes in mother's side,Inherited from father,Maternal gene,Paternal gene,Blood cell count (mcL),Mother's age,Father's age,Status,Respiratory Rate (breaths/min),...,No. of previous abortion,Birth defects,White Blood cell count (thousand per microliter),Blood test result,Symptom 1,Symptom 2,Symptom 3,Symptom 4,Symptom 5,fuente
2022,8.0,1.0,0.0,1.0,0.0,4.967373,28.0,27.0,1.0,1.0,...,1.0,2.0,5.058817,2.0,0.0,1.0,1.0,1.0,1.0,0
19686,5.0,1.0,0.0,0.0,0.0,5.299121,34.0,42.0,1.0,1.0,...,0.0,1.0,7.39519,3.0,1.0,1.0,1.0,0.0,1.0,1
10983,7.0,1.0,0.0,,1.0,5.189551,24.0,35.0,0.0,2.0,...,,1.0,11.982329,1.0,0.0,1.0,1.0,0.0,0.0,1
12051,13.0,1.0,0.0,1.0,0.0,5.226273,51.0,38.0,1.0,1.0,...,3.0,2.0,8.155844,4.0,1.0,0.0,1.0,1.0,1.0,1
5201,8.0,1.0,0.0,1.0,0.0,4.984761,50.0,30.0,1.0,1.0,...,2.0,1.0,6.465685,2.0,1.0,0.0,1.0,1.0,,1


### Reemplazando los valores NaN usando una MICE(Multiple Imputations by Chained Equations) imputation

In [30]:
from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer as mice

# Creando una copia del dataset original
dataset_impute = df_merged.copy()

# Aplicando MICE
dataset_impute_complete = mice(max_iter=150, verbose=1, initial_strategy='most_frequent',random_state=10).fit_transform(dataset_impute.values)

# Turning into df again
df_merged = pd.DataFrame(data=dataset_impute_complete, columns=dataset_impute.columns, index=dataset_impute.index)

[IterativeImputer] Completing matrix with shape (27512, 31)
[IterativeImputer] Change: 148.67376800752697, scaled tolerance: 0.099 
[IterativeImputer] Change: 1.5144384958663288, scaled tolerance: 0.099 
[IterativeImputer] Change: 0.039169122130680356, scaled tolerance: 0.099 
[IterativeImputer] Early stopping criterion reached.


In [31]:
df_merged.sample(5)

Unnamed: 0,Patient Age,Genes in mother's side,Inherited from father,Maternal gene,Paternal gene,Blood cell count (mcL),Mother's age,Father's age,Status,Respiratory Rate (breaths/min),...,No. of previous abortion,Birth defects,White Blood cell count (thousand per microliter),Blood test result,Symptom 1,Symptom 2,Symptom 3,Symptom 4,Symptom 5,fuente
4734,5.0,0.0,0.0,0.539822,0.0,4.888909,40.0,26.0,0.0,2.0,...,2.0,2.0,12.0,4.0,0.0,1.0,1.0,1.0,0.0,0.0
611,9.0,1.0,0.0,0.592867,1.0,4.970301,18.0,23.0,1.0,0.985022,...,-99.0,1.0,3.0,2.0,1.0,1.0,1.0,1.0,0.0,0.0
21618,11.0,0.0,0.0,1.0,1.0,4.984722,25.0,52.0,0.0,1.0,...,2.0,2.0,7.483353,4.0,0.0,1.0,0.0,1.0,1.0,1.0
5374,12.0,1.0,1.0,0.594935,0.0,5.186028,45.0,59.0,1.0,2.0,...,1.0,0.0,-99.0,3.0,0.0,1.0,1.0,0.0,0.0,0.0
8690,4.0,1.0,1.0,0.0,1.0,4.902166,51.0,26.0,1.0,2.0,...,2.0,1.0,4.354549,4.0,1.0,0.0,0.0,0.0,0.0,1.0


In [32]:
# Separando de nuevo entre test y train
# df_test, df_train = df_merged[df_merged["fuente"].eq("df_test")], df_merged[df_merged["fuente"].eq("df_train")]
df_test, df_train = df_merged[df_merged["fuente"].eq(0.0)], df_merged[df_merged["fuente"].eq(1.0)]
df_train.drop(["fuente"], axis=1, inplace=True, errors='ignore')
df_test.drop(["fuente"], axis=1, inplace=True, errors='ignore')

In [33]:
df_train = pd.concat([df_train, df_target], axis=1)
df_train.sample(5)

Unnamed: 0,Patient Age,Genes in mother's side,Inherited from father,Maternal gene,Paternal gene,Blood cell count (mcL),Mother's age,Father's age,Status,Respiratory Rate (breaths/min),...,Birth defects,White Blood cell count (thousand per microliter),Blood test result,Symptom 1,Symptom 2,Symptom 3,Symptom 4,Symptom 5,Genetic Disorder,Disorder Subclass
12578,0.0,1.0,1.0,1.0,1.0,5.209613,25.0,41.915435,0.0,2.0,...,1.496166,5.411932,2.0,1.0,1.0,0.555805,0.0,0.486924,Multifactorial genetic inheritance disorders,Diabetes
13661,10.0,1.0,0.0,0.556231,0.0,4.751998,35.0,41.908812,1.0,1.0,...,2.0,11.07209,3.0,1.0,0.0,0.0,0.0,0.461364,Single-gene inheritance diseases,Hemochromatosis
19113,1.0,1.0,0.0,1.0,0.0,4.88571,47.0,41.907921,0.0,1.0,...,1.0,12.0,3.0,1.0,0.523137,0.0,0.0,0.469329,Single-gene inheritance diseases,Tay-Sachs
10354,0.0,1.0,1.0,1.0,0.0,5.232346,25.0,41.914861,1.0,1.0,...,2.0,7.081961,2.463872,1.0,1.0,0.0,1.0,0.0,Single-gene inheritance diseases,Cystic fibrosis
14955,0.0,1.0,0.0,1.0,1.0,4.67965,26.0,21.0,1.0,1.0,...,1.0,5.155619,4.0,1.0,0.0,0.0,0.0,1.0,Mitochondrial genetic inheritance disorders,Leigh syndrome


In [34]:
print("NaN Values:",df_train.isna().any().sum())

NaN Values: 0


In [35]:
print("NaN Values:",df_test.isna().any().sum())

NaN Values: 0


In [36]:
# shape of data
print("Train shape: ",df_train.shape)
print("Test shape: ",df_test.shape)

Train shape:  (18047, 32)
Test shape:  (9465, 30)


In [39]:
# save preprocessed files
df_train.to_csv(ROOT_PATH+'processed\\train_preprocessed.csv', index=False)
df_test.to_csv(ROOT_PATH+'processed\\test_preprocessed.csv', index=False)