# Genetic disorders in children

**Context**

Over 9000 rare diseases have been described and up to 350 million people worldwide suffer from one. While individually, these diseases are rare (affecting 5-6% of the population),  in aggregate they represent a substantial challenge to global health systems. The majority of rare disorders are genetic in origin, with children under the age of five disproportionately affected (83%). Due to disease heterogeneity and unknown variant pathogenicity, about half of patients with rare genetic diseases never receive a causal diagnosis. However, making a molecular diagnosis with current technologies and knowledge is often still a challenge.
Thus understanding the clinically implications of genetic and phenotypic variation is crucial for delivering an early diagnose and treatment for these patients.

**Dataset details**

The raw dataset contains the following files:
- train.csv : 22083 rows x 45 columns
- test.csv : 9465 rows x 43 columns

The columns that are used have the following information:
- Patient Id: Represents the unique identification number of a patient
- Patient Age: Represents the age of a patient
- Genes in mother's side: Represents a gene defect in a patient's mother
- Inherited from father: Represents a gene defect in a patient's father
- Maternal gene: Represents a gene defect in the patient's maternal side of the family
- Paternal gene: Represents a gene defect in a patient's paternal side of the family
- Blood cell count (mcL): Represents the blood cell count of a patient
- Patient First Name: Represents a patient's first name
- Family Name: Represents a patient's family name or surname
- Father's name: Represents a patient's father's name
- Mother's age: Represents a patient's mother's age
- Father's age: Represents a patient's father's age
- Institute Name: Represents the medical institute where a patient was born
- Location of Institute: Represents the location of the medical institute
- Status: Represents whether a patient is deceased
- Respiratory Rate (breaths/min): Represents a patient's respiratory breating rate
- Heart Rate (rates/min): Represents a patient's heart rate
- Test 1 - Test 5: Represents different (masked) tests that were conducted on a patient
- Parental Consent: Represents whether a patient's parents approved the treatment plan
- Follow-up: Represents a patient's level of risk (how intense their condition is)
- Gender: Represents a patient's gender
- Birth asphyxia: Represents whether a patient suffered from birth asphyxia
- Autopsy shows birth defect (if applicable): Represents whether a patient's autopsy showed any birth defects
- Place of birth: Represents whether a patient was born in a medical institute or home
- Folic acid details (peri-conceptional): Represents the periconceptional folic acid supplementation details of a patient
- H/O serious maternal illness: Represents an unexpected outcome of labor and delivery that resulted in significant short or long term consequences to a patient's mother
- H/O radiation exposure (x-ray): Represents whether a patient has any radiation exposure history
- H/O substance abuse: Represents whether a parent has a history of drug addiction
- Assisted conception IVF/ART: Represents the type of treatment used for infertility
- History of anomalies in previous pregnancies: Represents whether the mother had any anomalies in her previous pregnancies
- No. of previous abortion: Represents the number of abortions that a mother had
- Birth defects: Represents whether a patient has birth defects
- White Blood cell count (thousand per microliter): Represents a patient's white blood test results
- Blood test result: Represents a patient's blood test results
- Symptom 1 - Symptom 5: Represents (masked) different types of symptoms that a patient had
- Genetic Disorder: Represents the genetic disorder that a patient has
- Disorder Subclass: Represents the subclass of the disorder

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

from warnings import filterwarnings
filterwarnings('ignore')

In [3]:
# Load the data
df_original = pd.read_csv("../data/raw/train.csv")

# Copy the data so as not to modify the original information and visualize it
df = df_original.copy()
df

Unnamed: 0,Patient Id,Patient Age,Genes in mother's side,Inherited from father,Maternal gene,Paternal gene,Blood cell count (mcL),Patient First Name,Family Name,Father's name,...,Birth defects,White Blood cell count (thousand per microliter),Blood test result,Symptom 1,Symptom 2,Symptom 3,Symptom 4,Symptom 5,Genetic Disorder,Disorder Subclass
0,PID0x6418,2.0,Yes,No,Yes,No,4.760603,Richard,,Larre,...,,9.857562,,1.0,1.0,1.0,1.0,1.0,Mitochondrial genetic inheritance disorders,Leber's hereditary optic neuropathy
1,PID0x25d5,4.0,Yes,Yes,No,No,4.910669,Mike,,Brycen,...,Multiple,5.522560,normal,1.0,,1.0,1.0,0.0,,Cystic fibrosis
2,PID0x4a82,6.0,Yes,No,No,No,4.893297,Kimberly,,Nashon,...,Singular,,normal,0.0,1.0,1.0,1.0,1.0,Multifactorial genetic inheritance disorders,Diabetes
3,PID0x4ac8,12.0,Yes,No,Yes,No,4.705280,Jeffery,Hoelscher,Aayaan,...,Singular,7.919321,inconclusive,0.0,0.0,1.0,0.0,0.0,Mitochondrial genetic inheritance disorders,Leigh syndrome
4,PID0x1bf7,11.0,Yes,No,,Yes,4.720703,Johanna,Stutzman,Suave,...,Multiple,4.098210,,0.0,0.0,0.0,0.0,,Multifactorial genetic inheritance disorders,Cancer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22078,PID0x5598,4.0,Yes,Yes,Yes,No,5.258298,Lynn,,Alhassane,...,Multiple,6.584811,inconclusive,0.0,0.0,1.0,0.0,0.0,Mitochondrial genetic inheritance disorders,Leigh syndrome
22079,PID0x19cb,8.0,No,Yes,No,Yes,4.974220,Matthew,Farley,Dartanion,...,Multiple,7.041556,inconclusive,1.0,1.0,1.0,1.0,0.0,Multifactorial genetic inheritance disorders,Diabetes
22080,PID0x3c4f,8.0,Yes,No,Yes,No,5.186470,John,,Cavani,...,Singular,7.715464,normal,0.0,0.0,0.0,1.0,,Mitochondrial genetic inheritance disorders,Mitochondrial myopathy
22081,PID0x13a,7.0,Yes,No,Yes,Yes,4.858543,Sharon,,Bomer,...,Multiple,8.437670,abnormal,1.0,1.0,1.0,0.0,0.0,,Leigh syndrome


In [7]:
# Delete the personal information columns in order to work with anonimous data, as well as parental consent and medical institution name and location which are not relevant
df1 = df.drop(["Patient Id", "Patient First Name", "Family Name", "Father's name", "Parental consent", "Institute Name", "Location of Institute"], axis=1)

# Test and Symptom columns don't have any specific information, so were deleted
df1 = df1.drop(["Test 1", "Test 2", "Test 3", "Test 4", "Test 5", "Symptom 1", "Symptom 2", "Symptom 3", "Symptom 4", "Symptom 5"], axis=1)

# Observe the remaining information of the dataframe
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22083 entries, 0 to 22082
Data columns (total 28 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   Patient Age                                       20656 non-null  float64
 1   Genes in mother's side                            22083 non-null  object 
 2   Inherited from father                             21777 non-null  object 
 3   Maternal gene                                     19273 non-null  object 
 4   Paternal gene                                     22083 non-null  object 
 5   Blood cell count (mcL)                            22083 non-null  float64
 6   Mother's age                                      16047 non-null  float64
 7   Father's age                                      16097 non-null  float64
 8   Status                                            22083 non-null  object 
 9   Respiratory Rate 

In [15]:
# Rename columns
df1.rename(columns={"Patient Age": "Patient_Age", "Genes in mother's side": "Mother_inherit", "Inherited from father": "Father_inherit",
                    "Maternal gene": "Maternal_gene", "Paternal gene": "Paternal_gene", "Blood cell count (mcL)": "Blood_cell_count",
                    "Mother's age": "Mother_age", "Father's age": "Father_age", "Respiratory Rate (breaths/min)": "Respiratory_rate",
                    "Heart Rate (rates/min": "Heart_rate", "Follow-up": "Follow_up", "Birth asphyxia": "Birth_asphyxia",
                    "Autopsy shows birth defect (if applicable)": "Autopsy_birth_defect", "Place of birth": "Place_birth",
                    "Folic acid details (peri-conceptional)": "Folic_acid", "H/O serious maternal illness": "Maternal_illness",
                    "H/O radiation exposure (x-ray)": "Radiation_exposure", "H/O substance abuse": "Substance_abuse",
                    "Assisted conception IVF/ART": "Assisted_conception", "History of anomalies in previous pregnancies": "History_previous_pregnancies",
                    "No. of previous abortion": "Number_abortions", "Birth defects": "Birth_defects", "White Blood cell count (thousand per microliter)": "WBC_count",
                    "Blood test result": "Blood_test", "Genetic Disorder": "Genetic_disorder", "Disorder Subclass": "Disorder_subclass"}, inplace=True)

Working with missing information

In [10]:
# Replacing missing information categories to NaN
df1["Birth_asphyxia"] = df1["Birth_asphyxia"].replace("No record",np.NaN)
df1["Birth_asphyxia"] = df1["Birth_asphyxia"].replace("Not available",np.NaN)

df1["Autopsy_birth_defect"] = df1["Autopsy_birth_defect"].replace("None",np.NaN)
df1["Autopsy_birth_defect"] = df1["Autopsy_birth_defect"].replace("Not applicable",np.NaN)

df1["Radiation_exposure"] = df1["Radiation_exposure"].replace("-",np.NaN)
df1["Radiation_exposure"] = df1["Radiation_exposure"].replace("Not applicable",np.NaN)

df1["Substance_abuse"] = df1["Substance_abuse"].replace("-",np.NaN)
df1["Substance_abuse"] = df1["Substance_abuse"].replace("Not applicable",np.NaN)

In [14]:
# The aim is to predict the Genetic Disorder and Disorder Subclass a patient has.
# If the patient has no value in the Genetic Disorder and Disorder Subclass columns for the training data, we can assume that the genetic disease is unknown or not present.
# Either way, we can drop the rows with missing values.

df1.dropna(subset=["Genetic_disorder", "Disorder_subclass"], axis=0, inplace=True)

In [None]:
# Replacing missing information with mode values for categorical columns and mean values for numerical columns
# Categorical columns
df1["Autopsy_birth_defect"].fillna(df1["Autopsy_birth_defect"].mode()[0], inplace=True)
df1["Birth_asphyxia"].fillna(df1["Birth_asphyxia"].mode()[0], inplace=True)
df1["Radiation_exposure"].fillna(df1["Radiation_exposure"].mode()[0], inplace=True)
df1["Substance_abuse"].fillna(df1["Substance_abuse"].mode()[0], inplace=True)
df1["Maternal_gene"].fillna(df1["Maternal_gene"].mode()[0], inplace=True)
df1["History_previous_pregnancies"].fillna(df1["History_previous_pregnancies"].mode()[0], inplace=True)
df1["Place_birth"].fillna(df1["Place_birth"].mode()[0], inplace=True)
df1["Assisted_conception"].fillna(df1["Assisted_conception"].mode()[0], inplace=True)
df1["Follow_up"].fillna(df1["Follow_up"].mode()[0], inplace=True)
df1["Gender"].fillna(df1["Gender"].mode()[0], inplace=True)
df1["Respiratory_rate"].fillna(df1["Respiratory_rate"].mode()[0], inplace=True)
df1["Birth_defects"].fillna(df1["Birth_defects"].mode()[0], inplace=True)
df1["Folic_acid"].fillna(df1["Folic_acid"].mode()[0], inplace=True)
df1["Blood_test"].fillna(df1["Blood_test"].mode()[0], inplace=True)
df1["Maternal_illness"].fillna(df1["Maternal_illness"].mode()[0], inplace=True)
df1["Heart_rate"].fillna(df1["Heart_rate"].mode()[0], inplace=True)
df1["Father_inherit"].fillna(df1["Father_inherit"].mode()[0], inplace=True)


# Numerical columns
df1["Mother_age"].fillna(df1.groupby(["Disorder_subclass"])["Mother_age"].transform("mean"),inplace=True)
df1["Father_age"].fillna(df1.groupby(["Disorder_subclass"])["Father_age"].transform("mean"),inplace=True)
df1["WBC_count"].fillna(df1.groupby(["Disorder_subclass"])["WBC_count"].transform("mean"),inplace=True)
df1["Patient_Age"].fillna(df1.groupby(["Disorder_subclass"])["Patient_Age"].transform("mean"),inplace=True)
df1["Number_abortions"].fillna(df1.groupby(["Disorder_subclass"])["Number_abortions"].transform("mean"),inplace=True)

In [None]:
numeric_cols = df1.select_dtypes(include=[np.number]).columns
categoric_cols = df1.select_dtypes(exclude=[np.number]).columns

In [None]:
# Autopsy birth defect is negatively correlated with Status, since all patients with autopsy are deceased.
# For this reason, we can drop one of those columns.
df3 = df2.drop(["Autopsy_birth_defect"], axis=1)

In [None]:
# Save df3 as df_eda in data/processed

# Machine Learning

In [None]:
df_eda = pd.read_csv("../data/processed/df_eda.csv", index_col=0)

In [None]:
# This dataset aims to predict both the genetic disorder and disorder subclass of patients, so we will make two dataframes for each target column.
Genetic_disorder = df_eda.drop(["Disorder_subclass"], axis=1)
Disorder_subclass = df_eda.drop(["Genetic_disorder"], axis=1)