# Context

Since the dawn of human life on the face of the earth, the global population has been booming. The population was estimated to be 1 billion people in the year 1800. The figure had increased to a new high of 6 billion humans by the turn of the twentieth century. Day in and day out, 227,000 people are being added to the world; it is projected that by the end of the 21st century, the world's population may exceed 11 billion.

As per reports, as a consequence of the unsustainable increase in population and a lack of access to adequate health care, food, and shelter, the number of genetic disorder ailments have increased. Hereditary illnesses are becoming more common due to a lack of understanding about the need for genetic testing. Often kids die as a result of these illnesses, thus genetic testing during pregnancy is critical.

# Task
You are hired as a Machine Learning Engineer from a government agency. You are given a dataset that contains medical information about children who have genetic disorders. Your task is to predict the following:

Genetic disorder

Disorder subclass

In [10]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling as pp
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.neighbors import LocalOutlierFactor
from scipy.stats import levene
from scipy.stats import norm
from scipy.stats import shapiro
from scipy.stats.stats import pearsonr
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.preprocessing import scale
from sklearn.model_selection import ShuffleSplit, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import model_selection
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LinearRegression
from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import ElasticNetCV
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.impute import KNNImputer
import xgboost as xgb
from xgboost import XGBRegressor, XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')
pd.options.display.float_format = '{:.5f}'.format
plt.style.use('seaborn-whitegrid')
%matplotlib inline


In [11]:
BATCH_SIZE = 10
EPOCHS = 10
ROOT_PATH ='processed_data'
TRAIN_PATH = ROOT_PATH + '\\df_train_pr.csv'
TEST_PATH = ROOT_PATH + "\\df_test_pr.csv"

In [12]:
train_data = pd.read_csv(TRAIN_PATH)
test_data = pd.read_csv(TEST_PATH)
df_train = train_data.copy()
df_test = test_data.copy()
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18672 entries, 0 to 18671
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Patient_Age                   18672 non-null  int64  
 1   Genes_Mother_Side             18672 non-null  object 
 2   Inherited_from_Father         18672 non-null  object 
 3   Maternal_Gene                 18672 non-null  object 
 4   Paternal_Gene                 18672 non-null  object 
 5   Blood_Cell_mcL                18672 non-null  float64
 6   Mother_Age                    18672 non-null  int64  
 7   Father_Age                    18672 non-null  int64  
 8   Status                        18672 non-null  object 
 9   Respiratory_Rate_Breaths_Min  18672 non-null  object 
 10  Heart_Rates_Min               18672 non-null  object 
 11  Follow_Up                     18672 non-null  object 
 12  Gender                        18672 non-null  object 
 13  B

In [13]:
target_col = ["Disorder_Subclass"]
cat_columns   = df_train.nunique()[df_train.nunique() < 12].keys().tolist()
cat_columns   = [x for x in cat_columns ]
print('Categorial Variables')
print(cat_columns)
#numerical columns
num_columns   = [x for x in df_train.columns if x not in cat_columns + target_col]
#Binary columns with 2 values
print('Numeric Variables')
print(num_columns)


Categorial Variables
['Genes_Mother_Side', 'Inherited_from_Father', 'Maternal_Gene', 'Paternal_Gene', 'Status', 'Respiratory_Rate_Breaths_Min', 'Heart_Rates_Min', 'Follow_Up', 'Gender', 'Birth_Asphyxia', 'Autopsy_Birth_Defect', 'Place_Birth', 'Folic_Acid', 'Maternal_Illness', 'Radiation_Exposure', 'Substance_Abuse', 'Assisted_Conception', 'History_Previous_Pregnancies', 'Previous_Abortion', 'Birth_Defects', 'Blood_Test_Result', 'Genetic_Disorder', 'Disorder_Subclass']
Numeric Variables
['Patient_Age', 'Blood_Cell_mcL', 'Mother_Age', 'Father_Age', 'White_Blood_Cell']


In [14]:
X_train = df_train.drop(['Disorder_Subclass','Genetic_Disorder'],axis =1)
y_train = df_train['Disorder_Subclass']



In [25]:

labels = sorted(y_train.unique())

dict1={}

for i in range(len(labels)):
    dict1[labels[i]]=i
    
y=[]

for i in y_train: 
    y.append(dict1[i])
    
y_train = np.array(y)

In [26]:
y_train

array([5, 2, 3, ..., 7, 6, 3])

In [27]:
dict1

{"Alzheimer's": 0,
 'Cancer': 1,
 'Cystic fibrosis': 2,
 'Diabetes': 3,
 'Hemochromatosis': 4,
 "Leber's hereditary optic neuropathy": 5,
 'Leigh syndrome': 6,
 'Mitochondrial myopathy': 7,
 'Tay-Sachs': 8}

In [15]:
all_Data = pd.concat((X_train,df_test))
all_Data_E = pd.get_dummies(all_Data, columns=cat_columns[:-2])

In [16]:
X_train= pd.get_dummies(X_train, columns=cat_columns[:-2])
X_test= pd.get_dummies(df_test, columns=cat_columns[:-2])

In [17]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18672 entries, 0 to 18671
Data columns (total 57 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   Patient_Age                                  18672 non-null  int64  
 1   Blood_Cell_mcL                               18672 non-null  float64
 2   Mother_Age                                   18672 non-null  int64  
 3   Father_Age                                   18672 non-null  int64  
 4   White_Blood_Cell                             18672 non-null  float64
 5   Genes_Mother_Side_No                         18672 non-null  uint8  
 6   Genes_Mother_Side_Yes                        18672 non-null  uint8  
 7   Inherited_from_Father_No                     18672 non-null  uint8  
 8   Inherited_from_Father_Yes                    18672 non-null  uint8  
 9   Maternal_Gene_No                             18672 non-null  uint8  
 10

In [18]:
#No hay valores distintos entre las columnas generadas
#print(np.sum(X_train_E.columns!=X_test_E.columns))

In [19]:
models = [{'name': 'logreg','label': 'Logistic Regression',
           'classifier': LogisticRegression(random_state=88),
           'grid': {"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}},
          
          {'name': 'knn','label':'K Nearest Neighbors',
           'classifier':KNeighborsClassifier(),
           'grid': {"n_neighbors":np.arange(8)+1}},
          
          {'name': 'dsc','label': 'Descision Tree', 
           'classifier': DecisionTreeClassifier(random_state=88),
           'grid': {"max_depth":np.arange(8)+1}},
          
          {'name': 'rf', 'label': 'Random Forest',
           'classifier': RandomForestClassifier(random_state=88),
           'grid': {'n_estimators': [100, 200, 500, 700],'max_features': ['auto', 'sqrt', 'log2'],
                    'max_depth' : [2,3,4,5,6,7,8],'criterion' :['gini', 'entropy']}}]

In [20]:

def model_selection(classifier, name, grid, X_train, y_train, scoring):
    
    gridsearch_cv=GridSearchCV(classifier, 
                               grid,
                               cv=10, 
                               scoring = scoring,
                               verbose = 1,
                               n_jobs = -1)
    
    gridsearch_cv.fit(X_train, y_train)
    
    results_dict = {}
    
    results_dict['classifier_name'] = name    
    results_dict['classifier'] = gridsearch_cv.best_estimator_
    results_dict['best_params'] = gridsearch_cv.best_params_
    results_dict['ROC_AUC'] = gridsearch_cv.best_score_
    
    return(results_dict)
results = []
for m in models:    
    print(m['name'])    
    results.append(model_selection(m['classifier'], 
                                   m['name'],
                                   m['grid'],
                                   X_train, 
                                   y_train, 
                                   'roc_auc'))      
    print('completed')

logreg
Fitting 10 folds for each of 14 candidates, totalling 140 fits


ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

In [28]:
rand_forest = RandomForestClassifier()


rand_forest_param = {
    "criterion":['entropy'],
    "n_estimators": [700],
    "max_features": ['auto'],
    "max_depth": [8],
    'random_state':[88]
}


gs_rand_forest = GridSearchCV(rand_forest,
                         rand_forest_param,
                         cv = 10,
                         scoring = 'accuracy',
                         verbose = 1,
                         n_jobs = -1)

grids = {"gs_rand_forest": gs_rand_forest}
    

In [29]:
for nombre, grid_search in grids.items():
    grid_search.fit(X_train, y_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits


In [32]:
y_train

array([5, 2, 3, ..., 7, 6, 3])

In [31]:
print(gs_rand_forest.best_score_)
print(gs_rand_forest.best_params_)
print(gs_rand_forest.best_estimator_)

0.26574586902690595
{'criterion': 'entropy', 'max_depth': 8, 'max_features': 'auto', 'n_estimators': 700, 'random_state': 88}
RandomForestClassifier(criterion='entropy', max_depth=8, n_estimators=700,
                       random_state=88)


In [21]:
from supervised.automl import AutoML


In [33]:
automl = AutoML()
automl.fit(X_train, y_train)

Linear algorithm was disabled.
AutoML directory: AutoML_3
The task is multiclass_classification with evaluation metric logloss
AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 2 models
1_Baseline logloss 1.841309 trained in 1.19 seconds


2021-09-24 21:06:52,086 concurrent.futures ERROR exception calling callback for <Future at 0x254b586c4c0 state=finished raised BrokenProcessPool>
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\Usuario\anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py", line 404, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "C:\Users\Usuario\anaconda3\lib\multiprocessing\queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "C:\Users\Usuario\anaconda3\lib\site-packages\supervised\__init__.py", line 3, in <module>
    from supervised.automl import AutoML
  File "C:\Users\Usuario\anaconda3\lib\site-packages\supervised\automl.py", line 3, in <module>
    from supervised.base_automl import BaseAutoML
  File "C:\Users\Usuario\anaconda3\lib\site-packages\supervised\base_automl.py", line 21, in <module>
    from supervised.algorithms.registry import AlgorithmsRegis

A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
Problem during computing permutation importance. Skipping ...
2_DecisionTree logloss 1.793842 trained in 45.57 seconds
* Step default_algorithms will try to check up to 3 models


2021-09-24 21:07:54,085 concurrent.futures ERROR exception calling callback for <Future at 0x25499ef5850 state=finished raised BrokenProcessPool>
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\Usuario\anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py", line 404, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "C:\Users\Usuario\anaconda3\lib\multiprocessing\queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "C:\Users\Usuario\anaconda3\lib\site-packages\supervised\__init__.py", line 3, in <module>
    from supervised.automl import AutoML
  File "C:\Users\Usuario\anaconda3\lib\site-packages\supervised\automl.py", line 3, in <module>
    from supervised.base_automl import BaseAutoML
  File "C:\Users\Usuario\anaconda3\lib\site-packages\supervised\base_automl.py", line 21, in <module>
    from supervised.algorithms.registry import AlgorithmsRegis

A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
Problem during computing permutation importance. Skipping ...
3_Default_Xgboost logloss 1.82256 trained in 72.65 seconds


2021-09-24 21:09:07,664 concurrent.futures ERROR exception calling callback for <Future at 0x254a9261910 state=finished raised BrokenProcessPool>
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\Usuario\anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py", line 404, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "C:\Users\Usuario\anaconda3\lib\multiprocessing\queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "C:\Users\Usuario\anaconda3\lib\site-packages\supervised\__init__.py", line 3, in <module>
    from supervised.automl import AutoML
  File "C:\Users\Usuario\anaconda3\lib\site-packages\supervised\automl.py", line 3, in <module>
    from supervised.base_automl import BaseAutoML
  File "C:\Users\Usuario\anaconda3\lib\site-packages\supervised\base_automl.py", line 21, in <module>
    from supervised.algorithms.registry import AlgorithmsRegis

A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
Problem during computing permutation importance. Skipping ...
4_Default_NeuralNetwork logloss 1.80776 trained in 20.8 seconds


2021-09-24 21:09:37,783 concurrent.futures ERROR exception calling callback for <Future at 0x25499f79460 state=finished raised BrokenProcessPool>
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\Usuario\anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py", line 404, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "C:\Users\Usuario\anaconda3\lib\multiprocessing\queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "C:\Users\Usuario\anaconda3\lib\site-packages\supervised\__init__.py", line 3, in <module>
    from supervised.automl import AutoML
  File "C:\Users\Usuario\anaconda3\lib\site-packages\supervised\automl.py", line 3, in <module>
    from supervised.base_automl import BaseAutoML
  File "C:\Users\Usuario\anaconda3\lib\site-packages\supervised\base_automl.py", line 21, in <module>
    from supervised.algorithms.registry import AlgorithmsRegis

A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
Problem during computing permutation importance. Skipping ...
5_Default_RandomForest logloss 1.782773 trained in 65.84 seconds
* Step ensemble will try to check up to 1 model
Ensemble logloss 1.782773 trained in 0.8 seconds
AutoML fit time: 244.03 seconds
AutoML best model: 5_Default_RandomForest


AutoML()

In [34]:
# compute the MSE on test data
predictions = automl.predict(X_train)
print("Test MSE:", mean_squared_error(y_train, predictions))



Test MSE: 5.846347472150814


In [38]:
automl.score(X_train, y_train)

0.2727077977720651