<h1 style="text-align: center;"><strong>Prediction of Viral Load Suppression Project</strong></h1>

<p style="text-align: center;"><strong>Dègninou Yehadji</strong><br /><span style="color: #0000ff;">TU Dublin, Blanchardstown Campus</span><br /><span style="color: #0000ff;">Dublin 15</span><br /><span style="color: #0000ff;"><em>Email: <a href="mailto:degninou.yehadji@fulbrightmail.org">B00108474@mytudublin.ie</a></em></span> </p>

This is a project conducted using an HIV treatement dataset to predict viral load supression for individuals in the cohort. 

The target variable is Last ART VL Count recoded into a binary variable indicating whether the individual has viral load suppressed or not. The threshold of < 1000 RNA copies/ml is used to define suppressed viral load. 

This Python script is develloped to meet the objective presented above. Specifically, it is intended to: 
- Perform data management tasks on the original dataset 
- Show visualisations of the data using the Python plotting tools
- Develop and evaluate 3 classification models to predict viral load supression 
- Perform cross validation (use k-fold cross validation)
- Perform some searching for optimal values of the hyper or tuning parameters (see GridSearchCV and/or RandomizedSearchCV) in scikit-learn.
- Show the performance of the models by generating classification reports/confusion matrices, ROC curves etc.
- Comment on the process as a whole and in particular on the result of the estimation.

This notebook is dedicated to the data cleaning

## Import required packages 

In [22]:
# Import required packages 

import os
import json

import warnings

import datetime
from datetime import timedelta
from scipy.stats import chi2_contingency
from scipy.stats import chisquare
from dython import nominal

from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import Normalizer
from sklearn.ensemble import ExtraTreesClassifier

from ITMO_FS.filters.univariate import reliefF_measure
from ITMO_FS.filters.univariate import information_gain
from ITMO_FS.filters.univariate import gini_index
from ITMO_FS.filters.univariate import f_ratio_measure

from sklearn import tree
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier  
from sklearn import metrics 
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import mutual_info_classif

from sklearn.utils import resample

from sklearn.tree import export_graphviz
from IPython.display import Image  
import pydotplus
from io import StringIO

import missingno as msno 

import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd
import seaborn as sns
import xlrd

## Set working directory and load dataset

In [23]:
os.chdir ("..\..\MCS Thesis\Modeling")
data_train01 = pd.read_excel('data_train_clean.xlsx')
data_test01 = pd.read_excel('data_test_clean.xlsx')

In [24]:
data_train02 = data_train01.copy()
data_test02 = data_test01.copy()

In [25]:
data_train02.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16793 entries, 0 to 16792
Data columns (total 45 columns):
 #   Column                                     Non-Null Count  Dtype
---  ------                                     --------------  -----
 0   Gender_Male                                16793 non-null  int64
 1   Regiment schedule_3-Month                  16793 non-null  int64
 2   Regiment schedule_6-Month                  16793 non-null  int64
 3   Regiment schedule_Regular                  16793 non-null  int64
 4   Prior ART_Naive                            16793 non-null  int64
 5   Method into ART_New                        16793 non-null  int64
 6   TB Rx Started_Yes                          16793 non-null  int64
 7   TPT Outcome_No treatement                  16793 non-null  int64
 8   TPT Outcome_Other                          16793 non-null  int64
 9   TPT Outcome_Rx completed                   16793 non-null  int64
 10  Regimen At Baseline_1S3E                   167

### Select features with Gini = 0.16 to 0.45

In [26]:
# Create a subset with potential predictors - Training set
xtrain01 = data_train02[['Regiment schedule_6-Month',
'Last ART Prescription_1T3E',
'Facility_CS Wanidara',
'Facility_CMC Flamboyants',
'Last ART CD4',
'Last ART Prescription_1T3O']]

# Create a subset with potential predictors - Test set 
xtest01 = data_test02[['Regiment schedule_6-Month',
'Last ART Prescription_1T3E',
'Facility_CS Wanidara',
'Facility_CMC Flamboyants',
'Last ART CD4',
'Last ART Prescription_1T3O']]

# Target -- Training set
ytrain01 = data_train02[['VL Suppressed_Yes']]
# Target -- Test set
ytest01 = data_test02[['VL Suppressed_Yes']]

In [27]:
# Create Naive Bayes classifer object
NB = GaussianNB()

parameters = {}

# Conduct Parameter Optmization 
# Create a grid search object
NB_classifier = GridSearchCV(NB, parameters, n_jobs=-1, cv=10)
# Fit the grid search
NB_classifier.fit(xtrain01, ytrain01.values.ravel())

# Best paramete set
# print('Best parameters found:\n', GS_SVM_classifier.best_params_)

# Use Cross Validation To Evaluate Model
CV_Result = cross_val_score(NB_classifier, xtrain01, ytrain01, cv=10, n_jobs=-1)
print(); print('Accuracy of the 10-fold CVs: ', CV_Result)
print(); print('Mean accuracy: ', CV_Result.mean())
print(); print('Standard deviation: ', CV_Result.std())


Accuracy of the 10-fold CVs:  [0.74940476 0.75357143 0.75357143 0.75342466 0.74449077 0.74568195
 0.74925551 0.75044669 0.73853484 0.76474092]

Mean accuracy:  0.7503122961513372

Standard deviation:  0.006596309184906362


### Select features with Gini = 0.16 to 0.49

In [28]:
# Create a subset with potential predictors - Training set
xtrain01 = data_train02[['Regiment schedule_6-Month',
'Last ART Prescription_1T3E',
'Facility_CS Wanidara',
'Facility_CMC Flamboyants',
'Last ART CD4',
'Last ART Prescription_1T3O',
'TPT Outcome_Rx completed',
'Duration on ART (months)',
'Baseline CD4',
'Last Pre-ART CD4',
'Current Age',
'Age At ART Start',
'TB Status At Last Visit_No symptoms',
'Regimen At Baseline_1S3N',
'Facility_CS Dabompa',
'CPT at ART Start_Yes',
'Regimen At Baseline_1T3N',
'Facility_CMC Minière']]

# Create a subset with potential predictors - Test set 
xtest01 = data_test02[['Regiment schedule_6-Month',
'Last ART Prescription_1T3E',
'Facility_CS Wanidara',
'Facility_CMC Flamboyants',
'Last ART CD4',
'Last ART Prescription_1T3O',
'TPT Outcome_Rx completed',
'Duration on ART (months)',
'Baseline CD4',
'Last Pre-ART CD4',
'Current Age',
'Age At ART Start',
'TB Status At Last Visit_No symptoms',
'Regimen At Baseline_1S3N',
'Facility_CS Dabompa',
'CPT at ART Start_Yes',
'Regimen At Baseline_1T3N',
'Facility_CMC Minière']]

# Target -- Test set
ytest01 = data_test02[['VL Suppressed_Yes']]

# Target -- Training set
ytrain01 = data_train02[['VL Suppressed_Yes']]

In [29]:
# Create Naive Bayes classifer object
NB = GaussianNB()

parameters = {}

# Conduct Parameter Optmization 
# Create a grid search object
NB_classifier = GridSearchCV(NB, parameters, n_jobs=-1, cv=10)
# Fit the grid search
NB_classifier.fit(xtrain01, ytrain01.values.ravel())

# Best paramete set
# print('Best parameters found:\n', GS_SVM_classifier.best_params_)

# Use Cross Validation To Evaluate Model
CV_Result = cross_val_score(NB_classifier, xtrain01, ytrain01, cv=10, n_jobs=-1)
print(); print('Accuracy of the 10-fold CVs: ', CV_Result)
print(); print('Mean accuracy: ', CV_Result.mean())
print(); print('Standard deviation: ', CV_Result.std())


Accuracy of the 10-fold CVs:  [0.75357143 0.75952381 0.75535714 0.74568195 0.76116736 0.75699821
 0.76891007 0.75997618 0.74806432 0.7665277 ]

Mean accuracy:  0.7575778170112595

Standard deviation:  0.006959027349843323


### Select features with Gini = 0.16 to 0.50

In [30]:
# Create a subset with potential predictors - Training set
xtrain01 = data_train02[['Regiment schedule_6-Month',
'Last ART Prescription_1T3E',
'Facility_CS Wanidara',
'Facility_CMC Flamboyants',
'Last ART CD4',
'Last ART Prescription_1T3O',
'TPT Outcome_Rx completed',
'Duration on ART (months)',
'Baseline CD4',
'Last Pre-ART CD4',
'Current Age',
'Age At ART Start',
'TB Status At Last Visit_No symptoms',
'Regimen At Baseline_1S3N',
'Facility_CS Dabompa',
'CPT at ART Start_Yes',
'Regimen At Baseline_1T3N',
'Facility_CMC Minière',
'Regimen At Baseline_1T3E',
'Facility_CS Tombolia',
'Regimen At Baseline_1Z3N']]

# Create a subset with potential predictors - Test set 
xtest01 = data_test02[['Regiment schedule_6-Month',
'Last ART Prescription_1T3E',
'Facility_CS Wanidara',
'Facility_CMC Flamboyants',
'Last ART CD4',
'Last ART Prescription_1T3O',
'TPT Outcome_Rx completed',
'Duration on ART (months)',
'Baseline CD4',
'Last Pre-ART CD4',
'Current Age',
'Age At ART Start',
'TB Status At Last Visit_No symptoms',
'Regimen At Baseline_1S3N',
'Facility_CS Dabompa',
'CPT at ART Start_Yes',
'Regimen At Baseline_1T3N',
'Facility_CMC Minière',
'Regimen At Baseline_1T3E',
'Facility_CS Tombolia',
'Regimen At Baseline_1Z3N']]

# Target -- Training set
ytrain01 = data_train02[['VL Suppressed_Yes']]
# Target -- Test set
ytest01 = data_test02[['VL Suppressed_Yes']]

In [31]:
# Create Naive Bayes classifer object
NB = GaussianNB()

parameters = {}

# Conduct Parameter Optmization 
# Create a grid search object
NB_classifier = GridSearchCV(NB, parameters, n_jobs=-1, cv=10)
# Fit the grid search
NB_classifier.fit(xtrain01, ytrain01.values.ravel())

# Best paramete set
# print('Best parameters found:\n', GS_SVM_classifier.best_params_)

# Use Cross Validation To Evaluate Model
CV_Result = cross_val_score(NB_classifier, xtrain01, ytrain01, cv=10, n_jobs=-1)
print(); print('Accuracy of the 10-fold CVs: ', CV_Result)
print(); print('Mean accuracy: ', CV_Result.mean())
print(); print('Standard deviation: ', CV_Result.std())


Accuracy of the 10-fold CVs:  [0.75238095 0.76071429 0.75654762 0.74627755 0.76057177 0.75640262
 0.76831447 0.75938058 0.74865992 0.7659321 ]

Mean accuracy:  0.7575181868459117

Standard deviation:  0.006661533786928225


### Select features with Gini = 0.16 to 0.54

In [32]:
# Create a subset with potential predictors - Training set
xtrain01 = data_train02[['Regiment schedule_6-Month',
'Last ART Prescription_1T3E',
'Facility_CS Wanidara',
'Facility_CMC Flamboyants',
'Last ART CD4',
'Last ART Prescription_1T3O',
'TPT Outcome_Rx completed',
'Duration on ART (months)',
'Baseline CD4',
'Last Pre-ART CD4',
'Current Age',
'Age At ART Start',
'TB Status At Last Visit_No symptoms',
'Regimen At Baseline_1S3N',
'Facility_CS Dabompa',
'CPT at ART Start_Yes',
'Regimen At Baseline_1T3N',
'Facility_CMC Minière',
'Regimen At Baseline_1T3E',
'Facility_CS Tombolia',
'Regimen At Baseline_1Z3N',
'Prior ART_Naive',
'Last Pre-ART Stage',
'Gender_Male',
'Stage at ART Start',
'Facility_CMC Coléah',
'Method into ART_New',
'TB Status At Last Visit_Not Screened',
'Regimen At Baseline_Other',
'Facility_CS Gbessia port 1',
'TB Status At Last Visit_Screening Unknown']]

# Create a subset with potential predictors - Test set 
xtest01 = data_test02[['Regiment schedule_6-Month',
'Last ART Prescription_1T3E',
'Facility_CS Wanidara',
'Facility_CMC Flamboyants',
'Last ART CD4',
'Last ART Prescription_1T3O',
'TPT Outcome_Rx completed',
'Duration on ART (months)',
'Baseline CD4',
'Last Pre-ART CD4',
'Current Age',
'Age At ART Start',
'TB Status At Last Visit_No symptoms',
'Regimen At Baseline_1S3N',
'Facility_CS Dabompa',
'CPT at ART Start_Yes',
'Regimen At Baseline_1T3N',
'Facility_CMC Minière',
'Regimen At Baseline_1T3E',
'Facility_CS Tombolia',
'Regimen At Baseline_1Z3N',
'Prior ART_Naive',
'Last Pre-ART Stage',
'Gender_Male',
'Stage at ART Start',
'Facility_CMC Coléah',
'Method into ART_New',
'TB Status At Last Visit_Not Screened',
'Regimen At Baseline_Other',
'Facility_CS Gbessia port 1',
'TB Status At Last Visit_Screening Unknown']]

# Target -- Training set
ytrain01 = data_train02[['VL Suppressed_Yes']]
# Target -- Test set
ytest01 = data_test02[['VL Suppressed_Yes']]

In [33]:
# Create Naive Bayes classifer object
NB = GaussianNB()

parameters = {}

# Conduct Parameter Optmization 
# Create a grid search object
NB_classifier = GridSearchCV(NB, parameters, n_jobs=-1, cv=10)
# Fit the grid search
NB_classifier.fit(xtrain01, ytrain01.values.ravel())

# Best paramete set
# print('Best parameters found:\n', GS_SVM_classifier.best_params_)

# Use Cross Validation To Evaluate Model
CV_Result = cross_val_score(NB_classifier, xtrain01, ytrain01, cv=10, n_jobs=-1)
print(); print('Accuracy of the 10-fold CVs: ', CV_Result)
print(); print('Mean accuracy: ', CV_Result.mean())
print(); print('Standard deviation: ', CV_Result.std())


Accuracy of the 10-fold CVs:  [0.76130952 0.76607143 0.75595238 0.75699821 0.75461584 0.75640262
 0.76235855 0.76176295 0.74687314 0.7665277 ]

Mean accuracy:  0.7588872344649593

Standard deviation:  0.005631295553525414


### Select features with Gini = 0.16 to 0.66

In [34]:
# Create a subset with potential predictors - Training set
xtrain01 = data_train02[['Regiment schedule_6-Month',
'Last ART Prescription_1T3E',
'Facility_CS Wanidara',
'Facility_CMC Flamboyants',
'Last ART CD4',
'Last ART Prescription_1T3O',
'TPT Outcome_Rx completed',
'Duration on ART (months)',
'Baseline CD4',
'Last Pre-ART CD4',
'Current Age',
'Age At ART Start',
'TB Status At Last Visit_No symptoms',
'Regimen At Baseline_1S3N',
'Facility_CS Dabompa',
'CPT at ART Start_Yes',
'Regimen At Baseline_1T3N',
'Facility_CMC Minière',
'Regimen At Baseline_1T3E',
'Facility_CS Tombolia',
'Regimen At Baseline_1Z3N',
'Prior ART_Naive',
'Last Pre-ART Stage',
'Gender_Male',
'Stage at ART Start',
'Facility_CMC Coléah',
'Method into ART_New',
'TB Status At Last Visit_Not Screened',
'Regimen At Baseline_Other',
'Facility_CS Gbessia port 1',
'TB Status At Last Visit_Screening Unknown',
'TPT Outcome_No treatement',
'TB Rx Started_Yes',
'TPT Outcome_Other',
'Regimen At Baseline_1S3E',
'Facility_CMC Matam',
'Regiment schedule_3-Month',
'Last ART Prescription_1Z3N']]

# Create a subset with potential predictors - Test set 
xtest01 = data_test02[['Regiment schedule_6-Month',
'Last ART Prescription_1T3E',
'Facility_CS Wanidara',
'Facility_CMC Flamboyants',
'Last ART CD4',
'Last ART Prescription_1T3O',
'TPT Outcome_Rx completed',
'Duration on ART (months)',
'Baseline CD4',
'Last Pre-ART CD4',
'Current Age',
'Age At ART Start',
'TB Status At Last Visit_No symptoms',
'Regimen At Baseline_1S3N',
'Facility_CS Dabompa',
'CPT at ART Start_Yes',
'Regimen At Baseline_1T3N',
'Facility_CMC Minière',
'Regimen At Baseline_1T3E',
'Facility_CS Tombolia',
'Regimen At Baseline_1Z3N',
'Prior ART_Naive',
'Last Pre-ART Stage',
'Gender_Male',
'Stage at ART Start',
'Facility_CMC Coléah',
'Method into ART_New',
'TB Status At Last Visit_Not Screened',
'Regimen At Baseline_Other',
'Facility_CS Gbessia port 1',
'TB Status At Last Visit_Screening Unknown',
'TPT Outcome_No treatement',
'TB Rx Started_Yes',
'TPT Outcome_Other',
'Regimen At Baseline_1S3E',
'Facility_CMC Matam',
'Regiment schedule_3-Month',
'Last ART Prescription_1Z3N']]

# Target -- Training set
ytrain01 = data_train02[['VL Suppressed_Yes']]
# Target -- Test set
ytest01 = data_test02[['VL Suppressed_Yes']]

In [35]:
# Create Naive Bayes classifer object
NB = GaussianNB()

parameters = {}

# Conduct Parameter Optmization 
# Create a grid search object
NB_classifier = GridSearchCV(NB, parameters, n_jobs=-1, cv=10)
# Fit the grid search
NB_classifier.fit(xtrain01, ytrain01.values.ravel())

# Best paramete set
# print('Best parameters found:\n', GS_SVM_classifier.best_params_)

# Use Cross Validation To Evaluate Model
CV_Result = cross_val_score(NB_classifier, xtrain01, ytrain01, cv=10, n_jobs=-1)
print(); print('Accuracy of the 10-fold CVs: ', CV_Result)
print(); print('Mean accuracy: ', CV_Result.mean())
print(); print('Standard deviation: ', CV_Result.std())


Accuracy of the 10-fold CVs:  [0.73571429 0.725      0.72261905 0.74270399 0.7337701  0.73793925
 0.74091721 0.73436569 0.73555688 0.73555688]

Mean accuracy:  0.7344143339289259

Standard deviation:  0.005965797722898561


### Select features with Gini = 0.16 to 0.78

In [36]:
# Create a subset with potential predictors - Training set
xtrain01 = data_train02[['Regiment schedule_6-Month',
'Last ART Prescription_1T3E',
'Facility_CS Wanidara',
'Facility_CMC Flamboyants',
'Last ART CD4',
'Last ART Prescription_1T3O',
'TPT Outcome_Rx completed',
'Duration on ART (months)',
'Baseline CD4',
'Last Pre-ART CD4',
'Current Age',
'Age At ART Start',
'TB Status At Last Visit_No symptoms',
'Regimen At Baseline_1S3N',
'Facility_CS Dabompa',
'CPT at ART Start_Yes',
'Regimen At Baseline_1T3N',
'Facility_CMC Minière',
'Regimen At Baseline_1T3E',
'Facility_CS Tombolia',
'Regimen At Baseline_1Z3N',
'Prior ART_Naive',
'Last Pre-ART Stage',
'Gender_Male',
'Stage at ART Start',
'Facility_CMC Coléah',
'Method into ART_New',
'TB Status At Last Visit_Not Screened',
'Regimen At Baseline_Other',
'Facility_CS Gbessia port 1',
'TB Status At Last Visit_Screening Unknown',
'TPT Outcome_No treatement',
'TB Rx Started_Yes',
'TPT Outcome_Other',
'Regimen At Baseline_1S3E',
'Facility_CMC Matam',
'Regiment schedule_3-Month',
'Last ART Prescription_1Z3N',
'Regiment schedule_Regular',
'TB Status At Last Visit_Other']]

# Create a subset with potential predictors - Test set 
xtest01 = data_test02[['Regiment schedule_6-Month',
'Last ART Prescription_1T3E',
'Facility_CS Wanidara',
'Facility_CMC Flamboyants',
'Last ART CD4',
'Last ART Prescription_1T3O',
'TPT Outcome_Rx completed',
'Duration on ART (months)',
'Baseline CD4',
'Last Pre-ART CD4',
'Current Age',
'Age At ART Start',
'TB Status At Last Visit_No symptoms',
'Regimen At Baseline_1S3N',
'Facility_CS Dabompa',
'CPT at ART Start_Yes',
'Regimen At Baseline_1T3N',
'Facility_CMC Minière',
'Regimen At Baseline_1T3E',
'Facility_CS Tombolia',
'Regimen At Baseline_1Z3N',
'Prior ART_Naive',
'Last Pre-ART Stage',
'Gender_Male',
'Stage at ART Start',
'Facility_CMC Coléah',
'Method into ART_New',
'TB Status At Last Visit_Not Screened',
'Regimen At Baseline_Other',
'Facility_CS Gbessia port 1',
'TB Status At Last Visit_Screening Unknown',
'TPT Outcome_No treatement',
'TB Rx Started_Yes',
'TPT Outcome_Other',
'Regimen At Baseline_1S3E',
'Facility_CMC Matam',
'Regiment schedule_3-Month',
'Last ART Prescription_1Z3N',
'Regiment schedule_Regular',
'TB Status At Last Visit_Other']]

# Target -- Training set
ytrain01 = data_train02[['VL Suppressed_Yes']]
# Target -- Test set
ytest01 = data_test02[['VL Suppressed_Yes']]

In [37]:
# Create Naive Bayes classifer object
NB = GaussianNB()

parameters = {}

# Conduct Parameter Optmization 
# Create a grid search object
NB_classifier = GridSearchCV(NB, parameters, n_jobs=-1, cv=10)
# Fit the grid search
NB_classifier.fit(xtrain01, ytrain01.values.ravel())

# Best paramete set
# print('Best parameters found:\n', GS_SVM_classifier.best_params_)

# Use Cross Validation To Evaluate Model
CV_Result = cross_val_score(NB_classifier, xtrain01, ytrain01, cv=10, n_jobs=-1)
print(); print('Accuracy of the 10-fold CVs: ', CV_Result)
print(); print('Mean accuracy: ', CV_Result.mean())
print(); print('Standard deviation: ', CV_Result.std())


Accuracy of the 10-fold CVs:  [0.73928571 0.74107143 0.74583333 0.75104229 0.73496129 0.74508636
 0.74687314 0.74091721 0.74329958 0.75223347]

Mean accuracy:  0.7440603817465045

Standard deviation:  0.005026302376857503


### Select features with Gini = 0.16 to 0.90

In [38]:
# Create a subset with potential predictors - Training set
xtrain01 = data_train02[['Regiment schedule_6-Month',
'Last ART Prescription_1T3E',
'Facility_CS Wanidara',
'Facility_CMC Flamboyants',
'Last ART CD4',
'Last ART Prescription_1T3O',
'TPT Outcome_Rx completed',
'Duration on ART (months)',
'Baseline CD4',
'Last Pre-ART CD4',
'Current Age',
'Age At ART Start',
'TB Status At Last Visit_No symptoms',
'Regimen At Baseline_1S3N',
'Facility_CS Dabompa',
'CPT at ART Start_Yes',
'Regimen At Baseline_1T3N',
'Facility_CMC Minière',
'Regimen At Baseline_1T3E',
'Facility_CS Tombolia',
'Regimen At Baseline_1Z3N',
'Prior ART_Naive',
'Last Pre-ART Stage',
'Gender_Male',
'Stage at ART Start',
'Facility_CMC Coléah',
'Method into ART_New',
'TB Status At Last Visit_Not Screened',
'Regimen At Baseline_Other',
'Facility_CS Gbessia port 1',
'TB Status At Last Visit_Screening Unknown',
'TPT Outcome_No treatement',
'TB Rx Started_Yes',
'TPT Outcome_Other',
'Regimen At Baseline_1S3E',
'Facility_CMC Matam',
'Regiment schedule_3-Month',
'Last ART Prescription_1Z3N',
'Regiment schedule_Regular',
'TB Status At Last Visit_Other',
'Last ART Prescription_2T3L',
'Last ART Prescription_Other',
'Second Line Rx_Yes',
'Last ART Prescription_2Z3L']]

# Create a subset with potential predictors - Test set 
xtest01 = data_test02[['Regiment schedule_6-Month',
'Last ART Prescription_1T3E',
'Facility_CS Wanidara',
'Facility_CMC Flamboyants',
'Last ART CD4',
'Last ART Prescription_1T3O',
'TPT Outcome_Rx completed',
'Duration on ART (months)',
'Baseline CD4',
'Last Pre-ART CD4',
'Current Age',
'Age At ART Start',
'TB Status At Last Visit_No symptoms',
'Regimen At Baseline_1S3N',
'Facility_CS Dabompa',
'CPT at ART Start_Yes',
'Regimen At Baseline_1T3N',
'Facility_CMC Minière',
'Regimen At Baseline_1T3E',
'Facility_CS Tombolia',
'Regimen At Baseline_1Z3N',
'Prior ART_Naive',
'Last Pre-ART Stage',
'Gender_Male',
'Stage at ART Start',
'Facility_CMC Coléah',
'Method into ART_New',
'TB Status At Last Visit_Not Screened',
'Regimen At Baseline_Other',
'Facility_CS Gbessia port 1',
'TB Status At Last Visit_Screening Unknown',
'TPT Outcome_No treatement',
'TB Rx Started_Yes',
'TPT Outcome_Other',
'Regimen At Baseline_1S3E',
'Facility_CMC Matam',
'Regiment schedule_3-Month',
'Last ART Prescription_1Z3N',
'Regiment schedule_Regular',
'TB Status At Last Visit_Other',
'Last ART Prescription_2T3L',
'Last ART Prescription_Other',
'Second Line Rx_Yes',
'Last ART Prescription_2Z3L']]

# Target -- Training set
ytrain01 = data_train02[['VL Suppressed_Yes']]
# Target -- Test set
ytest01 = data_test02[['VL Suppressed_Yes']]

In [39]:
# Create Naive Bayes classifer object
NB = GaussianNB()

parameters = {}

# Conduct Parameter Optmization 
# Create a grid search object
NB_classifier = GridSearchCV(NB, parameters, n_jobs=-1, cv=10)
# Fit the grid search
NB_classifier.fit(xtrain01, ytrain01.values.ravel())

# Best paramete set
# print('Best parameters found:\n', GS_SVM_classifier.best_params_)

# Use Cross Validation To Evaluate Model
CV_Result = cross_val_score(NB_classifier, xtrain01, ytrain01, cv=10, n_jobs=-1)
print(); print('Accuracy of the 10-fold CVs: ', CV_Result)
print(); print('Mean accuracy: ', CV_Result.mean())
print(); print('Standard deviation: ', CV_Result.std())


Accuracy of the 10-fold CVs:  [0.72797619 0.71666667 0.75119048 0.7337701  0.71530673 0.72721858
 0.72066706 0.72245384 0.72364503 0.71709351]

Mean accuracy:  0.7255988187413143

Standard deviation:  0.010129459260342892


### Select features without CD4 features

In [40]:
# Create a subset with potential predictors - Training set
xtrain01 = data_train02[['Gender_Male',
'Regiment schedule_3-Month',
'Regiment schedule_6-Month',
'Regiment schedule_Regular',
'Prior ART_Naive',
'Method into ART_New',
'TB Rx Started_Yes',
'TPT Outcome_No treatement',
'TPT Outcome_Other',
'TPT Outcome_Rx completed',
'Regimen At Baseline_1S3E',
'Regimen At Baseline_1S3N',
'Regimen At Baseline_1T3E',
'Regimen At Baseline_1T3N',
'Regimen At Baseline_1Z3N',
'Regimen At Baseline_Other',
'Last ART Prescription_1T3E',
'Last ART Prescription_1T3O',
'Last ART Prescription_1Z3N',
'Last ART Prescription_2T3L',
'Last ART Prescription_2Z3L',
'Last ART Prescription_Other',
'Facility_CMC Coléah',
'Facility_CMC Flamboyants',
'Facility_CMC Matam',
'Facility_CMC Minière',
'Facility_CS Dabompa',
'Facility_CS Gbessia port 1',
'Facility_CS Tombolia',
'Facility_CS Wanidara',
'TB Status At Last Visit_No symptoms',
'TB Status At Last Visit_Not Screened',
'TB Status At Last Visit_Other',
'TB Status At Last Visit_Screening Unknown',
'CPT at ART Start_Yes',
'Second Line Rx_Yes',
'Last Pre-ART Stage',
'Stage at ART Start',
'Age At ART Start',
'Current Age',
'Duration on ART (months)']]

# Create a subset with potential predictors - Test set 
xtest01 = data_test02[['Gender_Male',
'Regiment schedule_3-Month',
'Regiment schedule_6-Month',
'Regiment schedule_Regular',
'Prior ART_Naive',
'Method into ART_New',
'TB Rx Started_Yes',
'TPT Outcome_No treatement',
'TPT Outcome_Other',
'TPT Outcome_Rx completed',
'Regimen At Baseline_1S3E',
'Regimen At Baseline_1S3N',
'Regimen At Baseline_1T3E',
'Regimen At Baseline_1T3N',
'Regimen At Baseline_1Z3N',
'Regimen At Baseline_Other',
'Last ART Prescription_1T3E',
'Last ART Prescription_1T3O',
'Last ART Prescription_1Z3N',
'Last ART Prescription_2T3L',
'Last ART Prescription_2Z3L',
'Last ART Prescription_Other',
'Facility_CMC Coléah',
'Facility_CMC Flamboyants',
'Facility_CMC Matam',
'Facility_CMC Minière',
'Facility_CS Dabompa',
'Facility_CS Gbessia port 1',
'Facility_CS Tombolia',
'Facility_CS Wanidara',
'TB Status At Last Visit_No symptoms',
'TB Status At Last Visit_Not Screened',
'TB Status At Last Visit_Other',
'TB Status At Last Visit_Screening Unknown',
'CPT at ART Start_Yes',
'Second Line Rx_Yes',
'Last Pre-ART Stage',
'Stage at ART Start',
'Age At ART Start',
'Current Age',
'Duration on ART (months)']]

# Target -- Training set
ytrain01 = data_train02[['VL Suppressed_Yes']]
# Target -- Test set
ytest01 = data_test02[['VL Suppressed_Yes']]

In [41]:
# Create Naive Bayes classifer object
NB = GaussianNB()

parameters = {}

# Conduct Parameter Optmization 
# Create a grid search object
NB_classifier = GridSearchCV(NB, parameters, n_jobs=-1, cv=10)
# Fit the grid search
NB_classifier.fit(xtrain01, ytrain01.values.ravel())

# Best paramete set
# print('Best parameters found:\n', GS_SVM_classifier.best_params_)

# Use Cross Validation To Evaluate Model
CV_Result = cross_val_score(NB_classifier, xtrain01, ytrain01, cv=10, n_jobs=-1)
print(); print('Accuracy of the 10-fold CVs: ', CV_Result)
print(); print('Mean accuracy: ', CV_Result.mean())
print(); print('Standard deviation: ', CV_Result.std())


Accuracy of the 10-fold CVs:  [0.72142857 0.7172619  0.7422619  0.72662299 0.7099464  0.71113758
 0.71054199 0.71530673 0.70875521 0.71232877]

Mean accuracy:  0.7175592047420517

Standard deviation:  0.009827567308955447


In [19]:
# data_test02.info()

End of accuracy calculation with features with various Gini