#### Import libraries

In [1]:
%config Completer.use_jedi=False # comment if not needed
import pandas as pd
import numpy as np
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
warnings.filterwarnings('ignore')

#### Load the dataset

In [2]:
df = pd.read_csv('risk_factors.csv', na_values='?')

In [3]:
df.shape
df.sample(10)

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
621,31,1.0,15.0,5.0,0.0,0.0,0.0,1.0,1.5,0.0,...,,,0,0,0,0,0,0,0,0
736,36,3.0,18.0,4.0,0.0,0.0,0.0,1.0,2.0,0.0,...,,,0,0,0,0,0,0,0,0
721,34,1.0,15.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
631,20,3.0,15.0,2.0,1.0,3.0,3.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
241,24,2.0,19.0,1.0,0.0,0.0,0.0,1.0,6.0,0.0,...,,,0,0,0,0,0,0,0,0
608,19,2.0,16.0,1.0,0.0,0.0,0.0,1.0,2.0,0.0,...,,,0,0,0,0,0,0,0,0
542,19,3.0,16.0,1.0,0.0,0.0,0.0,,,,...,,,0,0,0,0,0,0,0,0
430,19,4.0,15.0,1.0,0.0,0.0,0.0,1.0,0.42,0.0,...,,,0,0,0,0,0,0,0,0
324,33,,16.0,4.0,0.0,0.0,0.0,1.0,3.0,0.0,...,,,0,0,0,0,0,0,0,0
591,21,5.0,15.0,1.0,0.0,0.0,0.0,1.0,0.17,0.0,...,,,0,0,0,0,0,0,0,0


#### Split the dataset
Split the dataset into training, validation and test sets.

In [5]:
y = df[['Hinselmann', 'Schiller', 'Citology', 'Biopsy']]
X = df.drop(['Hinselmann', 'Schiller', 'Citology', 'Biopsy'], axis=1)
scaler = StandardScaler()
X_transform = scaler.fit_transform(X)


In [6]:
seed_num = 10
X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.3, random_state=seed_num) # random_state is set to a value for reproducible␣output.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=seed_num)
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(480, 32)
(120, 32)
(258, 32)


#### Data preprocessing
Perform data preprocessing such as normalization, standardization, label encoding etc.
______________________________________________________________________________________
Description:

##### Handling Boolean Values

In [7]:
# Convert boolean-like columns to proper boolean data types
boolean_cols = [
    'Smokes', 'Smokes (years)', 'Smokes (packs/year)', 'Hormonal Contraceptives', 'IUD',
    'STDs', 'STDs:condylomatosis', 'STDs:cervical condylomatosis', 'STDs:vaginal condylomatosis',
    'STDs:vulvo-perineal condylomatosis', 'STDs:syphilis', 'STDs:pelvic inflammatory disease',
    'STDs:genital herpes', 'STDs:molluscum contagiosum', 'STDs:AIDS', 'STDs:HIV', 'STDs:Hepatitis B',
    'STDs:HPV', 'Dx:Cancer', 'Dx:CIN', 'Dx:HPV', 'Dx', 'Hinselmann', 'Schiller', 'Citology', 'Biopsy'
]

# Convert boolean-like columns to numeric
df[boolean_cols] = df[boolean_cols].apply(pd.to_numeric, errors='coerce')

# Replace values greater than 0 with 1, leave missing values untouched
df[boolean_cols] = df[boolean_cols].applymap(lambda x: 1 if x > 0 else x)


##### Check for the sum of missing values

In [8]:
print(df.isnull().sum())

Age                                     0
Number of sexual partners              26
First sexual intercourse                7
Num of pregnancies                     56
Smokes                                 13
Smokes (years)                         13
Smokes (packs/year)                    13
Hormonal Contraceptives               108
Hormonal Contraceptives (years)       108
IUD                                   117
IUD (years)                           117
STDs                                  105
STDs (number)                         105
STDs:condylomatosis                   105
STDs:cervical condylomatosis          105
STDs:vaginal condylomatosis           105
STDs:vulvo-perineal condylomatosis    105
STDs:syphilis                         105
STDs:pelvic inflammatory disease      105
STDs:genital herpes                   105
STDs:molluscum contagiosum            105
STDs:AIDS                             105
STDs:HIV                              105
STDs:Hepatitis B                  

##### Dropping Columns with too high missing values

In [9]:
# Set the threshold for dropping columns
threshold = 0.5  #keep columns with at least 50% non-null values

# Calculate the minimum number of non-null values required for each column to be retained
min_non_null_values = len(df) * threshold

# Drop columns with too many missing values
df = df.dropna(axis=1, thresh=min_non_null_values)

# Print the remaining missing values count after dropping columns
print(df.isnull().sum())

df.shape


Age                                     0
Number of sexual partners              26
First sexual intercourse                7
Num of pregnancies                     56
Smokes                                 13
Smokes (years)                         13
Smokes (packs/year)                    13
Hormonal Contraceptives               108
Hormonal Contraceptives (years)       108
IUD                                   117
IUD (years)                           117
STDs                                  105
STDs (number)                         105
STDs:condylomatosis                   105
STDs:cervical condylomatosis          105
STDs:vaginal condylomatosis           105
STDs:vulvo-perineal condylomatosis    105
STDs:syphilis                         105
STDs:pelvic inflammatory disease      105
STDs:genital herpes                   105
STDs:molluscum contagiosum            105
STDs:AIDS                             105
STDs:HIV                              105
STDs:Hepatitis B                  

(858, 34)

##### Dropping Rows with missing values

In [10]:
# Drop rows with missing values
df = df.dropna()

# Print the remaining missing values count after dropping rows
print(df.isnull().sum())

# Print the shape of the cleaned dataset
df.shape

#export cleaned data
df.to_csv('risk_factors_clean.csv', index=False)

Age                                   0
Number of sexual partners             0
First sexual intercourse              0
Num of pregnancies                    0
Smokes                                0
Smokes (years)                        0
Smokes (packs/year)                   0
Hormonal Contraceptives               0
Hormonal Contraceptives (years)       0
IUD                                   0
IUD (years)                           0
STDs                                  0
STDs (number)                         0
STDs:condylomatosis                   0
STDs:cervical condylomatosis          0
STDs:vaginal condylomatosis           0
STDs:vulvo-perineal condylomatosis    0
STDs:syphilis                         0
STDs:pelvic inflammatory disease      0
STDs:genital herpes                   0
STDs:molluscum contagiosum            0
STDs:AIDS                             0
STDs:HIV                              0
STDs:Hepatitis B                      0
STDs:HPV                              0


#### Feature Selection
Perform feature selection to select the relevant features.
______________________________________________________________________________________
Feature Selection Using Random Forest

Step 1: Loading dataset

Step 2: Train Test Split

Step 3: Import Random Forest Classifier module.

Step 4: Training of Model

In [11]:
df = pd.read_csv('risk_factors_clean.csv', na_values='?')

In [12]:
y = df[['Hinselmann', 'Schiller', 'Citology', 'Biopsy']]
X = df.drop(['Hinselmann', 'Schiller', 'Citology', 'Biopsy'], axis=1)
scaler = StandardScaler()
X_transform = scaler.fit_transform(X)

In [13]:
seed_num = 10
X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.2, random_state=seed_num) # random_state is set to a value for reproducible␣output.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=seed_num)

In [14]:
from sklearn.ensemble import RandomForestClassifier

In [15]:
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100)  
 
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(X_train, y_train)
 
# performing predictions on the test dataset
y_pred = clf.predict(X_test)
 
# metrics are used to find accuracy or error
from sklearn import metrics  
print()
 
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL:", metrics.accuracy_score(y_test, y_pred))


ACCURACY OF THE MODEL: 0.8582089552238806


In [16]:
feature_imp = pd.Series(clf.feature_importances_, index = X.columns).sort_values(ascending = False)
feature_imp ##print the features significant in ascending order

Age                                   0.200980
Hormonal Contraceptives (years)       0.162567
First sexual intercourse              0.159940
Number of sexual partners             0.115291
Num of pregnancies                    0.095743
IUD (years)                           0.055990
Hormonal Contraceptives               0.026115
STDs (number)                         0.023473
STDs:HIV                              0.021222
IUD                                   0.019347
Dx                                    0.016264
Dx:HPV                                0.014428
Smokes (packs/year)                   0.014421
Dx:Cancer                             0.013367
STDs: Number of diagnosis             0.012619
Smokes                                0.012092
Smokes (years)                        0.009874
STDs:condylomatosis                   0.007715
STDs                                  0.005804
STDs:vulvo-perineal condylomatosis    0.004987
Dx:CIN                                0.003644
STDs:syphilis

In [17]:
##Take 5 most important features
X = df[['Age','First sexual intercourse', 'Hormonal Contraceptives (years)', 'Number of sexual partners', 'Num of pregnancies']]
X_transform = scaler.fit_transform(X)

seed_num = 42
X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.2, random_state=seed_num) # random_state is set to a value for reproducible␣output
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=seed_num)

In [18]:
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100)  
 
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(X_train, y_train)
 
# performing predictions on the test dataset
y_pred = clf.predict(X_test)
 
# metrics are used to find accuracy or error
from sklearn import metrics
 
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL:", metrics.accuracy_score(y_test, y_pred))

ACCURACY OF THE MODEL: 0.8432835820895522


_**As the result, the accuracy of the model when we take the 5 most significant features is exactly
the same as the accuracy of the model using all 30 features, indicating that the selected features effectively 
capture the essential information for making predictions.**_

#### Data modeling
Build the machine learning models. You must build atleast two (2) predictive models. One of the predictive models must be either Decision Tree or Support Vector Machine.
______________________________________________________________________________________
Description:

In [None]:
#importing all required functions 
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV

## Decision Tree

In [19]:
from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor

parameters = {
    'estimator__max_depth': [None, 1, 3, 8],
    'estimator__min_samples_split': [2, 5, 10],
    'estimator__min_samples_leaf': [1, 2, 4]
}

dt_multi_model = MultiOutputClassifier(DecisionTreeClassifier(random_state=seed_num))
dt_gsc = GridSearchCV(dt_multi_model, parameters, cv=10, scoring=['accuracy', 'precision_micro', 'recall_micro', 'f1_micro'], n_jobs=3, refit='recall_micro')
dt_gsc.fit(X_train, y_train)

GridSearchCV(cv=10,
             estimator=MultiOutputClassifier(estimator=DecisionTreeClassifier(random_state=42)),
             n_jobs=3,
             param_grid={'estimator__max_depth': [None, 1, 3, 8],
                         'estimator__min_samples_leaf': [1, 2, 4],
                         'estimator__min_samples_split': [2, 5, 10]},
             refit='recall_micro',
             scoring=['accuracy', 'precision_micro', 'recall_micro',
                      'f1_micro'])

In [20]:
print(f"\033[1mBest parameters found:\033[0m {dt_gsc.best_params_}")
print(f"\033[1mBest F1 score:\033[0m {dt_gsc.best_score_}")

# Validate the parameter
print(f"\033[1mRecommended Model Accuracy with validation data:\033[0m {dt_gsc.best_estimator_.score(X_val, y_val)}")

[1mBest parameters found:[0m {'estimator__max_depth': None, 'estimator__min_samples_leaf': 1, 'estimator__min_samples_split': 2}
[1mBest F1 score:[0m 0.06927308802308803
[1mRecommended Model Accuracy with validation data:[0m 0.7222222222222222


_**Build Decision Tree model with the best parameters**_

In [21]:
from sklearn.tree import DecisionTreeClassifier

# Obtain the best parameters from hyperparameter tuning
best_params = {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}

# Create a decision tree classifier with the best parameters
# best_dt_model = MultiOutputClassifier(DecisionTreeClassifier(**best_params))
best_dt_model = dt_gsc.best_estimator_

In [22]:
# Fit the model to the training data
best_dt_model.fit(X_train, y_train)

# Make predictions on the test set
dt_y_pred = best_dt_model.predict(X_test)

## Support Vector Machine

In [23]:
from sklearn.svm import SVC

parameters = {'estimator__kernel': ['linear', 'sigmoid'], 'estimator__C': [0.1, 0.5, 1, 1.5], 'estimator__coef0': [0, 1, 2, 3, 4, 5]}
clf = MultiOutputClassifier(SVC(random_state=seed_num))
gsc = GridSearchCV(clf, parameters, cv=10, scoring=['accuracy', 'precision_micro', 'recall_micro', 'f1_micro'], n_jobs=3, refit='recall_micro')
gsc.fit(X_train, y_train)


GridSearchCV(cv=10,
             estimator=MultiOutputClassifier(estimator=SVC(random_state=42)),
             n_jobs=3,
             param_grid={'estimator__C': [0.1, 0.5, 1, 1.5],
                         'estimator__coef0': [0, 1, 2, 3, 4, 5],
                         'estimator__kernel': ['linear', 'sigmoid']},
             refit='recall_micro',
             scoring=['accuracy', 'precision_micro', 'recall_micro',
                      'f1_micro'])

In [24]:
print(f"\033[1mBest parameters found:\033[0m {gsc.best_params_}")
print(f"\033[1mBest F1 score:\033[0m {gsc.best_score_}")
# Validate the parameter
print(f"\033[1mRecommended Model Accuracy with validation data:\033[0m {gsc.best_estimator_.score(X_val, y_val)}")

[1mBest parameters found:[0m {'estimator__C': 1, 'estimator__coef0': 1, 'estimator__kernel': 'sigmoid'}
[1mBest F1 score:[0m 0.04206349206349207
[1mRecommended Model Accuracy with validation data:[0m 0.7592592592592593


_**Build SVM model with the best parameters**_

In [25]:
from sklearn.svm import SVC
C = 1
model_svc = SVC(kernel='sigmoid', C=C, coef0=1)

multi_target_svm = MultiOutputClassifier(model_svc)  

# Fit the classifier to the data
multi_target_svm.fit(X_train, y_train)

# Predictions
svm_y_pred = multi_target_svm.predict(X_test)

#### Evaluate the models
Perform a comparison between the predictive models. <br>
Report the accuracy, recall, precision and F1-score measures as well as the confusion matrix if it is a classification problem. <br>
Report the R2 score, mean squared error and mean absolute error if it is a regression problem.
______________________________________________________________________________________
Description:
<br>
Since the nature of the dataset requires the need of predicting multiple targets and a classification problem, we will use **accuracy**, **recall**, **precision**, **F1-score** and **confusion matrix** to evaluate both our model. 

For our hyperparameter tuning, we use **accuracy**, **precision_micro**, **recall_micro**, **f1_micro** as scoring metrics to evaluate our MultiLabel Classification model.  Since our dataset involve medical diagnoses, using **recall_micro** as the metric to refit our model to minimize the amount of _False Negative_ is crucial in our prediction.

This approach aligns with the healthcare objective of maximizing sensitivity, ensuring that our model maintains a high level of performance in identifying true positives, even if it comes at the expense of other metrics like precision or overall accuracy

In [26]:
#importing all required functions 
from sklearn.metrics import classification_report, multilabel_confusion_matrix, accuracy_score, confusion_matrix

_**Model Evaluation for Decision Tree**_

In [27]:
print('Accuracy:\n',accuracy_score(y_test, dt_y_pred))
print('\n')
print('Confusion Matrix:\n',multilabel_confusion_matrix(y_test, dt_y_pred))
print('\n')
print('Classification Report: \n', classification_report(y_test, dt_y_pred))

Accuracy:
 0.6417910447761194


Confusion Matrix:
 [[[118   9]
  [  7   0]]

 [[107  14]
  [  9   4]]

 [[108  15]
  [ 10   1]]

 [[116   7]
  [ 10   1]]]


Classification Report: 
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         7
           1       0.22      0.31      0.26        13
           2       0.06      0.09      0.07        11
           3       0.12      0.09      0.11        11

   micro avg       0.12      0.14      0.13        42
   macro avg       0.10      0.12      0.11        42
weighted avg       0.12      0.14      0.13        42
 samples avg       0.03      0.02      0.02        42



_**Model Evaluation for SVM**_

In [28]:
print('Accuracy:\n',accuracy_score(y_test, svm_y_pred))
print('\n')
print('Confusion Matrix:\n',multilabel_confusion_matrix(y_test, svm_y_pred))
print('\n')
print('Classification Report: \n', classification_report(y_test, svm_y_pred))

Accuracy:
 0.746268656716418


Confusion Matrix:
 [[[125   2]
  [  7   0]]

 [[114   7]
  [ 12   1]]

 [[118   5]
  [ 10   1]]

 [[115   8]
  [ 10   1]]]


Classification Report: 
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         7
           1       0.12      0.08      0.10        13
           2       0.17      0.09      0.12        11
           3       0.11      0.09      0.10        11

   micro avg       0.12      0.07      0.09        42
   macro avg       0.10      0.06      0.08        42
weighted avg       0.11      0.07      0.09        42
 samples avg       0.01      0.01      0.01        42



### _**Key Observation**_

_Accuracy_

- SVM model: Achieves an accuracy of **74.63%**.
- Decision Tree model: Achieves an accuracy of **64.18%**.

<br>

_Recall_

- SVM model: Generally **higher** recall values across all classes compared to the Decision Tree model.
- Decision Tree model: Shows **lower** recall values, indicating a higher rate of false negatives.

<br>

_Confusion Matrix_

- SVM model: Shows **lower** numbers of false negatives compared to the Decision Tree model, indicating better performance in correctly identifying positive instances.
- Decision Tree model: Exhibits **higher** numbers of false negatives, which could lead to potential misdiagnoses if not addressed.

<br>

_Classification Report_

SVM model demonstrates slightly **better** performance across precision, recall, and F1-score metrics compared to the Decision Tree model

## Conclusion

In a nutshell, both the Decision Tree and SVM models have performed within our expectations during the evaluation process. However, upon careful examination of their performance metrics, particularly accuracy, precision, recall, and the associated confusion matrices and classification reports, it becomes evident that the SVM model outperforms the Decision Tree model in terms of overall predictive performance, especially in the context of medical diagnosis where minimizing False Negatives (maximizing recall) is crucial to avoid misdiagnosing patients.

Given the critical nature of medical diagnosis and the importance of minimizing false negatives to avoid missing potentially significant diagnoses, we recommend selecting the SVM model as the champion model for this task. Its superior performance in recall, combined with comparable accuracy, makes it a more reliable choice for medical diagnosis applications. Additionally, further fine-tuning of the SVM model's hyperparameters could potentially enhance its performance even further, making it an optimal choice for deployment in real-world healthcare settings.