#### Instruction (Read this)
- Use this template to develop your project. Do not change the steps. 
- For each step, you may add additional cells if needed.
- But remove <b>unnecessary</b> cells to ensure the notebook is readable.
- Marks will be <b>deducted</b> if the notebook is cluttered or difficult to follow due to excess or irrelevant content.
- <b>Briefly</b> describe the steps in the "Description:" field.
- <b>Do not</b> submit the dataset. 
- The submitted jupyter notebook will be executed using the uploaded dataset in eLearn.

#### Group Information

Group No: Cancer2

- Member 1:MUHAMMAD DANISH AIMAN BIN MUHAMMAD NAZIR
- Member 2:MUHAMMAD AMMAR BIN ADNAN 160932
- Member 3:ARDY QAWI BIN HASHIM
- Member 4:MAHDIL ASHRONIE BIN MUHAMAD MURTADZA


#### Import libraries

In [1]:
%config Completer.use_jedi=False # comment if not needed
import pandas as pd
import numpy as np
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
warnings.filterwarnings('ignore')

#### Load the dataset

In [2]:
df = pd.read_csv('risk_factors.csv', na_values='?')

In [3]:
df.shape
df.sample(10)

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
845,19,2.0,15.0,2.0,0.0,0.0,0.0,1.0,0.75,0.0,...,,,0,0,0,0,0,0,0,0
679,50,2.0,17.0,7.0,0.0,0.0,0.0,1.0,5.0,0.0,...,,,0,0,0,0,0,0,0,0
516,20,2.0,16.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,...,,,0,0,0,0,0,0,0,0
19,40,2.0,27.0,,0.0,0.0,0.0,0.0,0.0,1.0,...,,,0,0,0,0,0,0,0,0
716,19,3.0,15.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
250,25,2.0,18.0,2.0,0.0,0.0,0.0,1.0,0.25,0.0,...,,,0,0,0,0,0,0,0,0
802,15,2.0,14.0,1.0,0.0,0.0,0.0,1.0,0.16,0.0,...,,,0,0,0,0,0,0,0,0
161,28,2.0,18.0,2.0,0.0,0.0,0.0,1.0,6.0,0.0,...,,,0,0,0,0,0,0,0,0
639,17,2.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
732,26,2.0,17.0,2.0,0.0,0.0,0.0,0.0,0.0,,...,,,0,0,0,0,0,0,0,0


#### Split the dataset
Split the dataset into training, validation and test sets.

In [4]:
y = df[['Hinselmann', 'Schiller', 'Citology', 'Biopsy']]
X = df.drop(['Hinselmann', 'Schiller', 'Citology', 'Biopsy'], axis=1)
scaler = StandardScaler()
X_transform = scaler.fit_transform(X)


In [5]:
seed_num = 10
X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.3, random_state=seed_num) # random_state is set to a value for reproducible␣output.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=seed_num)
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(480, 32)
(120, 32)
(258, 32)


#### Data preprocessing
Perform data preprocessing such as normalization, standardization, label encoding etc.
______________________________________________________________________________________
Description:

##### Handling Boolean Values

In [6]:
# Convert boolean-like columns to proper boolean data types
boolean_cols = [
    'Smokes', 'Smokes (years)', 'Smokes (packs/year)', 'Hormonal Contraceptives', 'IUD',
    'STDs', 'STDs:condylomatosis', 'STDs:cervical condylomatosis', 'STDs:vaginal condylomatosis',
    'STDs:vulvo-perineal condylomatosis', 'STDs:syphilis', 'STDs:pelvic inflammatory disease',
    'STDs:genital herpes', 'STDs:molluscum contagiosum', 'STDs:AIDS', 'STDs:HIV', 'STDs:Hepatitis B',
    'STDs:HPV', 'Dx:Cancer', 'Dx:CIN', 'Dx:HPV', 'Dx', 'Hinselmann', 'Schiller', 'Citology', 'Biopsy'
]

# Convert boolean-like columns to numeric
df[boolean_cols] = df[boolean_cols].apply(pd.to_numeric, errors='coerce')

# Replace values greater than 0 with 1, leave missing values untouched
df[boolean_cols] = df[boolean_cols].applymap(lambda x: 1 if x > 0 else x)


##### Check for the sum of missing values

In [7]:
print(df.isnull().sum())

Age                                     0
Number of sexual partners              26
First sexual intercourse                7
Num of pregnancies                     56
Smokes                                 13
Smokes (years)                         13
Smokes (packs/year)                    13
Hormonal Contraceptives               108
Hormonal Contraceptives (years)       108
IUD                                   117
IUD (years)                           117
STDs                                  105
STDs (number)                         105
STDs:condylomatosis                   105
STDs:cervical condylomatosis          105
STDs:vaginal condylomatosis           105
STDs:vulvo-perineal condylomatosis    105
STDs:syphilis                         105
STDs:pelvic inflammatory disease      105
STDs:genital herpes                   105
STDs:molluscum contagiosum            105
STDs:AIDS                             105
STDs:HIV                              105
STDs:Hepatitis B                  

##### Dropping Columns with too high missing values

In [8]:
# Set the threshold for dropping columns
threshold = 0.5  #keep columns with at least 50% non-null values

# Calculate the minimum number of non-null values required for each column to be retained
min_non_null_values = len(df) * threshold

# Drop columns with too many missing values
df = df.dropna(axis=1, thresh=min_non_null_values)

# Print the remaining missing values count after dropping columns
print(df.isnull().sum())

df.shape


Age                                     0
Number of sexual partners              26
First sexual intercourse                7
Num of pregnancies                     56
Smokes                                 13
Smokes (years)                         13
Smokes (packs/year)                    13
Hormonal Contraceptives               108
Hormonal Contraceptives (years)       108
IUD                                   117
IUD (years)                           117
STDs                                  105
STDs (number)                         105
STDs:condylomatosis                   105
STDs:cervical condylomatosis          105
STDs:vaginal condylomatosis           105
STDs:vulvo-perineal condylomatosis    105
STDs:syphilis                         105
STDs:pelvic inflammatory disease      105
STDs:genital herpes                   105
STDs:molluscum contagiosum            105
STDs:AIDS                             105
STDs:HIV                              105
STDs:Hepatitis B                  

(858, 34)

##### Dropping Rows with missing values

In [9]:
# Drop rows with missing values
df = df.dropna()

# Print the remaining missing values count after dropping rows
print(df.isnull().sum())

# Print the shape of the cleaned dataset
df.shape

#export cleaned data
df.to_csv('risk_factors_clean.csv', index=False)

Age                                   0
Number of sexual partners             0
First sexual intercourse              0
Num of pregnancies                    0
Smokes                                0
Smokes (years)                        0
Smokes (packs/year)                   0
Hormonal Contraceptives               0
Hormonal Contraceptives (years)       0
IUD                                   0
IUD (years)                           0
STDs                                  0
STDs (number)                         0
STDs:condylomatosis                   0
STDs:cervical condylomatosis          0
STDs:vaginal condylomatosis           0
STDs:vulvo-perineal condylomatosis    0
STDs:syphilis                         0
STDs:pelvic inflammatory disease      0
STDs:genital herpes                   0
STDs:molluscum contagiosum            0
STDs:AIDS                             0
STDs:HIV                              0
STDs:Hepatitis B                      0
STDs:HPV                              0


#### Feature Selection
Perform feature selection to select the relevant features.
______________________________________________________________________________________
Feature Selection Using Random Forest

Step 1: Loading dataset

Step 2: Train Test Split

Step 3: Import Random Forest Classifier module.

Step 4: Training of Model

In [10]:
df = pd.read_csv('risk_factors_clean.csv', na_values='?')

In [11]:
y = df[['Hinselmann', 'Schiller', 'Citology', 'Biopsy']]
X = df.drop(['Hinselmann', 'Schiller', 'Citology', 'Biopsy'], axis=1)
scaler = StandardScaler()
X_transform = scaler.fit_transform(X)

In [12]:
seed_num = 10
X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.3, random_state=seed_num) # random_state is set to a value for reproducible␣output.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=seed_num)

In [13]:
from sklearn.ensemble import RandomForestClassifier

In [14]:
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100)  
 
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(X_train, y_train)
 
# performing predictions on the test dataset
y_pred = clf.predict(X_test)
 
# metrics are used to find accuracy or error
from sklearn import metrics  
print()
 
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL:", metrics.accuracy_score(y_test, y_pred))


ACCURACY OF THE MODEL: 0.8557213930348259


In [15]:
feature_imp = pd.Series(clf.feature_importances_, index = X.columns).sort_values(ascending = False)
feature_imp ##print the features significant in ascending order

Age                                   0.210668
Hormonal Contraceptives (years)       0.167478
First sexual intercourse              0.154819
Num of pregnancies                    0.111136
Number of sexual partners             0.108137
Dx:HPV                                0.034878
IUD (years)                           0.031605
Dx:Cancer                             0.020848
Hormonal Contraceptives               0.020508
Dx                                    0.019488
IUD                                   0.018577
Smokes                                0.014778
STDs:HIV                              0.014193
Smokes (years)                        0.013717
STDs (number)                         0.013044
Smokes (packs/year)                   0.010953
STDs:vulvo-perineal condylomatosis    0.007454
STDs:condylomatosis                   0.006548
STDs: Number of diagnosis             0.006173
STDs                                  0.006048
Dx:CIN                                0.005728
STDs:vaginal 

In [16]:
##Take 5 most important features
X = df[['Age','First sexual intercourse', 'Hormonal Contraceptives (years)', 'Number of sexual partners', 'Num of pregnancies']]
X_transform = scaler.fit_transform(X)

seed_num = 10
X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.3, random_state=seed_num) # random_state is set to a value for reproducible␣output
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=seed_num)

In [17]:
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100)  
 
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(X_train, y_train)
 
# performing predictions on the test dataset
y_pred = clf.predict(X_test)
 
# metrics are used to find accuracy or error
from sklearn import metrics
 
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL:", metrics.accuracy_score(y_test, y_pred))

ACCURACY OF THE MODEL: 0.8656716417910447


_**As the result, the accuracy of the model when we take the 5 most significant features is exactly
the same as the accuracy of the model using all 30 features, indicating that the selected features effectively 
capture the essential information for making predictions.**_

#### Data modeling
Build the machine learning models. You must build atleast two (2) predictive models. One of the predictive models must be either Decision Tree or Support Vector Machine.
______________________________________________________________________________________
Description:

## Decision Tree

In [18]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor
import matplotlib.pyplot as plt

parameters = {
    'estimator__max_depth': [1, 10, 20, 30],
    'estimator__min_samples_split': [2, 5, 10],
    'estimator__min_samples_leaf': [1, 2, 4]
}

dt_multi_model = MultiOutputClassifier(DecisionTreeClassifier(random_state=seed_num))
dt_gsc = GridSearchCV(dt_multi_model, parameters, cv=3, scoring=['accuracy', 'precision_micro', 'recall_micro', 'f1_micro'], n_jobs=3, refit='f1_micro')
dt_gsc.fit(X_train, y_train)

GridSearchCV(cv=3,
             estimator=MultiOutputClassifier(estimator=DecisionTreeClassifier(random_state=10)),
             n_jobs=3,
             param_grid={'estimator__max_depth': [1, 10, 20, 30],
                         'estimator__min_samples_leaf': [1, 2, 4],
                         'estimator__min_samples_split': [2, 5, 10]},
             refit='f1_micro',
             scoring=['accuracy', 'precision_micro', 'recall_micro',
                      'f1_micro'])

In [19]:
print(f"Best parameters found: {dt_gsc.best_params_}")
print(f"Best F1 score: {dt_gsc.best_score_}")

# Validate the parameter
print(f"Recommended Model Accuracy with validation data: {dt_gsc.best_estimator_.score(X_val, y_val)}")

Best parameters found: {'estimator__max_depth': 20, 'estimator__min_samples_leaf': 1, 'estimator__min_samples_split': 2}
Best F1 score: 0.10833036984352773
Recommended Model Accuracy with validation data: 0.6702127659574468


_**Build Decision Tree model with the best parameters**_

In [20]:
from sklearn.tree import DecisionTreeClassifier

# Obtain the best parameters from hyperparameter tuning
best_params = {'max_depth': 20}

# Create a decision tree classifier with the best parameters
best_dt_model = MultiOutputClassifier(DecisionTreeClassifier(**best_params))

In [21]:
# Fit the model to the training data
best_dt_model.fit(X_train, y_train)

# Make predictions on the test set
dt_y_pred = best_dt_model.predict(X_test)

## Support Vector Machine

In [22]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV 
from sklearn.svm import SVC

parameters = {'estimator__kernel': ['linear', 'sigmoid'], 'estimator__C': [0.1, 0.5, 1, 5], 'estimator__coef0': [0, 2, 5]}
clf = MultiOutputClassifier(SVC(random_state=seed_num))
gsc = GridSearchCV(clf, parameters, cv=3, scoring=['accuracy', 'precision_micro', 'recall_micro', 'f1_micro'], n_jobs=3, refit='f1_micro')
gsc.fit(X_train, y_train)
gsc.cv_results_


{'mean_fit_time': array([0.00466673, 0.0053335 , 0.00433294, 0.00600036, 0.00400003,
        0.00500011, 0.00466633, 0.00500019, 0.00466633, 0.00533406,
        0.00466657, 0.00499996, 0.00466593, 0.00566657, 0.00466641,
        0.0056661 , 0.00433342, 0.00566721, 0.0049998 , 0.00533477,
        0.00533334, 0.00533382, 0.00533326, 0.0053335 ]),
 'std_fit_time': array([4.71819920e-04, 4.71314168e-04, 4.70583613e-04, 6.25769923e-07,
        2.24783192e-07, 0.00000000e+00, 4.71539032e-04, 5.94720425e-07,
        4.71202018e-04, 4.72775499e-04, 4.71707689e-04, 6.83651389e-07,
        4.71426560e-04, 4.71258002e-04, 4.71258605e-04, 4.71089465e-04,
        4.71089626e-04, 4.71538951e-04, 4.89903609e-07, 4.70418660e-04,
        4.71258123e-04, 4.71763815e-04, 4.71482786e-04, 4.71145571e-04]),
 'mean_score_time': array([0.00499916, 0.00466625, 0.00466649, 0.00433397, 0.00499988,
        0.00500003, 0.00466696, 0.00433366, 0.0043335 , 0.00466498,
        0.00466657, 0.00500011, 0.00466688, 0.00

In [23]:
print(gsc.best_params_)
print(gsc.best_score_)
# Validate the parameter
print(f"Recommended Model Accuracy with validation data: {gsc.best_estimator_.score(X_val, y_val)}")

{'estimator__C': 1, 'estimator__coef0': 0, 'estimator__kernel': 'sigmoid'}
0.03773584905660377
Recommended Model Accuracy with validation data: 0.8085106382978723


_**Build SVM model with the best parameters**_

In [24]:
from sklearn.svm import SVC
C = 1
model_svc = SVC(kernel='sigmoid', C=C)

multi_target_svm = MultiOutputClassifier(model_svc)  

# Fit the classifier to the data
multi_target_svm.fit(X_train, y_train)

# Predictions
svm_y_pred = multi_target_svm.predict(X_test)

#### Evaluate the models
Perform a comparison between the predictive models. <br>
Report the accuracy, recall, precision and F1-score measures as well as the confusion matrix if it is a classification problem. <br>
Report the R2 score, mean squared error and mean absolute error if it is a regression problem.
______________________________________________________________________________________
Description:
<br>
Since the nature of this data require the need of predicting multiple targets and a classification problem, we will use **accuracy**, **recall**, **precision**, **F1-score** and **confusion matrix** to evaluate both our model. 

In [25]:
#importing all required functions 
from sklearn.metrics import classification_report, multilabel_confusion_matrix, accuracy_score

_**Model Evaluation for Decision Tree**_

In [26]:
print('Accuracy:\n',accuracy_score(y_test, dt_y_pred))
print('\n')
print('Confusion Matrix:\n',multilabel_confusion_matrix(y_test, dt_y_pred))
print('\n')
print('Classification Report: \n', classification_report(y_test, dt_y_pred))

Accuracy:
 0.5920398009950248


Confusion Matrix:
 [[[177  16]
  [  6   2]]

 [[158  22]
  [ 16   5]]

 [[167  25]
  [  7   2]]

 [[171  15]
  [ 14   1]]]


Classification Report: 
               precision    recall  f1-score   support

           0       0.11      0.25      0.15         8
           1       0.19      0.24      0.21        21
           2       0.07      0.22      0.11         9
           3       0.06      0.07      0.06        15

   micro avg       0.11      0.19      0.14        53
   macro avg       0.11      0.19      0.13        53
weighted avg       0.12      0.19      0.14        53
 samples avg       0.03      0.02      0.02        53



_**Model Evaluation for SVM**_

In [27]:
print('Accuracy:\n',accuracy_score(y_test, svm_y_pred))
print('\n')
print('Confusion Matrix:\n',multilabel_confusion_matrix(y_test, svm_y_pred))
print('\n')
print('Classification Report: \n', classification_report(y_test, svm_y_pred))

Accuracy:
 0.8208955223880597


Confusion Matrix:
 [[[190   3]
  [  8   0]]

 [[172   8]
  [ 21   0]]

 [[189   3]
  [  9   0]]

 [[181   5]
  [ 15   0]]]


Classification Report: 
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.00      0.00      0.00        21
           2       0.00      0.00      0.00         9
           3       0.00      0.00      0.00        15

   micro avg       0.00      0.00      0.00        53
   macro avg       0.00      0.00      0.00        53
weighted avg       0.00      0.00      0.00        53
 samples avg       0.00      0.00      0.00        53

