#### Instruction (Read this)
- Use this template to develop your project. Do not change the steps. 
- For each step, you may add additional cells if needed.
- But remove <b>unnecessary</b> cells to ensure the notebook is readable.
- Marks will be <b>deducted</b> if the notebook is cluttered or difficult to follow due to excess or irrelevant content.
- <b>Briefly</b> describe the steps in the "Description:" field.
- <b>Do not</b> submit the dataset. 
- The submitted jupyter notebook will be executed using the uploaded dataset in eLearn.

#### Group Information

Group No: Cancer2

- Member 1:
- Member 2:
- Member 3:
- Member 4:MAHDIL ASHRONIE BIN MUHAMAD MURTADZA


#### Import libraries

In [199]:
%config Completer.use_jedi=False # comment if not needed
import pandas as pd
import numpy as np
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
warnings.filterwarnings('ignore')

#### Load the dataset

In [200]:
df = pd.read_csv('risk_factors.csv', na_values='?')

In [201]:
df.shape
df.sample(10)

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
56,35,5.0,15.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
16,41,4.0,21.0,3.0,0.0,0.0,0.0,1.0,0.25,0.0,...,,,0,0,0,0,0,0,0,0
333,22,3.0,17.0,4.0,0.0,0.0,0.0,1.0,5.0,0.0,...,,,0,0,0,0,0,0,0,0
392,19,1.0,16.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
260,28,2.0,15.0,4.0,0.0,0.0,0.0,1.0,5.0,0.0,...,,,0,0,0,0,0,0,0,0
491,19,1.0,17.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,...,,,0,0,0,0,0,0,0,0
81,31,3.0,15.0,3.0,1.0,16.0,0.8,0.0,0.0,1.0,...,,,0,0,0,0,0,0,0,0
728,32,3.0,21.0,2.0,0.0,0.0,0.0,1.0,8.0,0.0,...,,,0,0,0,0,0,0,0,0
129,30,2.0,14.0,6.0,0.0,0.0,0.0,1.0,5.0,0.0,...,,,0,0,0,0,0,0,0,0
96,35,5.0,11.0,,1.0,15.0,15.0,1.0,14.0,0.0,...,,,0,0,0,0,1,1,1,1


#### Split the dataset
Split the dataset into training, validation and test sets.

In [202]:
y = df[['Hinselmann', 'Schiller', 'Citology', 'Biopsy']]
X = df.drop(['Hinselmann', 'Schiller', 'Citology', 'Biopsy'], axis=1)
scaler = StandardScaler()
X_transform = scaler.fit_transform(X)


In [203]:
seed_num = 10
X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.3, random_state=seed_num) # random_state is set to a value for reproducible␣output.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=seed_num)
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(480, 32)
(120, 32)
(258, 32)


#### Data preprocessing
Perform data preprocessing such as normalization, standardization, label encoding etc.
______________________________________________________________________________________
Description:

##### Handling Boolean Values

In [204]:
# Convert boolean-like columns to proper boolean data types
boolean_cols = [
    'Smokes', 'Smokes (years)', 'Smokes (packs/year)', 'Hormonal Contraceptives', 'IUD',
    'STDs', 'STDs:condylomatosis', 'STDs:cervical condylomatosis', 'STDs:vaginal condylomatosis',
    'STDs:vulvo-perineal condylomatosis', 'STDs:syphilis', 'STDs:pelvic inflammatory disease',
    'STDs:genital herpes', 'STDs:molluscum contagiosum', 'STDs:AIDS', 'STDs:HIV', 'STDs:Hepatitis B',
    'STDs:HPV', 'Dx:Cancer', 'Dx:CIN', 'Dx:HPV', 'Dx', 'Hinselmann', 'Schiller', 'Citology', 'Biopsy'
]

# Convert boolean-like columns to numeric
df[boolean_cols] = df[boolean_cols].apply(pd.to_numeric, errors='coerce')

# Replace values greater than 0 with 1, leave missing values untouched
df[boolean_cols] = df[boolean_cols].applymap(lambda x: 1 if x > 0 else x)


##### Check for the sum of missing values

In [205]:
print(df.isnull().sum())

Age                                     0
Number of sexual partners              26
First sexual intercourse                7
Num of pregnancies                     56
Smokes                                 13
Smokes (years)                         13
Smokes (packs/year)                    13
Hormonal Contraceptives               108
Hormonal Contraceptives (years)       108
IUD                                   117
IUD (years)                           117
STDs                                  105
STDs (number)                         105
STDs:condylomatosis                   105
STDs:cervical condylomatosis          105
STDs:vaginal condylomatosis           105
STDs:vulvo-perineal condylomatosis    105
STDs:syphilis                         105
STDs:pelvic inflammatory disease      105
STDs:genital herpes                   105
STDs:molluscum contagiosum            105
STDs:AIDS                             105
STDs:HIV                              105
STDs:Hepatitis B                  

##### Dropping Columns with too high missing values

In [206]:
# Set the threshold for dropping columns
threshold = 0.5  #keep columns with at least 50% non-null values

# Calculate the minimum number of non-null values required for each column to be retained
min_non_null_values = len(df) * threshold

# Drop columns with too many missing values
df = df.dropna(axis=1, thresh=min_non_null_values)

# Print the remaining missing values count after dropping columns
print(df.isnull().sum())

df.shape


Age                                     0
Number of sexual partners              26
First sexual intercourse                7
Num of pregnancies                     56
Smokes                                 13
Smokes (years)                         13
Smokes (packs/year)                    13
Hormonal Contraceptives               108
Hormonal Contraceptives (years)       108
IUD                                   117
IUD (years)                           117
STDs                                  105
STDs (number)                         105
STDs:condylomatosis                   105
STDs:cervical condylomatosis          105
STDs:vaginal condylomatosis           105
STDs:vulvo-perineal condylomatosis    105
STDs:syphilis                         105
STDs:pelvic inflammatory disease      105
STDs:genital herpes                   105
STDs:molluscum contagiosum            105
STDs:AIDS                             105
STDs:HIV                              105
STDs:Hepatitis B                  

(858, 34)

##### Dropping Rows with missing values

In [207]:
# Drop rows with missing values
df = df.dropna()

# Print the remaining missing values count after dropping rows
print(df.isnull().sum())

# Print the shape of the cleaned dataset
df.shape

#export cleaned data
df.to_csv('risk_factors_clean.csv', index=False)

Age                                   0
Number of sexual partners             0
First sexual intercourse              0
Num of pregnancies                    0
Smokes                                0
Smokes (years)                        0
Smokes (packs/year)                   0
Hormonal Contraceptives               0
Hormonal Contraceptives (years)       0
IUD                                   0
IUD (years)                           0
STDs                                  0
STDs (number)                         0
STDs:condylomatosis                   0
STDs:cervical condylomatosis          0
STDs:vaginal condylomatosis           0
STDs:vulvo-perineal condylomatosis    0
STDs:syphilis                         0
STDs:pelvic inflammatory disease      0
STDs:genital herpes                   0
STDs:molluscum contagiosum            0
STDs:AIDS                             0
STDs:HIV                              0
STDs:Hepatitis B                      0
STDs:HPV                              0


#### Feature Selection
Perform feature selection to select the relevant features.
______________________________________________________________________________________
Feature Selection Using Random Forest

Step 1: Loading dataset

Step 2: Train Test Split

Step 3: Import Random Forest Classifier module.

Step 4: Training of Model

In [208]:
df = pd.read_csv('risk_factors_clean.csv', na_values='?')

In [209]:
y = df[['Hinselmann', 'Schiller', 'Citology', 'Biopsy']]
X = df.drop(['Hinselmann', 'Schiller', 'Citology', 'Biopsy'], axis=1)
scaler = StandardScaler()
X_transform = scaler.fit_transform(X)

In [210]:
seed_num = 10
X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.3, random_state=seed_num) # random_state is set to a value for reproducible␣output.

In [211]:
from sklearn.ensemble import RandomForestClassifier

In [212]:
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100)  
 
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(X_train, y_train)
 
# performing predictions on the test dataset
y_pred = clf.predict(X_test)
 
# metrics are used to find accuracy or error
from sklearn import metrics  
print()
 
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL:", metrics.accuracy_score(y_test, y_pred))


ACCURACY OF THE MODEL: 0.8557213930348259


In [213]:
feature_imp = pd.Series(clf.feature_importances_, index = X.columns).sort_values(ascending = False)
feature_imp ##print the features significant in ascending order

Age                                   0.209201
Hormonal Contraceptives (years)       0.161375
First sexual intercourse              0.157461
Number of sexual partners             0.112091
Num of pregnancies                    0.106691
IUD (years)                           0.038003
Hormonal Contraceptives               0.028212
Dx:HPV                                0.022131
STDs (number)                         0.020098
IUD                                   0.018015
Dx:Cancer                             0.017988
STDs:HIV                              0.016870
Smokes                                0.015408
Dx                                    0.013123
Smokes (years)                        0.013039
Smokes (packs/year)                   0.012752
STDs: Number of diagnosis             0.011544
STDs:condylomatosis                   0.005987
STDs                                  0.005765
Dx:CIN                                0.004757
STDs:vulvo-perineal condylomatosis    0.003851
STDs:syphilis

In [214]:
##Take 5 most important features
X = df[['Age','First sexual intercourse', 'Hormonal Contraceptives (years)', 'Number of sexual partners', 'Num of pregnancies']]

In [215]:
seed_num = 10
X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.3, random_state=seed_num) # random_state is set to a value for reproducible␣output.

In [216]:
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100)  
 
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(X_train, y_train)
 
# performing predictions on the test dataset
y_pred = clf.predict(X_test)
 
# metrics are used to find accuracy or error
from sklearn import metrics  
print()
 
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL:", metrics.accuracy_score(y_test, y_pred))


ACCURACY OF THE MODEL: 0.8656716417910447


_**As the result, the accuracy of the model when we take the 5 most significant features is exactly
the same as the accuracy of the model using all 30 features, indicating that the selected features effectively 
capture the essential information for making predictions.**_

#### Data modeling
Build the machine learning models. You must build atleast two (2) predictive models. One of the predictive models must be either Decision Tree or Support Vector Machine.
______________________________________________________________________________________
Description:

## Decision Tree

## Support Vector Machine

In [225]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV 
from sklearn.svm import SVC

parameters = {'estimator__kernel': ['linear', 'sigmoid'], 'estimator__C': [0.1, 0.5, 1, 5], 'estimator__coef0': [0, 2, 5]}
clf = MultiOutputClassifier(SVC())
gsc = GridSearchCV(clf, parameters, cv=3, scoring=['accuracy', 'precision', 'recall', 'f1'], n_jobs=3, refit='f1')
gsc.fit(X_train, y_train)
gsc.cv_results_


{'mean_fit_time': array([0.01059834, 0.0102129 , 0.01136589, 0.01086052, 0.01018858,
        0.01153493, 0.0116748 , 0.01204038, 0.01253303, 0.01119765,
        0.01052586, 0.00965063, 0.01141707, 0.01039561, 0.01128443,
        0.01044703, 0.01192602, 0.01038774, 0.01905878, 0.01117333,
        0.02170038, 0.01460934, 0.0239025 , 0.01103179]),
 'std_fit_time': array([1.06802086e-03, 4.67004975e-04, 6.26880507e-04, 4.70490713e-04,
        4.70528232e-04, 1.43020040e-03, 1.28529031e-03, 1.47093657e-03,
        2.15847339e-03, 1.24384893e-03, 1.41423396e-03, 9.90039828e-05,
        6.01268159e-04, 6.28594682e-04, 5.36501222e-04, 1.37126401e-03,
        1.24133065e-03, 2.24760136e-03, 2.34242038e-03, 4.80603063e-04,
        5.21319968e-03, 5.68697260e-03, 7.65941610e-03, 4.75884612e-04]),
 'mean_score_time': array([0.01055702, 0.00720636, 0.00668073, 0.00768232, 0.00885415,
        0.00884628, 0.00753077, 0.00842086, 0.00753633, 0.00752211,
        0.00786757, 0.00753236, 0.00693353, 0.00

In [223]:
print(gsc.best_params_)
print(gsc.best_index_)

{'estimator__C': 0.1, 'estimator__coef0': 0, 'estimator__kernel': 'linear'}
0


**Build SVM model with the best parameters**

In [172]:
from sklearn.svm import SVC
C = 0.1
model_svc = SVC(kernel='linear', C=C)

In [224]:
multi_target_svm = MultiOutputClassifier(model_svc)  

# Fit the classifier to the data
multi_target_svm.fit(X_train, y_train)

# Predictions
y_pred = multi_target_svm.predict(X_test)
from sklearn.metrics import accuracy_score, multilabel_confusion_matrix,classification_report
print(accuracy_score(y_test, y_pred))
print(multilabel_confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.8557213930348259
[[[193   0]
  [  8   0]]

 [[176   4]
  [ 20   1]]

 [[192   0]
  [  9   0]]

 [[182   4]
  [ 14   1]]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.20      0.05      0.08        21
           2       0.00      0.00      0.00         9
           3       0.20      0.07      0.10        15

   micro avg       0.20      0.04      0.06        53
   macro avg       0.10      0.03      0.04        53
weighted avg       0.14      0.04      0.06        53
 samples avg       0.00      0.00      0.00        53



0.8557213930348259
[[[193   0]
  [  8   0]]

 [[176   4]
  [ 20   1]]

 [[192   0]
  [  9   0]]

 [[182   4]
  [ 14   1]]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.20      0.05      0.08        21
           2       0.00      0.00      0.00         9
           3       0.20      0.07      0.10        15

   micro avg       0.20      0.04      0.06        53
   macro avg       0.10      0.03      0.04        53
weighted avg       0.14      0.04      0.06        53
 samples avg       0.00      0.00      0.00        53



#### Evaluate the models
Perform a comparison between the predictive models. <br>
Report the accuracy, recall, precision and F1-score measures as well as the confusion matrix if it is a classification problem. <br>
Report the R2 score, mean squared error and mean absolute error if it is a regression problem.
______________________________________________________________________________________
Description: