#### Instruction (Read this)
- Use this template to develop your project. Do not change the steps. 
- For each step, you may add additional cells if needed.
- But remove <b>unnecessary</b> cells to ensure the notebook is readable.
- Marks will be <b>deducted</b> if the notebook is cluttered or difficult to follow due to excess or irrelevant content.
- <b>Briefly</b> describe the steps in the "Description:" field.
- <b>Do not</b> submit the dataset. 
- The submitted jupyter notebook will be executed using the uploaded dataset in eLearn.

#### Group Information

Group No: Cancer2

- Member 1:MUHAMMAD DANISH AIMAN BIN MUHAMMAD NAZIR
- Member 2:MUHAMMAD AMMAR BIN ADNAN
- Member 3:ARDY QAWI BIN HASHIM
- Member 4:MAHDIL ASHRONIE BIN MUHAMAD MURTADZA


#### Import libraries

In [4]:
%config Completer.use_jedi=False # comment if not needed
import pandas as pd
import numpy as np
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
warnings.filterwarnings('ignore')

#### Load the dataset

In [5]:
df = pd.read_csv('risk_factors.csv', na_values='?')

In [6]:
df.shape
df.sample(10)

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs:HPV,STDs: Number of diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
247,20,5.0,17.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0
535,38,4.0,18.0,4.0,0.0,0.0,0.0,1.0,16.0,1.0,...,0.0,0,0,0,0,0,1,1,1,1
403,36,1.0,22.0,4.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,1,0,0,0,0,0,0,0,0
478,20,3.0,18.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0
258,20,2.0,17.0,2.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0
312,19,1.0,16.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0
170,28,2.0,20.0,2.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0
299,20,2.0,14.0,4.0,1.0,1.0,1.0,1.0,0.5,0.0,...,0.0,0,0,0,0,0,0,0,0,1
451,24,1.0,17.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,1,0
35,36,3.0,18.0,3.0,1.0,1.0,1.0,1.0,9.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0


#### Split the dataset
Split the dataset into training, validation and test sets.

In [7]:
y = df[['Hinselmann', 'Schiller', 'Citology', 'Biopsy']]
X = df.drop(['Hinselmann', 'Schiller', 'Citology', 'Biopsy'], axis=1)
scaler = StandardScaler()
X_transform = scaler.fit_transform(X)


In [8]:
seed_num = 10
X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.3, random_state=seed_num) # random_state is set to a value for reproducible␣output.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=seed_num)
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(373, 30)
(94, 30)
(201, 30)


#### Data preprocessing
Perform data preprocessing such as normalization, standardization, label encoding etc.
______________________________________________________________________________________
Description:

##### Handling Boolean Values

In [9]:
# Convert boolean-like columns to proper boolean data types
boolean_cols = [
    'Smokes', 'Smokes (years)', 'Smokes (packs/year)', 'Hormonal Contraceptives', 'IUD',
    'STDs', 'STDs:condylomatosis', 'STDs:cervical condylomatosis', 'STDs:vaginal condylomatosis',
    'STDs:vulvo-perineal condylomatosis', 'STDs:syphilis', 'STDs:pelvic inflammatory disease',
    'STDs:genital herpes', 'STDs:molluscum contagiosum', 'STDs:AIDS', 'STDs:HIV', 'STDs:Hepatitis B',
    'STDs:HPV', 'Dx:Cancer', 'Dx:CIN', 'Dx:HPV', 'Dx', 'Hinselmann', 'Schiller', 'Citology', 'Biopsy'
]

# Convert boolean-like columns to numeric
df[boolean_cols] = df[boolean_cols].apply(pd.to_numeric, errors='coerce')

# Replace values greater than 0 with 1, leave missing values untouched
df[boolean_cols] = df[boolean_cols].applymap(lambda x: 1 if x > 0 else x)


##### Check for the sum of missing values

In [10]:
print(df.isnull().sum())

Age                                   0
Number of sexual partners             0
First sexual intercourse              0
Num of pregnancies                    0
Smokes                                0
Smokes (years)                        0
Smokes (packs/year)                   0
Hormonal Contraceptives               0
Hormonal Contraceptives (years)       0
IUD                                   0
IUD (years)                           0
STDs                                  0
STDs (number)                         0
STDs:condylomatosis                   0
STDs:cervical condylomatosis          0
STDs:vaginal condylomatosis           0
STDs:vulvo-perineal condylomatosis    0
STDs:syphilis                         0
STDs:pelvic inflammatory disease      0
STDs:genital herpes                   0
STDs:molluscum contagiosum            0
STDs:AIDS                             0
STDs:HIV                              0
STDs:Hepatitis B                      0
STDs:HPV                              0


##### Dropping Columns with too high missing values

In [11]:
# Set the threshold for dropping columns
threshold = 0.5  #keep columns with at least 50% non-null values

# Calculate the minimum number of non-null values required for each column to be retained
min_non_null_values = len(df) * threshold

# Drop columns with too many missing values
df = df.dropna(axis=1, thresh=min_non_null_values)

# Print the remaining missing values count after dropping columns
print(df.isnull().sum())

df.shape


Age                                   0
Number of sexual partners             0
First sexual intercourse              0
Num of pregnancies                    0
Smokes                                0
Smokes (years)                        0
Smokes (packs/year)                   0
Hormonal Contraceptives               0
Hormonal Contraceptives (years)       0
IUD                                   0
IUD (years)                           0
STDs                                  0
STDs (number)                         0
STDs:condylomatosis                   0
STDs:cervical condylomatosis          0
STDs:vaginal condylomatosis           0
STDs:vulvo-perineal condylomatosis    0
STDs:syphilis                         0
STDs:pelvic inflammatory disease      0
STDs:genital herpes                   0
STDs:molluscum contagiosum            0
STDs:AIDS                             0
STDs:HIV                              0
STDs:Hepatitis B                      0
STDs:HPV                              0


(668, 34)

##### Dropping Rows with missing values

In [12]:
# Drop rows with missing values
df = df.dropna()

# Print the remaining missing values count after dropping rows
print(df.isnull().sum())

# Print the shape of the cleaned dataset
df.shape

#export cleaned data
df.to_csv('risk_factors_clean.csv', index=False)

Age                                   0
Number of sexual partners             0
First sexual intercourse              0
Num of pregnancies                    0
Smokes                                0
Smokes (years)                        0
Smokes (packs/year)                   0
Hormonal Contraceptives               0
Hormonal Contraceptives (years)       0
IUD                                   0
IUD (years)                           0
STDs                                  0
STDs (number)                         0
STDs:condylomatosis                   0
STDs:cervical condylomatosis          0
STDs:vaginal condylomatosis           0
STDs:vulvo-perineal condylomatosis    0
STDs:syphilis                         0
STDs:pelvic inflammatory disease      0
STDs:genital herpes                   0
STDs:molluscum contagiosum            0
STDs:AIDS                             0
STDs:HIV                              0
STDs:Hepatitis B                      0
STDs:HPV                              0


#### Feature Selection
Perform feature selection to select the relevant features.
______________________________________________________________________________________
Feature Selection Using Random Forest

Step 1: Loading dataset

Step 2: Train Test Split

Step 3: Import Random Forest Classifier module.

Step 4: Training of Model

In [13]:
df = pd.read_csv('risk_factors_clean.csv', na_values='?')

In [14]:
y = df[['Hinselmann', 'Schiller', 'Citology', 'Biopsy']]
X = df.drop(['Hinselmann', 'Schiller', 'Citology', 'Biopsy'], axis=1)
scaler = StandardScaler()
X_transform = scaler.fit_transform(X)

In [15]:
seed_num = 10
X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.3, random_state=seed_num) # random_state is set to a value for reproducible␣output.

In [16]:
from sklearn.ensemble import RandomForestClassifier

In [17]:
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100)  
 
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(X_train, y_train)
 
# performing predictions on the test dataset
y_pred = clf.predict(X_test)
 
# metrics are used to find accuracy or error
from sklearn import metrics  
print()
 
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL:", metrics.accuracy_score(y_test, y_pred))


ACCURACY OF THE MODEL: 0.8656716417910447


In [18]:
feature_imp = pd.Series(clf.feature_importances_, index = X.columns).sort_values(ascending = False)
feature_imp ##print the features significant in ascending order

Age                                   0.214121
Hormonal Contraceptives (years)       0.162846
First sexual intercourse              0.156386
Number of sexual partners             0.113129
Num of pregnancies                    0.094507
IUD (years)                           0.039175
Dx:HPV                                0.029488
Hormonal Contraceptives               0.027081
Dx:Cancer                             0.019484
STDs (number)                         0.018159
STDs:HIV                              0.017896
Smokes                                0.017291
IUD                                   0.013666
Smokes (years)                        0.013183
Dx                                    0.012113
Smokes (packs/year)                   0.011815
STDs: Number of diagnosis             0.011474
STDs:condylomatosis                   0.007146
STDs:vulvo-perineal condylomatosis    0.006712
STDs                                  0.004710
Dx:CIN                                0.003677
STDs:syphilis

In [19]:
##Take 5 most important features
X = df[['Age','First sexual intercourse', 'Hormonal Contraceptives (years)', 'Number of sexual partners', 'Num of pregnancies']]

In [20]:
seed_num = 10
X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.3, random_state=seed_num) # random_state is set to a value for reproducible␣output.

In [21]:
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100)  
 
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(X_train, y_train)
 
# performing predictions on the test dataset
y_pred = clf.predict(X_test)
 
# metrics are used to find accuracy or error
from sklearn import metrics  
print()
 
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL:", metrics.accuracy_score(y_test, y_pred))


ACCURACY OF THE MODEL: 0.8507462686567164


_**As the result, the accuracy of the model when we take the 5 most significant features is exactly
the same as the accuracy of the model using all 30 features, indicating that the selected features effectively 
capture the essential information for making predictions.**_

#### Data modeling
Build the machine learning models. You must build atleast two (2) predictive models. One of the predictive models must be either Decision Tree or Support Vector Machine.
______________________________________________________________________________________
Description:

## Decision Tree

In [40]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor

parameters = {
    'estimator__max_depth': [None, 10, 20, 30],
    'estimator__min_samples_split': [2, 5, 10],
    'estimator__min_samples_leaf': [1, 2, 4]
}

clf = MultiOutputClassifier(DecisionTreeClassifier(random_state=seed_num))
gsc = GridSearchCV(clf, parameters, cv=3, scoring=['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted'], n_jobs=3, refit='f1_weighted')
gsc.fit(X_train, y_train)
cv_results_dt = gsc.cv_results_

for mean_score, params in zip(cv_results_dt['mean_test_f1_weighted'], cv_results_dt['params']):
    print(f"Mean F1 score: {mean_score} with parameters: {params}")

Mean F1 score: 0.12725572770309615 with parameters: {'estimator__max_depth': None, 'estimator__min_samples_leaf': 1, 'estimator__min_samples_split': 2}
Mean F1 score: 0.1375685560419454 with parameters: {'estimator__max_depth': None, 'estimator__min_samples_leaf': 1, 'estimator__min_samples_split': 5}
Mean F1 score: 0.06822190155523489 with parameters: {'estimator__max_depth': None, 'estimator__min_samples_leaf': 1, 'estimator__min_samples_split': 10}
Mean F1 score: 0.1432551575408718 with parameters: {'estimator__max_depth': None, 'estimator__min_samples_leaf': 2, 'estimator__min_samples_split': 2}
Mean F1 score: 0.09562769505626646 with parameters: {'estimator__max_depth': None, 'estimator__min_samples_leaf': 2, 'estimator__min_samples_split': 5}
Mean F1 score: 0.05291406124739458 with parameters: {'estimator__max_depth': None, 'estimator__min_samples_leaf': 2, 'estimator__min_samples_split': 10}
Mean F1 score: 0.034500419912319225 with parameters: {'estimator__max_depth': None, 'est

In [27]:
print(f"Best parameters found: {gsc.best_params_}")
print(f"Best F1 score: {gsc.best_score_}")


Best parameters found: {'estimator__max_depth': 10, 'estimator__min_samples_leaf': 2, 'estimator__min_samples_split': 2}
Best F1 score: 0.15794553594553593


In [37]:
from sklearn.tree import DecisionTreeClassifier

# Obtain the best parameters from hyperparameter tuning
best_params = {'max_depth': 10, 'min_samples_split': 2, 'min_samples_leaf': 1}

# Create a decision tree classifier with the best parameters
best_dt_model = DecisionTreeClassifier(**best_params)

# Fit the model to the training data
best_dt_model.fit(X_train, y_train)

In [38]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix

# Get the best model
best_model = gsc.best_estimator_

# Make predictions on the test set
y_pred_test = best_model.predict(X_test)

# Convert predictions to DataFrame to align with y_test
y_pred_test_df = pd.DataFrame(y_pred_test, columns=y.columns)

# Initialize a dictionary to store metrics for each target
test_metrics = {}

# Calculate metrics for each target
for target in y.columns:
    accuracy = accuracy_score(y_test[target], y_pred_test_df[target])
    recall = recall_score(y_test[target], y_pred_test_df[target], average='weighted')
    precision = precision_score(y_test[target], y_pred_test_df[target], average='weighted')
    f1 = f1_score(y_test[target], y_pred_test_df[target], average='weighted')
    cm = confusion_matrix(y_test[target], y_pred_test_df[target])
    
    test_metrics[target] = {
        'accuracy': accuracy,
        'recall': recall,
        'precision': precision,
        'f1': f1,
        'confusion_matrix': cm
    }

# Print the metrics for each target
for target, metrics in test_metrics.items():
    print(f"\nMetrics for Decision Tree Model predicting {target}:")
    print(f"Accuracy: {metrics['accuracy']}")
    print(f"Recall: {metrics['recall']}")
    print(f"Precision: {metrics['precision']}")
    print(f"F1-Score: {metrics['f1']}")
    print(f"Confusion Matrix:\n{metrics['confusion_matrix']}")



Metrics for Decision Tree Model predicting Hinselmann:
Accuracy: 0.9303482587064676
Recall: 0.9303482587064676
Precision: 0.9208062252838373
F1-Score: 0.9255526491255065
Confusion Matrix:
[[187   6]
 [  8   0]]

Metrics for Decision Tree Model predicting Schiller:
Accuracy: 0.8656716417910447
Recall: 0.8656716417910447
Precision: 0.8363539445628998
F1-Score: 0.8489922891087798
Confusion Matrix:
[[171   9]
 [ 18   3]]

Metrics for Decision Tree Model predicting Citology:
Accuracy: 0.945273631840796
Recall: 0.945273631840796
Precision: 0.9120228005700143
F1-Score: 0.9283505744932626
Confusion Matrix:
[[190   2]
 [  9   0]]

Metrics for Decision Tree Model predicting Biopsy:
Accuracy: 0.9054726368159204
Recall: 0.9054726368159204
Precision: 0.8713738997321087
F1-Score: 0.8863332381959718
Confusion Matrix:
[[181   5]
 [ 14   1]]


#### Regression problem Decision Tree

In [41]:
# Train Decision Tree Regressor
dt_regressor = DecisionTreeRegressor(random_state=seed_num)
dt_regressor.fit(X_train, y_train[target])

In [43]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
# Initialize dictionaries to store metrics for each target
dt_r2_scores = {}
dt_mse_scores = {}
dt_mae_scores = {}

# Evaluate Decision Tree Regressor for each target
for target in y.columns:
    dt_predictions = dt_regressor.predict(X_test)
    
    # Calculate metrics
    dt_r2_scores[target] = r2_score(y_test[target], dt_predictions)
    dt_mse_scores[target] = mean_squared_error(y_test[target], dt_predictions)
    dt_mae_scores[target] = mean_absolute_error(y_test[target], dt_predictions)

# Print metrics for Decision Tree Regressor for each target
for target in y.columns:
    print(f"\nMetrics for Decision Tree Regressor predicting {target}:")
    print(f"R2 Score: {dt_r2_scores[target]}")
    print(f"Mean Squared Error: {dt_mse_scores[target]}")
    print(f"Mean Absolute Error: {dt_mae_scores[target]}")



Metrics for Decision Tree Regressor predicting Hinselmann:
R2 Score: -1.6117632772020731
Mean Squared Error: 0.09981343283582089
Mean Absolute Error: 0.10074626865671642

Metrics for Decision Tree Regressor predicting Schiller:
R2 Score: -0.5453869047619047
Mean Squared Error: 0.14458955223880596
Mean Absolute Error: 0.1455223880597015

Metrics for Decision Tree Regressor predicting Citology:
R2 Score: -1.2173394097222228
Mean Squared Error: 0.09483830845771145
Mean Absolute Error: 0.09577114427860696

Metrics for Decision Tree Regressor predicting Biopsy:
R2 Score: -0.8055779569892472
Mean Squared Error: 0.12468905472636815
Mean Absolute Error: 0.1256218905472637


## Support Vector Machine

In [225]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV 
from sklearn.svm import SVC

parameters = {'estimator__kernel': ['linear', 'sigmoid'], 'estimator__C': [0.1, 0.5, 1, 5], 'estimator__coef0': [0, 2, 5]}
clf = MultiOutputClassifier(SVC())
gsc = GridSearchCV(clf, parameters, cv=3, scoring=['accuracy', 'precision', 'recall', 'f1'], n_jobs=3, refit='f1')
gsc.fit(X_train, y_train)
gsc.cv_results_


{'mean_fit_time': array([0.01059834, 0.0102129 , 0.01136589, 0.01086052, 0.01018858,
        0.01153493, 0.0116748 , 0.01204038, 0.01253303, 0.01119765,
        0.01052586, 0.00965063, 0.01141707, 0.01039561, 0.01128443,
        0.01044703, 0.01192602, 0.01038774, 0.01905878, 0.01117333,
        0.02170038, 0.01460934, 0.0239025 , 0.01103179]),
 'std_fit_time': array([1.06802086e-03, 4.67004975e-04, 6.26880507e-04, 4.70490713e-04,
        4.70528232e-04, 1.43020040e-03, 1.28529031e-03, 1.47093657e-03,
        2.15847339e-03, 1.24384893e-03, 1.41423396e-03, 9.90039828e-05,
        6.01268159e-04, 6.28594682e-04, 5.36501222e-04, 1.37126401e-03,
        1.24133065e-03, 2.24760136e-03, 2.34242038e-03, 4.80603063e-04,
        5.21319968e-03, 5.68697260e-03, 7.65941610e-03, 4.75884612e-04]),
 'mean_score_time': array([0.01055702, 0.00720636, 0.00668073, 0.00768232, 0.00885415,
        0.00884628, 0.00753077, 0.00842086, 0.00753633, 0.00752211,
        0.00786757, 0.00753236, 0.00693353, 0.00

In [223]:
print(gsc.best_params_)
print(gsc.best_index_)

{'estimator__C': 0.1, 'estimator__coef0': 0, 'estimator__kernel': 'linear'}
0


**Build SVM model with the best parameters**

In [172]:
from sklearn.svm import SVC
C = 0.1
model_svc = SVC(kernel='linear', C=C)

In [224]:
multi_target_svm = MultiOutputClassifier(model_svc)  

# Fit the classifier to the data
multi_target_svm.fit(X_train, y_train)

# Predictions
y_pred = multi_target_svm.predict(X_test)
from sklearn.metrics import accuracy_score, multilabel_confusion_matrix,classification_report
print(accuracy_score(y_test, y_pred))
print(multilabel_confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.8557213930348259
[[[193   0]
  [  8   0]]

 [[176   4]
  [ 20   1]]

 [[192   0]
  [  9   0]]

 [[182   4]
  [ 14   1]]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.20      0.05      0.08        21
           2       0.00      0.00      0.00         9
           3       0.20      0.07      0.10        15

   micro avg       0.20      0.04      0.06        53
   macro avg       0.10      0.03      0.04        53
weighted avg       0.14      0.04      0.06        53
 samples avg       0.00      0.00      0.00        53



#### Evaluate the models
Perform a comparison between the predictive models. <br>
Report the accuracy, recall, precision and F1-score measures as well as the confusion matrix if it is a classification problem. <br>
Report the R2 score, mean squared error and mean absolute error if it is a regression problem.
______________________________________________________________________________________
Description: