#### Instruction (Read this)
- Use this template to develop your project. Do not change the steps. 
- For each step, you may add additional cells if needed.
- But remove <b>unnecessary</b> cells to ensure the notebook is readable.
- Marks will be <b>deducted</b> if the notebook is cluttered or difficult to follow due to excess or irrelevant content.
- <b>Briefly</b> describe the steps in the "Description:" field.
- <b>Do not</b> submit the dataset. 
- The submitted jupyter notebook will be executed using the uploaded dataset in eLearn.

#### Group Information

Group No: Cancer2

- Member 1:
- Member 2:
- Member 3:
- Member 4:


#### Import libraries

In [1]:
%config Completer.use_jedi=False # comment if not needed
import pandas as pd
import numpy as np
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
warnings.filterwarnings('ignore')

#### Load the dataset

In [2]:
df = pd.read_csv('risk_factors.csv', na_values='?')

In [3]:
df.shape

(858, 36)

#### Split the dataset
Split the dataset into training, validation and test sets.

In [4]:
y = df[['Hinselmann', 'Schiller', 'Citology', 'Biopsy']]
X = df.drop(['Hinselmann', 'Schiller', 'Citology', 'Biopsy'], axis=1)
scaler = StandardScaler()
X_transform = scaler.fit_transform(X)


In [5]:
seed_num = 10
X_train, X_test, y_train, y_test = train_test_split(X_transform, y, test_size=0.3, random_state=seed_num) # random_state is set to a value for reproducible␣output.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=seed_num)
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(480, 32)
(120, 32)
(258, 32)


#### Data preprocessing
Perform data preprocessing such as normalization, standardization, label encoding etc.
______________________________________________________________________________________
Description:

##### Handling Boolean Values

In [6]:
# Convert boolean-like columns to proper boolean data types
boolean_cols = [
    'Smokes', 'Smokes (years)', 'Smokes (packs/year)', 'Hormonal Contraceptives', 'IUD',
    'STDs', 'STDs:condylomatosis', 'STDs:cervical condylomatosis', 'STDs:vaginal condylomatosis',
    'STDs:vulvo-perineal condylomatosis', 'STDs:syphilis', 'STDs:pelvic inflammatory disease',
    'STDs:genital herpes', 'STDs:molluscum contagiosum', 'STDs:AIDS', 'STDs:HIV', 'STDs:Hepatitis B',
    'STDs:HPV', 'Dx:Cancer', 'Dx:CIN', 'Dx:HPV', 'Dx', 'Hinselmann', 'Schiller', 'Citology', 'Biopsy'
]

# Convert boolean-like columns to numeric
df[boolean_cols] = df[boolean_cols].apply(pd.to_numeric, errors='coerce')

# Replace values greater than 0 with 1, leave missing values untouched
df[boolean_cols] = df[boolean_cols].applymap(lambda x: 1 if x > 0 else x)


##### Check for the sum of missing values

In [7]:
print(df.isnull().sum())

Age                                     0
Number of sexual partners              26
First sexual intercourse                7
Num of pregnancies                     56
Smokes                                 13
Smokes (years)                         13
Smokes (packs/year)                    13
Hormonal Contraceptives               108
Hormonal Contraceptives (years)       108
IUD                                   117
IUD (years)                           117
STDs                                  105
STDs (number)                         105
STDs:condylomatosis                   105
STDs:cervical condylomatosis          105
STDs:vaginal condylomatosis           105
STDs:vulvo-perineal condylomatosis    105
STDs:syphilis                         105
STDs:pelvic inflammatory disease      105
STDs:genital herpes                   105
STDs:molluscum contagiosum            105
STDs:AIDS                             105
STDs:HIV                              105
STDs:Hepatitis B                  

##### Dropping Columns with too high missing values

In [8]:
# Set the threshold for dropping columns
threshold = 0.5  #keep columns with at least 50% non-null values

# Calculate the minimum number of non-null values required for each column to be retained
min_non_null_values = len(df) * threshold

# Drop columns with too many missing values
df = df.dropna(axis=1, thresh=min_non_null_values)

# Print the remaining missing values count after dropping columns
print(df.isnull().sum())

df.shape


Age                                     0
Number of sexual partners              26
First sexual intercourse                7
Num of pregnancies                     56
Smokes                                 13
Smokes (years)                         13
Smokes (packs/year)                    13
Hormonal Contraceptives               108
Hormonal Contraceptives (years)       108
IUD                                   117
IUD (years)                           117
STDs                                  105
STDs (number)                         105
STDs:condylomatosis                   105
STDs:cervical condylomatosis          105
STDs:vaginal condylomatosis           105
STDs:vulvo-perineal condylomatosis    105
STDs:syphilis                         105
STDs:pelvic inflammatory disease      105
STDs:genital herpes                   105
STDs:molluscum contagiosum            105
STDs:AIDS                             105
STDs:HIV                              105
STDs:Hepatitis B                  

(858, 34)

##### Dropping Rows with missing values

In [9]:
# Drop rows with missing values
df = df.dropna()

# Print the remaining missing values count after dropping rows
print(df.isnull().sum())

# Print the shape of the cleaned dataset
df.shape

#export cleaned data
df.to_csv('risk_factors_clean.csv', index=False)

Age                                   0
Number of sexual partners             0
First sexual intercourse              0
Num of pregnancies                    0
Smokes                                0
Smokes (years)                        0
Smokes (packs/year)                   0
Hormonal Contraceptives               0
Hormonal Contraceptives (years)       0
IUD                                   0
IUD (years)                           0
STDs                                  0
STDs (number)                         0
STDs:condylomatosis                   0
STDs:cervical condylomatosis          0
STDs:vaginal condylomatosis           0
STDs:vulvo-perineal condylomatosis    0
STDs:syphilis                         0
STDs:pelvic inflammatory disease      0
STDs:genital herpes                   0
STDs:molluscum contagiosum            0
STDs:AIDS                             0
STDs:HIV                              0
STDs:Hepatitis B                      0
STDs:HPV                              0


#### Feature Selection
Perform feature selection to select the relevant features.
______________________________________________________________________________________
Description:

#### Data modeling
Build the machine learning models. You must build atleast two (2) predictive models. One of the predictive models must be either Decision Tree or Support Vector Machine.
______________________________________________________________________________________
Description:

#### Evaluate the models
Perform a comparison between the predictive models. <br>
Report the accuracy, recall, precision and F1-score measures as well as the confusion matrix if it is a classification problem. <br>
Report the R2 score, mean squared error and mean absolute error if it is a regression problem.
______________________________________________________________________________________
Description: