<a href="https://colab.research.google.com/github/amandafbri/structured-data-classifier/blob/main/risk_cancer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Risk Factors for Cervical Cancer**

##**Setup**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install autokeras

Collecting autokeras
  Downloading autokeras-1.0.16-py3-none-any.whl (166 kB)
[?25l[K     |██                              | 10 kB 17.6 MB/s eta 0:00:01[K     |████                            | 20 kB 23.5 MB/s eta 0:00:01[K     |██████                          | 30 kB 13.0 MB/s eta 0:00:01[K     |███████▉                        | 40 kB 10.2 MB/s eta 0:00:01[K     |█████████▉                      | 51 kB 5.5 MB/s eta 0:00:01[K     |███████████▉                    | 61 kB 6.1 MB/s eta 0:00:01[K     |█████████████▊                  | 71 kB 5.8 MB/s eta 0:00:01[K     |███████████████▊                | 81 kB 6.5 MB/s eta 0:00:01[K     |█████████████████▊              | 92 kB 5.0 MB/s eta 0:00:01[K     |███████████████████▋            | 102 kB 5.3 MB/s eta 0:00:01[K     |█████████████████████▋          | 112 kB 5.3 MB/s eta 0:00:01[K     |███████████████████████▋        | 122 kB 5.3 MB/s eta 0:00:01[K     |█████████████████████████▌      | 133 kB 5.3 MB/s eta 0:00:

In [12]:
import numpy as np
import autokeras as ak
import tensorflow as tf
import pandas as pd

#**DATA**

###**Dataset description**

https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29
* (int) Age
* (int) Number of sexual partners
* (int) First sexual intercourse (age)
* (int) Num of pregnancies
* (bool) Smokes
* (bool) Smokes (years)
* (bool) Smokes (packs/year)
* (bool) Hormonal Contraceptives
* (int) Hormonal Contraceptives (years)
* (bool) IUD
* (int) IUD (years)
* (bool) STDs
* (int) STDs (number)
* (bool) STDs:condylomatosis
* (bool) STDs:cervical condylomatosis
* (bool) STDs:vaginal condylomatosis
* (bool) STDs:vulvo-perineal condylomatosis
* (bool) STDs:syphilis
* (bool) STDs:pelvic inflammatory disease
* (bool) STDs:genital herpes
* (bool) STDs:molluscum contagiosum
* (bool) STDs:AIDS
* (bool) STDs:HIV
* (bool) STDs:Hepatitis B
* (bool) STDs:HPV
* (int) STDs: Number of diagnosis
* (int) STDs: Time since first diagnosis
* (int) STDs: Time since last diagnosis
* (bool) Dx:Cancer
* (bool) Dx:CIN
* (bool) Dx:HPV
* (bool) Dx
* (bool) Hinselmann: target variable
* (bool) Schiller: target variable
* (bool) Cytology: target variable
* (bool) Biopsy: target variable

https://christophm.github.io/interpretable-ml-book/cervical.html
*   Age in years
*   Number of sexual partners
*   First sexual intercourse (age in years)
*   Number of pregnancies
*   Smoking yes or no
*   Smoking (in years)
*   Hormonal contraceptives yes or no
*   Hormonal contraceptives (in years)
*   Intrauterine device yes or no (IUD)
*   Number of years with an intrauterine device (IUD)
*   Has patient ever had a sexually transmitted disease (STD) yes or no
*   Number of STD diagnoses
*   Time since first STD diagnosis
*   Time since last STD diagnosis
*   The biopsy results “Healthy” or “Cancer”. Target outcome.


##**Load and prepare data**

In [4]:
risk_factors = '/content/drive/MyDrive/Estudo/Colab Notebooks/data/cancer/risk_factors_cervical_cancer.csv'

In [29]:
data = pd.read_csv(risk_factors)
data.head()

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,IUD (years),STDs,STDs (number),STDs:condylomatosis,STDs:cervical condylomatosis,STDs:vaginal condylomatosis,STDs:vulvo-perineal condylomatosis,STDs:syphilis,STDs:pelvic inflammatory disease,STDs:genital herpes,STDs:molluscum contagiosum,STDs:AIDS,STDs:HIV,STDs:Hepatitis B,STDs:HPV,STDs: Number of diagnosis,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
0,18,4.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,?,?,0,0,0,0,0,0,0,0
1,15,1.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,?,?,0,0,0,0,0,0,0,0
2,34,1.0,?,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,?,?,0,0,0,0,0,0,0,0
3,52,5.0,16.0,4.0,1.0,37.0,37.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,?,?,1,0,1,0,0,0,0,0
4,46,3.0,21.0,4.0,0.0,0.0,0.0,1.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,?,?,0,0,0,0,0,0,0,0


Let's try Autokeras without processing/cleaning the data.

In [36]:
# Split into 80/10/10
train, validation, test = np.split(
        data.sample(frac=1, random_state=0), [int(.8*len(data)), int(.9*len(data))]
    )

# Remember to check if there are examples of all labels in all sets
print(train['Biopsy'].value_counts())
print(validation['Biopsy'].value_counts())
print(test['Biopsy'].value_counts())

0    646
1     40
Name: Biopsy, dtype: int64
0    78
1     8
Name: Biopsy, dtype: int64
0    79
1     7
Name: Biopsy, dtype: int64


In [37]:
x_train = train.copy()
y_train = x_train.pop("Biopsy")

#**AUTOML**

https://autokeras.com/structured_data_classifier/

##**Create and train the model**

AutoKeras automatically detects a lot of the [classifier parameters](https://autokeras.com/structured_data_classifier/), but it is always recommended to pass them explicitly for more transparency and understanding of what is going on.

In [52]:
clf = ak.StructuredDataClassifier(
    metrics=['accuracy', 'AUC'],
    max_trials=3,
    objective='val_loss',
    seed=123,
    overwrite=True
)

The `fit` method search for the best model and hyperparameters. 

In [54]:
clf.fit(
    x_train,
    y_train,
    epochs=10
)

Trial 4 Complete [00h 00m 08s]
val_loss: 0.1495509296655655

Best val_loss So Far: 0.1495509296655655
Total elapsed time: 00h 00m 23s
INFO:tensorflow:Oracle triggered exit
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
INFO:tensorflow:Assets written to: ./structured_data_classifier/best_model/assets


<tensorflow.python.keras.callbacks.History at 0x7fd6e1beb410>

In [60]:
trained_model = clf.export_model()

##**Loading the trained model**

Let's take a look into our best model trained by AutoKeras!

In [66]:
trained_model_import = tf.keras.models.load_model('/content/structured_data_classifier/best_model/')

In [67]:
trained_model_import.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 35)]              0         
_________________________________________________________________
multi_category_encoding (Mul (None, 35)                0         
_________________________________________________________________
normalization (Normalization (None, 35)                71        
_________________________________________________________________
dense (Dense)                (None, 32)                1152      
_________________________________________________________________
re_lu (ReLU)                 (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
_________________________________________________________________
classification_head_1 (Activ (None, 1)                 0     

In [68]:
trained_model_import.get_config()

{'input_layers': [['input_1', 0, 0]],
 'layers': [{'class_name': 'InputLayer',
   'config': {'batch_input_shape': (None, 35),
    'dtype': 'string',
    'name': 'input_1',
    'ragged': False,
    'sparse': False},
   'inbound_nodes': [],
   'name': 'input_1'},
  {'class_name': 'Custom>MultiCategoryEncoding',
   'config': {'dtype': 'float32',
    'encoding': ListWrapper(['none', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int']),
    'name': 'multi_category_encoding',
    'trainable': True},
   'inbound_nodes': [[['input_1', 0, 0, {}]]],
   'name': 'multi_category_encoding'},
  {'class_name': 'Normalization',
   'config': {'axis': (-1,),
    'dtype': 'float32',
    'name': 'normalization',
    'trainable': True},
   'inbound_nodes': [[['multi_category_encoding', 0, 0, {}]]],
   'name': 'normalization'

##**Predicting**

In [None]:
y_test = test.pop('Biopsy')

In [96]:
classes = trained_model_import.predict(test.astype(np.unicode))

In [97]:
classes.round()

array([[0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],