# **1. Library Import**

In [1]:
!pip install -q condacolab
import condacolab
condacolab.install()
!conda --version

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:25
🔁 Restarting kernel...
conda 22.9.0


In [1]:
!conda create --name stroke-detection python==3.9.1

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
Solving environment: - failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | /

In [2]:
!conda activate stroke-detection


CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.




In [3]:
!pip install jupyter scikit-learn tensorflow tfx==1.11.0 flask joblib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting jupyter
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m42.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow
  Downloading tensorflow-2.11.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (588.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tfx==1.11.0
  Downloading tfx-1.11.0-py3-none-any.whl (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m45.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting flask
  Downloading Flask-2.2.2-py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

In [4]:
!unzip Stroke-Disease-Detection.zip

Archive:  Stroke-Disease-Detection.zip
   creating: modules/
  inflating: modules/components.py   
  inflating: modules/trainer.py      
  inflating: modules/transform.py    
  inflating: modules/tuner.py        
   creating: monitoring/
  inflating: monitoring/Dockerfile   
  inflating: monitoring/prometheus.yml  
 extracting: Dockerfile              
   creating: config/
  inflating: config/prometheus.config  
   creating: images/


In [5]:
import os
import pandas as pd
from typing import Text

from absl import logging
from tfx.orchestration import metadata, pipeline
from tfx.orchestration.beam.beam_dag_runner import BeamDagRunner
from modules import components

# **2. Data Loading**

## 2.1 Environment and Kaggle Credential

Set up the [Colab](https://colab.research.google.com) `operating system` environment with the `KAGGLE_USERNAME` variable and the `KAGGLE_KEY` variable to connect to the [Kaggle](https://kaggle.com) platform using [Kaggle's Beta API](https://www.kaggle.com/docs/api) Token.

In [6]:
os.environ['KAGGLE_USERNAME'] = 'andrewbjamesie'
os.environ['KAGGLE_KEY']      = 'b75c236a492526b84d0dc7517c37d48b'

## 2.2 Dataset Download

Download the dataset form Kaggle with the dataset file name, `healthcare-dataset-stroke-data.csv`. The dataset used in this project is the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset) dataset in the form of a `.csv` ([Comma-separated Values](https://en.wikipedia.org/wiki/Comma-separated_values)) file.

In [7]:
!kaggle datasets download -d fedesoriano/stroke-prediction-dataset -f healthcare-dataset-stroke-data.csv

Downloading healthcare-dataset-stroke-data.csv to /content
  0% 0.00/310k [00:00<?, ?B/s]
100% 310k/310k [00:00<00:00, 78.7MB/s]


## 2.3 Dataset Preparation

In [8]:
df = pd.read_csv('healthcare-dataset-stroke-data.csv')
df = df.drop('id', axis=1)

In [9]:
df.isnull().sum()

gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

In [10]:
df = df.dropna()
df.isnull().sum()

gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4909 entries, 0 to 5109
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             4909 non-null   object 
 1   age                4909 non-null   float64
 2   hypertension       4909 non-null   int64  
 3   heart_disease      4909 non-null   int64  
 4   ever_married       4909 non-null   object 
 5   work_type          4909 non-null   object 
 6   Residence_type     4909 non-null   object 
 7   avg_glucose_level  4909 non-null   float64
 8   bmi                4909 non-null   float64
 9   smoking_status     4909 non-null   object 
 10  stroke             4909 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 460.2+ KB


In [12]:
df['age'] = df['age'].astype(int)

In [13]:
DATA_PATH = 'data'

if not os.path.exists(DATA_PATH):
    os.makedirs(DATA_PATH)

df.to_csv(os.path.join(DATA_PATH, 'healthcare-dataset-stroke-data.csv'), index=False)
df

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
2,Male,80,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,Female,49,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,Female,79,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,Male,81,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
...,...,...,...,...,...,...,...,...,...,...,...
5104,Female,13,0,0,No,children,Rural,103.08,18.6,Unknown,0
5106,Female,81,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,Female,35,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,Male,51,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


# **3. Set Pipeline Variable**

In [14]:
PIPELINE_NAME = 'stroke-disease-pipeline'

# Pipeline inputs
DATA_ROOT = 'data'
TRANSFORM_MODULE_FILE = 'modules/transform.py'
TUNER_MODULE_FILE = 'modules/tuner.py'
TRAINER_MODULE_FILE = 'modules/trainer.py'

# Pipeline outputs
OUTPUT_BASE = 'outputs'

serving_model_dir = os.path.join(OUTPUT_BASE, 'serving_model')
pipeline_root = os.path.join(OUTPUT_BASE, PIPELINE_NAME)
metadata_path = os.path.join(pipeline_root, 'metadata.sqlite')

# **4. Pipeline Initialization**

In [15]:
def init_local_pipeline(
    components, pipeline_root: Text
) -> pipeline.Pipeline:
    """Init local pipeline

    Args:
        components (dict): tfx components
        pipeline_root (Text): path to pipeline directory

    Returns:
        pipeline.Pipeline: apache beam pipeline orchestration
    """
    logging.info(f"Pipeline root set to: {pipeline_root}")
    beam_args = [
        '--direct_running_mode=multi_processing'
        # 0 auto-detect based on on the number of CPUs available
        # during execution time.
        '----direct_num_workers=0'
    ]

    return pipeline.Pipeline(
        pipeline_name=PIPELINE_NAME,
        pipeline_root=pipeline_root,
        components=components,
        enable_cache=True,
        metadata_connection_config=metadata.sqlite_metadata_connection_config(
            metadata_path
        ),
        eam_pipeline_args=beam_args
    )

In [16]:
logging.set_verbosity(logging.INFO)

components = components.init_components({
    'data_dir': DATA_ROOT,
    'transform_module': TRANSFORM_MODULE_FILE,
    'tuner_module': TUNER_MODULE_FILE,
    'training_module': TRAINER_MODULE_FILE,
    'training_steps': 5000,
    'eval_steps': 1000,
    'serving_model_dir': serving_model_dir
})

pipeline = init_local_pipeline(components, pipeline_root)
BeamDagRunner().run(pipeline=pipeline)

Trial 30 Complete [00h 04m 23s]
val_loss: 0.30945828557014465

Best val_loss So Far: 0.15500961244106293
Total elapsed time: 01h 00m 49s
INFO:tensorflow:Oracle triggered exit


INFO:tensorflow:Oracle triggered exit
INFO:absl:Finished tuning... Tuner ID: tuner0
INFO:absl:Best HyperParameters: {'space': [{'class_name': 'Choice', 'config': {'name': 'num_layers', 'default': 1, 'conditions': [], 'values': [1, 2, 3], 'ordered': True}}, {'class_name': 'Int', 'config': {'name': 'dense_units', 'default': None, 'conditions': [], 'min_value': 16, 'max_value': 256, 'step': 16, 'sampling': None}}, {'class_name': 'Float', 'config': {'name': 'dropout_rate', 'default': 0.1, 'conditions': [], 'min_value': 0.1, 'max_value': 0.7, 'step': 0.1, 'sampling': None}}, {'class_name': 'Choice', 'config': {'name': 'learning_rate', 'default': 0.01, 'conditions': [], 'values': [0.01, 0.001, 0.0001], 'ordered': True}}], 'values': {'num_layers': 1, 'dense_units': 48, 'dropout_rate': 0.6, 'learning_rate': 0.0001, 'tuner/epochs': 4, 'tuner/initial_epoch': 2, 'tuner/bracket': 2, 'tuner/round': 1, 'tuner/trial_id': '0003'}}
INFO:absl:Best Hyperparameters are written to outputs/stroke-disease-pi

Results summary
Results in outputs/stroke-disease-pipeline/Tuner/.system/executor_execution/7/.temp/7/stroke_disaster_kt
Showing 10 best trials
<keras_tuner.engine.objective.Objective object at 0x7ff256dcdc40>
Trial summary
Hyperparameters:
num_layers: 1
dense_units: 48
dropout_rate: 0.6
learning_rate: 0.0001
tuner/epochs: 4
tuner/initial_epoch: 2
tuner/bracket: 2
tuner/round: 1
tuner/trial_id: 0003
Score: 0.15500961244106293
Trial summary
Hyperparameters:
num_layers: 2
dense_units: 48
dropout_rate: 0.2
learning_rate: 0.0001
tuner/epochs: 2
tuner/initial_epoch: 0
tuner/bracket: 2
tuner/round: 0
Score: 0.15590107440948486
Trial summary
Hyperparameters:
num_layers: 1
dense_units: 48
dropout_rate: 0.6
learning_rate: 0.0001
tuner/epochs: 10
tuner/initial_epoch: 4
tuner/bracket: 2
tuner/round: 2
tuner/trial_id: 0013
Score: 0.1566063016653061
Trial summary
Hyperparameters:
num_layers: 2
dense_units: 48
dropout_rate: 0.2
learning_rate: 0.0001
tuner/epochs: 10
tuner/initial_epoch: 4
tuner/brac

INFO:absl:node Trainer is running.
INFO:absl:Running launcher for node_info {
  type {
    name: "tfx.components.trainer.component.Trainer"
    base_type: TRAIN
  }
  id: "Trainer"
}
contexts {
  contexts {
    type {
      name: "pipeline"
    }
    name {
      field_value {
        string_value: "stroke-disease-pipeline"
      }
    }
  }
  contexts {
    type {
      name: "pipeline_run"
    }
    name {
      field_value {
        string_value: "20230121-083653.609845"
      }
    }
  }
  contexts {
    type {
      name: "node"
    }
    name {
      field_value {
        string_value: "stroke-disease-pipeline.Trainer"
      }
    }
  }
}
inputs {
  inputs {
    key: "examples"
    value {
      channels {
        producer_node_query {
          id: "Transform"
        }
        context_queries {
          type {
            name: "pipeline"
          }
          name {
            field_value {
              string_value: "stroke-disease-pipeline"
            }
          }
     

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 gender_xf (InputLayer)         [(None, 3)]          0           []                               
                                                                                                  
 ever_married_xf (InputLayer)   [(None, 3)]          0           []                               
                                                                                                  
 work_type_xf (InputLayer)      [(None, 6)]          0           []                               
                                                                                                  
 Residence_type_xf (InputLayer)  [(None, 3)]         0           []                               
                                                                                            



INFO:tensorflow:Assets written to: outputs/stroke-disease-pipeline/Trainer/model/8/Format-Serving/assets


INFO:tensorflow:Assets written to: outputs/stroke-disease-pipeline/Trainer/model/8/Format-Serving/assets


Epoch 2/2
Epoch 2: val_binary_accuracy did not improve from 0.95195
INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:Assets written to: outputs/stroke-disease-pipeline/Trainer/model/8/Format-Serving/assets


INFO:tensorflow:Assets written to: outputs/stroke-disease-pipeline/Trainer/model/8/Format-Serving/assets
INFO:absl:Training complete. Model written to outputs/stroke-disease-pipeline/Trainer/model/8/Format-Serving. ModelRun written to outputs/stroke-disease-pipeline/Trainer/model_run/8
INFO:absl:Cleaning up stateless execution info.
INFO:absl:Execution 8 succeeded.
INFO:absl:Cleaning up stateful execution info.
INFO:absl:Publishing output artifacts defaultdict(<class 'list'>, {'model': [Artifact(artifact: uri: "outputs/stroke-disease-pipeline/Trainer/model/8"
, artifact_type: name: "Model"
base_type: MODEL
)], 'model_run': [Artifact(artifact: uri: "outputs/stroke-disease-pipeline/Trainer/model_run/8"
, artifact_type: name: "ModelRun"
)]}) for execution 8
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:node Trainer is finished.
INFO:absl:node Evaluator is running.
INFO:absl:Running launcher for node_info {
  type {
    name: "tfx.components.evaluator.component.Evaluator



INFO:absl:The 'example_splits' parameter is not set, using 'eval' split.
INFO:absl:Evaluating model.
INFO:absl:udf_utils.get_fn {'fairness_indicator_thresholds': 'null', 'example_splits': 'null', 'eval_config': '{\n  "metrics_specs": [\n    {\n      "metrics": [\n        {\n          "class_name": "AUC"\n        },\n        {\n          "class_name": "Precision"\n        },\n        {\n          "class_name": "Recall"\n        },\n        {\n          "class_name": "ExampleCount"\n        },\n        {\n          "class_name": "TruePositives"\n        },\n        {\n          "class_name": "FalsePositives"\n        },\n        {\n          "class_name": "TrueNegatives"\n        },\n        {\n          "class_name": "FalseNegatives"\n        },\n        {\n          "class_name": "BinaryAccuracy",\n          "threshold": {\n            "change_threshold": {\n              "absolute": 0.0001,\n              "direction": "HIGHER_IS_BETTER"\n            },\n            "value_threshold": 



























INFO:absl:Evaluation complete. Results written to outputs/stroke-disease-pipeline/Evaluator/evaluation/9.
INFO:absl:Checking validation results.


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
INFO:absl:Blessing result True written to outputs/stroke-disease-pipeline/Evaluator/blessing/9.
INFO:absl:Cleaning up stateless execution info.
INFO:absl:Execution 9 succeeded.
INFO:absl:Cleaning up stateful execution info.
INFO:absl:Publishing output artifacts defaultdict(<class 'list'>, {'blessing': [Artifact(artifact: uri: "outputs/stroke-disease-pipeline/Evaluator/blessing/9"
, artifact_type: name: "ModelBlessing"
)], 'evaluation': [Artifact(artifact: uri: "outputs/stroke-disease-pipeline/Evaluator/evaluation/9"
, artifact_type: name: "ModelEvaluation"
)]}) for execution 9
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:node Evaluator is finished.
INFO:absl:node Pusher is running.
INFO:absl:Running launcher for node_info {
  type {
    name: "tfx.components.pusher.component.Pusher"
    base_type: DEPLOY
  }
  id: "Pusher"
}
contexts {
  contexts {
    type {
      name: "pipeline"


In [17]:
!zip -r data.zip data/
!zip -r images.zip images/
!zip -r outputs.zip outputs/
!pip freeze > requirements.txt

  adding: data/ (stored 0%)
  adding: data/healthcare-dataset-stroke-data.csv (deflated 83%)
  adding: images/ (stored 0%)
  adding: images/model_plot.png (deflated 20%)
  adding: outputs/ (stored 0%)
  adding: outputs/serving_model/ (stored 0%)
  adding: outputs/serving_model/1674294000/ (stored 0%)
  adding: outputs/serving_model/1674294000/saved_model.pb (deflated 89%)
  adding: outputs/serving_model/1674294000/assets/ (stored 0%)
  adding: outputs/serving_model/1674294000/assets/vocab_compute_and_apply_vocabulary_1_vocabulary (stored 0%)
  adding: outputs/serving_model/1674294000/assets/vocab_compute_and_apply_vocabulary_4_vocabulary (deflated 18%)
  adding: outputs/serving_model/1674294000/assets/vocab_compute_and_apply_vocabulary_vocabulary (deflated 6%)
  adding: outputs/serving_model/1674294000/assets/vocab_compute_and_apply_vocabulary_3_vocabulary (stored 0%)
  adding: outputs/serving_model/1674294000/assets/vocab_compute_and_apply_vocabulary_2_vocabulary (stored 0%)
  adding: