<a id="0"></a>
<h1>Contents</h1>
<hr>
<div style="line-height: 2">
<div style="line-height: 2.5;"><a href="#0"><h2 style="display: inline">.... 0. Notebook Initialisation</h2></a></div>
<div><a href="#01">........ <h3 style="display: inline">0.1. Package Imports</h3></a></div>
<div><a href="#02">........ <h3 style="display: inline">0.2. Data Loading</h3></a></div>
<br>
<div><a href="#1"><h2 style="display: inline">.... 1. CRISP DM</h2></a></div>
<div style="line-height:2.5;"><a href="#11">........ <h3 style="display: inline">1.1. Business Understanding</h3></a></div>
<!--  -->
<div style="line-height:2.5;"><a href="#12">........ <h3 style="display: inline">1.2. Data Understanding</h3></a></div>
<div><a href="#121">............ <h4 style="display: inline">1.2.1. Data Dictionary</h4></a></div>
<div><a href="#122">............ <h4 style="display: inline">1.2.2. Data Correctness</h4></a></div>
<div><a href="#1221">................ <h5 style="display: inline">1.2.2.1. Checking for Missing Data</h5></a></div>
<div><a href="#1222">................ <h5 style="display: inline">1.2.2.2. Checking for Duplicated Data</h5></a></div>
<div><a href="#1223">................ <h5 style="display: inline">1.2.2.3. Checking for Corrupt Data</h5></a></div>
<div><a href="#123">............ <h4 style="display: inline">1.2.3. Data Distribution</h4></a></div>
<div><a href="#124">............ <h4 style="display: inline">1.2.4. Feature Inspection</h4></a></div>
<div><a href="#125">............ <h4 style="display: inline">1.2.5. Evaluation of Understanding</h4></a></div>
<div><a href="#126">............ <h4 style="display: inline">1.2.6. Actions</h4></a></div>
<!--  -->
<div style="line-height:2.5;"><a href="#13">........ <h3 style="display: inline">1.3. Data Preparation</h3></a></div>
<div><a href="#131">............ <h4 style="display: inline">1.3.1. Cleaning</h4></a></div>
<div><a href="#132">............ <h4 style="display: inline">1.3.2. Transformation</h4></a></div>
<div><a href="#133">............ <h4 style="display: inline">1.3.3. Stratification (TTS)</h4></a></div>
<!--  -->
<div style="line-height:2.5;"><a href="#14">........ <h3 style="display: inline">1.4. Modeling</h3></a></div>
<div><a href="#141">............ <h4 style="display: inline">1.4.1. Baseline Models (all features)</h4></a></div>
<div><a href="#142">............ <h4 style="display: inline">1.4.2. Baseline Models (selected features)</h4></a></div>
<div><a href="#143">............ <h4 style="display: inline">1.4.3. Model Selection</h4></a></div>
<div><a href="#144">............ <h4 style="display: inline">1.4.4. Hyperparemeter Tuning</h4></a></div>
<!--  -->
<div style="line-height:2.5;"><a href="#15">........ <h3 style="display: inline">1.5. Evaluation</h3></a></div>
<!--  -->
<div style="line-height:2.5;"><a href="#16">........ <h3 style="display: inline">1.6. (Theoretical) Deployment</h3></a></div>
</div>

<!--  Data Understanding
Data dictionary
	Data assumptions
		Data conformity
			Cols exist
			Number of categories
			Category correctness
			Data types
		
		Missing data

		Duplicated data
			Duplicated records
			Duplicated attributes

		Data distribution/balance/outliers		

		Data correlations and feature inspection
			Univariate
			Covariate (with label)
			Multivariate (with label)
			Apriori

		Inspect and understand concerns

		Actions
		
Data prep
	Clean according to understanding
	Normalise and Encode
	TTS
		Compare with raw data


Modeling
	Baseline model all features & all models
		kfold
		multi-linear?
		dtree
		dnn	
	Baseline model selected features
		kfold
		dtree
		dnn
	Select best model
		tune hyperparams

Data prep2 
	impute missing data
	better or worse performance ? -->

<a id="0"></a>
<h2>0. Notebook Initialisation</h2>

<a id="01"></a>
<h3>0.1. Package Imports</h3>

In [10]:
## Import all libraries for use in notebook.
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl
import pandas as pd
import numpy as np

from sklearn import model_selection, linear_model
from sklearn.metrics import mean_absolute_error as mae, mean_squared_error as mse
from sklearn.preprocessing import MinMaxScaler

from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

%matplotlib inline

<a href="#0">[back to top]</a>

<a id="02"></a>
<h3>0.2. Data Loading</h3>

In [11]:
path = "data.csv" ## Relative path to train/test data.
rawData = pd.read_csv(path) ## Will reamin untouched for reference.
rawData.columns = [col.lower() for col in rawData.columns] ## Make headers lowercase to avoid some trivial errors.
df = rawData.copy() ## Working copy.

rawNRows = rawData.shape[0]
rawNCols = rawData.shape[1]
rawColNames = [colName.lower() for colName in rawData.columns.values] ## Convert colnames to lower for checking to avoid trivial errors.

rawData ## Show

Unnamed: 0,random,id,indication,diabetes,ihd,hypertension,arrhythmia,history,ipsi,contra,label
0,0.602437,218242,A-F,no,no,yes,no,no,78.0,20,NoRisk
1,0.602437,159284,TIA,no,no,no,no,no,70.0,60,NoRisk
2,0.602437,106066,A-F,no,yes,yes,no,no,95.0,40,Risk
3,0.128157,229592,TIA,no,no,yes,no,no,90.0,85,Risk
4,0.676862,245829,CVA,no,no,no,no,no,70.0,20,NoRisk
...,...,...,...,...,...,...,...,...,...,...,...
1515,0.391440,93406,A-F,no,yes,no,no,no,76.0,60,NoRisk
1516,0.253504,121814,A-F,no,no,yes,yes,no,90.0,75,Risk
1517,0.620373,101754,TIA,no,no,yes,no,no,75.0,20,NoRisk
1518,0.639342,263836,A-F,no,yes,no,no,no,70.0,45,NoRisk


<a href="#0">[back to top]</a>

<a id="1"></a>
<h2>1. CRISP DM</h2>
<img src="crisp-dm.png" style="max-height:400px">
<a href="#0">[back to top]</a>

<a id="11"></a>
<h3>1.1. Business Understanding</h3>

<div style="font-size: 14px">
<p>DOMAIN: Cardio-vascular medicine / healthcare</p>
<p>PROBLEM TYPE: Classification</p>
<p>INPUTS: Tabulated patient data</p>
<p>OUTPUTS:</p>
    <ul>
        <li>Risk</li>
        <li>No Risk</li>
    </ul>
</div>

<a id="12"></a>
<h3>1.2. Data Understanding</h3>

<a id="121"></a>
<h4>1.2.1. Data Dictionary</h4>

<table>
    <tbody>
        <tr>
            <td>
                <p><strong>Attribute</strong></p>
            </td>
            <td>
                <p><strong>Value Type</strong></p>
            </td>
            <td>
                <p><strong>NumberOfValues</strong></p>
            </td>
            <td>
                <p><strong>Values</strong></p>
            </td>
            <td>
                <p><strong>Comment</strong></p>
            </td>
            <td>
                <p><strong>Non-clinical Description</strong></p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Random</p>
            </td>
            <td>
                <p>Real</p>
            </td>
            <td>
                <p>Number of Records</p>
            </td>
            <td>
                <p>Unique</p>
            </td>
            <td>
                <p>Real number of help in randomly sorting the data records</p>
            </td>
            <td>
                <p>Real number of&nbsp;help&nbsp;in randomly sorting the data records: Should be unique values.</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Id</p>
            </td>
            <td>
                <p>Integer</p>
            </td>
            <td>
                <p>Max of Number of Records</p>
            </td>
            <td>
                <p>Unique to patient</p>
            </td>
            <td>
                <p>Anonymous patient record identifier: Should be unique values unless patient has multiple sessions</p>
            </td>
            <td>
                <p>Anonymous patient record identifier: Should be unique value per patient. Patient can have multiple sessions</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Indication</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Four</p>
            </td>
            <td>
                <p>{a-f, asx, cva, tia}</p>
            </td>
            <td>
                <p>What type of Cardiovascular event triggered the hospitalisation?</p>
            </td>
            <td>
                <p>What type of Cardiovascular event triggered the hospitalisation?</p><p> a-f :&nbsp;Atrial-Fibrillation</p>
                <p>asx&nbsp;:&nbsp;Asymptomatic Stenosis&nbsp;</p><p>cva&nbsp;: Cardiovascular Arrest</p>
                <p>tia&nbsp;:&nbsp;Transient Ischemic Attack ("mini-heart attack")</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Diabetes</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Does the patient suffer from Diabetes?</p>
            </td>
            <td>
                <p>Does the patient suffer from Diabetes?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>IHD</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Does the patient suffer from Coronary artery disease (CAD), also known as ischemic heart disease (IHD)?</p>
            </td>
            <td>
                <p>Does the patient suffer from Coronary artery disease (CAD), also known as ischemic heart disease (IHD)?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Hypertension</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Does the patient suffer from Hypertension?</p>
            </td>
            <td>
                <p>Does the patient suffer from Hypertension?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Arrhythmia</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Does the patient suffer from</p>
                <p>Arrhythmia (i.e. erratic heart beat)?</p>
            </td>
            <td>
                <p>Does the patient suffer from Arrhythmia (i.e. erratic&nbsp;heart beat)?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>History</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Has the patient a history of</p>
                <p>Cardiovascular interventions?</p>
            </td>
            <td>
                <p>Has the patient a history of Cardiovascular interventions?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>IPSI</p>
            </td>
            <td>
                <p>Integer</p>
            </td>
            <td>
                <p>Potentially 101</p>
            </td>
            <td>
                <p>[0, 100]</p>
            </td>
            <td>
                <p>Percentage figure for cerebral ischemic lesions defined as ipsilateral</p>
            </td>
            <td>
                <p>Percentage figure for cerebral ischemic lesions defined as ipsilateral</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Contra</p>
            </td>
            <td>
                <p>Integer</p>
            </td>
            <td>
                <p>Potentially 101</p>
            </td>
            <td>
                <p>[0, 100]</p>
            </td>
            <td>
                <p>Percentage figure for contralateral cerebral ischemic lesions</p>
            </td>
            <td>
                <p>Percentage figure for contralateral cerebral ischemic lesions</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Label</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{risk, norisk}</p>
            </td>
            <td>
                <p>Is the patient at risk (Mortality)?</p>
            </td>
            <td>
                <p>Is the patient at risk (Mortality)?</p>
            </td>
        </tr>
    </tbody>

<br>
<b style="color: red;">NOTE:</b>
<p style="font-size: 14px">"Session" is also included in the non-clinical description, but not included in the data dictionary.</p>
<br>
<table>
    <tr>
        <td>
            <p><strong>Attribute</strong></p>
        </td>
        <td>
            <p><strong>Value Type</strong></p>
        </td>
        <td>
            <p><strong>NumberOfValues</strong></p>
        </td>
        <td>
            <p><strong>Values</strong></p>
        </td>
        <td>
            <p><strong>Comment</strong></p>
        </td>
        <td>
            <p><strong>Non-clinical Description</strong></p>
        </td>
    </tr>
    <tr>
        <td>
            <p>Session</p>
        </td>
        <td>
            <p>Unknown</p>
        </td>
        <td>
            <p>Max Number of Records (assumed)</p>
        </td>
        <td>
            <p>Unique to patient</p>
        </td>
        <td>
            <p>Unknown</p>
        </td>
        <td>
            <p>Anonymous patient session identifier.</p>
        </td>
    </tr>
</table>
<br>

<a href="#0">[back to top]</a>

<a id="122"></a>
<h4>1.2.2. Data Correctness</h4>
Check for data conformity to data dictionary and explore common pitfalls (e.g. missing or duplicate data).

<a id="1221"></a>
<h5>1.2.2.1 Checking for Missing Data</h5>
Look for records containing nan or missing values.

<a href="#0">[back to top]</a>

<a id="1222"></a>
<h5>1.2.2.2 Checking for Duplicated Data</h5>
Look for records that are entirely the same or very similiar (e.g. only has different values for random or id).

<a href="#0">[back to top]</a>

<a id="1223"></a>
<h5>1.2.2.3 Checking for Corrupted Data</h5>
Create an object that defines a set of assumptions as described in the data dictionary and flag any concerns.

<a href="#0">[back to top]</a>

<a id="123"></a>
<h4>1.2.3. Data Distribution</h4>

<a href="#0">[back to top]</a>

<a id="124"></a>
<h4>1.2.4. Feature Inspection</h4>

<a href="#0">[back to top]</a>

<a id="125"></a>
<h4>1.2.5. Evaluation of Understanding</h4>

<a href="#0">[back to top]</a>

<a id="126"></a>
<h4>1.2.6. Actions</h4>

<a href="#0">[back to top]</a>

<a id="13"></a>
<h3>1.3. Data Preparation</h3>

<a id="131"></a>
<h4>1.3.1. Cleaning</h4>

<a href="#0">[back to top]</a>

<a id="132"></a>
<h4>1.3.2. Transformation</h4>

<a href="#0">[back to top]</a>

<a id="133"></a>
<h4>1.3.3. Stratification (TTS)</h4>

<a href="#0">[back to top]</a>

<a id="14"></a>
<h3>1.4. Modeling</h3>

<a href="#0">[back to top]</a>

<a id="15"></a>
<h3>1.5. Evaluation</h3>

<a href="#0">[back to top]</a>

<a id="16"></a>
<h3>1.6. (Theoretical) Deployment</h3>

<a href="#0">[back to top]</a>