# CRISP DM
Herein, the CRISP DM data methodology is followed (as close as is possible in the context of this project).

<img src="crisp-dm.png" style="max-height:300px">

Most time is spent in the 'Data Understanding' phase to make up for the fact that there is no client communcation beyond the given information and to allow for better informed decisions in the 'Data Preperation' and 'Modelling' stages.

# Business Understanding
Beyond the the given task definition and data dictionary, there will be no additional client/business communication. Therfore, some assumptions must be made based on *personal*: experience, domain knowledge, and research.

 <hr>

**Below is a brief breakdown** of the problem definition and some domain considerations:

DOMAIN: Cardio-vascular medicine / healthcare

- As a healthcare dataset it may be "natural", anonymised patient data, study data (e.g. clinical trial), or an aggregation of many different datasets.
- There is a chance there is "control" data (healthy cohorts) within the dataset or, similarly, focus groups that consist of unhealthy cohorts.
- Due to the (often) subjective nature of clinical diagnosis (i.e. different doctors with varying levels of experience make the diagnoses), some data may be mislabelled.
- Some diagnoses or features may be self-certified or be derived from incorrect patient interpretations (e.g. "Yes, I have been feeling...").
- Some features might represent the same thing (e.g. an alternative clincal test - both may be conducted or one might replace the other). 

PROBLEM TYPE: Classification

INPUTS: Tabulated patient data; (up-to) 1520 records of 11 features

OUTPUTS:
- Risk
- No Risk

<hr>

**More objectively**, domain-specific terminology from the provided data dictionary can be researched further:

- Atrial Fibrillation
    - A form of **arrythmia** (Atrial Fibrillation and other Arrhythmias, 2019).
    - Increases risk of stroke (https://www.nhs.uk/conditions/arrhythmia/)
    
    
- Asymptomatic Stenosis
    - Narrowing of the cartoid artery without recent history of TIA  or ischemic stroke (https://www.uptodate.com/contents/management-of-asymptomatic-carotid-atherosclerotic-disease).


- Cardiovascular Arrest
    - When the heart stops pumping blood - NOT a heart attack (https://www.bhf.org.uk/informationsupport/conditions/cardiac-arrest).
    - Can be caused by arrhythmias (https://www.heart.org/en/health-topics/cardiac-arrest/about-cardiac-arrest).
    
    
- Transient Ischemic Attack (mini heart attack)
    - Risk increased by a-f, asx, diabetes and hypertension (https://www.nhs.uk/conditions/transient-ischaemic-attack-tia/; https://www.cardiosmart.org/Healthwise/hw22/6606/hw226606).
    - Actually a **mini-stroke**, not heart attack.


- Diabetes
    - Type 2 makes up 90% of cases, but could be type 1 or a mix of both.


- IHD/CAD (Ischemic Heart Disease/Coronary Artery Disease)
    - Narrowing or blockage of the coronary arteries (https://www.cancer.gov/publications/dictionaries/cancer-terms/def/coronary-heart-disease).


- Hypertension
    - i.e high blood pressure.


- Arrhythmia (erratic heart beat)
    - Main types include **a-f**, tachcardia, bradycardia heart block and ventricular fibrilation (may cause cardiac arrest).


- IPSI (ipsilateral cerebral ischemic lesions)
    - Ipsilateral means "same side". Based on the context, the side of comparsion is likely the side of the brain that the stroke occurred.


- Contra (contralateral cerebral ischemic lesions)
    - Contralateral means "opposite side". Based on the context, the side of comparsion is likely the side of the brain that the stroke occurred.


- (History) Cardiovascular Interventions
    - Typically, cardiac invasive treatments e.g. catheterisation. (https://onlinelibrary.wiley.com/doi/book/10.1002/9781444316704)

<hr>

**Based on this research**, there are some assumptions to be made:

- Patients with an indication of "a-f" should also be be recorded as having an arrhythmia.


- Assuming IPSI and Contra are recorded at the same time in relation to the same stroke or event; and Since IPSI is reffering to the percentage of lesions on the same side and Contra on the opposite side, it would make sense for the 2 values to have sum of 100%

<hr>

*References*

# Data Understanding
This section focuses on an in-depth understanding of the given date, its correctness and any patterns.
<hr>

## Data Dictionary
The data dictionary with all expected features and their format is included in the table below.

<table>
    <tbody>
        <tr>
            <td>
                <p><strong>Attribute</strong></p>
            </td>
            <td>
                <p><strong>Value Type</strong></p>
            </td>
            <td>
                <p><strong>NumberOfValues</strong></p>
            </td>
            <td>
                <p><strong>Values</strong></p>
            </td>
            <td>
                <p><strong>Comment</strong></p>
            </td>
            <td>
                <p><strong>Non-clinical Description</strong></p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Random</p>
            </td>
            <td>
                <p>Real</p>
            </td>
            <td>
                <p>Number of Records</p>
            </td>
            <td>
                <p>Unique</p>
            </td>
            <td>
                <p>Real number of help in randomly sorting the data records</p>
            </td>
            <td>
                <p>Real number of&nbsp;help&nbsp;in randomly sorting the data records: Should be unique values.</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Id</p>
            </td>
            <td>
                <p>Integer</p>
            </td>
            <td>
                <p>Max of Number of Records</p>
            </td>
            <td>
                <p>Unique to patient</p>
            </td>
            <td>
                <p>Anonymous patient record identifier: Should be unique values unless patient has multiple sessions</p>
            </td>
            <td>
                <p>Anonymous patient record identifier: Should be unique value per patient. Patient can have multiple sessions</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Indication</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Four</p>
            </td>
            <td>
                <p>{a-f, asx, cva, tia}</p>
            </td>
            <td>
                <p>What type of Cardiovascular event triggered the hospitalisation?</p>
            </td>
            <td>
                <p>What type of Cardiovascular event triggered the hospitalisation?</p><p> a-f :&nbsp;Atrial-Fibrillation</p>
                <p>asx&nbsp;:&nbsp;Asymptomatic Stenosis&nbsp;</p><p>cva&nbsp;: Cardiovascular Arrest</p>
                <p>tia&nbsp;:&nbsp;Transient Ischemic Attack ("mini-heart attack")</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Diabetes</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Does the patient suffer from Diabetes?</p>
            </td>
            <td>
                <p>Does the patient suffer from Diabetes?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>IHD</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Does the patient suffer from Coronary artery disease (CAD), also known as ischemic heart disease (IHD)?</p>
            </td>
            <td>
                <p>Does the patient suffer from Coronary artery disease (CAD), also known as ischemic heart disease (IHD)?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Hypertension</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Does the patient suffer from Hypertension?</p>
            </td>
            <td>
                <p>Does the patient suffer from Hypertension?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Arrhythmia</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Does the patient suffer from</p>
                <p>Arrhythmia (i.e. erratic heart beat)?</p>
            </td>
            <td>
                <p>Does the patient suffer from Arrhythmia (i.e. erratic&nbsp;heart beat)?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>History</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Has the patient a history of</p>
                <p>Cardiovascular interventions?</p>
            </td>
            <td>
                <p>Has the patient a history of Cardiovascular interventions?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>IPSI</p>
            </td>
            <td>
                <p>Integer</p>
            </td>
            <td>
                <p>Potentially 101</p>
            </td>
            <td>
                <p>[0, 100]</p>
            </td>
            <td>
                <p>Percentage figure for cerebral ischemic lesions defined as ipsilateral</p>
            </td>
            <td>
                <p>Percentage figure for cerebral ischemic lesions defined as ipsilateral</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Contra</p>
            </td>
            <td>
                <p>Integer</p>
            </td>
            <td>
                <p>Potentially 101</p>
            </td>
            <td>
                <p>[0, 100]</p>
            </td>
            <td>
                <p>Percentage figure for contralateral cerebral ischemic lesions</p>
            </td>
            <td>
                <p>Percentage figure for contralateral cerebral ischemic lesions</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Label</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{risk, norisk}</p>
            </td>
            <td>
                <p>Is the patient at risk (Mortality)?</p>
            </td>
            <td>
                <p>Is the patient at risk (Mortality)?</p>
            </td>
        </tr>
    </tbody>

<br>
<b style="color: red;">NOTE:</b> "Session" is also included in the non-clinical description, but not included in the data dictionary.
<br>
<table>
    <tr>
        <td>
            <p><strong>Attribute</strong></p>
        </td>
        <td>
            <p><strong>Value Type</strong></p>
        </td>
        <td>
            <p><strong>NumberOfValues</strong></p>
        </td>
        <td>
            <p><strong>Values</strong></p>
        </td>
        <td>
            <p><strong>Comment</strong></p>
        </td>
        <td>
            <p><strong>Non-clinical Description</strong></p>
        </td>
    </tr>
    <tr>
        <td>
            <p>Session</p>
        </td>
        <td>
            <p>Unknown</p>
        </td>
        <td>
            <p>Max Number of Records (assumed)</p>
        </td>
        <td>
            <p>Unique to patient</p>
        </td>
        <td>
            <p>Unknown</p>
        </td>
        <td>
            <p>Anonymous patient session identifier.</p>
        </td>
    </tr>
</table>
<br>

## Data Correctness
Check for data conformity to data dictionary and explore common pitfalls (e.g. missing or duplicate data).

### Conformity to Data Dictionary
corrected

### Duplicates
dupes
noDupes

### Missing Data
imputed
dropped

### Outliers
imputeExpected (correct)
drop

### Other Assumptions
Random, ID, Session
drop or keep? clusterDf noClusterDf

## Distribution
### Univariate
df.hist (low, default, high bins)
    risk distribution (box plot)

### Multivariate
Check .corr and boxplot multiple features

# Data Preperation
phase description

## Cleaning

## Transormation
binarise
1he/dummies

## Feature Selection
based on understanding
aprioiri
featureselection
rf
informed decision

## Stratification
tts
stratified kfold
!stratified kfold

# Modelling
description

Train CODE

## Baseline (Multiple Linear Regression)
foreach dataset, full featureset and selected features
## SGD
## SVM
## K-Nearest Neighbours
## Decision Tree
## Random Forest
## MLP

## Model Selection

## Model Tuning

# Evaluation

# Deployment

Revisits:
    
    - ID Cluster (when visualising id against contra and ipsi)
    
    - Contra strings (when distplot failed)