# CRISP DM
Herein, the CRISP DM data methodology will be followed (as close as is possible in the context of this project).

<img src="crisp-dm.png" style="max-height:300px">

Most time will be spent in the 'Data Understanding' phase to make up for the fact that there is no client communcation beyond the given information and to allow for more informed decisions in the 'Data Preperation' and 'Modelling' stages.

## Business Understanding
Beyond the the given task definition and data dictionary, there will be no additional client/business communication. Therfore, some assumptions must be made based on *personal*: experience, domain knowledge, and research.

 <hr>

**Below is a brief breakdown** of the problem definition and some domain considerations:

DOMAIN: Cardio-vascular medicine / healthcare

- As a healthcare dataset it may be "natural", anonymised patient data, study data (e.g. clinical trial), or an aggregation of many different datasets.
- There is a chance there is "control" data (healthy cohorts) within the dataset or, similarly, focus groups that consist of unhealthy cohorts.
- Due to the (often) subjective nature of clinical diagnosis (i.e. different doctors with varying levels of experience make the diagnoses), some data may be mislabelled.
- Some diagnoses or features may be self-certified or be derived from incorrect patient interpretations (e.g. "Yes, I have been feeling...").
- Some features might represent the same thing (e.g. an alternative clincal test - both may be conducted or one might replace the other). 

PROBLEM TYPE: Classification

INPUTS: Tabulated patient data; (up-to) 1520 records of 11 features

OUTPUTS:
- Risk
- No Risk

<hr>

**More objectively**, domain-specific terminology from the provided data dictionary can be researched further:

- Atrial Fibrillation
    - A form of **arrythmia** (Atrial Fibrillation and other Arrhythmias, 2019).
    - Increases risk of stroke (https://www.nhs.uk/conditions/arrhythmia/)
    
    
- Asymptomatic Stenosis
    - Narrowing of the cartoid artery without recent history of TIA  or ischemic stroke (https://www.uptodate.com/contents/management-of-asymptomatic-carotid-atherosclerotic-disease).


- Cardiovascular Arrest
    - When the heart stops pumping blood - NOT a heart attack (https://www.bhf.org.uk/informationsupport/conditions/cardiac-arrest).
    - Can be caused by arrhythmias (https://www.heart.org/en/health-topics/cardiac-arrest/about-cardiac-arrest).
    
    
- Transient Ischemic Attack (mini heart attack)
    - Risk increased by a-f, asx, diabetes and hypertension (https://www.nhs.uk/conditions/transient-ischaemic-attack-tia/; https://www.cardiosmart.org/Healthwise/hw22/6606/hw226606).
    - Actually a **mini-stroke**, not heart attack.


- Diabetes
    - Type 2 makes up 90% of cases, but could be type 1 or a mix of both.


- IHD/CAD (Ischemic Heart Disease/Coronary Artery Disease)
    - Narrowing or blockage of the coronary arteries (https://www.cancer.gov/publications/dictionaries/cancer-terms/def/coronary-heart-disease).


- Hypertension
    - i.e high blood pressure.


- Arrhythmia (erratic heart beat)
    - Main types include **a-f**, tachcardia, bradycardia heart block and ventricular fibrilation (may cause cardiac arrest).


- IPSI (ipsilateral cerebral ischemic lesions)
    - Ipsilateral means "same side". Based on the context, the side of comparsion is likely the side of the brain that the stroke occurred.


- Contra (contralateral cerebral ischemic lesions)
    - Contralateral means "opposite side". Based on the context, the side of comparsion is likely the side of the brain that the stroke occurred.


- (History) Cardiovascular Interventions
    - Typically, cardiac invasive treatments e.g. catheterisation. (https://onlinelibrary.wiley.com/doi/book/10.1002/9781444316704)

<hr>

**Based on this research**, there are a number of assumptions to be made:

- Patients with an indication of "a-f" should also be be recorded as having an arrhythmia.


- Assuming IPSI and Contra are recorded at the same time in relation to the same stroke or event; and Since IPSI is reffering to the percentage of lesions on the same side and Contra on the opposite side, it would make sense for the 2 values to have sum of 100%

<hr>

*References*

## Data Understanding