<br></br>
# Data Mining and Decision Systems ACW
<br></br>
#### Student number: 201601628
<br>
<hr>

# 0. Notebook Initialisation

### 0.1. Package Imports
Import all libraries/packages used in the notebook.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl
import pandas as pd
import numpy as np

# from sklearn import model_selection, linear_model, svm
# from sklearn.metrics import mean_absolute_error as mae, mean_squared_error as mse, confusion_matrix, plot_confusion_matrix
# from sklearn.tree import DecisionTreeClassifier, plot_tree
# from sklearn.neural_network import MLPClassifier as mlp
# from sklearn.ensemble import RandomForestClassifier as rf
# from sklearn.feature_selection import SelectFromModel ## https://chrisalbon.com/machine_learning/trees_and_forests/feature_selection_using_random_forest/
# from sklearn.model_selection import StratifiedKFold ## https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold

# from pandas.api.types import is_string_dtype, is_numeric_dtype
from collections import defaultdict ## Used in automating and collating data discrepancies.

%matplotlib inline

### 0.2. Data Loading
Read in the file containing the data.

In [2]:
path = "data.csv" ## Relative path to train/test data.
rawData = pd.read_csv(path) ## Original data to make copies from and compare with.
rawData.head(3) ## Show dataframe to check it was read correctly.

Unnamed: 0,Random,Id,Indication,Diabetes,IHD,Hypertension,Arrhythmia,History,IPSI,Contra,label
0,0.602437,218242,A-F,no,no,yes,no,no,78.0,20,NoRisk
1,0.602437,159284,TIA,no,no,no,no,no,70.0,60,NoRisk
2,0.602437,106066,A-F,no,yes,yes,no,no,95.0,40,Risk


At first glance it can be seen that all column headers are unique, so for the sake of simplicity and to avoid trivial errors, convert them to lowercase.

**NB:** all other modifications will be made to copies of this dataframe.

In [3]:
rawData.columns = [col.lower() for col in rawData.columns] ## Make headers lowercase to avoid some trivial errors.
rawData.head(3) ## Show dataframe.

Unnamed: 0,random,id,indication,diabetes,ihd,hypertension,arrhythmia,history,ipsi,contra,label
0,0.602437,218242,A-F,no,no,yes,no,no,78.0,20,NoRisk
1,0.602437,159284,TIA,no,no,no,no,no,70.0,60,NoRisk
2,0.602437,106066,A-F,no,yes,yes,no,no,95.0,40,Risk


### 0.3. Utility Functions
Define any utility functions or properties used throughout the notebook.

In [4]:
rawNRows = rawData.shape[0] ## Get number of rows in original dataframe.
rawNCols = rawData.shape[1] ## Get number of columns in original dataframe.
rawColNames = rawData.columns.values # Get column names which will often be used as an iterator.
# concerns = defaultdict(list) ## Create a dict to store data discrepencies without littering notebook with outputs until required.

In [5]:
## For pretty printing.
# ''' n == number of indents '''
def Indent(n=1):
    indentSize = 4
    indent = (" " * indentSize) * n
    return indent

In [6]:
## Iterate over dictionary items and output the key and any values.
# ''' collection == dictionary object '''
# ''' label == string to prefix each dictionary key e.g. "1. " '''
def PrintDict(collection, label = ""):
    i = 1
    for key, value in collection.items():
        print("\n________________________________________________________________\n")  
        print(label + str(i) + ": " + key)
        i += 1

        for val in value:
            print(val)
    
    print("\n________________________________________________________________\n")  

In [7]:
## Impute a value in a given record based on the mode in a collection, using the knowledge that
## the data set is quite homogenous.
# ''' toImpute == feature to impute '''
# ''' record == pd series object '''
# ''' df == pd dataframe object '''
# ''' ignore == list of columns to ignore '''
# ''' output == bool : True = print result '''
def NNImpute(toImpute, record, df, ignore=[], output=True):
    neighbours = []
    
    # Look for records that are duplicated when ignoring the specifed columns and target feature.
    ignore.append(toImpute)
    tempDf = df.drop(columns=ignore)
    tempSeries = record.drop(labels=ignore)

    for index, row in tempDf.iterrows():
        if row.all() == tempSeries.all():
            neighbours.append(index)
    
    # Get the mode class of the neighbours.
    mode = df.iloc[neighbours][toImpute].mode()[0]
    
    if output:
        print("Based on " + str(len(neighbours)) + " neighbours: " + str(mode))
    
    return mode    

### Duplicates
dupes
noDupes

### Missing Data
imputed
dropped

### Outliers
imputeExpected (correct)
drop

### Other Assumptions
Random, ID, Session
drop or keep? clusterDf noClusterDf

## Distribution
### Univariate
df.hist (low, default, high bins)
    risk distribution (box plot)

### Multivariate
Check .corr and boxplot multiple features

# 3. Data Preperation
phase description

## Cleaning

## Transformation
binarise
1he/dummies

## Feature Selection
based on understanding
aprioiri
featureselection
rf
informed decision

## Stratification
tts
stratified kfold
!stratified kfold

# 4. Modelling
description

Train CODE

## Baseline (Multiple Linear Regression)
foreach dataset, full featureset and selected features
## SGD
## SVM
## K-Nearest Neighbours
## Decision Tree
## Random Forest
## MLP

## Model Selection

## Model Tuning

# 5. Evaluation

# 6. Deployment

<hr>

# CRISP DM
Herein, the CRISP DM data methodology is followed (as close as is possible in the context of this project).

<img src="crisp-dm.png" style="max-height:300px">

Most time is spent in the 'Data Understanding' phase to make up for the fact that there is no client communcation beyond the given information and to allow for better informed decisions in the 'Data Preperation' and 'Modelling' stages.

# 1. Business Understanding
Beyond the the given task definition and data dictionary, there will be no additional client/business communication. Therefore, some assumptions must be made based on *personal*: experience, domain knowledge, and research.

 <hr>

**Below is a brief breakdown** of the problem definition and some domain considerations:

DOMAIN: Cardio-vascular medicine / healthcare

- As a healthcare dataset it may be "natural", anonymised patient data, study data (e.g. clinical trial), or an aggregation of many different datasets.
- There is a chance there is "control" data (healthy cohorts) within the dataset or, similarly, focus groups that consist of unhealthy cohorts.
- Due to the (often) subjective nature of clinical diagnosis (i.e. different doctors with varying levels of experience make the diagnoses), some data may be mislabelled.
- Some diagnoses or features may be self-certified or be derived from incorrect patient interpretations (e.g. "Yes, I have been feeling...").
- Some features might represent the same thing (e.g. an alternative clincal test - both may be conducted or one might replace the other). 

PROBLEM TYPE: Classification

INPUTS: Tabulated patient data; (up-to) 1520 records of 11 features

OUTPUTS:
- Risk
- No Risk

<hr>

**More objectively**, domain-specific terminology from the provided data dictionary can be researched further:

- Atrial Fibrillation
    - A form of **arrythmia** (Atrial Fibrillation and other Arrhythmias, 2019).
    - Increases risk of stroke (https://www.nhs.uk/conditions/arrhythmia/)
    
    
- Asymptomatic Stenosis
    - Narrowing of the cartoid artery without recent history of TIA  or ischemic stroke (https://www.uptodate.com/contents/management-of-asymptomatic-carotid-atherosclerotic-disease).


- Cardiovascular Arrest
    - When the heart stops pumping blood - NOT a heart attack (https://www.bhf.org.uk/informationsupport/conditions/cardiac-arrest).
    - Can be caused by arrhythmias (https://www.heart.org/en/health-topics/cardiac-arrest/about-cardiac-arrest).
    
    
- Transient Ischemic Attack (mini heart attack)
    - Risk increased by a-f, asx, diabetes and hypertension (https://www.nhs.uk/conditions/transient-ischaemic-attack-tia/; https://www.cardiosmart.org/Healthwise/hw22/6606/hw226606).
    - Actually a **mini-stroke**, not heart attack.


- Diabetes
    - Type 2 makes up 90% of cases, but could be type 1 or a mix of both (https://www.bhf.org.uk/informationsupport/risk-factors/diabetes).


- IHD/CAD (Ischemic Heart Disease/Coronary Artery Disease)
    - Narrowing or blockage of the coronary arteries (https://www.cancer.gov/publications/dictionaries/cancer-terms/def/coronary-heart-disease).


- Hypertension
    - i.e high blood pressure.


- Arrhythmia (erratic heart beat)
    - Main types include **a-f**, tachcardia, bradycardia heart block and ventricular fibrilation (may cause cardiac arrest) (https://www.nhs.uk/conditions/arrhythmia/).


- IPSI (ipsilateral cerebral ischemic lesions)
    - Ipsilateral means "same side". Based on the context, the side of comparsion is likely the side of the brain that the stroke occurred.


- Contra (contralateral cerebral ischemic lesions)
    - Contralateral means "opposite side". Based on the context, the side of comparsion is likely the side of the brain that the stroke occurred.


- (History) Cardiovascular Interventions
    - Typically, cardiac invasive treatments e.g. catheterisation. (https://onlinelibrary.wiley.com/doi/book/10.1002/9781444316704)

<hr>

**Based on these findings**, there are some assumptions to be made:

- Patients with an indication of "a-f" should also be be recorded as having an arrhythmia.


- The indication feature almost appears ordinal, with a-f and asx being cause for cva and tia; although it is difficult to verify this without communicating with professionals.


- Assuming IPSI and Contra are recorded at the same time in relation to the same stroke or event; and Since IPSI is reffering to the percentage of lesions on the same side and Contra on the opposite side, it would make sense for the 2 values to have sum of 100%

<hr>

*References*

    - 1
    
    - 2
   

# 2. Data Understanding
This section focuses on an in-depth understanding of the given date, its correctness and any patterns.
<hr>

## 2.1. Data Dictionary
The data dictionary with all expected features and their format is included in the table below.

<table>
    <tbody>
        <tr>
            <td>
                <p><strong>Attribute</strong></p>
            </td>
            <td>
                <p><strong>Value Type</strong></p>
            </td>
            <td>
                <p><strong>NumberOfValues</strong></p>
            </td>
            <td>
                <p><strong>Values</strong></p>
            </td>
            <td>
                <p><strong>Comment</strong></p>
            </td>
            <td>
                <p><strong>Non-clinical Description</strong></p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Random</p>
            </td>
            <td>
                <p>Real</p>
            </td>
            <td>
                <p>Number of Records</p>
            </td>
            <td>
                <p>Unique</p>
            </td>
            <td>
                <p>Real number of help in randomly sorting the data records</p>
            </td>
            <td>
                <p>Real number of&nbsp;help&nbsp;in randomly sorting the data records: Should be unique values.</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Id</p>
            </td>
            <td>
                <p>Integer</p>
            </td>
            <td>
                <p>Max of Number of Records</p>
            </td>
            <td>
                <p>Unique to patient</p>
            </td>
            <td>
                <p>Anonymous patient record identifier: Should be unique values unless patient has multiple sessions</p>
            </td>
            <td>
                <p>Anonymous patient record identifier: Should be unique value per patient. Patient can have multiple sessions</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Indication</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Four</p>
            </td>
            <td>
                <p>{a-f, asx, cva, tia}</p>
            </td>
            <td>
                <p>What type of Cardiovascular event triggered the hospitalisation?</p>
            </td>
            <td>
                <p>What type of Cardiovascular event triggered the hospitalisation?</p><p> a-f :&nbsp;Atrial-Fibrillation</p>
                <p>asx&nbsp;:&nbsp;Asymptomatic Stenosis&nbsp;</p><p>cva&nbsp;: Cardiovascular Arrest</p>
                <p>tia&nbsp;:&nbsp;Transient Ischemic Attack ("mini-heart attack")</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Diabetes</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Does the patient suffer from Diabetes?</p>
            </td>
            <td>
                <p>Does the patient suffer from Diabetes?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>IHD</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Does the patient suffer from Coronary artery disease (CAD), also known as ischemic heart disease (IHD)?</p>
            </td>
            <td>
                <p>Does the patient suffer from Coronary artery disease (CAD), also known as ischemic heart disease (IHD)?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Hypertension</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Does the patient suffer from Hypertension?</p>
            </td>
            <td>
                <p>Does the patient suffer from Hypertension?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Arrhythmia</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Does the patient suffer from</p>
                <p>Arrhythmia (i.e. erratic heart beat)?</p>
            </td>
            <td>
                <p>Does the patient suffer from Arrhythmia (i.e. erratic&nbsp;heart beat)?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>History</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{no, yes}</p>
            </td>
            <td>
                <p>Has the patient a history of</p>
                <p>Cardiovascular interventions?</p>
            </td>
            <td>
                <p>Has the patient a history of Cardiovascular interventions?</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>IPSI</p>
            </td>
            <td>
                <p>Integer</p>
            </td>
            <td>
                <p>Potentially 101</p>
            </td>
            <td>
                <p>[0, 100]</p>
            </td>
            <td>
                <p>Percentage figure for cerebral ischemic lesions defined as ipsilateral</p>
            </td>
            <td>
                <p>Percentage figure for cerebral ischemic lesions defined as ipsilateral</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Contra</p>
            </td>
            <td>
                <p>Integer</p>
            </td>
            <td>
                <p>Potentially 101</p>
            </td>
            <td>
                <p>[0, 100]</p>
            </td>
            <td>
                <p>Percentage figure for contralateral cerebral ischemic lesions</p>
            </td>
            <td>
                <p>Percentage figure for contralateral cerebral ischemic lesions</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Label</p>
            </td>
            <td>
                <p>Nominal</p>
            </td>
            <td>
                <p>Two</p>
            </td>
            <td>
                <p>{risk, norisk}</p>
            </td>
            <td>
                <p>Is the patient at risk (Mortality)?</p>
            </td>
            <td>
                <p>Is the patient at risk (Mortality)?</p>
            </td>
        </tr>
    </tbody>

<br>
<b style="color: red;">NOTE:</b> "Session" is also included in the non-clinical description, but not included in the data dictionary.
<br>
<table>
    <tr>
        <td>
            <p><strong>Attribute</strong></p>
        </td>
        <td>
            <p><strong>Value Type</strong></p>
        </td>
        <td>
            <p><strong>NumberOfValues</strong></p>
        </td>
        <td>
            <p><strong>Values</strong></p>
        </td>
        <td>
            <p><strong>Comment</strong></p>
        </td>
        <td>
            <p><strong>Non-clinical Description</strong></p>
        </td>
    </tr>
    <tr>
        <td>
            <p>Session</p>
        </td>
        <td>
            <p>Unknown</p>
        </td>
        <td>
            <p>Max Number of Records (assumed)</p>
        </td>
        <td>
            <p>Unique to patient</p>
        </td>
        <td>
            <p>Unknown</p>
        </td>
        <td>
            <p>Anonymous patient session identifier.</p>
        </td>
    </tr>
</table>
<br>



**Looking at the data dictionary** it can be seen that the "Random" and "Id" attributes are supposed to be unique. If this is true, the features should provide almost no benefit to any models and can be discarded to reduce the dimensionality of the problem.

The **Label** feature is the feature we want to predict and our ground-truth.

<hr>

## 2.2. Data Correctness
Check for data conformity to data dictionary and explore common pitfalls (e.g. missing or duplicate data).

<hr>

### 2.2.1. Conformity to Data Dictionary
The data dictionary serves as the foundation for assumptions made regarding the data.

The following python-object is a distillation of the data-dictionary which can be used to check the expected values/types etc. against the *actual* data.

In [8]:
## Object description:
# key == column/feature name.
# nVals == range of expected values for a continuous column.
# vals == possible values for any categoric or discrete column.

assumptions = {
    "random":{ ## Col name.
        "nVals": (rawNRows, rawNRows), # Range: unique per record. ## Real.
    },  
    "id":{
        "nVals": (1, rawNRows), ## Range: unique per patient. ## Integer.
    },
    "indication":{
        "vals": ["a-f","asx","cva","tia"] ## Possible values (except nan).
    },
    "diabetes":{
        "vals": ["yes", "no"]
    },
    "ihd":{
        "vals": ["yes", "no"]
    },
    "hypertension":{
        "vals": ["yes", "no"]
    },
    "arrhythmia":{
        "vals": ["yes", "no"]
    },
    "history": {
        "vals": ["yes", "no"]
    },
    "ipsi": {
        "vals": np.arange(0,101) # Percentage 0-100.
    },
    "contra": {
        "vals": np.arange(0,101), # Percentage 0-100.
    },
    "label": {
        "vals": ["risk", "norisk"]
    },
    "session":{ ## This feature was given separate to the dictionary.
        "nVals": (1, rawNRows), ## Unique per patient (assumed).
    },
}

<hr>

#### Compare Actual Data with Assumptions Object

In [9]:
df = rawData.copy() # Copy of the unmodified, raw data.
discrepancies = defaultdict(list) # Collate discrepencies.

# Iterate over assumptions object.
for key, value in assumptions.items():
    
    # If the expected feature exists in the actual data.
    if key in rawColNames:
        actualValues = df[key].dropna().unique() ## Ignore nan values in uniques (handle seperately).
    
        try:
            # Check expected values.
            expectedValues = value["vals"]
            if (not(set(actualValues) & set(expectedValues))):
                discrepancies["EXPECTED VALUES"].append(Indent(2) + key + "\n" + Indent(3)+ "Expected: " + str(set(expectedValues)) + "\n" + Indent(3)+ "Actual: " + str(set(actualValues)) + "\n")
        except:
            # No "vals" key; value is expected to be unique (nVals).
            actualNValues = len(actualValues)
            expectedNValues = value["nVals"]
            if (not(actualNValues >= expectedNValues[0]) or not(actualNValues <= expectedNValues[1])): ## Check actual number of values is within the expected range.           
                discrepancies["NUMBER OF UNIQUE VALUES"].append(Indent(2) + key + "\n" + Indent(3) + "Expected: " + str(expectedNValues) + "\n" + Indent(3)+ "Actual: " + str(actualNValues))
    else:
        # Expected column isn't present.
        discrepancies["MISSING COLUMNS"].append(key)

Format and output any descrepancies.

In [10]:
PrintDict(discrepancies, "Discrepancy ")


________________________________________________________________

Discrepancy 1: NUMBER OF UNIQUE VALUES
        random
            Expected: (1520, 1520)
            Actual: 1222

________________________________________________________________

Discrepancy 2: EXPECTED VALUES
        indication
            Expected: {'asx', 'cva', 'tia', 'a-f'}
            Actual: {'ASx', 'TIA', 'A-F', 'Asx', 'CVA'}

        contra
            Expected: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100}
            Actual: {'60 ', '90 ', '51 ', '70', '55 ', '75', '60', '15 ', '53', '20 ', '100 ', '10', '65 ', '40 ', '40', ' ', '25 ', '10 ', '90', '20', '85', '30

...

**Discrepancy 1: NUMBER OF UNIQUE VALUES**
    
- RANDOM was expected to be unique per patient, but only 1222 of 1520 records comply. Presumably, this can be attributed to null or duplicate values.

In [11]:
df["random"].describe()

count    1520.000000
mean        0.509545
std         0.284006
min         0.000295
25%         0.268531
50%         0.517616
75%         0.754724
max         0.999448
Name: random, dtype: float64

In [12]:
# Get number of random attributes that aren't unique.
nMissing = df["random"].shape[0] - df["random"].unique().shape[0]

# Get number of random attributes that are duplicated or null.
nDupes = df[df["random"].duplicated() == True].shape[0] ## Get number of duplicate random attributes.
nNan = df[df["random"] == np.isnan].shape[0]

# Calculate number of non-unique values not accounted for by nans and dupes.
stillMissing = nMissing - (nDupes + nNan)

print(str(nMissing) + " values are not unique.")
print(str(nDupes) + " 'random' values are duplicated.")
print(str(nNan) + " 'random' values are nan.")
print (str(stillMissing) + " non-unique records unaccounted for.")

298 values are not unique.
298 'random' values are duplicated.
0 'random' values are nan.
0 non-unique records unaccounted for.


The random feature isn't unique as as described in the data dictionary; there ae 298 duplicates.
<p><b style="color: red">ACTION:</b> The records where the random attributes are duplicated should be inspected further.<p>

In [13]:
## View all duplicate values.
randomDupes = df[df["random"].duplicated(keep=False)]
randomDupes.head()

Unnamed: 0,random,id,indication,diabetes,ihd,hypertension,arrhythmia,history,ipsi,contra,label
0,0.602437,218242,A-F,no,no,yes,no,no,78.0,20,NoRisk
1,0.602437,159284,TIA,no,no,no,no,no,70.0,60,NoRisk
2,0.602437,106066,A-F,no,yes,yes,no,no,95.0,40,Risk
8,0.678157,256128,TIA,no,no,yes,no,no,81.0,20,NoRisk
10,0.678157,174588,CVA,no,yes,yes,yes,no,75.0,50,Risk


Records don't appear to be duplicated where the random attribute is duplicated and it is apparent that some random codes are duplicated more than once (e.g. indexes 1, 2 and 3).

Considering the absence of the session column and the fact that the Id feature IS unique, it could be possible that the Id feature is actually the missing session column, and the random code is the patient id.

To prove or disprove this, the following looks at each random code to see if any diabetes or history values change more than once per random code. If no such pattern is detected, this supports the idea that random is actually the patient id and the id is the session.

In [14]:
contradictions = []

# Iterate through all the unique values in random.
for randVal in df["random"].unique():
    
    # Get the records with the current random value being inspected.
    randDf = df[df["random"] == randVal]
    
    try:
        # See if the value for history changes more than once. 
        if randDf["history"].value_counts()["yes"] > 1:
            contradictions.append(randVal)
            continue
    except:
        pass
    
    try:
        # See if the value for diabetes changes more than once. 
        if randDf["diabetes"].value_counts()["yes"] > 1:
            contradictions.append(randVal)
            continue
    except:
        pass

# Report any contradictions.
if len(contradictions) < 1:
    print("No contradictions found.")
else:
    contradictions ## Output list of random codes which disprove random being id.

No contradictions found.


It seems possible that the random feature is actually a patient identifier and the id column is a unique identifier for the session.

Arguments against this suggestion include the facts that values range between 0-1, which supports the concept of a sorting utility, and that the values in the id column are *very* unconventional for denoting sessions (expected values would be simpler, e.g. bl/baseline, 1/V1/V01).

<p><b style="color: red">ACTION:</b> Maintain the consideration that the random feature may be a patient identifier.<p>

<hr>


**Discrepancy 2: EXPECTED VALUES**
    
- **INDICATION** had an unexpected variant of ASx/Asx. Clinical research also abbreviates the condition as "ASX" suggesting that they are the same class as stipulated by the data dictionary (https://www.sciencedirect.com/science/article/pii/S0741521415010241).
    
<p><b style="color: red">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ACTION:</b> Treat variations of "asx" as the same category.<p>

In [15]:
# These changes are fundamental, so best to work with them now.
correctedData = rawData.copy()

# Make all indication categories lowercase.
correctedData["indication"] = correctedData["indication"].apply(lambda x: str(x).lower())
correctedData["indication"].unique() # Output and confirm changes.

array(['a-f', 'tia', 'cva', 'asx', 'nan'], dtype=object)

<br>

- **CONTRA** is formatted as a string in the actual data, rather than the expected numeric format, although (with the exception of null values) the numeric equivalents are all within the expected range.

<p><b style="color: red">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ACTION:</b> Convert contra values to numeric.<p>

In [16]:
# Convert empty strings to nan.
correctedData['contra'] = correctedData['contra'].replace(r'^\s*$', np.nan, regex=True)
# Convert all values to numeric.
correctedData["contra"] = correctedData["contra"].apply(lambda x: float(x))
correctedData["contra"].head(3) # Output and confirm changes.

0    20.0
1    60.0
2    40.0
Name: contra, dtype: float64

<br>


- **LABEL** has an additional, unexpected category: "Unknown".

In [17]:
## Output all values where the value of the label feature equals "Unknown".
correctedData[correctedData["label"] == "Unknown"]

Unnamed: 0,random,id,indication,diabetes,ihd,hypertension,arrhythmia,history,ipsi,contra,label
475,0.298074,173791,asx,no,yes,yes,no,no,70.0,55.0,Unknown
523,0.46017,283846,cva,no,no,yes,yes,no,95.0,100.0,Unknown


Since the requested outputs of the end-product are risk and norisk, and the fact that there are only 2 of 1520 datapoints with this classification (overwhelming imbalance); they are useless.

The options are to either impute the values, or drop them: although it isn't expected that either will have a significant effect since only 2 records are affected.

In [18]:
for index in correctedData[correctedData["label"] == "Unknown"].index.values:
    NNImpute("label", correctedData.iloc[index], correctedData, ignore=["random", "id"])

Based on 23 neighbours: NoRisk
Based on 283 neighbours: Risk


In [None]:
checks = []

for index in correctedData.index.values:
    print(index)
    imputed = NNImpute("label", correctedData.iloc[index], correctedData, ignore=["random", "id"], output=False)
    actual = correctedData.iloc[index]["label"]
    
    if imputed != actual:
        checks.append(index)
        
checks

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

### Duplicates
dupes
noDupes

### Missing Data
imputed
dropped

### Outliers
imputeExpected (correct)
drop

### Other Assumptions
Random, ID, Session
drop or keep? clusterDf noClusterDf

## Distribution
### Univariate
df.hist (low, default, high bins)
    risk distribution (box plot)

### Multivariate
Check .corr and boxplot multiple features

# 3. Data Preperation
phase description

## Cleaning

## Transformation
binarise
1he/dummies

## Feature Selection
based on understanding
aprioiri
featureselection
rf
informed decision

## Stratification
tts
stratified kfold
!stratified kfold

# 4. Modelling
description

Train CODE

## Baseline (Multiple Linear Regression)
foreach dataset, full featureset and selected features
## SGD
## SVM
## K-Nearest Neighbours
## Decision Tree
## Random Forest
## MLP

## Model Selection

## Model Tuning

# 5. Evaluation

# 6. Deployment

Revisits:
    
    - ID Cluster (when visualising id against contra and ipsi)
    
    - Contra strings (when distplot failed)