# Notebook Initialisation

## Package Imports
Import all libraries for use in notebook.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl
import pandas as pd
import numpy as np

from sklearn import model_selection, linear_model
from sklearn.metrics import mean_absolute_error as mae, mean_squared_error as mse
from sklearn.preprocessing import MinMaxScaler

from pandas.api.types import is_string_dtype ##
from pandas.api.types import is_numeric_dtype ##
from collections import defaultdict ## Used in automating and collating data discrepancies.

%matplotlib inline

## Data Loading
Read in the file containing the data.

In [6]:
path = "data.csv" ## Relative path to train/test data.
rawData = pd.read_csv(path) ## Original data to make copies from and compare with.
rawData.head() ## Show dataframe.

Unnamed: 0,Random,Id,Indication,Diabetes,IHD,Hypertension,Arrhythmia,History,IPSI,Contra,label
0,0.602437,218242,A-F,no,no,yes,no,no,78.0,20,NoRisk
1,0.602437,159284,TIA,no,no,no,no,no,70.0,60,NoRisk
2,0.602437,106066,A-F,no,yes,yes,no,no,95.0,40,Risk
3,0.128157,229592,TIA,no,no,yes,no,no,90.0,85,Risk
4,0.676862,245829,CVA,no,no,no,no,no,70.0,20,NoRisk


Make the columns lower case to simplify typing and to avoid trivial errors.

In [10]:
rawData.columns = [col.lower() for col in rawData.columns] ## Make headers lowercase to avoid some trivial errors.
rawData.head(0) ## Show dataframe columns in df format but without data (.columns returns a list).

Unnamed: 0,random,id,indication,diabetes,ihd,hypertension,arrhythmia,history,ipsi,contra,label


## Misc
Define some variables to store properties of the original data, for easy access, and any utility functions.

In [11]:
rawNRows = rawData.shape[0] ## Get number of rows in original dataframe.
rawNCols = rawData.shape[1] ## Get number of columns in original dataframe.
rawColNames = rawData.columns.values # Get column names which will often be used as an iterator.
concerns = defaultdict(list) ## Create a dict to store data discrepencies without littering notebook with outputs until required.

## For pretty printing.
## ''' n == number of indents '''
def Indent(n=1):
    indent = "    " * n
    return indent

# CRISP DM
Herein, the CRISP DM data methodology will be followed (as close as is possible in the context of this project).

<img src="crisp-dm.png" style="max-height:300px">

Most time will be spent in the 'Data Understanding' phase to make up for the fact that there is no client communcation beyond the given information and to allow for more informed decisions in the 'Data Preperation' and 'Modelling' stages.

## 1. Business Understanding
In the absence of client/buisness communication, personal experience and domain knowledge will support any decisions made around the given information. 

Below is a brief breakdown of the problem definition and some domain considerations:
<ol>
    <li><p><b>DOMAIN:</b> Cardio-vascular medicine / healthcare</p>
<ul>
    <li>As a healthcare dataset it may be "natural", anonymised patient data, study data (e.g. clinical trial), or an aggregation of many different datasets.</li>
    <li>There is a chance there is "control" data (healthy cohorts) within the dataset or, similarly, focus groups that consist of unhealthy cohorts.</li>
    <li>Due to the largely subjective nature of clinical diagnosis (i.e. different doctors with varying levels of experience make the diagnoses), it's entirely possible that some data is mislabelled (has the wrong classification).</li>
    <li>It is also possible that some diagnoses or features are self certified or derived from incorrect patient assumptions (e.g. "Yes, I have a history of...").</li>
    </ul></li>
<li><p><b>PROBLEM TYPE:</b> Classification</p></li>
<li><p><b>INPUTS:</b> Tabulated patient data; 1520 records of 11 features</p></li>
<li><p><b>OUTPUTS:</b></p>
    <ul>
        <li>Risk</li>
        <li>No Risk</li>
    </ul>
</li>
    </ol>