## Exploring Medical Data with Python

**What?:**
- Exploratory data analysis
- Linear Regression
- Python
- Libraries: scikit-learn, numpy, matplotlib, pandas, seaborn, bokeh

**Who?:**
- Medics
- Statisticians
- Machine Learning students

**Why?:**
- Visualising data
- Machine Learning (regression)
- Tools to understand data

**Noteable features**
- Pre-installed libraries
- Practice data sets
- Interactive visualisation
- On the spot results
- Easy translation to R
- Tutorial format

**How? Tools/methods used:**
- Bokeh - interactive visualisation
- Base SciPy stack - processing and accessing data
- Explanations integrated into code

<hr>

### Importing Libraries

Python has excellent support for data analysis through the [scipy](https://www.scipy.org) stack. Since the core libraries are pre-installed on Noteable, they need only be imported to be used:

In [None]:
# scikit learn - used throughout for machine learning tools
import sklearn.datasets as ds
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
import sklearn

# numpy - array support
import numpy as np

# pandas - dataset tools
import pandas as pd

# matplotlib - data visualisation
import matplotlib.pyplot as plt
%matplotlib inline

# seaborn - notebook styling
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

# bokeh - visualisation library
from bokeh.plotting import figure, show
from bokeh.io import output_notebook

# hide unnecessary warnings
import warnings
warnings.filterwarnings('ignore')

<hr>

### Data

Scikit-learn provides inbuilt datasets which can be useful for practise. Here, the [diabetes dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html) is loaded directly into the notebook.

In [None]:
# load dataset to notebook as Bunch object
diabetes = ds.load_diabetes()

<hr>

### Context

This dataset comes with a description of the data. It is wise to save descriptions in the same directory as the data.

The next cell creates a text file then stores the data description, <code>data.DESCR</code> within it. The description is then printed out to the notebook.

In [None]:
# store description in file
description = open("diabetes.txt", "w")
description.write(diabetes.DESCR)
description.close()

# print to notebook (optional)
print(diabetes.DESCR)

<hr>

### Pandas

[Pandas](https://pandas.pydata.org) is a library for data analysis. It provides useful functions to inspect and manipulate data.

To use the pandas function, the data should be in a [pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

In [None]:
# convert data from Bunch to pandas dataframe
data = pd.DataFrame(data= np.c_[diabetes['data'], diabetes['target']],
                     columns= diabetes['feature_names'] + ['target'])

<hr>

### Exploring Data with pandas

Pandas functions can now be used to retrieve helpful information about the data:

In [None]:
# check dimensions of data
print(data.shape)

In [None]:
# check names of columns
print(data.columns)

In [None]:
# visually inspect a few datapoints
data.head()

In [None]:
# check types of all variables
data.dtypes

In [None]:
# count null values for all variables
data.isnull().sum()

In [None]:
# summary statistics for each variable
data.describe()

<hr>

# Linear Regression

Later on, the target and feature variables will be used separately. The next cell splits the data to two separate dataframes - X for the features, y for the target.

In [None]:
# split data frame to predictors and target
predictors = data.shape[1]-1
X = data.iloc[:,0:predictors]
y = data.iloc[:,predictors:predictors+1]

## Linear Regression - One Feature Variable

This section will create a linear regression model using only the s5 variable.

In [None]:
# plot s5 variable against target
data.plot(x='s5', y='target', kind='scatter', c='blue');

To perform linear regression, the samples will be split into a training set and a test set. The model is then trained on the training data.

In [None]:
# split data to train and test sets
X_train_s5, X_test, y_train_s5, y_test = sklearn.model_selection.train_test_split(
    X['s5'], y, test_size=0.05, random_state=0)

# convert to numpy array, force to correct dimensions
X_train_s5 = X_train_s5.to_numpy().reshape(-1,1)
y_train_s5 = y_train_s5.to_numpy().reshape(-1,1)
X_test     = X_test.to_numpy().reshape(-1,1)
y_test     = y_test.to_numpy().reshape(-1,1)

In [None]:
# create instance of linear regression model
lreg = linear_model.LinearRegression()

# fit linear regression model on training data
lreg.fit(X_train_s5, y_train_s5)

# predict on test set
y_pred = lreg.predict(X_test)

# print result data
print("coefficient of s5:            ", lreg.coef_[0][0])
print("intercept:                    ", lreg.intercept_[0])
print("mse:                          ",mean_squared_error(y_test, y_pred))
print("coefficient of determination: ", sklearn.metrics.r2_score(y_test, y_pred))

The plots below give a visual representation of the model:

In [None]:
# reshape for plotting
X_train_r = X_train_s5.reshape(-1)
y_train_r = y_train_s5.reshape(-1)
X_test_r  = X_test.reshape(-1)
y_test_r  = y_test.reshape(-1)
y_pred_r  = y_pred.reshape(-1)

In [None]:
# plot training data with bokeh functions

# specify output
output_notebook()

# specify figure
p = figure(plot_width=450, plot_height=350)
p.circle(X_train_r, y_train_r, color='orange')
p.line(X_train_r, lreg.predict(X_train_s5).flatten(), color='blue')
p.title.text = "Training Data"
p.title.align = "center"
p.title.text_color = "purple"
p.title.text_font_size = "20px"
p.xaxis.axis_label = "s5"
p.yaxis.axis_label = "Target"

show(p)

In [None]:
# plot test data with bokeh functions

# specify output
output_notebook()

# specify figure
p = figure(plot_width=450, plot_height=350)
p.circle(X_test_r, y_test_r, color='red')
p.line(X_test_r, y_pred_r, color='blue')
p.title.text = "Test Data"
p.title.align = "center"
p.title.text_color = "purple"
p.title.text_font_size = "20px"
p.xaxis.axis_label = "s5"
p.yaxis.axis_label = "Target"

show(p)

## Linear Regression - Multiple Feature Variables

In the example above, the coefficient of determination was 0.29. The closer the value is to 1, the more accurate the model's predictions.

One way to increase this is to use more features for training. The next example uses three features: bmi, s1 and s5.

In [None]:
# take bmi, s1, s5 variables only
features = ['bmi', 's1', 's5']
feats = X[features]

# split data to train and test sets
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    feats, y, test_size=0.05, random_state=0)

# convert data to numpy, force to correct dimensions
X_train = X_train.to_numpy()
y_train = y_train.to_numpy().reshape(-1,1)
X_test  = X_test.to_numpy()
y_test  = y_test.to_numpy().reshape(-1,1)

In [None]:
# create instance of linear regression model
lm = linear_model.LinearRegression()

# fit model on train data
lm.fit(X_train, y_train);

In [None]:
# print intercept and coefficients - lm
print("Intercept:", lm.intercept_[0])
print("\nCoefficients:")
for f, c in zip(features, lm.coef_[0]):
    print(f, ":  ", c)

In [None]:
# predict test set
y_pred = lm.predict(X_test)

In [None]:
# print results
print("mse:                          ", mean_squared_error(y_test, y_pred))
print("coefficient of determination: ", sklearn.metrics.r2_score(y_test, y_pred))

When 3 features are used, the coefficient of determination increases to 0.53.

Explain the results:



<hr>

# Summary

In this tutorial, you have seen:
- pandas functions to learn about data
- visualisation of 2D data using pandas/matplotlib
- use of bokeh to create interactive graphs
- linear regression with one, then multiple feature variables

Recommended next steps:
- Edit this notebook for greater understanding of the code
- Try it yourself! Create a notebook and follow the same steps with a different dataset!