A signature-based machine learning model for bipolar disorder and borderline personality disorder
This repository contains the code from the paper A signature-based machine learning model for bipolar disorder and borderline personality disorder:
Perez Arribas, I., Goodwin, G.M., Geddes, J.R., Lyons, T. and Saunders, K.E., 2018. A signature-based machine learning model for distinguishing bipolar disorder and borderline personality disorder. Translational Psychiatry, 8(1), p.274. DOI: 10.1038/s41398-018-0334-0.
- Reproducible Research Champions
- Setting up signatures-psychiatry
- Generating figures and tables from the paper
Reproducible Research Champions
In May 2018, Terry Lyons was selected as one of the Alan Turing Institute's Reproducible Research Champions - academics who encourage and promote reproducible research through their own work, and who want to take their latest project to the "next level" of reproducibility.
The Reproducible Research programme at the Turing is led by Kirstie Whitaker and Martin O'Reilly, with the Champions project also involving Louise Bowler from the Research Engineering Group.
Each of the Champions' projects will receive several weeks of support from the Research Engineering Group throughout 2018-2019. Over this time, we will work on the project with Terry and Imanol and will track our efforts in this repository. So far, we've added installation instructions, set the project up to work with Binder, made it possible to use synthetic data in the same workflow and added some tests. Given our focus on reproducibility, we obviously don't intend to change any of the code's core functionality - but we hope that our work, both past and future, will make it easier for you to install, use and test out your own ideas with the methods used in the
You can find out more about the Turing's Reproducible Research Champions project here.
The dataset used in the study is confidential, so we are not able to publicly release it. Access to the dataset is restricted to staff and students of the University of Oxford who have received the appropriate permission.
However, we feel that it is important that some similar data are provided for the purposes of demonstrating how the methods outlined in the paper work. To this end, we include three sets of synthetic data, and one example of a fake entry in the dataset.
Mood score data The original dataset contains the mood scores of participants who were either healthy or were diagnosed with borderline personality disorder or bipolar disorder. Participants recorded their mood score on a seven-point scale across six different categories (anxiety, elation, sadness. anger, irritability and energy) at approximately daily intervals. Further details of this dataset, which was collected as part of the Automated Monitoring of Symptoms Severity (AMoSS) study, can be found in the paper. Access to this dataset is limited and so it is not included in this repository.
Synthetic signature data Synthetic data is data that has been generated to exhibit the same statistical properties as the original data, without containing the original entries. The synthetic data in this repository was derived from the signatures of the original mood score data, and is therefore in signature form itself. Each dataset contains mood score signatures and their associated diagnostic classification. The synthetic signatures were derived from all mood score signatures from each of the three diagnostic classifications, so the concept of "participant" does not apply when using this dataset. Three synthetic datasets were generated and have been included in the
Fake mood score data This repository contains an example of a fake mood score dataset from one participant in
data/fake_patient.csv. This data is not statistically related to the original mood score data but is presented in the same format in order to illustrate the data normalisation process and how Figures 1 and 2 in the paper were generated.
Setting up signatures-psychiatry
The instructions below assume you are comfortable cloning a git repository and running Python scripts via the command line.
If not, you may find the tutorials available from GitHub and Software Carpentry helpful. You can also open an interactive version of this project on
mybinder.org by clicking the badge below.
Once the Binder project loads, open a Python 2 console and skip down this page to "Generating figures and tables from the paper".
Begin by obtaining a copy of this repository using
git clone https://github.com/alan-turing-institute/signatures-psychiatry.git
and move into the directory
This project uses
Python 2.7 and the packages listed in
Let's start by setting up a virtual environment and installing the dependencies inside it.
We give examples here for using
If you use CPython, use
virtualenv to set up a virtual environment named
Activate the environment and install the packages required by this project using
virtualenv env # Use if your default Python version is 2.7 virtualenv --python=<path-to-your-python-2.7> env # If not, specify the path manually source env/bin/activate pip install -r requirements.txt
If you use Anaconda, create the virtual environment with
conda and install the
pip package directly into the environment.
Once the environment is activated, the dependencies can be installed using
pip (we use
pip because the
esig package is not currently available through
conda create -n sig-psy python=2.7 pip conda activate sig-psy pip install -r requirements.txt
With the virtual environment set up and all the dependencies installed, we can use the scripts in this project by following the instructions below.
Generating figures and tables from the paper
If you are running this project via Binder (or any other Jupyter Lab installation), open a Python 2 console. The commands need a minor change in this environment - swap
%run, and use the
shift+enter keys to run the cell.
Table 1: Accuracy and area under the ROC curve
If you have access to the full dataset, run
To run the same analysis on the synthetic signatures, use
python pairwise_group_classification.py --synth
By default, the synthetic cohort
772192 will be used, or the cohort ID can be specified with
The random seed for this script is set by default to
83042, or it can be changed using
The pairwise values will be displayed in the terminal and also saved to a log file,
Table 2: Demographic characteristics of study participants
The content of this table was gathered manually.
Figure 1: Normalised anxiety scores of a sample participant
This repository contains a set of "fake" data from a single patient. To convert this dataset into the normalised format shown in Figure 1, run
Figure 2: Pairwise normalised mood scores of a sample participant
The same command used above also produces the pairwise mood score plots for the fake patient.
Figure 3 (top row):
The heat maps in Figure 3 can be produced from either the original data or the synthetic signatures. To use the original data, run
and to use the synthetic signatures, use
python heat_map.py --synth
Alternatives to the default cohort of
772192 and random seed of
1 can be set with, for example, the options
The script will save three figures in
.png format. The names of the figures and the settings can be found in the
heat_map.log file in the
Figure 3 (bottom row): Accuracy and MAE of predictions of future mood score
The lower row of plots in Figure 3 require mood score data, so we cannot use the synthetic signature data here.
Table 3: Summary of accuracy and MAE of predictions of each aspect of next-day mood score
To generate the accuracy and MAE scores, run
As the next-day mood score is required, this script cannot be applied to the synthetic data which is already in the signature format.