01 – Genotype Tutorial: Ancestry Prediction

A - Setup

In this tutorial, we will be using genotype data to train deep learning models for ancestry prediction.

Note

This tutorial goes into some detail about how EIR works, and how to use it. If you are more interested in quickly training the deep learning models for genomic prediction, the EIR-auto-GP project might be of use to you.

To start, please download processed sample data (or process your own .bed, .bim, .fam files with e.g. plink pipelines). The sample data we are using here for predicting ancestry is the public Human Origins dataset, but the same approach can just as well be used for e.g. disease predictions in other cohorts (for example the UK Biobank).

Examining the sample data, we can see the following structure:

processed_sample_data
├── arrays                      # Genotype data as NumPy arrays
├── data_final_gen.bim          # Variant information file accompanying the genotype arrays
└── human_origins_labels.csv    # Contains the target labels (what we want to predict from the genotype data)

Important

The label file ID column must be called "ID" (uppercase).

For this tutorial, we are going to use the data above to models to predict ancestry, of which there are 6 classes (Asia, Eastern Asia, Europe, Latin America and the Caribbean, Middle East and Sub-Saharan Africa). Before diving into the model training, we first have to configure our experiments.

To configure the experiments we want to run, we will use .yaml configurations. Running eirtrain --help, we can see the configurations needed:

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/eirtrain_help.txt
    :language: console
    :lines: 2-

Above we can see that there are four types of configurations we can use: global, inputs, fusion and outputs. To see more details about what should be in these configuration files, we can check the :ref:`api-reference` reference.

Note

Instead of having to type out the configuration files below manually, you can download them from the docs/tutorials/tutorial_files/01_basic_tutorial directory in the project repository

While the global configuration has a lot of options, the only one we really need to fill in now is output_folder and evaluation interval (in batch iterations), so we have the following tutorial_01_globals.yaml file:

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/tutorial_01_globals.yaml
    :language: yaml
    :caption: tutorial_01_globals.yaml

We also need to tell the framework where to load inputs from, and some information about the input, for that we use an input .yaml configuration called tutorial_01_inputs.yaml:

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/tutorial_01_input.yaml
    :language: yaml
    :caption: tutorial_01_input.yaml

Above we can see that the input needs 3 fields: input_info, input_type_info and model_config. The input_info contains basic information about the input. The input_type_info contains information specific to the input type (in this case omics). Finally, the model_config contains configuration for the model that should be trained with the input data. For more information about the configurations, e.g. which parameters are relevant for the chosen models and what they do, head over to the :ref:`api-reference` reference.

Finally, we need to specify what outputs to predict during training. For that we will use the tutorial_01_outputs.yaml file with the following content:

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/tutorial_01_outputs.yaml
    :language: yaml
    :caption: tutorial_01_outputs.yaml

Note

You might notice that we have not written any fusion config so far. The fusion configuration controls how different modalities (i.e. input data types, for example genotype and clinical data) are combined using a neural network. While we indeed can configure the fusion, we will leave use the defaults for now. The default fusion model is a fully connected neural network.

With all this, we should have our project directory looking something like this:

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/tutorial_folder.txt
    :language: console

B - Training

Training a GLN model

Now that we have our configurations set up, training is simply passing them to the framework, like so:

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/GLN_1.txt
    :language: console

This will generate a folder in the current directory called eir_tutorials, and eir_tutorials/tutorial_runs/a_using_eir/tutorial_01_run (note that the inner run name comes from the value in global_config we set before) will contain the results from our experiment.

Tip

You might try running the command above again after it partially/completely finishes, and most likely you will encounter a FileExistsError. This is to avoid accidentally overwriting previous experiments. When performing another run, we will have to delete/rename the experiment, or change it in the configuration (see below).

Examining the directory, we see the following structure (some files have been excluded here for brevity):

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/experiment_01_folder.txt
    :language: console

In the results folder for a given output, the [200, 400, 600] folders contain our validation results according to our sample_interval configuration in the global config.

We can examine how our model did with respect to accuracy (let's assume our targets are fairly balanced in this case) by checking the training_curve_ACC.png file:

Examining the actual predictions and how they matched the target labels, we can look at the confusion matrix in one of the evaluation folders of results/Origin/samples. When I ran this, I got the following at iteration 600:

In the training curve above, we can see that our model barely got going before the run finished! Let's try another experiment. We can change the output_folder value in 01_basic_tutorial/tutorial_01_globals.yaml, but the framework also supports rudimentary injection of values from the command line. Let's try that, setting a new run name, increasing the number of epochs and changing the learning rate:

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/GLN_2.txt
    :language: console

Note

The injected values are according to the configuration filenames.

Looking at the training curve from that run, we can see we did a bit better:

We also notice that there is a gap between the training and evaluation performances, indicating that the model is starting to overfit on the training data. There are a bunch of regularisation settings we could try, such as increasing dropout in the input, fusion and output modules. Check the :ref:`api-reference` reference for a full overview.

C - Predicting on external samples

Predicting on samples with known labels

To predict on external samples, we run eirpredict. As we can see when running eirpredict --help, it looks quite similar to eirtrain:

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/eirpredict_help.txt
    :language: console
    :lines: 2-

Generally we do not change much of the configs when predicting, with the exception of the input configs (and then mainly setting the input_source, i.e. where to load our samples to predict/test on from) and perhaps the global config (e.g. we might not compute attributions during training, but compute them on our test set by activating compute_attributions in the global config when predicting). Specific to eirpredict, we have to choose a saved model (--model_path), whether we want to evaluate the performance on the test set (--evaluate this means that the respective labels must be present in the --output_configs) and where to save the prediction results (--output_folder).

For the sake of this tutorial, we use one of the saved models from our previous training run and use it for inference using eirpredict module. Here, we will simply use it to predict on the same data as before.

Warning

We are only predicting on the same data we trained on in this tutorial to show how to use the eirpredict module. Always take care in separating what data you use for training and to evaluate generalization performance of your models!

Run the commands below, making sure you add the correct path of a saved model to the --model_path argument.

To test, we can run the following command (note that you will have to add the path to your saved model for the --model_path parameter below).

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/GLN_1_PREDICT.txt
    :language: console

This will generate a file called calculated_metrics.json in the supplied output_folder as well as a folder for each output (in this case called ancestry_output containing the actual predictions and plots. Of course the metrics are quite nonsensical here, as we are predicting on the same data we trained on.

One of the files generated are the actual predictions, found in the predictions.csv file:

The True Label Untransformed column contains the actual labels, as they were in the raw data. The True Label column contains the labels after they have been numerically encoded / normalized in EIR. The other columns represent the raw network outputs for each of the classes.

Predicting on samples with unknown labels

Notice that when running the command above, we knew the labels of the samples we were predicting on. In practice, we are often predicting on samples we have no clue about the labels of. In this case, we can again use the eirpredict with slightly modified arguments:

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/GLN_1_PREDICT_UNKNOWN.txt
    :language: console
    :emphasize-lines: 4,6

We can notice a couple of changes here compared to the previous command:

We have removed the --evaluate flag, as we do not have the labels for the samples we are predicting on.
We have a different output configuation file, tutorial_01_outputs_unknown.yaml.
We have a different output folder, tutorial_01_unknown.

If we take a look at the tutorial_01_outputs_unknown.yaml file, we can see that it contains the following:

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/tutorial_01_outputs_unknown.yaml
    :language: yaml
    :caption: tutorial_01_outputs_unknown.yaml
    :emphasize-lines: 3

Notice that everything is the same as before, but for output_source we have null instead of the .csv label file we had before.

Taking a look at the produced predictions.csv file, we can see that we only have the actual predictions, and no true labels:

D - Applying to your own data (e.g. UK Biobank)

Thank you for reading this far! Hopefully this tutorial introduced you well enough to the framework so you can apply it to your own data. For that, you will have to process it first (see: plink pipelines). Then you will have to set the relevant paths for the inputs (e.g. input_source, snp_file) and outputs (e.g. output_source, target_cat_columns or target_con_columns if you have continuous targets).

Important

If you are interested in quickly training deep learning models for genomic prediction, the EIR-auto-GP project might be of use to you.

When moving to large scale data such as the UK Biobank, the configurations we used on the ancestry toy data in this tutorial will likely not be sufficient. For example, the learning rate is likely too high. For this, here are some baseline configurations that we have found to work well as a starting point:

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/large_scale_globals.yaml
    :language: yaml
    :caption: globals.yaml

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/large_scale_input_gln.yaml
    :language: yaml
    :caption: input_genotype.yaml

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/large_scale_input_tabular.yaml
    :language: yaml
    :caption: input_tabular.yaml

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/large_scale_fusion.yaml
    :language: yaml
    :caption: fusion.yaml

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/large_scale_output.yaml
    :language: yaml
    :caption: output.yaml

E - Serving

In this final section, we demonstrate serving our trained model as a web service and interacting with it using HTTP requests.

Starting the Web Service

To serve the model, use the following command:

eirserve --model-path [MODEL_PATH]

Replace [MODEL_PATH] with the actual path to your trained model. This command initiates a web service that listens for incoming requests.

Here is an example of the command:

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/commands/GLN_1_DEPLOY.txt
    :language: console

Sending Requests

With the server running, we can now send requests. The requests are prepared by loading numpy array data, converting it to base64 encoded strings, and then constructing a JSON payload.

Here's an example Python function demonstrating this process:

import numpy as np
import base64
import requests

def encode_numpy_array(file_path: str) -> str:
    array = np.load(file_path)
    encoded = base64.b64encode(array.tobytes()).decode('utf-8')
    return encoded

def send_request(url: str, payload: dict):
    response = requests.post(url, json=payload)
    return response.json()

encoded_data = encode_numpy_array('path_to_your_numpy_array.npy')
response = send_request('http://localhost:8000/predict', {'genotype': encoded_data})
print(response)

Analyzing Responses

Here are some examples of responses from the server for a set of inputs:

.. literalinclude:: ../tutorial_files/a_using_eir/01_basic_tutorial/serve_results/predictions.json
    :language: json
    :caption: predictions.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01_basic_tutorial.rst

01_basic_tutorial.rst

01 – Genotype Tutorial: Ancestry Prediction

A - Setup

B - Training

Training a GLN model

C - Predicting on external samples

Predicting on samples with known labels

Predicting on samples with unknown labels

D - Applying to your own data (e.g. UK Biobank)

E - Serving

Starting the Web Service

Sending Requests

Analyzing Responses

Files

01_basic_tutorial.rst

Latest commit

History

01_basic_tutorial.rst

File metadata and controls

01 – Genotype Tutorial: Ancestry Prediction

A - Setup

B - Training

Training a GLN model

C - Predicting on external samples

Predicting on samples with known labels

Predicting on samples with unknown labels

D - Applying to your own data (e.g. UK Biobank)

E - Serving

Starting the Web Service

Sending Requests

Analyzing Responses