Patient phenotype and genotype similarity

This package is research code designed to measure the similarity between patients, using phenotype (specified as Human Phenotype Ontology (HPO) terms) and/or genotype (specified as the Exomiser-processed results of whole-exome VCF files).

Dependencies

Python 3.3+

Phenotypic similarity

Input file formats

JSON (default, `--patient-file-format phenotips`)

By default, phenotype data is expected in PhenoTips JSON export format, e.g.: phenotips_2017-02-01_00-01.json.

Here is an example JSON file.

CSV (`--patient-file-format csv`)

A simple csv file format is supported, with a patient per line and columns:

The patient's identifier (required)
The patient's first present HPO term (required)
The patient's second present HPO term (optional)
The patient's third present HPO term (optional), ...

For example :

Patient1,HP:0000001,HP:0000002,HP:0000003,HP:0000004
Patient2,HP:0000001,HP:0000002

Here is an example CSV file.

Pair-wise phenotypic similarity

Pair-wise phenotypic similarity can be computed using a number of different similarity metrics using the patient_similarity.py script. For example, to compute just the simGIC score:

python -m patient_similarity --log=INFO -s simgic test/test.json \
  data/hp.obo data/phenotype_annotation.tab

This will print to stdout the pairwise similarity scores, e.g.:

A	B	simgic
P0000001	P0000002	0.146613
P0000001	P0000003	0.191716
P0000001	P0000004	0.170512
P0000002	P0000003	0.124032
P0000002	P0000004	0.167785
P0000003	P0000004	0.291074

Multiple scores can be added by specifying -s multiple times, or all scores will be computed if -s is not specified. Supported phenotypic similarity scores include:

jaccard
resnik
lin
jc
owlsim
ui
simgic
icca
TODO: add ebosimgic

See the PhenomeCentral paper for a comparison of many of these

Many of these similarity scores use the information content of the terms in the HPO to compute a similarity score. The information content of a term is defined to be IC(t) = -log_2(p(t)), where p(t) is the probability of the term. The probability of the term can be estimated in many ways, such as the fraction of OMIM diseases that have the term associated (10.1016/j.ajhg.2008.09.017).

A number of options have been added to support different variants of the IC computation:

--use-disease-prevalence: instead of weighting each disease uniformly, weight them by their estimated prevalence from Orphanet
--use-phenotype-frequency: instead of weighting each phenotype-disease association uniformly, weight them by the frequency of the association where available
--use-patient-phenotyes: count each patient as an additional entry in the corpus, alongside diseases, in the frequency estimation
--distribute-ic-to-leaves: evenly divide the observed frequency of each term amongst its children, so that all non-leaf nodes have zero frequency
--use-aoo: include an age-of-onset similarity penalty in the similarity scoring

Updating the data files

This package includes data files from HPO and Orphanet sources, which should be updated occasionally.

data/hp.obo - See http://human-phenotype-ontology.github.io/downloads.html
data/phenotype_annotations.tab - See http://human-phenotype-ontology.github.io/downloads.html
data/en_product1.xml - See http://www.orphadata.org/cgi-bin/inc/product1.inc.php
data/en_product2.xml - See http://www.orphadata.org/cgi-bin/inc/product2.inc.php

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
data		data
patient_similarity		patient_similarity
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_disease_prototypes.py		generate_disease_prototypes.py
hpo_lookup.py		hpo_lookup.py
patient_ic.py		patient_ic.py
plot_im_scores.R		plot_im_scores.R
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Patient phenotype and genotype similarity

Dependencies

Phenotypic similarity

Input file formats

JSON (default, `--patient-file-format phenotips`)

CSV (`--patient-file-format csv`)

Pair-wise phenotypic similarity

Updating the data files

About

Releases

Packages

Contributors 3

Languages

License

buske/patient-similarity

Folders and files

Latest commit

History

Repository files navigation

Patient phenotype and genotype similarity

Dependencies

Phenotypic similarity

Input file formats

JSON (default, --patient-file-format phenotips)

CSV (--patient-file-format csv)

Pair-wise phenotypic similarity

Updating the data files

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

JSON (default, `--patient-file-format phenotips`)

CSV (`--patient-file-format csv`)

Packages