# OMIM annotations of constrained genes
This script combines regional constraint annotations and OMIM phenotype / inheritance mode annotations.

Note:
- Non-disease phenotypes (e.g. provisional, or susceptibility phenotypes) are dropped
- OMIM entries where inheritance information is missing are dropped
- The inheritance patterns from OMIM are simplified here
- Several genes have duplicate entries in OMIM (multiple phenotypes and / or inheritance modes). These genes are also duplicated in the merged annotation.

## Import modules

In [1]:
import numpy as np
import pandas as pd

## OMIM phenotype data
Sanitised OMIM phenotype and inheritance data for each gene

In [2]:
# Read data to memory
omim = (
    pd.read_csv(
        "../outputs/genemap2_parsed.tsv",
        sep="\t",
        usecols=["ensg", "phenotype", "inheritance"],
    )
    .dropna(subset=["ensg", "inheritance"])  # Drop missing values
    .drop_duplicates()
)

In [3]:
# Exclude entries for susceptibility, provisional, and non disease phenotypes
m1 = omim.phenotype.str.startswith("[")  # Non-disease
m2 = omim.phenotype.str.startswith("{")  # Susceptibility
m3 = omim.phenotype.str.startswith("?")  # Provisional

masks = m1, m2, m3
strings = ["Non-disease", "Susceptibility", "Provisional"]

print("The following entries are dropped:")
for m, s in zip(masks, strings):
    print(f"{s} entries: {m.sum()}")

omim = omim[~(m1 | m2 | m3)]

print(f"\nEntries remainining: {len(omim)}")

The following entries are dropped:
Non-disease entries: 84
Susceptibility entries: 518
Provisional entries: 623

Entries remainining: 5693


In [4]:
# Sanitise the inheritance modes.
# Categorise as AD, AR, X-linked, or Other

print("Inheritance mode value counts:")
print(omim.inheritance.value_counts())

# Masks for filtering
m1 = omim.inheritance.str.contains("X-linked")
m2 = omim.inheritance.str.startswith("Autosomal recessive")
m3 = omim.inheritance.str.startswith("Autosomal dominant")

omim.loc[m1, "inheritance"] = "X-linked"
omim.loc[~(m1 | m2 | m3), "inheritance"] = "Other"

print("\nInheritance mode value counts after sanitising:")
print(omim.inheritance.value_counts())

Inheritance mode value counts:
Autosomal recessive          3080
Autosomal dominant           2215
X-linked recessive            197
X-linked                       66
X-linked dominant              65
Somatic mutation               25
Digenic dominant               16
Digenic recessive              11
Isolated cases                  5
Somatic mosaicism               5
Multifactorial                  3
Y-linked                        2
?Autosomal dominant             1
Pseudoautosomal dominant        1
Pseudoautosomal recessive       1
Name: inheritance, dtype: int64

Inheritance mode value counts after sanitising:
Autosomal recessive    3080
Autosomal dominant     2215
X-linked                328
Other                    70
Name: inheritance, dtype: int64


## Genic constraint data
A list of constrained transcripts and regions

In [5]:
# Read to memory
gc = pd.read_csv("../outputs/constrained_regions_labels.tsv", sep="\t")

## Gene IDs
Match ENST, ENSG, and HGNC symbol identifiers

In [6]:
# Read to memory
ids = pd.read_csv(
    "../outputs/gene_ids.tsv",
    sep="\t",
    header=0,
    names=["ensg", "enst", "symbol"],
)

# Drop ENSG and ENST version numbers
ids["ensg"] = ids.ensg.str.split(".").str[0]
ids["enst"] = ids.enst.str.split(".").str[0]

# Drop duplicate IDs
ids = ids.drop_duplicates()

# Show format of IDs
ids.head(1)

Unnamed: 0,ensg,enst,symbol
0,ENSG00000186092,ENST00000641515,OR4F5


## Merge datasets

In [7]:
# Merge datasets
df = gc.merge(ids)  # Constraint and gene IDs
df = df.merge(omim, how="left") # OMIM annotations

print("Note that several genes have duplicate OMIM entries:")
for a, b, in zip([omim, gc, df], ["OMIM", "Constraint", "Merged"]):
    print(f"{b} entries: {len(a)}")

Note that several genes have duplicate OMIM entries:
OMIM entries: 5693
Constraint entries: 46719
Merged entries: 50584


## Save to output

In [8]:
# Reorder columns
df = df[["symbol", "ensg", "enst", "region", "constraint", "phenotype", "inheritance"]]

# Write to output
df.to_csv(
    "../outputs/omim_phenotypes_in_constrained_regions.tsv", sep="\t", index=False
)