### An overview of OMIM data

[OMIM®](https://www.omim.org/) is an Online Catalog of Human Genes and Genetic Disorders.

This notebook aims to give an overview of the [data files](https://www.omim.org/downloads/). We will use pandas to read and manipulate the data files.

In [1]:
import pandas as pd

We will create pandas dataframes by hand.

In [2]:
d1 = pd.DataFrame({'Date': [1,2,3,4], 'Value': [12,13,14,15]})

In [3]:
d1

Unnamed: 0,Date,Value
0,1,12
1,2,13
2,3,14
3,4,15


In [4]:
d2 = pd.DataFrame({'Date': [1,2,3,4,5], 'Month': ['J','F','M','A','M']})

In [5]:
d2

Unnamed: 0,Date,Month
0,1,J
1,2,F
2,3,M
3,4,A
4,5,M


Sometimes there are missing data. We can also create one by hand.

In [6]:
import numpy as np

In [7]:
d3 = pd.DataFrame({'Month': ['M','J','F','M','A'], 'Value': [12,13,14,15,np.nan]})

In [8]:
d3

Unnamed: 0,Month,Value
0,M,12.0
1,J,13.0
2,F,14.0
3,M,15.0
4,A,


Next, we will create a dummy csv file and read it using [pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

We will run shell command echo with argument -e to generate a csv file.

In [9]:
! echo -e "1,2,3\na,b,c" > test.csv

In [10]:
pd.read_csv('test.csv')

Unnamed: 0,1,2,3
0,a,b,c


We can also skip the first line. Remember that Python uses zero-based indexing. The index of the second line is 1.

In [11]:
pd.read_csv('test.csv', header=1)

Unnamed: 0,a,b,c


Let's remove the dummy file and read the OMIM data files.

In [12]:
! rm test.csv

We will skip the first three lines, since the header starts at the fourth line.

In [101]:
Genemap = pd.read_csv('../ref/omim/genemap2.txt', sep='\t', header=3)

In [102]:
GeneMap

Unnamed: 0,# Chromosome,Genomic Position Start,Genomic Position End,Cyto Location,Computed Cyto Location,MIM Number,Gene Symbols,Gene Name,Approved Gene Symbol,Entrez Gene ID,Ensembl Gene ID,Comments,Phenotypes,Mouse Gene Symbol/ID
0,chr1,1.0,27600000.0,1p36,,607413.0,AD7CNTP,Alzheimer disease neuronal thread protein,,,,,,
1,chr1,1.0,27600000.0,1p36,,612367.0,ALPQTL2,"Alkaline phosphatase, plasma level of, QTL 2",,100196914.0,,linkage with rs1780324,"{Alkaline phosphatase, plasma level of, QTL 2}...",
2,chr1,1.0,123400001.0,1p,,606788.0,ANON1,"Anorexia nervosa, susceptibility to, 1",,171514.0,,,"{Anorexia nervosa, susceptibility to, 1}, 6067...",
3,chr1,1.0,27600000.0,1p36,,605462.0,BCC1,"Basal cell carcinoma, susceptibility to, 1",,100307118.0,,associated with rs7538876,"{Basal cell carcinoma, susceptibility to, 1}, ...",
4,chr1,1.0,27600000.0,1p36,,606928.0,BMND3,Bone mineral density QTL 3,,246259.0,,?another locus at 3p21,"[Bone mineral density QTL 3], 606928 (2)",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18302,#,,,,,,,,,,,,,
18303,# You can find sample code in Python to parse ...,,,,,,,,,,,,,
18304,#,,,,,,,,,,,,,
18305,# https://github.com/OMIM-org/genemap2-parser,,,,,,,,,,,,,


In [103]:
Mimtittles = pd.read_csv('../ref/omim/mimTitles.txt', sep='\t', header=2)

In [104]:
MimTittles

Unnamed: 0,# Prefix,MIM Number,Preferred Title; symbol,Alternative Title(s); symbol(s),Included Title(s); symbols
0,,100050.0,"AARSKOG SYNDROME, AUTOSOMAL DOMINANT",,
1,Percent,100070.0,"AORTIC ANEURYSM, FAMILIAL ABDOMINAL, 1; AAA1","ANEURYSM, ABDOMINAL AORTIC; AAA;; ABDOMINAL AO...",
2,Number Sign,100100.0,PRUNE BELLY SYNDROME; PBS,"ABDOMINAL MUSCLES, ABSENCE OF, WITH URINARY TR...",
3,,100200.0,ABDUCENS PALSY,,
4,Number Sign,100300.0,ADAMS-OLIVER SYNDROME 1; AOS1,"AOS;; ABSENCE DEFECT OF LIMBS, SCALP, AND SKUL...","APLASIA CUTIS CONGENITA, CONGENITAL HEART DEFE..."
...,...,...,...,...,...
28348,"# Number Sign (#) Phenotype, molecular basis ...",,,,
28349,"# Percent (%) Phenotype or locus, molecular b...",,,,
28350,"# NULL (<null>) Other, mainly phenotypes with...",,,,
28351,# Caret (^) Entry has been removed from the d...,,,,


In [105]:
Mim2gen = pd.read_csv('../ref/omim/mim2gene.txt', sep="\t", header=4)

In [106]:
Mim2gen

Unnamed: 0,# MIM Number,MIM Entry Type (see FAQ 1.3 at https://omim.org/help/faq),Entrez Gene ID (NCBI),Approved Gene Symbol (HGNC),Ensembl Gene ID (Ensembl)
0,100050,predominantly phenotypes,,,
1,100070,phenotype,100329167.0,,
2,100100,phenotype,,,
3,100200,predominantly phenotypes,,,
4,100300,phenotype,,,
...,...,...,...,...,...
28335,620423,phenotype,,,
28336,620424,gene,646799.0,ZAR1L,ENSG00000189167
28337,620425,phenotype,,,
28338,620426,gene,389119.0,INKA1,ENSG00000185614


In [107]:
Morbidmap = pd.read_csv('../ref/omim/morbidmap.txt', sep="\t", header=3)

In [108]:
Morbidmap

Unnamed: 0,# Phenotype,Gene Symbols,MIM Number,Cyto Location
0,"17,20-lyase deficiency, isolated, 202110 (3)","CYP17A1, CYP17, P450C17",609300.0,10q24.32
1,"17-alpha-hydroxylase/17,20-lyase deficiency, 2...","CYP17A1, CYP17, P450C17",609300.0,10q24.32
2,"2,4-dienoyl-CoA reductase deficiency, 616034 (3)","NADK2, C5orf33, DECRD",615787.0,5p13.2
3,"2-methylbutyrylglycinuria, 610006 (3)","ACADSB, SBCAD",600301.0,10q26.13
4,"3-M syndrome 1, 273750 (3)","CUL7, 3M1",609577.0,6p21.1
...,...,...,...,...
8640,# 3 - The molecular basis for the disorder is ...,,,
8641,# found in the gene.,,,
8642,# 4 - A contiguous gene deletion or duplicatio...,,,
8643,# are deleted or duplicated causing the phenot...,,,


Let's merge the two dataframes.

In [112]:
df1 = pd.merge(GeneMap, MimTittles)

In [113]:
df1

Unnamed: 0,# Chromosome,Genomic Position Start,Genomic Position End,Cyto Location,Computed Cyto Location,MIM Number,Gene Symbols,Gene Name,Approved Gene Symbol,Entrez Gene ID,Ensembl Gene ID,Comments,Phenotypes,Mouse Gene Symbol/ID,# Prefix,Preferred Title; symbol,Alternative Title(s); symbol(s),Included Title(s); symbols
0,chr1,1.0,27600000.0,1p36,,607413.0,AD7CNTP,Alzheimer disease neuronal thread protein,,,,,,,,ALZHEIMER DISEASE NEURONAL THREAD PROTEIN,AD7CNTP,
1,chr1,1.0,27600000.0,1p36,,612367.0,ALPQTL2,"Alkaline phosphatase, plasma level of, QTL 2",,100196914.0,,linkage with rs1780324,"{Alkaline phosphatase, plasma level of, QTL 2}...",,Percent,"ALKALINE PHOSPHATASE, PLASMA LEVEL OF, QUANTIT...",ALPQTL2,
2,chr1,1.0,123400001.0,1p,,606788.0,ANON1,"Anorexia nervosa, susceptibility to, 1",,171514.0,,,"{Anorexia nervosa, susceptibility to, 1}, 6067...",,Percent,"ANOREXIA NERVOSA, SUSCEPTIBILITY TO; ANON",AN,"ANOREXIA NERVOSA, SUSCEPTIBILITY TO, 1, INCLUD..."
3,chr1,1.0,27600000.0,1p36,,605462.0,BCC1,"Basal cell carcinoma, susceptibility to, 1",,100307118.0,,associated with rs7538876,"{Basal cell carcinoma, susceptibility to, 1}, ...",,Percent,"BASAL CELL CARCINOMA, SUSCEPTIBILITY TO, 1; BCC1",,"BASAL CELL CARCINOMA, NONSYNDROMIC, INCLUDED;;..."
4,chr1,1.0,27600000.0,1p36,,606928.0,BMND3,Bone mineral density QTL 3,,246259.0,,?another locus at 3p21,"[Bone mineral density QTL 3], 606928 (2)",,Percent,BONE MINERAL DENSITY QUANTITATIVE TRAIT LOCUS ...,"BONE MINERAL DENSITY, LOW, SUSCEPTIBILITY TO",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19214,#,,,,,,,,,,,,,,"# Number Sign (#) Phenotype, molecular basis ...",,,
19215,#,,,,,,,,,,,,,,"# Percent (%) Phenotype or locus, molecular b...",,,
19216,#,,,,,,,,,,,,,,"# NULL (<null>) Other, mainly phenotypes with...",,,
19217,#,,,,,,,,,,,,,,# Caret (^) Entry has been removed from the d...,,,


We can specify which column(s) to merge and how to [merge](https://pandas.pydata.org/docs/reference/api/pandas.merge.html).

In [216]:
dftest = pd.merge(GeneMap, Morbidmap, on = ['MIM Number', 'Gene Symbols', 'Cyto Location'], how = 'inner')

In [217]:
dftest

Unnamed: 0,# Chromosome,Genomic Position Start,Genomic Position End,Cyto Location,Computed Cyto Location,MIM Number,Gene Symbols,Gene Name,Approved Gene Symbol,Entrez Gene ID,Ensembl Gene ID,Comments,Phenotypes,Mouse Gene Symbol/ID,# Phenotype
0,chr1,1.0,27600000.0,1p36,,612367.0,ALPQTL2,"Alkaline phosphatase, plasma level of, QTL 2",,100196914.0,,linkage with rs1780324,"{Alkaline phosphatase, plasma level of, QTL 2}...",,"{Alkaline phosphatase, plasma level of, QTL 2}..."
1,chr1,1.0,123400001.0,1p,,606788.0,ANON1,"Anorexia nervosa, susceptibility to, 1",,171514.0,,,"{Anorexia nervosa, susceptibility to, 1}, 6067...",,"{Anorexia nervosa, susceptibility to, 1} (2)"
2,chr1,1.0,27600000.0,1p36,,605462.0,BCC1,"Basal cell carcinoma, susceptibility to, 1",,100307118.0,,associated with rs7538876,"{Basal cell carcinoma, susceptibility to, 1}, ...",,"{Basal cell carcinoma, susceptibility to, 1} (2)"
3,chr1,1.0,27600000.0,1p36,,606928.0,BMND3,Bone mineral density QTL 3,,246259.0,,?another locus at 3p21,"[Bone mineral density QTL 3], 606928 (2)",,[Bone mineral density QTL 3] (2)
4,chr1,1.0,2300000.0,1p36.33,,618815.0,"C1DUPp36.33, DUP1p36.33","Chromosome 1p36.33 duplication syndrome, ATAD3...",,,,,"Chromosome 1p36.33 duplication syndrome, ATAD3...",,"Chromosome 1p36.33 duplication syndrome, ATAD3..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5764,#,,,,,,,,,,,,,,# 3 - The molecular basis for the disorder is ...
5765,#,,,,,,,,,,,,,,# found in the gene.
5766,#,,,,,,,,,,,,,,# 4 - A contiguous gene deletion or duplicatio...
5767,#,,,,,,,,,,,,,,# are deleted or duplicated causing the phenot...


There is a hash "#" in "# MIM Number".

In [91]:
Mim2gen

Unnamed: 0,# MIM Number,MIM Entry Type (see FAQ 1.3 at https://omim.org/help/faq),Entrez Gene ID (NCBI),Approved Gene Symbol (HGNC),Ensembl Gene ID (Ensembl)
0,100050,predominantly phenotypes,,,
1,100070,phenotype,100329167.0,,
2,100100,phenotype,,,
3,100200,predominantly phenotypes,,,
4,100300,phenotype,,,
...,...,...,...,...,...
28335,620423,phenotype,,,
28336,620424,gene,646799.0,ZAR1L,ENSG00000189167
28337,620425,phenotype,,,
28338,620426,gene,389119.0,INKA1,ENSG00000185614


 To merge this dataframe with others, we will rename the column.

In [114]:
Mim2gen = Mim2gen.rename(columns={'# MIM Number':'MIM Number'})

In [115]:
Mim2gen

Unnamed: 0,MIM Number,MIM Entry Type (see FAQ 1.3 at https://omim.org/help/faq),Entrez Gene ID (NCBI),Approved Gene Symbol (HGNC),Ensembl Gene ID (Ensembl)
0,100050,predominantly phenotypes,,,
1,100070,phenotype,100329167.0,,
2,100100,phenotype,,,
3,100200,predominantly phenotypes,,,
4,100300,phenotype,,,
...,...,...,...,...,...
28335,620423,phenotype,,,
28336,620424,gene,646799.0,ZAR1L,ENSG00000189167
28337,620425,phenotype,,,
28338,620426,gene,389119.0,INKA1,ENSG00000185614


In [116]:
df1 = pd.merge(df1, Mim2gen)

In [117]:
df1

Unnamed: 0,# Chromosome,Genomic Position Start,Genomic Position End,Cyto Location,Computed Cyto Location,MIM Number,Gene Symbols,Gene Name,Approved Gene Symbol,Entrez Gene ID,Ensembl Gene ID,Comments,Phenotypes,Mouse Gene Symbol/ID,# Prefix,Preferred Title; symbol,Alternative Title(s); symbol(s),Included Title(s); symbols,MIM Entry Type (see FAQ 1.3 at https://omim.org/help/faq),Entrez Gene ID (NCBI),Approved Gene Symbol (HGNC),Ensembl Gene ID (Ensembl)
0,chr1,1.0,27600000.0,1p36,,607413.0,AD7CNTP,Alzheimer disease neuronal thread protein,,,,,,,,ALZHEIMER DISEASE NEURONAL THREAD PROTEIN,AD7CNTP,,predominantly phenotypes,,,
1,chr1,1.0,27600000.0,1p36,,612367.0,ALPQTL2,"Alkaline phosphatase, plasma level of, QTL 2",,100196914.0,,linkage with rs1780324,"{Alkaline phosphatase, plasma level of, QTL 2}...",,Percent,"ALKALINE PHOSPHATASE, PLASMA LEVEL OF, QUANTIT...",ALPQTL2,,phenotype,100196914.0,,
2,chr1,1.0,123400001.0,1p,,606788.0,ANON1,"Anorexia nervosa, susceptibility to, 1",,171514.0,,,"{Anorexia nervosa, susceptibility to, 1}, 6067...",,Percent,"ANOREXIA NERVOSA, SUSCEPTIBILITY TO; ANON",AN,"ANOREXIA NERVOSA, SUSCEPTIBILITY TO, 1, INCLUD...",phenotype,171514.0,,
3,chr1,1.0,27600000.0,1p36,,605462.0,BCC1,"Basal cell carcinoma, susceptibility to, 1",,100307118.0,,associated with rs7538876,"{Basal cell carcinoma, susceptibility to, 1}, ...",,Percent,"BASAL CELL CARCINOMA, SUSCEPTIBILITY TO, 1; BCC1",,"BASAL CELL CARCINOMA, NONSYNDROMIC, INCLUDED;;...",phenotype,100307118.0,,
4,chr1,1.0,27600000.0,1p36,,606928.0,BMND3,Bone mineral density QTL 3,,246259.0,,?another locus at 3p21,"[Bone mineral density QTL 3], 606928 (2)",,Percent,BONE MINERAL DENSITY QUANTITATIVE TRAIT LOCUS ...,"BONE MINERAL DENSITY, LOW, SUSCEPTIBILITY TO",,phenotype,246259.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18226,chrY,25622115.0,25625511.0,Yq11.23,Yq11.23,400016.0,"CDY1, CDY","Chromodomain protein, Y chromosome",CDY1,9085.0,ENSG00000172288,,,,Asterisk,"CHROMODOMAIN PROTEIN, Y-LINKED, 1; CDY1","CHROMODOMAIN PROTEIN, Y CHROMOSOME; CDY",,gene,9085.0,CDY1,ENSG00000172288
18227,chrY,25728490.0,25733388.0,Yq11.23,Yq11.23,400036.0,TTTY3,"Testis-specific transcript, Y-linked, 3",TTTY3,114760.0,ENSG00000231141,,,,Asterisk,"TESTIS-SPECIFIC TRANSCRIPT, Y-LINKED, 3; TTTY3",,,gene,114760.0,TTTY3,ENSG00000231141
18228,chrY,26600001.0,57227415.0,Yq12,,475000.0,"GCY, TSY, STA","Growth control, Y-chromosome influenced",,2656.0,,,,,Percent,"GROWTH CONTROL, Y-CHROMOSOME INFLUENCED; GCY",STATURE; STA;; TOOTH SIZE; TS; TSY,,phenotype,2656.0,,
18229,chrY,1.0,57227415.0,Chr.Y,,400043.0,DFNY1,"Deafness, Y-linked 1",,724074.0,,,"Deafness, Y-linked 1, 400043 (2), Y-linked",,Percent,"DEAFNESS, Y-LINKED 1; DFNY1",,,phenotype,724074.0,,


In [118]:
df1 = pd.merge(df1, Morbidmap)

In [119]:
df1

Unnamed: 0,# Chromosome,Genomic Position Start,Genomic Position End,Cyto Location,Computed Cyto Location,MIM Number,Gene Symbols,Gene Name,Approved Gene Symbol,Entrez Gene ID,Ensembl Gene ID,Comments,Phenotypes,Mouse Gene Symbol/ID,# Prefix,Preferred Title; symbol,Alternative Title(s); symbol(s),Included Title(s); symbols,MIM Entry Type (see FAQ 1.3 at https://omim.org/help/faq),Entrez Gene ID (NCBI),Approved Gene Symbol (HGNC),Ensembl Gene ID (Ensembl),# Phenotype
0,chr1,1.0,27600000.0,1p36,,612367.0,ALPQTL2,"Alkaline phosphatase, plasma level of, QTL 2",,100196914.0,,linkage with rs1780324,"{Alkaline phosphatase, plasma level of, QTL 2}...",,Percent,"ALKALINE PHOSPHATASE, PLASMA LEVEL OF, QUANTIT...",ALPQTL2,,phenotype,100196914.0,,,"{Alkaline phosphatase, plasma level of, QTL 2}..."
1,chr1,1.0,123400001.0,1p,,606788.0,ANON1,"Anorexia nervosa, susceptibility to, 1",,171514.0,,,"{Anorexia nervosa, susceptibility to, 1}, 6067...",,Percent,"ANOREXIA NERVOSA, SUSCEPTIBILITY TO; ANON",AN,"ANOREXIA NERVOSA, SUSCEPTIBILITY TO, 1, INCLUD...",phenotype,171514.0,,,"{Anorexia nervosa, susceptibility to, 1} (2)"
2,chr1,1.0,27600000.0,1p36,,605462.0,BCC1,"Basal cell carcinoma, susceptibility to, 1",,100307118.0,,associated with rs7538876,"{Basal cell carcinoma, susceptibility to, 1}, ...",,Percent,"BASAL CELL CARCINOMA, SUSCEPTIBILITY TO, 1; BCC1",,"BASAL CELL CARCINOMA, NONSYNDROMIC, INCLUDED;;...",phenotype,100307118.0,,,"{Basal cell carcinoma, susceptibility to, 1} (2)"
3,chr1,1.0,27600000.0,1p36,,606928.0,BMND3,Bone mineral density QTL 3,,246259.0,,?another locus at 3p21,"[Bone mineral density QTL 3], 606928 (2)",,Percent,BONE MINERAL DENSITY QUANTITATIVE TRAIT LOCUS ...,"BONE MINERAL DENSITY, LOW, SUSCEPTIBILITY TO",,phenotype,246259.0,,,[Bone mineral density QTL 3] (2)
4,chr1,1.0,2300000.0,1p36.33,,618815.0,"C1DUPp36.33, DUP1p36.33","Chromosome 1p36.33 duplication syndrome, ATAD3...",,,,,"Chromosome 1p36.33 duplication syndrome, ATAD3...",,Number Sign,"CHROMOSOME 1p36.33 DUPLICATION SYNDROME, ATAD3...",,,phenotype,,,,"Chromosome 1p36.33 duplication syndrome, ATAD3..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3940,chrY,6910697.0,7091683.0,Yp11.2,Yp11.2,400033.0,"TBL1Y, DFNY2","Transducin-beta-like 1, Y-linked",TBL1Y,90665.0,ENSG00000092377,mutation identified in 1 DFNY2 family,"?Deafness, Y-linked 2, 400047 (3), Y-linked",Tbl1x (MGI:1336172),Asterisk,"TRANSDUCIN-BETA-LIKE 1, Y-LINKED; TBL1Y",,,gene,90665.0,TBL1Y,ENSG00000092377,"?Deafness, Y-linked 2, 400047 (3)"
3941,chrY,10400001.0,26600000.0,Yq11,,400042.0,"DELYq11, CYDELq11, SPGFY1",Chromosome Yq11 interstitial deletion syndrome,,,,contiguous gene deletion syndrome,"Spermatogenic failure, Y-linked, 1, 400042 (4)...",,Number Sign,"SPERMATOGENIC FAILURE, Y-LINKED, 1; SPGFY1","SERTOLI CELL-ONLY SYNDROME, Y-LINKED;; SERTOLI...","SERTOLI CELL-ONLY SYNDROME, TYPE II, INCLUDED;...",phenotype,,,,"Spermatogenic failure, Y-linked, 1 (4)"
3942,chrY,10400001.0,57227415.0,Yq,,425500.0,HEY,"Hairy ears, Y-linked",,100188776.0,,,"?Hairy ears, Y-linked, 425500 (2), Y-linked",,,"HAIRY EARS, Y-LINKED","HYPERTRICHOSIS PINNAE AURIS, Y-LINKED",,predominantly phenotypes,100188776.0,,,"?Hairy ears, Y-linked (2)"
3943,chrY,1.0,57227415.0,Chr.Y,,400043.0,DFNY1,"Deafness, Y-linked 1",,724074.0,,,"Deafness, Y-linked 1, 400043 (2), Y-linked",,Percent,"DEAFNESS, Y-LINKED 1; DFNY1",,,phenotype,724074.0,,,"Deafness, Y-linked 1 (2)"


So far we have read and merged four data files. What if we have more than four? Merging them one by one is tedious. We will merge these dataframes in one go.

In [193]:
genemap = pd.read_csv('/Volumes/archive/scratch/brownlab/ensembl/ref/omim/genemap2.txt', sep='\t', header=3, skipfooter=76, engine='python')
genemap.rename(columns={'# Chromosome':'Chromosome'}, inplace=True)

mimtittles = pd.read_csv('/Volumes/archive/scratch/brownlab/ensembl/ref/omim/mimTitles.txt', sep='\t', header=2, skipfooter=13, engine='python')
mimtittles.rename(columns={'# Prefix':'Prefix'}, inplace=True)

mim2gene = pd.read_csv('/Volumes/archive/scratch/brownlab/ensembl/ref/omim/mim2gene.txt', sep="\t", header=4)
mim2gene.rename(columns={'# MIM Number':'MIM Number'}, inplace=True)
mim2gene.columns = ['MIM Number','MIM Entry Type','Entrez Gene ID',
                    'Approved Gene Symbol','Ensembl Gene ID']

morbidmap = pd.read_csv('/Volumes/archive/scratch/brownlab/ensembl/ref/omim/morbidmap.txt', sep="\t", header=3, skipfooter=24, engine='python')
morbidmap['morbidmap'] = True
morbidmap = morbidmap[['MIM Number','morbidmap']].drop_duplicates()

In [195]:
#merge all dfs
from functools import reduce

dfs = [genemap, mimtittles, morbidmap, mim2gene]
df = reduce(lambda left, right: pd.merge(left, right, how = 'outer'), dfs)

We will use groupby(['column_name']).first() to get the first record from each group

In [None]:
df_ =df.groupby(['MIM Number']).first().reset_index()

We will work with lines with Chromosome information, and discriminate between disease and non-disease MIM.

We will get the subset for those with values in ['Chromosome'] in df_, and fill empty fiels with "morbidmap".

In [None]:
df_ = df_[~df_['Chromosome'].isna()].fillna({'morbidmap': False})

We will export df_ as a tab separated file called omim.txt.

In [None]:
df_.to_csv('../ref/omim/omim.txt', sep= '\t', index=None)