
# Rare disease patients mapping to EHR system by using ICD-10 codes
#### (Rare Disease Data Preprocessing (GARD-ORPHANET) ICD10 codes to use it in N3C enclave)
This notebbok is to get concept ids (ICD-10 codes) of rare diseases. We used 12,003 rare diseases from GARD (https://rarediseases.info.nih.gov/) and mapped it to ORPHANET (https://www.orpha.net/) which mapped 9369 ORPHANET codes. Further we used folloing inclusion/exclusion criterias:
 - Excluded groups of disorders
 - Dropped ICD-10 codes if it has phenotype mapping

This file is a parsing from these three files:
1. RDIP's Disease List for GARD-Orphanet mappings
2. Orphanet via Orphadata (see here) - July 2023 version releases

    a. Product 1 for Orphanet entities, Names, Classification Levels, and Xrefs to ICD-10 (note that this is ICD-10 and is not the US-version, ICD-10-CM) 
    
    b. Orphanet-SNOMED Mappings for Xrefs to SNOMED
    


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## GARD-ORPHANET-MAPPING files

In [2]:
# Following file (Orphanet via Orphadata - July 2023 version releases) converted xml to xlsx file
# This first sheet of the file
ICD10 = pd.read_excel('C:\\Users\\,...,\\RD_Cohort_Paper_Feb_2025\\GARD-Orphanet-SNOMED+ICD10_2023-11-29.xlsx',sheet_name = 'en_product1', engine='openpyxl' )
print('DataFrame_size', ICD10.shape)
ICD10.head()


DataFrame_size (78021, 85)


Unnamed: 0,/JDBOR,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 75,Unnamed: 76,Unnamed: 77,Unnamed: 78,Unnamed: 79,Unnamed: 80,Unnamed: 81,Unnamed: 82,Unnamed: 83,Unnamed: 84
0,/@copyright,/@date,/@dbserver,/@version,/Availability/Licence/FullName,/Availability/Licence/FullName/@lang,/Availability/Licence/LegalCode,/Availability/Licence/ShortIdentifier,/DisorderList/@count,/DisorderList/Disorder/#id,...,/DisorderList/Disorder/SummaryInformationList/...,/DisorderList/Disorder/SummaryInformationList/...,/DisorderList/Disorder/SummaryInformationList/...,/DisorderList/Disorder/SummaryInformationList/...,/DisorderList/Disorder/SummaryInformationList/...,/DisorderList/Disorder/SummaryInformationList/...,/DisorderList/Disorder/SummaryInformationList/...,/DisorderList/Disorder/SynonymList/@count,/DisorderList/Disorder/SynonymList/Synonym,/DisorderList/Disorder/SynonymList/Synonym/@lang
1,Orphanet (c) 2023,2023-06-22 13:48:49,jdbc:sybase:Tds:canard.orpha.net:2020,1.3.25 / 4.1.7 [2023-04-17] (orientdb version),Creative Commons Attribution 4.0 International,en,https://creativecommons.org/licenses/by/4.0/le...,CC-BY-4.0,10839,1,...,83697,en,A rare primary bone dysplasia characterized by...,16907,16907,Definition,en,,,
2,Orphanet (c) 2023,2023-06-22 13:48:49,jdbc:sybase:Tds:canard.orpha.net:2020,1.3.25 / 4.1.7 [2023-04-17] (orientdb version),Creative Commons Attribution 4.0 International,en,https://creativecommons.org/licenses/by/4.0/le...,CC-BY-4.0,10839,1,...,,en,A rare primary bone dysplasia characterized by...,16907,,Definition,en,,,
3,Orphanet (c) 2023,2023-06-22 13:48:49,jdbc:sybase:Tds:canard.orpha.net:2020,1.3.25 / 4.1.7 [2023-04-17] (orientdb version),Creative Commons Attribution 4.0 International,en,https://creativecommons.org/licenses/by/4.0/le...,CC-BY-4.0,10839,1,...,,en,A rare primary bone dysplasia characterized by...,16907,,Definition,en,,,
4,Orphanet (c) 2023,2023-06-22 13:48:49,jdbc:sybase:Tds:canard.orpha.net:2020,1.3.25 / 4.1.7 [2023-04-17] (orientdb version),Creative Commons Attribution 4.0 International,en,https://creativecommons.org/licenses/by/4.0/le...,CC-BY-4.0,10839,1,...,,en,A rare primary bone dysplasia characterized by...,16907,,Definition,en,,,


In [3]:
# GARD to ORPHANET mapping file (out of 12004 rare diseases only 9369 rare disease mapped from GARD to ORPHANET)
GARD_to_Orphanet = pd.read_excel('C:\\Users\\,...,\\RD_Cohort_Paper_Feb_2025\\GARD-Orphanet-SNOMED+ICD10_2023-11-29.xlsx',sheet_name = 'GARD-to-Orphanet' , engine='openpyxl')
print('GARD_to_Orphanet_DataFrame_size', GARD_to_Orphanet.shape)
GARD_to_Orphanet.head()

GARD_to_Orphanet_DataFrame_size (9369, 6)


Unnamed: 0,GardID,DataSource,SourceID,ClassificationLevel,DisorderType,SourceName
0,69,Orphanet,319247,Disorder,[Disease],Hantavirus pulmonary syndrome
1,86,Orphanet,314597,Disorder,[Malformation syndrome],Chudley-McCullough syndrome
2,73,Orphanet,101088,Subtype of disorder,[Clinical subtype],X-linked hyper-IgM syndrome
3,79,Orphanet,33067,Disorder,[Disease],"Metaphyseal chondrodysplasia, Jansen type"
4,6,Orphanet,93437,Group of disorders,[Clinical group],Acromesomelic dysplasia


In [4]:
# selected columns which is relevent to ICD10 codes sorting
ICD10  = ICD10[['Unnamed: 37','Unnamed: 53','Unnamed: 59','Unnamed: 60','Unnamed: 61','Unnamed: 63']]
# renamed selected columns
ICD10.rename(columns = {'Unnamed: 37':'DisorderGroup','Unnamed: 53': 'DisorderMappingRelation','Unnamed: 59' : 'code', 'Unnamed: 60' : 'code_classification', 'Unnamed: 61' : 'name', 'Unnamed: 63' : 'OrphaCode'}, inplace = True)
# dropping 1st column which is columns header
data_orpha = ICD10.iloc[1:]
data_orpha

Unnamed: 0,DisorderGroup,DisorderMappingRelation,code,code_classification,name,OrphaCode
1,Disorder,,,,"Multiple epiphyseal dysplasia, Al-Gazali type",166024
2,Disorder,,,,"Multiple epiphyseal dysplasia, Al-Gazali type",166024
3,Disorder,NTBT (ORPHAcode is narrower than the targeted ...,LD24.61,ICD-11,"Multiple epiphyseal dysplasia, Al-Gazali type",166024
4,Disorder,NTBT (ORPHAcode is narrower than the targeted ...,Q77.3,ICD-10,"Multiple epiphyseal dysplasia, Al-Gazali type",166024
5,Disorder,E (Exact mapping: the two concepts are equival...,607131,OMIM,"Multiple epiphyseal dysplasia, Al-Gazali type",166024
...,...,...,...,...,...,...
78016,Disorder,,,,Hereditary persistence of fetal hemoglobin-int...,619233
78017,Group of disorders,,,,Rare hereditary autoinflammatory disease,619238
78018,Group of disorders,,,,Rare hereditary autoinflammatory disease,619238
78019,Group of disorders,,,,Rare hereditary autoinflammatory disease,619238


In [5]:
# dropping 1st column which is columns header
data_orpha = ICD10.iloc[1:]
data_orpha

Unnamed: 0,DisorderGroup,DisorderMappingRelation,code,code_classification,name,OrphaCode
1,Disorder,,,,"Multiple epiphyseal dysplasia, Al-Gazali type",166024
2,Disorder,,,,"Multiple epiphyseal dysplasia, Al-Gazali type",166024
3,Disorder,NTBT (ORPHAcode is narrower than the targeted ...,LD24.61,ICD-11,"Multiple epiphyseal dysplasia, Al-Gazali type",166024
4,Disorder,NTBT (ORPHAcode is narrower than the targeted ...,Q77.3,ICD-10,"Multiple epiphyseal dysplasia, Al-Gazali type",166024
5,Disorder,E (Exact mapping: the two concepts are equival...,607131,OMIM,"Multiple epiphyseal dysplasia, Al-Gazali type",166024
...,...,...,...,...,...,...
78016,Disorder,,,,Hereditary persistence of fetal hemoglobin-int...,619233
78017,Group of disorders,,,,Rare hereditary autoinflammatory disease,619238
78018,Group of disorders,,,,Rare hereditary autoinflammatory disease,619238
78019,Group of disorders,,,,Rare hereditary autoinflammatory disease,619238


In [6]:
# Merged GARD to Orphanet maping file with Orphanet to ICD10 code file
merged_df = pd.merge(data_orpha, GARD_to_Orphanet, how = 'inner', left_on = 'OrphaCode', right_on = 'SourceID')
# select relevant columns
data_orpha_Gard = merged_df[['DisorderGroup', 'DisorderMappingRelation', 'code','code_classification', 'name', 'OrphaCode','GardID']]
data_orpha_Gard

Unnamed: 0,DisorderGroup,DisorderMappingRelation,code,code_classification,name,OrphaCode,GardID
0,Disorder,,,,"Multiple epiphyseal dysplasia, Al-Gazali type",166024,17014
1,Disorder,,,,"Multiple epiphyseal dysplasia, Al-Gazali type",166024,17014
2,Disorder,NTBT (ORPHAcode is narrower than the targeted ...,LD24.61,ICD-11,"Multiple epiphyseal dysplasia, Al-Gazali type",166024,17014
3,Disorder,NTBT (ORPHAcode is narrower than the targeted ...,Q77.3,ICD-10,"Multiple epiphyseal dysplasia, Al-Gazali type",166024,17014
4,Disorder,E (Exact mapping: the two concepts are equival...,607131,OMIM,"Multiple epiphyseal dysplasia, Al-Gazali type",166024,17014
...,...,...,...,...,...,...,...
69458,Disorder,,,,Hereditary persistence of fetal hemoglobin-int...,619233,22458
69459,Disorder,,,,Hereditary persistence of fetal hemoglobin-int...,619233,22458
69460,Disorder,NTBT (ORPHAcode is narrower than the targeted ...,Q87.8,ICD-10,Hereditary persistence of fetal hemoglobin-int...,619233,22458
69461,Disorder,E (Exact mapping: the two concepts are equival...,617101,OMIM,Hereditary persistence of fetal hemoglobin-int...,619233,22458


In [7]:
# Selecting 'Exact mapping: the two concepts are equivalent' match between two files
data_orpha_new = data_orpha_Gard[data_orpha_Gard['DisorderMappingRelation'] == 'E (Exact mapping: the two concepts are equivalent)']
data_orpha_new

Unnamed: 0,DisorderGroup,DisorderMappingRelation,code,code_classification,name,OrphaCode,GardID
4,Disorder,E (Exact mapping: the two concepts are equival...,607131,OMIM,"Multiple epiphyseal dysplasia, Al-Gazali type",166024,17014
5,Disorder,E (Exact mapping: the two concepts are equival...,C1846722,UMLS,"Multiple epiphyseal dysplasia, Al-Gazali type",166024,17014
9,Disorder,E (Exact mapping: the two concepts are equival...,203450,OMIM,Alexander disease,58,5774
10,Disorder,E (Exact mapping: the two concepts are equival...,D038261,MeSH,Alexander disease,58,5774
11,Disorder,E (Exact mapping: the two concepts are equival...,C0270726,UMLS,Alexander disease,58,5774
...,...,...,...,...,...,...,...
69338,Disorder,E (Exact mapping: the two concepts are equival...,618618,OMIM,MIR140-related spondyloepiphyseal dysplasia,623695,22495
69397,Disorder,E (Exact mapping: the two concepts are equival...,C0346360,UMLS,Conjunctival malignant melanoma,617910,10744
69430,Disorder,E (Exact mapping: the two concepts are equival...,C5680416,UMLS,Early-onset autoimmunity-autoinflammation-immu...,619948,22465
69434,Disorder,E (Exact mapping: the two concepts are equival...,617744,OMIM,Developmental delay-immunodeficiency-leukoence...,619979,22468


In [8]:
# Selecting only ICD10 codes (which is 575 diseases)
data_orpha_ICD10 = data_orpha_new[data_orpha_new['code_classification'] == 'ICD-10']
data_orpha_ICD10

Unnamed: 0,DisorderGroup,DisorderMappingRelation,code,code_classification,name,OrphaCode,GardID
339,Disorder,E (Exact mapping: the two concepts are equival...,D59.5,ICD-10,Paroxysmal nocturnal hemoglobinuria,447,7337
584,Disorder,E (Exact mapping: the two concepts are equival...,Q80.1,ICD-10,Recessive X-linked ichthyosis,461,7904
685,Disorder,E (Exact mapping: the two concepts are equival...,D56.1,ICD-10,Beta-thalassemia,848,871
701,Disorder,E (Exact mapping: the two concepts are equival...,D56.0,ICD-10,Alpha-thalassemia,846,621
878,Disorder,E (Exact mapping: the two concepts are equival...,L40.3,ICD-10,Pustulosis palmaris et plantaris,163927,12820
...,...,...,...,...,...,...,...
68961,Disorder,E (Exact mapping: the two concepts are equival...,K00.0,ICD-10,Anodontia,99797,5818
69039,Disorder,E (Exact mapping: the two concepts are equival...,Q35.3,ICD-10,Cleft velum,99772,16907
69046,Disorder,E (Exact mapping: the two concepts are equival...,Q35.7,ICD-10,Bifid uvula,99771,19687
69067,Disorder,E (Exact mapping: the two concepts are equival...,L92.2,ICD-10,Granuloma faciale,615943,22442


In [9]:
data_orpha_ICD10.to_csv('data_orpha_ICD10.csv',index=False, encoding='utf-16-le', sep=',')

In [9]:
# Number of unique GARD IDs mapped with Orphanet & ICD-10 codes
len(data_orpha_ICD10['GardID'].unique())

564

In [10]:
# Drop diseases from if it is a groups of disorders
GARD_Orpha_ICD10_with_group_disorder = data_orpha_ICD10[data_orpha_ICD10['DisorderGroup'] =='Group of disorders']
GARD_Orpha_ICD10_with_group_disorder

Unnamed: 0,DisorderGroup,DisorderMappingRelation,code,code_classification,name,OrphaCode,GardID
5205,Group of disorders,E (Exact mapping: the two concepts are equival...,D24,ICD-10,Rare benign breast tumor,180253,12775
5384,Group of disorders,E (Exact mapping: the two concepts are equival...,Q51.4,ICD-10,Unilateral aplasia of the Müllerian ducts,180071,20173
5457,Group of disorders,E (Exact mapping: the two concepts are equival...,Q51.3,ICD-10,Bicornuate uterus,180134,20183
5651,Group of disorders,E (Exact mapping: the two concepts are equival...,E85.0,ICD-10,Amyloidosis,69,18676
5757,Group of disorders,E (Exact mapping: the two concepts are equival...,C84.0,ICD-10,Mycosis fungoides and variants,178566,20166
...,...,...,...,...,...,...,...
66743,Group of disorders,E (Exact mapping: the two concepts are equival...,D70,ICD-10,Constitutional neutropenia,101987,19809
66944,Group of disorders,E (Exact mapping: the two concepts are equival...,D81,ICD-10,Combined T and B cell immunodeficiency,101972,19806
66976,Group of disorders,E (Exact mapping: the two concepts are equival...,E23.0,ICD-10,Pituitary deficiency,101957,19801
68134,Group of disorders,E (Exact mapping: the two concepts are equival...,B87.0,ICD-10,Cutaneous myiasis,99983,19723


In [11]:
GARD_Orpha_ICD10_with_group_disorder.to_csv("ICD10_codes_drop_due_to_groups_of_disorders.csv", index = False)

### Drop rare diseases GARD IDs if it is group of disorders

In [11]:
# Drop diseases from if it is a groups of disorders
GARD_Orpha_ICD10_without_group_disorder = data_orpha_ICD10[data_orpha_ICD10['DisorderGroup'] !='Group of disorders']
GARD_Orpha_ICD10_without_group_disorder

Unnamed: 0,DisorderGroup,DisorderMappingRelation,code,code_classification,name,OrphaCode,GardID
339,Disorder,E (Exact mapping: the two concepts are equival...,D59.5,ICD-10,Paroxysmal nocturnal hemoglobinuria,447,7337
584,Disorder,E (Exact mapping: the two concepts are equival...,Q80.1,ICD-10,Recessive X-linked ichthyosis,461,7904
685,Disorder,E (Exact mapping: the two concepts are equival...,D56.1,ICD-10,Beta-thalassemia,848,871
701,Disorder,E (Exact mapping: the two concepts are equival...,D56.0,ICD-10,Alpha-thalassemia,846,621
878,Disorder,E (Exact mapping: the two concepts are equival...,L40.3,ICD-10,Pustulosis palmaris et plantaris,163927,12820
...,...,...,...,...,...,...,...
68961,Disorder,E (Exact mapping: the two concepts are equival...,K00.0,ICD-10,Anodontia,99797,5818
69039,Disorder,E (Exact mapping: the two concepts are equival...,Q35.3,ICD-10,Cleft velum,99772,16907
69046,Disorder,E (Exact mapping: the two concepts are equival...,Q35.7,ICD-10,Bifid uvula,99771,19687
69067,Disorder,E (Exact mapping: the two concepts are equival...,L92.2,ICD-10,Granuloma faciale,615943,22442


In [12]:
GARD_Orpha_ICD10_without_group_disorder.to_csv('GARD_Orpha_ICD10_without_group_disorder.csv',index=False, encoding='utf-16-le', sep=',')

In [13]:
GARD_Orpha_ICD10_without_group_disorder['name'].to_csv('ICD10_codes_for_HPO_mapping.txt', sep='\t', index=False, encoding='utf-16-le')

### HPO mapping of final diseases list

In [41]:
import requests
import json
import csv
from urllib.parse import quote

def get_hpo(dis):
    # Remove any null bytes or unwanted characters from the input term
    dis = dis.replace('\x00', '')  # Remove any null characters
    # URL encode the disease term to handle spaces and special characters correctly
    encoded_dis = quote(dis)
    url = f'https://clinicaltables.nlm.nih.gov/api/hpo/v3/search?terms={encoded_dis}'
    print(url)  # Print the URL for debugging
    my_response = requests.get(url)
    return my_response

# Specify the output CSV file name
output_file_name = "C:\\Users\\,...,\\RD_Cohort_Paper_Feb_2025\\ICD_10_code_mapping\\ICD10_codes_with_HPO_tag.csv"

# Open the output file in write mode with UTF-8 encoding for CSV
with open(output_file_name, 'w', encoding='utf-16-le', newline='') as output_file:
    # Create a CSV writer
    csv_writer = csv.writer(output_file)
    
    # Write the header row (optional, but useful to name the columns)
    csv_writer.writerow(["name", "HPO"])
    
    # Open the input file in read mode with a different encoding (ISO-8859-1 or latin1)
    with open('C:\\Users\\,...,\\RD_Cohort_Paper_Feb_2025\\ICD_10_code_mapping\\ICD10_codes_for_HPO_mapping.txt', 'r', encoding='ISO-8859-1') as input_file:
        lines = input_file.readlines()
        for x in lines:
            # Strip extra spaces and newlines, replace underscores with spaces
            x = x.strip().replace('_', ' ')
            
            # Remove any null characters that might have appeared
            x = x.replace('\x00', '')
            
            # Ensure the disease name is properly handled for spaces and special characters
            print(x)  # Print the disease name (for debugging)
            
            # Call the get_hpo function to retrieve information
            info = get_hpo(x)
            
            if info.ok:
                j_data = json.loads(info.content)
                # For simplicity, you can convert the JSON data to a string (you could format it better if needed)
                hpo_data = json.dumps(j_data)
                # Write the disease name and HPO data to the CSV file
                csv_writer.writerow([x, hpo_data])



name
https://clinicaltables.nlm.nih.gov/api/hpo/v3/search?terms=name

https://clinicaltables.nlm.nih.gov/api/hpo/v3/search?terms=
Paroxysmal nocturnal hemoglobinuria
https://clinicaltables.nlm.nih.gov/api/hpo/v3/search?terms=Paroxysmal%20nocturnal%20hemoglobinuria

https://clinicaltables.nlm.nih.gov/api/hpo/v3/search?terms=
Recessive X-linked ichthyosis
https://clinicaltables.nlm.nih.gov/api/hpo/v3/search?terms=Recessive%20X-linked%20ichthyosis

https://clinicaltables.nlm.nih.gov/api/hpo/v3/search?terms=
Beta-thalassemia
https://clinicaltables.nlm.nih.gov/api/hpo/v3/search?terms=Beta-thalassemia

https://clinicaltables.nlm.nih.gov/api/hpo/v3/search?terms=
Alpha-thalassemia
https://clinicaltables.nlm.nih.gov/api/hpo/v3/search?terms=Alpha-thalassemia

https://clinicaltables.nlm.nih.gov/api/hpo/v3/search?terms=
Pustulosis palmaris et plantaris
https://clinicaltables.nlm.nih.gov/api/hpo/v3/search?terms=Pustulosis%20palmaris%20et%20plantaris

https://clinicaltables.nlm.nih.gov/api/hpo/v3/se

#### Following code can be used to save directly in csv file few names character show issue in this approach

### import rare disease HPO mapping file

In [42]:
## HPO mapping of Rare Disease lists (ICD10 /SNOMED-CT codes) 
ICD10_HPO = pd.read_csv('ICD10_codes_with_HPO_tag.csv', encoding='utf-16-le')
print('shape of dataframe', ICD10_HPO.shape)
ICD10_HPO

shape of dataframe (929, 2)


Unnamed: 0,name,HPO
0,name,"[0, [], null, []]"
1,,"[19434, [""HP:0000001"", ""HP:0000002"", ""HP:00000..."
2,Paroxysmal nocturnal hemoglobinuria,"[1, [""HP:0004818""], null, [[""HP:0004818"", ""Par..."
3,,"[19434, [""HP:0000001"", ""HP:0000002"", ""HP:00000..."
4,Recessive X-linked ichthyosis,"[0, [], null, []]"
...,...,...
924,Granuloma faciale,"[0, [], null, []]"
925,,"[19434, [""HP:0000001"", ""HP:0000002"", ""HP:00000..."
926,Amniotic fluid embolism,"[0, [], null, []]"
927,,"[19434, [""HP:0000001"", ""HP:0000002"", ""HP:00000..."


In [43]:
# Drop the first row (usually the header or an unwanted entry)
ICD10_HPO = ICD10_HPO.iloc[1:].reset_index(drop=True)

# Drop rows where 'Disease Name' column has NaN values
ICD10_HPO = ICD10_HPO.dropna(subset=["name"])
ICD10_HPO

Unnamed: 0,name,HPO
1,Paroxysmal nocturnal hemoglobinuria,"[1, [""HP:0004818""], null, [[""HP:0004818"", ""Par..."
3,Recessive X-linked ichthyosis,"[0, [], null, []]"
5,Beta-thalassemia,"[0, [], null, []]"
7,Alpha-thalassemia,"[0, [], null, []]"
9,Pustulosis palmaris et plantaris,"[1, [""HP:0100847""], null, [[""HP:0100847"", ""Pal..."
...,...,...
917,Anodontia,"[4, [""HP:0000674"", ""HP:0000706"", ""HP:0000677"",..."
919,Cleft velum,"[2, [""HP:0000185"", ""HP:0011819""], null, [[""HP:..."
921,Bifid uvula,"[1, [""HP:0000193""], null, [[""HP:0000193"", ""Bif..."
923,Granuloma faciale,"[0, [], null, []]"


In [44]:
# Following disease mapped with HPO and hence will be dropped from ICD10 code list
ICD10_code_hpo = ICD10_HPO[ICD10_HPO['HPO'] != '[0, [], null, []]' ]
ICD10_code_hpo

Unnamed: 0,name,HPO
1,Paroxysmal nocturnal hemoglobinuria,"[1, [""HP:0004818""], null, [[""HP:0004818"", ""Par..."
9,Pustulosis palmaris et plantaris,"[1, [""HP:0100847""], null, [[""HP:0100847"", ""Pal..."
21,Retinoblastoma,"[1, [""HP:0009919""], null, [[""HP:0009919"", ""Ret..."
31,Wiskott-Aldrich syndrome,"[1, [""HP:6000472""], null, [[""HP:6000472"", ""Dec..."
47,Lamellar ichthyosis,"[1, [""HP:0007479""], null, [[""HP:0007479"", ""Con..."
...,...,...
879,Congenital mitral stenosis,"[1, [""HP:0011570""], null, [[""HP:0011570"", ""Con..."
885,Cleft hard palate,"[6, [""HP:0410005"", ""HP:0000176"", ""HP:5201003"",..."
917,Anodontia,"[4, [""HP:0000674"", ""HP:0000706"", ""HP:0000677"",..."
919,Cleft velum,"[2, [""HP:0000185"", ""HP:0011819""], null, [[""HP:..."


In [45]:
ICD10_code_hpo.to_csv('ICD10_codes_drop_due_to_HPO.csv', index = False, encoding='utf-16-le')

In [46]:
# Following disease without HPO 
ICD10_code_without_hpo = ICD10_HPO[ICD10_HPO['HPO'] == '[0, [], null, []]' ]
ICD10_code_without_hpo

Unnamed: 0,name,HPO
3,Recessive X-linked ichthyosis,"[0, [], null, []]"
5,Beta-thalassemia,"[0, [], null, []]"
7,Alpha-thalassemia,"[0, [], null, []]"
11,Rett syndrome,"[0, [], null, []]"
13,Autosomal recessive polycystic kidney disease,"[0, [], null, []]"
...,...,...
911,Crimean-Congo hemorrhagic fever,"[0, [], null, []]"
913,Lassa fever,"[0, [], null, []]"
915,Marburg hemorrhagic fever,"[0, [], null, []]"
923,Granuloma faciale,"[0, [], null, []]"


In [47]:
ICD10_code_without_hpo.to_csv('RD_icd10_without_hpo.csv',index=False, encoding='utf-16-le', sep=',')

In [48]:
icd10_code_without_hpo = pd.read_csv('RD_icd10_without_hpo.csv', encoding='utf-16-le')
icd10_code_without_hpo

Unnamed: 0,name,HPO
0,Recessive X-linked ichthyosis,"[0, [], null, []]"
1,Beta-thalassemia,"[0, [], null, []]"
2,Alpha-thalassemia,"[0, [], null, []]"
3,Rett syndrome,"[0, [], null, []]"
4,Autosomal recessive polycystic kidney disease,"[0, [], null, []]"
...,...,...
352,Crimean-Congo hemorrhagic fever,"[0, [], null, []]"
353,Lassa fever,"[0, [], null, []]"
354,Marburg hemorrhagic fever,"[0, [], null, []]"
355,Granuloma faciale,"[0, [], null, []]"


In [49]:
GARD_Orpha_ICD10_without_group_disorder = pd.read_csv('GARD_Orpha_ICD10_without_group_disorder.csv', encoding='utf-16-le')
GARD_Orpha_ICD10_without_group_disorder

Unnamed: 0,DisorderGroup,DisorderMappingRelation,code,code_classification,name,OrphaCode,GardID
0,Disorder,E (Exact mapping: the two concepts are equival...,D59.5,ICD-10,Paroxysmal nocturnal hemoglobinuria,447,7337
1,Disorder,E (Exact mapping: the two concepts are equival...,Q80.1,ICD-10,Recessive X-linked ichthyosis,461,7904
2,Disorder,E (Exact mapping: the two concepts are equival...,D56.1,ICD-10,Beta-thalassemia,848,871
3,Disorder,E (Exact mapping: the two concepts are equival...,D56.0,ICD-10,Alpha-thalassemia,846,621
4,Disorder,E (Exact mapping: the two concepts are equival...,L40.3,ICD-10,Pustulosis palmaris et plantaris,163927,12820
...,...,...,...,...,...,...,...
458,Disorder,E (Exact mapping: the two concepts are equival...,K00.0,ICD-10,Anodontia,99797,5818
459,Disorder,E (Exact mapping: the two concepts are equival...,Q35.3,ICD-10,Cleft velum,99772,16907
460,Disorder,E (Exact mapping: the two concepts are equival...,Q35.7,ICD-10,Bifid uvula,99771,19687
461,Disorder,E (Exact mapping: the two concepts are equival...,L92.2,ICD-10,Granuloma faciale,615943,22442


In [50]:
# Merged rare disease ICD10 code (df1) with ICD10_HPO file on disease name
ICD10_code_with_hpo = pd.merge(GARD_Orpha_ICD10_without_group_disorder,icd10_code_without_hpo, how = 'inner', on = 'name').drop_duplicates()
ICD10_code_with_hpo

Unnamed: 0,DisorderGroup,DisorderMappingRelation,code,code_classification,name,OrphaCode,GardID,HPO
0,Disorder,E (Exact mapping: the two concepts are equival...,Q80.1,ICD-10,Recessive X-linked ichthyosis,461,7904,"[0, [], null, []]"
1,Disorder,E (Exact mapping: the two concepts are equival...,D56.1,ICD-10,Beta-thalassemia,848,871,"[0, [], null, []]"
2,Disorder,E (Exact mapping: the two concepts are equival...,D56.0,ICD-10,Alpha-thalassemia,846,621,"[0, [], null, []]"
3,Disorder,E (Exact mapping: the two concepts are equival...,F84.2,ICD-10,Rett syndrome,778,5696,"[0, [], null, []]"
4,Disorder,E (Exact mapping: the two concepts are equival...,Q61.1,ICD-10,Autosomal recessive polycystic kidney disease,731,8378,"[0, [], null, []]"
...,...,...,...,...,...,...,...,...
368,Disorder,E (Exact mapping: the two concepts are equival...,A98.0,ICD-10,Crimean-Congo hemorrhagic fever,99827,19690,"[0, [], null, []]"
369,Disorder,E (Exact mapping: the two concepts are equival...,A96.2,ICD-10,Lassa fever,99824,19688,"[0, [], null, []]"
370,Disorder,E (Exact mapping: the two concepts are equival...,A98.3,ICD-10,Marburg hemorrhagic fever,99826,9444,"[0, [], null, []]"
371,Disorder,E (Exact mapping: the two concepts are equival...,L92.2,ICD-10,Granuloma faciale,615943,22442,"[0, [], null, []]"


In [51]:
len(ICD10_code_with_hpo['code'].unique())

357

In [52]:
len(ICD10_code_with_hpo['GardID'].unique())

349

In [53]:
ICD10_code_with_hpo.to_csv('Final_ICD10_codes_for_N3C_mapping.csv', index = False, encoding='utf-16-le')

In [54]:
# Following list of ICD-10 codes to map N3C enclave (Any EHR System)
print(ICD10_code_with_hpo['code'].tolist() )

['Q80.1', 'D56.1', 'D56.0', 'F84.2', 'Q61.1', 'Q87.4', 'G10', 'D82.1', 'Q99.2', 'E76.1', 'E76.0', 'L93.1', 'Q77.4', 'Q16.3', 'M61.1', 'E79.1', 'Q77.5', 'I78.0', 'E80.5', 'K22.0', 'C45.1', 'F84.3', 'Q98.5', 'D47.5', 'G60.1', 'Q82.3', 'G90.1', 'Q23.4', 'Q86.0', 'Q12.1', 'Q77.1', 'P35.0', 'Q83.0', 'Q83.1', 'Q31.5', 'Q31.0', 'C57.0', 'Q51.0', 'Q51.5', 'P37.1', 'P35.8', 'C82.6', 'M35.2', 'M33.2', 'Q78.0', 'T88.3', 'D81.5', 'G70.0', 'Q85.1', 'M31.0+', 'N08.5*', 'M30.1', 'M31.7', 'M31.3', 'B75', 'E71.0', 'Q61.2', 'Q21.1', 'A54.3+', 'H13.1*', 'M31.4', 'A48.1', 'L10.0', 'F84.1', 'M30.3', 'Q99.0', 'E80.3', 'M72.2', 'E24.1', 'G40.8', 'K75.4', 'K74.3', 'L13.0', 'A05.1', 'L44.0', 'Q31.3', 'M91.1', 'G23.1', 'Q99.1', 'J63.2', 'Q97.0', 'D82.2', 'Q77.0', 'K03.5', 'P91.6', 'M11.1', 'Q16.1', 'Q01.1', 'Q31.1', 'Q78.3', 'Q06.2', 'U04.9', 'Q86.1', 'P35.1', 'D76.2', 'G23.0', 'Q72.4', 'O14.2', 'Q78.4', 'L40.1', 'G51.2', 'Q80.3', 'Q80.4', 'L81.3', 'O01.1', 'O01.0', 'I67.5', 'E31.0', 'Q06.0', 'Q44.6', 'L98.2', 

### Following diseases dropped from ICD10 code mapping due to Incidence Filter
 These 8 diseases dropped from the list due to incidence filter after mapping above 357 rare disease ICD10 codes to N3C enclave. These diseases indence rate (> 6/10000) and hence dropped 

In [2]:
## Following seven rare diseases discarded due to incidence filter from above 357 rare dsisease ICD-10 codes which gives final 350 ICD-10 codes 
rd_discard = ['Q21.1','H90.5','M35.3','G50.1','L81.3','E85.4','A31.0','M72.2']
rd_discard

['Q21.1', 'H90.5', 'M35.3', 'G50.1', 'L81.3', 'E85.4', 'A31.0','M72.2']