# Orphanet RD  Linearization 

All Orphanet disorders in the database are included in the Orphanet classification of rare diseases. This classification follows a principle of polyhierarchy: a disorder is set in all classifications corresponding to medical specialties to which it is relevant. Entities can have as
many parents as needed.

Nevertheless, in order to sort out rare diseases by medical specialty and to avoid multiple countings, it is necessary to have a monohierarchical view available – a linearization – in which a disease belongs to one medical specialty only. It describes how to select for every rare disease included in the Orphanet database a medical specialty which has precedence. Linearization classes are following:

- Rare abdominal surgical disease
- Rare bone disease
- Rare cardiac disease
- Rare circulatory system disease
- Rare developmental defect during embryogenesis
- Rare disorder due to toxic effects
- Rare disorder potentially indicated for transplant or complication after transplantation
- Rare endocrine disease
- Rare gastroenterologic disease
- Rare genetic disease
- Rare gynecologic or obstetric disease
- Rare hematologic disease
- Rare hepatic disease
- Rare immune disease
- Rare inborn errors of metabolism
- Rare infectious disease
- Rare infertility
- Rare maxillo-facial surgical disease
- Rare neoplastic disease
- Rare neurologic disease
- Rare odontologic disease
- Rare ophthalmic disorder
- Rare otorhinolaryngologic disease
- Rare renal disease
- Rare respiratory disease
- Rare skin disease
- Rare surgical cardiac disease
- Rare surgical thoracic disease
- Rare systemic or rheumatologic disease
- Rare urogenital disease 

In [2]:
import pandas as pd
import numpy as np
import csv
import xml.etree.cElementTree as ET
import matplotlib.pyplot as plt
import seaborn as sns

In [50]:
# Read final ICD10 codes for linearization
rd_gard_icd10 = pd.read_csv('Final_ICD10_codes_for_N3C_mapping.csv', encoding='utf-16-le')
rd_gard_icd10 = rd_gard_icd10[['GardID','OrphaCode','name','code']]
rd_gard_icd10

Unnamed: 0,GardID,OrphaCode,name,code
0,7904,461,Recessive X-linked ichthyosis,Q80.1
1,871,848,Beta-thalassemia,D56.1
2,621,846,Alpha-thalassemia,D56.0
3,5696,778,Rett syndrome,F84.2
4,8378,731,Autosomal recessive polycystic kidney disease,Q61.1
...,...,...,...,...
352,19690,99827,Crimean-Congo hemorrhagic fever,A98.0
353,19688,99824,Lassa fever,A96.2
354,9444,99826,Marburg hemorrhagic fever,A98.3
355,22442,615943,Granuloma faciale,L92.2


In [51]:
rd_gard_snomed = pd.read_csv('Gard_Orpha_Snomed_file_for_RD_classification.txt', delimiter='\t')
rd_gard_snomed

Unnamed: 0,GardID,ORPHAcode,SourceName,SNOMED concept ID
0,69,319247,Hantavirus pulmonary syndrome,120639003
1,86,314597,Chudley-McCullough syndrome,773610007
2,73,101088,X-linked hyper-IgM syndrome,403835002
3,79,33067,"Metaphyseal chondrodysplasia, Jansen type",24629003
4,16,1125,"Ocular motor apraxia, Cogan type",405809000
...,...,...,...,...
6322,22369,592564,GNAO1-related developmental delay-seizures-mov...,1281842000
6323,22370,592570,TRAF7-associated heart defect-digital anomalie...,1208998007
6324,22371,592574,Menke-Hennekam syndrome,1260095004
6325,22388,595356,Localized dystrophic epidermolysis bullosa,254186008


In [6]:
# Orphanet rare disease linearization file for different classes of rare diseases
orphanet_linearization = pd.read_csv('ORPHAlinearisation_en_ES.csv', encoding='unicode_escape')
orphanet_linearization 

Unnamed: 0,OrphaCode,Name,Linearization Class
0,276413,10q22.3q23.3 microdeletion syndrome,Rare developmental defect during embryogenesis
1,276422,10q22.3q23.3 microduplication syndrome,Rare developmental defect during embryogenesis
2,300305,11p15.4 microduplication syndrome,Rare developmental defect during embryogenesis
3,444002,11q22.2q22.3 microdeletion syndrome,Rare developmental defect during embryogenesis
4,313884,12p12.1 microdeletion syndrome,Rare developmental defect during embryogenesis
...,...,...,...
7320,295187,Zygodactyly type 1,Rare developmental defect during embryogenesis
7321,295189,Zygodactyly type 2,Rare developmental defect during embryogenesis
7322,295191,Zygodactyly type 3,Rare developmental defect during embryogenesis
7323,295193,Zygodactyly type 4,Rare developmental defect during embryogenesis


In [7]:
## Linearization of ICD10 codes
df_merge_icd10 = pd.merge(rd_gard_icd10, orphanet_linearization, how = 'inner',  on = 'OrphaCode')
df_linearization_icd10 = df_merge_icd10[['GardID','OrphaCode','Name','code','Linearization Class']]
df_linearization_icd10

Unnamed: 0,GardID,OrphaCode,Name,code,Linearization Class
0,7904,461,Recessive X-linked ichthyosis,Q80.1,Rare skin disease
1,871,848,Beta-thalassemia,D56.1,Rare hematologic disease
2,621,846,Alpha-thalassemia,D56.0,Rare hematologic disease
3,5696,778,Rett syndrome,F84.2,Rare neurologic disease
4,8378,731,Autosomal recessive polycystic kidney disease,Q61.1,Rare renal disease
...,...,...,...,...,...
352,19690,99827,Crimean-Congo hemorrhagic fever,A98.0,Rare infectious disease
353,19688,99824,Lassa fever,A96.2,Rare infectious disease
354,9444,99826,Marburg hemorrhagic fever,A98.3,Rare infectious disease
355,22442,615943,Granuloma faciale,L92.2,Rare skin disease


In [8]:
# Here we looked for rare disease classes for given 357 rare disease with ICD-10 codes. Only 24 classes and 6 RD classes has no ICD10 codes.
len(df_linearization_icd10['Linearization Class'].unique())

24

In [9]:
## Linearization of SNOMED-CT codes. Merged ORPHANET rare disease linearizaton file with GARD-Orpha-SNOMED-CT file. 
df_merge = pd.merge(rd_gard_snomed, orphanet_linearization, how = 'inner',  left_on = 'ORPHAcode', right_on = 'OrphaCode')
df_linearization_snct = df_merge[['GardID','OrphaCode','SNOMED concept ID','Name','Linearization Class']]
df_linearization_snct

Unnamed: 0,GardID,OrphaCode,SNOMED concept ID,Name,Linearization Class
0,69,319247,120639003,Hantavirus pulmonary syndrome,Rare infectious disease
1,86,314597,773610007,Chudley-McCullough syndrome,Rare developmental defect during embryogenesis
2,73,101088,403835002,X-linked hyper-IgM syndrome,Rare immune disease
3,79,33067,24629003,"Metaphyseal chondrodysplasia, Jansen type",Rare bone disease
4,16,1125,405809000,"Ocular motor apraxia, Cogan type",Rare ophthalmic disorder
...,...,...,...,...,...
6317,22369,592564,1281842000,GNAO1-related developmental delay-seizures-mov...,Rare neurologic disease
6318,22370,592570,1208998007,TRAF7-associated heart defect-digital anomalie...,Rare developmental defect during embryogenesis
6319,22371,592574,1260095004,Menke-Hennekam syndrome,Rare developmental defect during embryogenesis
6320,22388,595356,254186008,Localized dystrophic epidermolysis bullosa,Rare skin disease


In [33]:
#GARD- Orphanet-SNOMED-CT code which is not linearized . Merge the dataframes on 'OrphaCode' and 'ORPHAcode'
merged_df = pd.merge(rd_gard_snomed, orphanet_linearization, left_on='ORPHAcode', right_on='OrphaCode', how='inner')

# Find the intersection of elements (rows that are present in both dataframes)
intersection_df = merged_df

# Find rows in rd_gard_snomed that are not present in orphanet_linearization
non_intersection_df = pd.merge(rd_gard_snomed, orphanet_linearization, left_on='ORPHAcode', right_on='OrphaCode', how='left', indicator=True)
non_intersection_df = non_intersection_df[non_intersection_df['_merge'] == 'left_only'].drop(columns=['_merge'])

# Display the results
print("Intersection of rd_gard_snomed and orphanet_linearization:", intersection_df.shape)

print("\nRows in rd_gard_snomed not present in orphanet_linearization:")
non_intersection_df = non_intersection_df[['GardID','ORPHAcode','SourceName','SNOMED concept ID']]
# Rename 'SourceName' to 'Name'
non_intersection_df = non_intersection_df.rename(columns={'SourceName': 'Name', 'ORPHAcode':'OrphaCode'})

# Select the required columns
non_intersection_df = non_intersection_df[['GardID', 'OrphaCode', 'Name', 'SNOMED concept ID']]

non_intersection_df

Intersection of rd_gard_snomed and orphanet_linearization: (6322, 7)

Rows in rd_gard_snomed not present in orphanet_linearization:


Unnamed: 0,GardID,OrphaCode,Name,SNOMED concept ID
2033,4598,3188,Congenital pulmonary veins atresia or stenosis,11614003
2698,7827,3389,Tuberculosis,56717001
5694,20951,268369,Spina bifida aperta,58557008
5696,20966,268813,Myelocystocele,203994003
6295,22317,573278,Split cord malformation,445308004


In [34]:
# Concatenate the two dataframes
concatenated_df = pd.concat([df_linearization_snct, non_intersection_df], ignore_index=False)
concatenated_df = concatenated_df[['GardID','Name','OrphaCode',	'Linearization Class',	'SNOMED concept ID']]
concatenated_df

Unnamed: 0,GardID,Name,OrphaCode,Linearization Class,SNOMED concept ID
0,69,Hantavirus pulmonary syndrome,319247,Rare infectious disease,120639003
1,86,Chudley-McCullough syndrome,314597,Rare developmental defect during embryogenesis,773610007
2,73,X-linked hyper-IgM syndrome,101088,Rare immune disease,403835002
3,79,"Metaphyseal chondrodysplasia, Jansen type",33067,Rare bone disease,24629003
4,16,"Ocular motor apraxia, Cogan type",1125,Rare ophthalmic disorder,405809000
...,...,...,...,...,...
2033,4598,Congenital pulmonary veins atresia or stenosis,3188,,11614003
2698,7827,Tuberculosis,3389,,56717001
5694,20951,Spina bifida aperta,268369,,58557008
5696,20966,Myelocystocele,268813,,203994003


In [37]:
# Perform an outer join on 'OrphaCode' to keep all rows from both dataframes
df_merged = pd.merge(concatenated_df, df_linearization_icd10[['GardID', 'OrphaCode', 'code', 'Linearization Class', 'Name']], 
                     on='OrphaCode', how='outer')


df_merged

Unnamed: 0,GardID_x,Name_x,OrphaCode,Linearization Class_x,SNOMED concept ID,GardID_y,code,Linearization Class_y,Name_y
0,6867.0,Long chain 3-hydroxyacyl-CoA dehydrogenase def...,5,Rare inborn errors of metabolism,7.260210e+08,,,,
1,10954.0,3-methylcrotonyl-CoA carboxylase deficiency,6,Rare inborn errors of metabolism,1.314400e+07,,,,
2,5666.0,3C syndrome,7,Rare developmental defect during embryogenesis,7.185560e+08,,,,
3,5674.0,"47,XYY syndrome",8,Rare developmental defect during embryogenesis,5.074901e+07,5674.0,Q98.5,Rare developmental defect during embryogenesis,"47,XYY syndrome"
4,7754.0,Tetrasomy X,9,Rare developmental defect during embryogenesis,1.056700e+07,,,,
...,...,...,...,...,...,...,...,...,...
6345,22388.0,Localized dystrophic epidermolysis bullosa,595356,Rare skin disease,2.541860e+08,,,,
6346,22391.0,IgG4-related systemic disease,596448,Rare systemic or rheumatologic disease,1.074327e+16,,,,
6347,,,600998,,,22426.0,Q43.7,Rare developmental defect during embryogenesis,Non-syndromic cloacal malformation
6348,,,615943,,,22442.0,L92.2,Rare skin disease,Granuloma faciale


In [41]:
# Perform an outer join on 'OrphaCode' to keep all rows from both dataframes
df_merged = pd.merge(concatenated_df, df_linearization_icd10[['GardID', 'OrphaCode', 'code', 'Linearization Class', 'Name']], 
                     on='OrphaCode', how='outer')

# Combine GardID_x and GardID_y, filling missing values from each other
df_merged['GardID'] = df_merged['GardID_x'].fillna(df_merged['GardID_y']).fillna(df_merged['GardID_x'])

# Combine Name_x and Name_y, filling missing values from each other
df_merged['Name'] = df_merged['Name_x'].fillna(df_merged['Name_y']).fillna(df_merged['Name_x'])

# Combine Linearization Class_x and Linearization Class_y, filling missing values from each other
df_merged['Linearization Class'] = df_merged['Linearization Class_x'].fillna(df_merged['Linearization Class_y']).fillna(df_merged['Linearization Class_x'])

# Drop the '_x' and '_y' columns
df_merged.drop(columns=['GardID_x', 'GardID_y', 'Name_x', 'Name_y', 'Linearization Class_x', 'Linearization Class_y'], inplace=True)

df_merged = df_merged[['GardID','Name','OrphaCode','Linearization Class','SNOMED concept ID','code']]
# Display the merged dataframe
df_merged

Unnamed: 0,GardID,Name,OrphaCode,Linearization Class,SNOMED concept ID,code
0,6867.0,Long chain 3-hydroxyacyl-CoA dehydrogenase def...,5,Rare inborn errors of metabolism,7.260210e+08,
1,10954.0,3-methylcrotonyl-CoA carboxylase deficiency,6,Rare inborn errors of metabolism,1.314400e+07,
2,5666.0,3C syndrome,7,Rare developmental defect during embryogenesis,7.185560e+08,
3,5674.0,"47,XYY syndrome",8,Rare developmental defect during embryogenesis,5.074901e+07,Q98.5
4,7754.0,Tetrasomy X,9,Rare developmental defect during embryogenesis,1.056700e+07,
...,...,...,...,...,...,...
6345,22388.0,Localized dystrophic epidermolysis bullosa,595356,Rare skin disease,2.541860e+08,
6346,22391.0,IgG4-related systemic disease,596448,Rare systemic or rheumatologic disease,1.074327e+16,
6347,22426.0,Non-syndromic cloacal malformation,600998,Rare developmental defect during embryogenesis,,Q43.7
6348,22442.0,Granuloma faciale,615943,Rare skin disease,,L92.2


In [42]:
## Save merged file in .txt file
df_merged.to_csv('GARD_ORPHA_SNOMED_ICD10_linearization_Suppl1.txt', sep='\t', index=False)

In [3]:
rd_gard_snct_icd10_combine = pd.read_csv('GARD_ORPHA_SNOMED_ICD10_linearization_Suppl1.txt', delimiter='\t')
rd_gard_snct_icd10_combine

Unnamed: 0,GardID,Name,OrphaCode,Linearization Class,SNOMED concept ID,code
0,6867.0,Long chain 3-hydroxyacyl-CoA dehydrogenase def...,5,Rare inborn errors of metabolism,7.260210e+08,
1,10954.0,3-methylcrotonyl-CoA carboxylase deficiency,6,Rare inborn errors of metabolism,1.314400e+07,
2,5666.0,3C syndrome,7,Rare developmental defect during embryogenesis,7.185560e+08,
3,5674.0,"47,XYY syndrome",8,Rare developmental defect during embryogenesis,5.074901e+07,Q98.5
4,7754.0,Tetrasomy X,9,Rare developmental defect during embryogenesis,1.056700e+07,
...,...,...,...,...,...,...
6345,22388.0,Localized dystrophic epidermolysis bullosa,595356,Rare skin disease,2.541860e+08,
6346,22391.0,IgG4-related systemic disease,596448,Rare systemic or rheumatologic disease,1.074327e+16,
6347,22426.0,Non-syndromic cloacal malformation,600998,Rare developmental defect during embryogenesis,,Q43.7
6348,22442.0,Granuloma faciale,615943,Rare skin disease,,L92.2


In [57]:
# Number of unique rare disease based on GARD-ORPHANET-SNOMED-CT-ICD-10
len(rd_gard_snct_icd10_combine['GardID'].unique())

6342

In [4]:
## Total number of disease with SNOMED-CT and its descendants after removing group of disorders & HPO 
rd_snomed_descendants = pd.read_csv('SNOMED_CT_codes_with_descendents_final_code_for_N3C.txt', delimiter='\t')
rd_snomed_descendants

Unnamed: 0,Code,Name,HPO
0,770401007,10q22.3q23.3 microdeletion syndrome,"[0, [], null, []]"
1,782669004,10q22.3q23.3 microduplication syndrome,"[0, [], null, []]"
2,770794008,11p15.4 microduplication syndrome,"[0, [], null, []]"
3,1229882003,11q22.2q22.3 microdeletion syndrome,"[0, [], null, []]"
4,778007004,12p12.1 microdeletion syndrome,"[0, [], null, []]"
...,...,...,...
12076,16468121000119106,Zika virus infection during pregnancy,"[0, [], null, []]"
12077,699447001,Zimmermann-Laband syndrome,"[0, [], null, []]"
12078,716248001,Zlotogora Ogur syndrome,"[0, [], null, []]"
12079,15207003,Zoonotic form of cutaneous leishmaniasis,"[0, [], null, []]"


In [5]:
## Investigate rare disease with GARD Ids are existing in OHDSI list or not (Out of 6327, 5949 mapped ) and rest dropped due to HPO mapping.
len(set(rd_gard_snomed['SNOMED concept ID'].tolist()) & set(rd_snomed_descendants['Code'].tolist()))

5949

### ORPHANET linearization file of rare disease classes for SNOMED-CT & ICD10 code linearization

In [6]:
# Orphanet rare disease linearization file for different classes of rare diseases (from ORPHANET )
orphanet_linearization = pd.read_csv('ORPHAlinearisation_en_ES.csv', encoding='unicode_escape')
orphanet_linearization 

Unnamed: 0,OrphaCode,Name,Linearization Class
0,276413,10q22.3q23.3 microdeletion syndrome,Rare developmental defect during embryogenesis
1,276422,10q22.3q23.3 microduplication syndrome,Rare developmental defect during embryogenesis
2,300305,11p15.4 microduplication syndrome,Rare developmental defect during embryogenesis
3,444002,11q22.2q22.3 microdeletion syndrome,Rare developmental defect during embryogenesis
4,313884,12p12.1 microdeletion syndrome,Rare developmental defect during embryogenesis
...,...,...,...
7320,295187,Zygodactyly type 1,Rare developmental defect during embryogenesis
7321,295189,Zygodactyly type 2,Rare developmental defect during embryogenesis
7322,295191,Zygodactyly type 3,Rare developmental defect during embryogenesis
7323,295193,Zygodactyly type 4,Rare developmental defect during embryogenesis


In [14]:
# Merge the dataframes on 'OrphaCode' and 'ORPHAcode'
merged_df = pd.merge(rd_gard_snomed, orphanet_linearization, left_on='ORPHAcode', right_on='OrphaCode', how='inner')

# Find the intersection of elements (rows that are present in both dataframes)
intersection_df = merged_df

# Find rows in rd_gard_snomed that are not present in orphanet_linearization
non_intersection_df = pd.merge(rd_gard_snomed, orphanet_linearization, left_on='ORPHAcode', right_on='OrphaCode', how='left', indicator=True)
non_intersection_df = non_intersection_df[non_intersection_df['_merge'] == 'left_only'].drop(columns=['_merge'])

# Display the results
print("Intersection of rd_gard_snomed and orphanet_linearization:", intersection_df.shape)

print("\nRows in rd_gard_snomed not present in orphanet_linearization:")
print(non_intersection_df)

Intersection of rd_gard_snomed and orphanet_linearization: (6322, 7)

Rows in rd_gard_snomed not present in orphanet_linearization:
      GardID  ORPHAcode                                      SourceName  \
2033    4598       3188  Congenital pulmonary veins atresia or stenosis   
2698    7827       3389                                    Tuberculosis   
5694   20951     268369                             Spina bifida aperta   
5696   20966     268813                                  Myelocystocele   
6295   22317     573278                         Split cord malformation   

      SNOMED concept ID  OrphaCode Name Linearization Class  
2033           11614003        NaN  NaN                 NaN  
2698           56717001        NaN  NaN                 NaN  
5694           58557008        NaN  NaN                 NaN  
5696          203994003        NaN  NaN                 NaN  
6295          445308004        NaN  NaN                 NaN  


In [18]:
# List of unique classes of rare diseases based on Orphanet linearization
unique_classes = orphanet_linearization['Linearization Class'].unique()

# Loop through each class and print the count
for rare_class in unique_classes:
    rare_diseases = orphanet_linearization[orphanet_linearization['Linearization Class'] == rare_class]
    print(f"# {rare_class}:", rare_diseases.shape[0])

# Rare developmental defect during embryogenesis: 2319
# Rare bone disease: 394
# Rare neurologic disease: 1208
# Rare neoplastic disease: 554
# Rare inborn errors of metabolism: 520
# Rare disorder due to toxic effects: 25
# Rare hematologic disease: 220
# Rare systemic or rheumatologic disease: 191
# Rare renal disease: 134
# Rare endocrine disease: 237
# Rare skin disease: 434
# Rare ophthalmic disorder: 257
# Rare immune disease: 183
# Rare infectious disease: 186
# Rare hepatic disease: 74
# Rare respiratory disease: 82
# Rare circulatory system disease: 21
# Rare abdominal surgical disease: 7
# Rare gastroenterologic disease: 95
# Rare odontologic disease: 24
# Rare gynecologic or obstetric disease: 18
# Rare surgical thoracic disease: 4
# Rare cardiac disease: 52
# Rare otorhinolaryngologic disease: 51
# Rare maxillo-facial surgical disease: 7
# Rare disorder potentially indicated for transplant or complication after transplantation: 2
# Rare genetic disease: 3
# Rare urogenital

### ICD 10 codes mapping with Orphane RD linearization file

In [25]:
df_merge_icd10 = pd.merge(rd_gard_icd10, orphanet_linearization , how = 'inner',  on = 'OrphaCode')
df_linearization_icd10 = df_merge_icd10[['GardID','OrphaCode','Name','code','Linearization Class']]
df_linearization_icd10

Unnamed: 0,GardID,OrphaCode,Name,code,Linearization Class
0,7904,461,Recessive X-linked ichthyosis,Q80.1,Rare skin disease
1,871,848,Beta-thalassemia,D56.1,Rare hematologic disease
2,621,846,Alpha-thalassemia,D56.0,Rare hematologic disease
3,5696,778,Rett syndrome,F84.2,Rare neurologic disease
4,8378,731,Autosomal recessive polycystic kidney disease,Q61.1,Rare renal disease
...,...,...,...,...,...
352,19690,99827,Crimean-Congo hemorrhagic fever,A98.0,Rare infectious disease
353,19688,99824,Lassa fever,A96.2,Rare infectious disease
354,9444,99826,Marburg hemorrhagic fever,A98.3,Rare infectious disease
355,22442,615943,Granuloma faciale,L92.2,Rare skin disease


In [26]:
# Here we looked for rare disease classes for given 357 rare disease with ICD-10 codes. Only 24 classes and 6 RD classes has no ICD10 codes.
len(df_linearization_icd10['Linearization Class'].unique())

24

In [21]:
df_linearization_icd10.to_csv('GARD_ORPHACODE_ICD10_code_linearization.csv', index = False)

In [28]:
# List of unique classes of rare diseases
unique_classes = df_linearization_icd10['Linearization Class'].unique()

# Loop through each class and print the count and codes
for rare_class in unique_classes:
    rare_diseases = df_linearization_icd10[df_linearization_icd10['Linearization Class'] == rare_class]
    print(f"# {rare_class} : {rare_diseases['code'].tolist()}\n")

# Rare skin disease : ['Q80.1', 'L93.1', 'L10.0', 'M72.2', 'L13.0', 'L44.0', 'L40.1', 'G51.2', 'Q80.3', 'Q80.4', 'L81.3', 'L98.2', 'Q82.1', 'L57.1', 'L66.2', 'L66.3', 'L98.3', 'L72.2', 'L85.0', 'L68.1', 'L66.1', 'Q84.3', 'L43.1', 'M35.6', 'L92.1', 'L95.0', 'L12.3', 'L13.1', 'L95.1', 'L94.0', 'Q81.1', 'L10.1', 'E52', 'L51.2', 'L92.2']

# Rare hematologic disease : ['D56.1', 'D56.0', 'O14.2', 'D58.0', 'D56.2', 'D47.3', 'D68.1', 'D61.0', 'D45', 'D56.4', 'D67', 'D60.1', 'D66']

# Rare neurologic disease : ['F84.2', 'G10', 'F84.3', 'G60.1', 'G90.1', 'T88.3', 'G70.0', 'F84.1', 'G40.8', 'G23.1', 'P91.6', 'G23.0', 'I67.5', 'G37.1', 'G14', 'G23.3', 'G37.5', 'G24.1', 'G47.4', 'G93.2', 'T57.2', 'P94.0', 'G50.1', 'G96.0', 'G70.2', 'G36.0', 'G37.0', 'G73.1', 'G12.0', 'G21.3', 'G21.0', 'F80.3', 'G23.2', 'G90.5', 'G90.6']

# Rare renal disease : ['Q61.1', 'Q61.2', 'Q61.5']

# Rare systemic or rheumatologic disease : ['Q87.4', 'M35.2', 'M33.2', 'M31.0+', 'N08.5*', 'M30.1', 'M31.7', 'M31.3', 'M31.4', '

In [29]:
# List of unique classes of rare diseases
unique_classes = df_linearization_icd10['Linearization Class'].unique()

# List to store individual DataFrames
dfs = []

# Loop through each class and create a DataFrame for the codes and names
for rare_class in unique_classes:
    rare_diseases = df_linearization_icd10[df_linearization_icd10['Linearization Class'] == rare_class]
    codes_and_names_df = pd.DataFrame({
        'Name': rare_diseases['Name'],
        'Code': rare_diseases['code'],
        'RD_linearization': rare_class  # Add the rare_class name to the new column
    })
    
    # Append the DataFrame to the list
    dfs.append(codes_and_names_df)
    
    print(f"DataFrame for {rare_class}:\n{codes_and_names_df}\n")

# Concatenate all DataFrames in the list
final_df = pd.concat(dfs, ignore_index=True)

# Print the final concatenated DataFrame
print("Concatenated DataFrame with 'RD_linearization' column:\n", final_df)

DataFrame for Rare skin disease:
                                                  Name   Code  \
0                        Recessive X-linked ichthyosis  Q80.1   
11              Subacute cutaneous lupus erythematosus  L93.1   
62                                  Pemphigus vulgaris  L10.0   
67                                  Ledderhose disease  M72.2   
72                            Dermatitis herpetiformis  L13.0   
74                            Pityriasis rubra pilaris  L44.0   
99                      Generalized pustular psoriasis  L40.1   
100                      Melkersson-Rosenthal syndrome  G51.2   
101        Autosomal dominant epidermolytic ichthyosis  Q80.3   
102                               Harlequin ichthyosis  Q80.4   
103             Familial isolated café-au-lait macules  L81.3   
110                                     Sweet syndrome  L98.2   
117                              Xeroderma pigmentosum  Q82.1   
151                         Chronic actinic dermatitis  L

In [31]:
final_df.to_csv('ICD10_RD_classification.csv', index = False, encoding='utf-16-le')

### ORPHANET linearization of SNOMED-CT codes 

In [33]:
# Merged ORPHANET rare disease linearizaton file with GARD-Orpha-SNOMED-CT file. Here 5 rare disease GARDid to Orpha to SNOMED linearization.
df_merge = pd.merge(rd_gard_snomed, orphanet_linearization, how = 'inner',  left_on = 'ORPHAcode', right_on = 'OrphaCode')
df_linearization_snct = df_merge[['GardID','OrphaCode','SNOMED concept ID','Name','Linearization Class']]
df_linearization_snct

Unnamed: 0,GardID,OrphaCode,SNOMED concept ID,Name,Linearization Class
0,69,319247,120639003,Hantavirus pulmonary syndrome,Rare infectious disease
1,86,314597,773610007,Chudley-McCullough syndrome,Rare developmental defect during embryogenesis
2,73,101088,403835002,X-linked hyper-IgM syndrome,Rare immune disease
3,79,33067,24629003,"Metaphyseal chondrodysplasia, Jansen type",Rare bone disease
4,16,1125,405809000,"Ocular motor apraxia, Cogan type",Rare ophthalmic disorder
...,...,...,...,...,...
6317,22369,592564,1281842000,GNAO1-related developmental delay-seizures-mov...,Rare neurologic disease
6318,22370,592570,1208998007,TRAF7-associated heart defect-digital anomalie...,Rare developmental defect during embryogenesis
6319,22371,592574,1260095004,Menke-Hennekam syndrome,Rare developmental defect during embryogenesis
6320,22388,595356,254186008,Localized dystrophic epidermolysis bullosa,Rare skin disease


In [34]:
df_linearization_snct.to_csv('GARD_ORPHACODE_SNOMEDCT_RD_linearization.csv',index = False, encoding='utf-16-le')

In [35]:
df_linearization_snct.to_csv('GARD_ORPHACODE_SNOMEDCT_RD_linearization.txt', sep='\t',index = False, encoding='utf-16-le')

In [36]:
# List of unique classes of rare diseases
unique_classes = df_linearization_snct['Linearization Class'].unique()

# For loop to iterate over the classes
for linearization_class in unique_classes:
    # Filter the dataframe for the current class
    class_df = df_linearization_snct[df_linearization_snct['Linearization Class'] == linearization_class]
    # Print the count of rows for the current class
    print(f'# {linearization_class} :', class_df.shape[0])

# Rare infectious disease : 169
# Rare developmental defect during embryogenesis : 2000
# Rare immune disease : 161
# Rare bone disease : 368
# Rare ophthalmic disorder : 236
# Rare inborn errors of metabolism : 424
# Rare endocrine disease : 195
# Rare neurologic disease : 1040
# Rare neoplastic disease : 488
# Rare skin disease : 391
# Rare renal disease : 90
# Rare odontologic disease : 24
# Rare systemic or rheumatologic disease : 162
# Rare gynecologic or obstetric disease : 12
# Rare hepatic disease : 55
# Rare hematologic disease : 189
# Rare circulatory system disease : 19
# Rare respiratory disease : 77
# Rare cardiac disease : 46
# Rare gastroenterologic disease : 74
# Rare otorhinolaryngologic disease : 46
# Rare abdominal surgical disease : 7
# Rare maxillo-facial surgical disease : 7
# Rare surgical thoracic disease : 2
# Rare disorder due to toxic effects : 23
# Rare urogenital disease : 7
# Rare infertility : 5
# Rare genetic disease : 3
# Rare disorder potentially indic

In [39]:
# List of unique classes of rare diseases
unique_classes = df_linearization_snct['Linearization Class'].unique()

# List to store individual DataFrames
dfs = []

# Loop through each class and create a DataFrame for the codes and names
for rare_class in unique_classes:
    rare_diseases = df_linearization_snct[df_linearization_snct['Linearization Class'] == rare_class]
    codes_and_names_df = pd.DataFrame({
        'GARD_ID': rare_diseases['GardID'],
        'OrphaCode': rare_diseases['OrphaCode'],
        'SNOMED_CT': rare_diseases['SNOMED concept ID'], 
        'Name': rare_diseases['Name'],
        'RD_linearization': rare_class  # Add the rare_class name to the new column
    })
    
    # Append the DataFrame to the list
    dfs.append(codes_and_names_df)
    
    print(f"DataFrame for {rare_class}:\n{codes_and_names_df}\n")

# Concatenate all DataFrames in the list
final_df = pd.concat(dfs, ignore_index=True)

# Print the final concatenated DataFrame
print("Concatenated DataFrame with 'RD_linearization' column:\n", final_df)


DataFrame for Rare infectious disease:
      GARD_ID  OrphaCode   SNOMED_CT  \
0          69     319247   120639003   
6          27      50839    79974007   
13        809       1223    57725006   
41        693       1070   442652006   
52        393        879    64612002   
...       ...        ...         ...   
6077    21880     449280  1217227009   
6094    21897     454836   772828001   
6254    22259     563684    59989004   
6255    22260     563687   726020009   
6256    22261     563690   819949002   

                                                   Name  \
0                         Hantavirus pulmonary syndrome   
6                                   Cat-scratch disease   
13                                        Balantidiasis   
41                                          Anisakiasis   
52                                            Tungiasis   
...                                                 ...   
6077                                     Scedosporiosis   
6094    

In [38]:
# List of unique classes of rare diseases
unique_classes = df_linearization_snct['Linearization Class'].unique()
# For loop to iterate over the classes
for linearization_class in unique_classes:
    # Filter the dataframe for the current class
    class_df = df_linearization_snct[df_linearization_snct['Linearization Class'] == linearization_class]
    # Print the concept IDs for the current class
    print(f'# {linearization_class} :', class_df['SNOMED concept ID'].tolist())
    print('\n')

# Rare infectious disease : [120639003, 79974007, 57725006, 442652006, 64612002, 38539003, 59925007, 414495006, 44917000, 26089000, 44250009, 417441005, 187151009, 19362000, 1231142001, 240626005, 716865000, 371423007, 1231140009, 10651001, 18116006, 27836007, 11817007, 186772009, 1214006, 714279000, 240849009, 42094007, 21954000, 47523006, 48113006, 23097003, 233850007, 716860005, 773738009, 63479002, 10087007, 392662004, 416925005, 61094002, 73328005, 1237418002, 21009004, 19265001, 61750000, 398565003, 77798004, 37109004, 74942003, 240820001, 699676006, 410039003, 88860002, 65110003, 36188001, 76902006, 712986001, 88264003, 65553006, 21061004, 75702008, 111864006, 63650001, 42386007, 20927009, 396334002, 186499007, 428638009, 52947006, 26726000, 80612004, 81004002, 4241002, 61462000, 29227009, 721764008, 22255007, 186788009, 14168008, 27031003, 77377001, 41545003, 16541001, 59051007, 231896005, 77503002, 60826002, 428111003, 1685005, 18504008, 4834000, 59277005, 266169003, 277869007

In [42]:
import os
import pandas as pd

# Define the path where the TXT files are located
txt_folder_path = r'C:\Users\....\RD_Cohort_Paper_Feb_2025\Orphanet_Linearization_ICD10_SNOMED_codes\RD_class_OHDSI_files'

# List of TXT file names (These files curated from OHDSI atlas to bring descendants of rare diseases)
txt_files = [
    'Rare_abdominal_surgical_disease.txt',
    'Rare_bone_disease.txt',
    'Rare_cardiac_disease.txt',
    'Rare_circulatory_system_disease.txt',
    'Rare_developmental_defect_during_embryogenesis.txt',
    'Rare_disorder_due_to_toxic_effects.txt',
    'Rare_endocrine_disease.txt',
    'Rare_gastroenterologic_disease.txt',
    'Rare_genetic_disease.txt',
    'Rare_gynecologic_or_obstetric_disease.txt',
    'Rare_hematologic_disease.txt',
    'Rare_hepatic_disease.txt',
    'Rare_immune_disease.txt',
    'Rare_inborn_errors_of_metabolism.txt',
    'Rare_infectious_disease.txt',
    'Rare_infertility_disease.txt',
    'Rare_maxillo_facial_surgical_disease.txt',
    'Rare_neoplastic_disease.txt',
    'Rare_neurologic_disease.txt',
    'Rare_odontologic_disease.txt',
    'Rare_ophthalmic_disorder.txt',
    'Rare_otorhinolaryngologic_disease.txt',
    'Rare_renal_disease.txt',
    'Rare_respiratory_disease.txt',
    'Rare_skin_disease.txt',
    'Rare_surgical_cardiac_disease.txt',
    'Rare_surgical_thoracic_disease.txt',
    'Rare_systemic_or_rheumatologic_disease.txt',
    'Rare_transplantation_disease.txt',
    'Rare_urogenital_disease.txt'
]

# Dictionary to store DataFrames for each TXT file
disease_data = {}

# Loop through the list of files and read them into a dictionary
for txt_file in txt_files:
    # Construct the full file path
    file_path = os.path.join(txt_folder_path, txt_file)
    
    # Read the TXT file into a DataFrame using UTF-8 encoding
    try:
        df = pd.read_csv(file_path, delimiter='\t', encoding='utf-8')  # Try using UTF-8 encoding
    except UnicodeDecodeError:
        print(f"UnicodeDecodeError: Unable to decode {txt_file} using UTF-8. Trying with ISO-8859-1.")
        df = pd.read_csv(file_path, delimiter='\t', encoding='ISO-8859-1')  # Fallback to ISO-8859-1
    
    # Remove the first null column (if it exists)
    if df.columns[0] == 'Unnamed: 0' or df.iloc[:, 0].isnull().all():  # Check if the first column is empty or unnamed
        df = df.drop(df.columns[0], axis=1)
    
    # Store the DataFrame in the dictionary with the disease name as the key
    disease_name = txt_file.split('.')[0]  # Extract the name from the file (without '.txt')
    disease_data[disease_name] = df
    
    # Print the shape and the name of the disease
    print(f"{disease_name} - Shape: {df.shape[0]}")

# Now, you can access the DataFrame for any disease from the dictionary
# For example, to access Rare infertility disease data:
rare_infertility_df = disease_data['Rare_infertility_disease']
print(rare_infertility_df.head())


Rare_abdominal_surgical_disease - Shape: 18
Rare_bone_disease - Shape: 495
Rare_cardiac_disease - Shape: 70
Rare_circulatory_system_disease - Shape: 61
Rare_developmental_defect_during_embryogenesis - Shape: 2960
Rare_disorder_due_to_toxic_effects - Shape: 123
Rare_endocrine_disease - Shape: 267
Rare_gastroenterologic_disease - Shape: 160
Rare_genetic_disease - Shape: 3
Rare_gynecologic_or_obstetric_disease - Shape: 31
Rare_hematologic_disease - Shape: 405
Rare_hepatic_disease - Shape: 88
Rare_immune_disease - Shape: 232
Rare_inborn_errors_of_metabolism - Shape: 545
Rare_infectious_disease - Shape: 1317
Rare_infertility_disease - Shape: 6
Rare_maxillo_facial_surgical_disease - Shape: 7
Rare_neoplastic_disease - Shape: 1487
Rare_neurologic_disease - Shape: 1958
Rare_odontologic_disease - Shape: 60
Rare_ophthalmic_disorder - Shape: 596
Rare_otorhinolaryngologic_disease - Shape: 75
Rare_renal_disease - Shape: 163
Rare_respiratory_disease - Shape: 205
Rare_skin_disease - Shape: 629
Rare_su

In [15]:
snomed_ct_codes_drop_due_to_hpo = pd.read_csv('SNOMED_CT_codes_to_drop_due_to_HPO_mapping.txt', delimiter='\t', encoding='utf-8')
snomed_ct_codes_drop_due_to_hpo

Unnamed: 0,Code,Name,HPO
0,718556007,3C syndrome,"[1, [""HP:0004602""], null, [[""HP:0004602"", ""Cer..."
1,26132002,5-Oxoprolinase deficiency,"[1, [""HP:0040142""], null, [[""HP:0040142"", ""Red..."
2,274945004,AA amyloidosis,"[1, [""HP:4000041""], null, [[""HP:4000041"", ""AA ..."
3,268302006,Aberrant thyroid gland,"[1, [""HP:0100028""], null, [[""HP:0100028"", ""Ect..."
4,190787008,Abetalipoproteinemia,"[1, [""HP:0008181""], null, [[""HP:0008181"", ""Abe..."
...,...,...,...
483,51247001,Vibratory urticaria,"[1, [""HP:0410138""], null, [[""HP:0410138"", ""Vib..."
484,28055006,West syndrome,"[1, [""HP:0011097""], null, [[""HP:0011097"", ""Epi..."
485,36070007,Wiskott-Aldrich syndrome,"[1, [""HP:6000472""], null, [[""HP:6000472"", ""Dec..."
486,16541001,Yellow fever,"[1, [""HP:0034310""], null, [[""HP:0034310"", ""Pos..."


## HPO mapping of rare disease classes with descendants from OHDSI

In [44]:
import os
import pandas as pd

# Define the path where the TXT files are located
txt_folder_path = r'C:\Users\,...,\RD_Cohort_Paper_Feb_2025\Orphanet_Linearization_ICD10_SNOMED_codes\RD_class_OHDSI_files'

# List of TXT file names (These files curated from OHDSI atlas to bring descendants of rare diseases)
txt_files = [
    'Rare_abdominal_surgical_disease.txt',
    'Rare_bone_disease.txt',
    'Rare_cardiac_disease.txt',
    'Rare_circulatory_system_disease.txt',
    'Rare_developmental_defect_during_embryogenesis.txt',
    'Rare_disorder_due_to_toxic_effects.txt',
    'Rare_endocrine_disease.txt',
    'Rare_gastroenterologic_disease.txt',
    'Rare_genetic_disease.txt',
    'Rare_gynecologic_or_obstetric_disease.txt',
    'Rare_hematologic_disease.txt',
    'Rare_hepatic_disease.txt',
    'Rare_immune_disease.txt',
    'Rare_inborn_errors_of_metabolism.txt',
    'Rare_infectious_disease.txt',
    'Rare_infertility_disease.txt',
    'Rare_maxillo_facial_surgical_disease.txt',
    'Rare_neoplastic_disease.txt',
    'Rare_neurologic_disease.txt',
    'Rare_odontologic_disease.txt',
    'Rare_ophthalmic_disorder.txt',
    'Rare_otorhinolaryngologic_disease.txt',
    'Rare_renal_disease.txt',
    'Rare_respiratory_disease.txt',
    'Rare_skin_disease.txt',
    'Rare_surgical_cardiac_disease.txt',
    'Rare_surgical_thoracic_disease.txt',
    'Rare_systemic_or_rheumatologic_disease.txt',
    'Rare_transplantation_disease.txt',
    'Rare_urogenital_disease.txt'
]

# DataFrame containing the snomed codes to match against (you already have this as snomed_ct_codes_drop_due_to_hpo)
# Example of how the DataFrame is assumed to look like
# snomed_ct_codes_drop_due_to_hpo = pd.DataFrame({
#     'Code': ['A001', 'B002', 'C003'],
#     'Name': ['Disease A', 'Disease B', 'Disease C'],
#     'HPO': ['HPO1', 'HPO2', 'HPO3']
# })

# Loop through the list of files and process each file
for txt_file in txt_files:
    # Construct the full file path
    file_path = os.path.join(txt_folder_path, txt_file)
    
    # Read the TXT file into a DataFrame using UTF-8 encoding
    try:
        df = pd.read_csv(file_path, delimiter='\t', encoding='utf-8')  # Adjust encoding if needed
    except UnicodeDecodeError:
        print(f"UnicodeDecodeError: Unable to decode {txt_file} using UTF-8. Trying with ISO-8859-1.")
        df = pd.read_csv(file_path, delimiter='\t', encoding='ISO-8859-1')  # Fallback to ISO-8859-1
    
    # Drop the first null column if it exists
    if df.columns[0] == 'Unnamed: 0' or df.iloc[:, 0].isnull().all():  # Check if the first column is empty or unnamed
        df = df.drop(df.columns[0], axis=1)

    # Step 1: Get the list of codes from the snomed_ct_codes_drop_due_to_hpo DataFrame
    snomed_codes = snomed_ct_codes_drop_due_to_hpo['Code'].tolist()

    # Step 2: Drop rows from the current DataFrame where 'Code' matches any code in snomed_ct_codes_drop_due_to_hpo
    df_cleaned = df[~df['Code'].isin(snomed_codes)]  # The ~ negates the condition

    # Step 3: Save the cleaned DataFrame back to a new .txt file with _new suffix
    new_file_path = os.path.join(txt_folder_path, txt_file.split('.')[0] + '_new.txt')
    df_cleaned.to_csv(new_file_path, sep='\t', index=False, encoding='utf-8')  # Adjust encoding if necessary

    print(f"Processed and saved: {new_file_path}")


Processed and saved: C:\Users\,...,RD_Cohort_Paper_Feb_2025\Orphanet_Linearization_ICD10_SNOMED_codes\RD_class_OHDSI_files\Rare_abdominal_surgical_disease_new.txt
Processed and saved: C:\Users\,...,RD_Cohort_Paper_Feb_2025\Orphanet_Linearization_ICD10_SNOMED_codes\RD_class_OHDSI_files\Rare_bone_disease_new.txt
Processed and saved: C:\Users\,...,RD_Cohort_Paper_Feb_2025\Orphanet_Linearization_ICD10_SNOMED_codes\RD_class_OHDSI_files\Rare_cardiac_disease_new.txt
Processed and saved: C:\Users\,...,RD_Cohort_Paper_Feb_2025\Orphanet_Linearization_ICD10_SNOMED_codes\RD_class_OHDSI_files\Rare_circulatory_system_disease_new.txt
Processed and saved: C:\Users\,...,RD_Cohort_Paper_Feb_2025\Orphanet_Linearization_ICD10_SNOMED_codes\RD_class_OHDSI_files\Rare_developmental_defect_during_embryogenesis_new.txt
Processed and saved: C:\Users\,...,\RD_Cohort_Paper_Feb_2025\Orphanet_Linearization_ICD10_SNOMED_codes\RD_class_OHDSI_files\Rare_disorder_due_to_toxic_effects_new.txt
Processed and saved: C:\User

In [47]:
## Read each rare disease classes after removing HPO mapped diseases

import os
import pandas as pd

# Define the path where the TXT files are located
txt_folder_path = r'C:\Users\,...,\RD_Cohort_Paper_Feb_2025\Orphanet_Linearization_ICD10_SNOMED_codes\RD_class_OHDSI_files_without_hpo'

# List of TXT file names (These files curated from OHDSI atlas to bring descendants of rare diseases)
txt_files = [
    'Rare_abdominal_surgical_disease_new.txt',
    'Rare_bone_disease_new.txt',
    'Rare_cardiac_disease_new.txt',
    'Rare_circulatory_system_disease_new.txt',
    'Rare_developmental_defect_during_embryogenesis_new.txt',
    'Rare_disorder_due_to_toxic_effects_new.txt',
    'Rare_endocrine_disease_new.txt',
    'Rare_gastroenterologic_disease_new.txt',
    'Rare_genetic_disease_new.txt',
    'Rare_gynecologic_or_obstetric_disease_new.txt',
    'Rare_hematologic_disease_new.txt',
    'Rare_hepatic_disease_new.txt',
    'Rare_immune_disease_new.txt',
    'Rare_inborn_errors_of_metabolism_new.txt',
    'Rare_infectious_disease_new.txt',
    'Rare_infertility_disease_new.txt',
    'Rare_maxillo_facial_surgical_disease_new.txt',
    'Rare_neoplastic_disease_new.txt',
    'Rare_neurologic_disease_new.txt',
    'Rare_odontologic_disease_new.txt',
    'Rare_ophthalmic_disorder_new.txt',
    'Rare_otorhinolaryngologic_disease_new.txt',
    'Rare_renal_disease_new.txt',
    'Rare_respiratory_disease_new.txt',
    'Rare_skin_disease_new.txt',
    'Rare_surgical_cardiac_disease_new.txt',
    'Rare_surgical_thoracic_disease_new.txt',
    'Rare_systemic_or_rheumatologic_disease_new.txt',
    'Rare_transplantation_disease_new.txt',
    'Rare_urogenital_disease_new.txt'
]

# Dictionary to store DataFrames for each TXT file
disease_data = {}

# Loop through the list of files and read them into a dictionary
for txt_file in txt_files:
    # Construct the full file path
    file_path = os.path.join(txt_folder_path, txt_file)
    
    # Read the TXT file into a DataFrame using UTF-8 encoding
    try:
        df = pd.read_csv(file_path, delimiter='\t', encoding='utf-8')  # Try using UTF-8 encoding
    except UnicodeDecodeError:
        print(f"UnicodeDecodeError: Unable to decode {txt_file} using UTF-8. Trying with ISO-8859-1.")
        df = pd.read_csv(file_path, delimiter='\t', encoding='ISO-8859-1')  # Fallback to ISO-8859-1
    
    # Remove the first null column (if it exists)
    if df.columns[0] == 'Unnamed: 0' or df.iloc[:, 0].isnull().all():  # Check if the first column is empty or unnamed
        df = df.drop(df.columns[0], axis=1)
    
    # Store the DataFrame in the dictionary with the disease name as the key
    disease_name = txt_file.split('.')[0]  # Extract the name from the file (without '.txt')
    disease_data[disease_name] = df
    
    # Print the shape and the name of the disease
    print(f"{disease_name} - Shape: {df.shape[0]}")

# Now, you can access the DataFrame for any disease from the dictionary
# For example, to access Rare infertility disease data:
#rare_infertility_df = disease_data['Rare_infertility_disease']
#print(rare_infertility_df.head())


Rare_abdominal_surgical_disease_new - Shape: 15
Rare_bone_disease_new - Shape: 483
Rare_cardiac_disease_new - Shape: 63
Rare_circulatory_system_disease_new - Shape: 60
Rare_developmental_defect_during_embryogenesis_new - Shape: 2765
Rare_disorder_due_to_toxic_effects_new - Shape: 123
Rare_endocrine_disease_new - Shape: 246
Rare_gastroenterologic_disease_new - Shape: 155
Rare_genetic_disease_new - Shape: 3
Rare_gynecologic_or_obstetric_disease_new - Shape: 28
Rare_hematologic_disease_new - Shape: 396
Rare_hepatic_disease_new - Shape: 81
Rare_immune_disease_new - Shape: 224
Rare_inborn_errors_of_metabolism_new - Shape: 531
Rare_infectious_disease_new - Shape: 1295
Rare_infertility_disease_new - Shape: 6
Rare_maxillo_facial_surgical_disease_new - Shape: 7
Rare_neoplastic_disease_new - Shape: 1435
Rare_neurologic_disease_new - Shape: 1942
Rare_odontologic_disease_new - Shape: 51
Rare_ophthalmic_disorder_new - Shape: 557
Rare_otorhinolaryngologic_disease_new - Shape: 73
Rare_renal_disease_n

In [21]:
import os
import pandas as pd

# Define the path where the TXT files are located
txt_folder_path = r'C:\Users\,...,\RD_Cohort_Paper_Feb_2025\Orphanet_Linearization_ICD10_SNOMED_codes\RD_class_OHDSI_files_without_hpo'

# List of TXT file names (These files curated from OHDSI atlas to bring descendants of rare diseases)
txt_files = [
    'Rare_abdominal_surgical_disease_new.txt',
    'Rare_bone_disease_new.txt',
    'Rare_cardiac_disease_new.txt',
    'Rare_circulatory_system_disease_new.txt',
    'Rare_developmental_defect_during_embryogenesis_new.txt',
    'Rare_disorder_due_to_toxic_effects_new.txt',
    'Rare_endocrine_disease_new.txt',
    'Rare_gastroenterologic_disease_new.txt',
    'Rare_genetic_disease_new.txt',
    'Rare_gynecologic_or_obstetric_disease_new.txt',
    'Rare_hematologic_disease_new.txt',
    'Rare_hepatic_disease_new.txt',
    'Rare_immune_disease_new.txt',
    'Rare_inborn_errors_of_metabolism_new.txt',
    'Rare_infectious_disease_new.txt',
    'Rare_infertility_disease_new.txt',
    'Rare_maxillo_facial_surgical_disease_new.txt',
    'Rare_neoplastic_disease_new.txt',
    'Rare_neurologic_disease_new.txt',
    'Rare_odontologic_disease_new.txt',
    'Rare_ophthalmic_disorder_new.txt',
    'Rare_otorhinolaryngologic_disease_new.txt',
    'Rare_renal_disease_new.txt',
    'Rare_respiratory_disease_new.txt',
    'Rare_skin_disease_new.txt',
    'Rare_surgical_cardiac_disease_new.txt',
    'Rare_surgical_thoracic_disease_new.txt',
    'Rare_systemic_or_rheumatologic_disease_new.txt',
    'Rare_transplantation_disease_new.txt',
    'Rare_urogenital_disease_new.txt'
]

# List to store DataFrames
df_list = []

# Loop through the list of files and read them into a list of DataFrames
for txt_file in txt_files:
    # Construct the full file path
    file_path = os.path.join(txt_folder_path, txt_file)
    
    # Read the TXT file into a DataFrame using UTF-8 encoding
    try:
        df = pd.read_csv(file_path, delimiter='\t', encoding='utf-8')  # Try using UTF-8 encoding
    except UnicodeDecodeError:
        print(f"UnicodeDecodeError: Unable to decode {txt_file} using UTF-8. Trying with ISO-8859-1.")
        df = pd.read_csv(file_path, delimiter='\t', encoding='ISO-8859-1')  # Fallback to ISO-8859-1
    
    # Remove the first null column (if it exists)
    if df.columns[0] == 'Unnamed: 0' or df.iloc[:, 0].isnull().all():  # Check if the first column is empty or unnamed
        df = df.drop(df.columns[0], axis=1)
    
    # Add a new column 'disease_linearization' to store the disease name (file name)
    df['disease_linearization'] = txt_file.split('.')[0]  # Extract the name from the file (without '.txt')
    
    # Append the DataFrame to the list
    df_list.append(df)
    
    # Print the shape and the name of the disease
    print(f"{txt_file} - Shape: {df.shape[0]}")

# Concatenate all the DataFrames into one single DataFrame
combined_df = pd.concat(df_list, ignore_index=True)

# Print the final combined DataFrame shape and some preview data
print(f"Combined DataFrame - Shape: {combined_df.shape}")
print(combined_df.head())

# Optionally, you can save the combined DataFrame to a file
#combined_df.to_csv('combined_disease_data.txt', sep='\t', index=False, encoding='utf-8')
combined_df

Rare_abdominal_surgical_disease_new.txt - Shape: 15
Rare_bone_disease_new.txt - Shape: 483
Rare_cardiac_disease_new.txt - Shape: 63
Rare_circulatory_system_disease_new.txt - Shape: 60
Rare_developmental_defect_during_embryogenesis_new.txt - Shape: 2765
Rare_disorder_due_to_toxic_effects_new.txt - Shape: 123
Rare_endocrine_disease_new.txt - Shape: 246
Rare_gastroenterologic_disease_new.txt - Shape: 155
Rare_genetic_disease_new.txt - Shape: 3
Rare_gynecologic_or_obstetric_disease_new.txt - Shape: 28
Rare_hematologic_disease_new.txt - Shape: 396
Rare_hepatic_disease_new.txt - Shape: 81
Rare_immune_disease_new.txt - Shape: 224
Rare_inborn_errors_of_metabolism_new.txt - Shape: 531
Rare_infectious_disease_new.txt - Shape: 1295
Rare_infertility_disease_new.txt - Shape: 6
Rare_maxillo_facial_surgical_disease_new.txt - Shape: 7
Rare_neoplastic_disease_new.txt - Shape: 1435
Rare_neurologic_disease_new.txt - Shape: 1942
Rare_odontologic_disease_new.txt - Shape: 51
Rare_ophthalmic_disorder_new.txt

Unnamed: 0,Id,Code,Name,Class,Standard Concept Caption,Valid Start Date,Valid End Date,RC,DRC,PC,DPC,Domain,Vocabulary,Ancestors,disease_linearization
0,36715929.0,721719006,Adenoma of ampulla of Vater,Disorder,Standard,01/30/2017,12/30/2099,13370,13370,1050,1050,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ...",Rare_abdominal_surgical_disease_new
1,37016243.0,208061000119101,Adenoma of pancreas,Disorder,Standard,01/30/2016,12/30/2099,30,20730,20,2160,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ...",Rare_abdominal_surgical_disease_new
2,4173174.0,276526002,Chylous ascites of newborn,Disorder,Standard,01/30/2002,12/30/2099,10,10,0,0,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ...",Rare_abdominal_surgical_disease_new
3,4342870.0,235967003,Cystadenoma of pancreas,Disorder,Standard,01/30/2002,12/30/2099,200,2730,70,380,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ...",Rare_abdominal_surgical_disease_new
4,4342886.0,236015007,Drug-induced retroperitoneal fibrosis,Disorder,Standard,01/30/2002,12/30/2099,0,0,0,0,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ...",Rare_abdominal_surgical_disease_new
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12143,37018952.0,367661000119102,Hematuria co-occurrent and due to chronic inte...,Disorder,Standard,01/30/2016,12/30/2099,275800,275800,86600,86600,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ...",Rare_urogenital_disease_new
12144,40479626.0,441575009,Ischemic priapism,Disorder,Standard,07/30/2009,12/30/2099,40,100,20,40,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ...",Rare_urogenital_disease_new
12145,4111154.0,253904001,Megacystis-megaureter syndrome,Disorder,Standard,01/30/2002,12/30/2099,0,0,0,0,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ...",Rare_urogenital_disease_new
12146,4179391.0,429233001,Nonneurogenic neurogenic bladder dysfunction,Disorder,Standard,01/30/2008,12/30/2099,1100,1100,40,40,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ...",Rare_urogenital_disease_new


In [57]:
len(combined_df['Code'].unique())

11784

In [59]:
combined_df.to_csv('rd_linearization_concatenated_files.csv', index = False, encoding='utf-16-le')

In [20]:
## Five rare disease without linearization (which brought 330 total snomed-ct codes from OHDSI) for HPO mapping
non_linearized_snomed_descendants = pd.read_csv('non_lonearized_snomed_codes_with_its_descendants.txt', delimiter='\t')

non_linearized_snomed_descendants = non_linearized_snomed_descendants.drop(non_linearized_snomed_descendants.columns[0], axis=1)
non_linearized_snomed_descendants

Unnamed: 0,Id,Code,Name,Class,Standard Concept Caption,Valid Start Date,Valid End Date,RC,DRC,PC,DPC,Domain,Vocabulary,Ancestors
0,4141802,427099000,Active tuberculosis,Disorder,Standard,07/30/2007,12/30/2099,110,110,60,60,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ..."
1,4237741,58374007,Acute diffuse tuberculosis,Disorder,Standard,01/30/2002,12/30/2099,0,0,0,0,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ..."
2,4345213,240381007,Acute miliary cutaneous tuberculosis,Disorder,Standard,01/30/2002,12/30/2099,0,0,0,0,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ..."
3,435734,186276006,Acute miliary tuberculosis,Disorder,Standard,01/30/2002,12/30/2099,13440,18410,3030,4110,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ..."
4,4087419,186278007,Acute miliary tuberculosis of multiple sites,Disorder,Standard,01/30/2002,12/30/2099,2660,2660,420,420,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
325,4121541,25738004,Tuberculous synovitis,Disorder,Standard,01/30/2002,12/30/2099,10,10,10,10,Condition,SNOMED,
326,4028666,13570003,Tuberculous tenosynovitis,Disorder,Standard,01/30/2002,12/30/2099,0,0,0,0,Condition,SNOMED,
327,36713392,717697005,Tuberculous ulceration of vulva,Disorder,Standard,01/30/2017,12/30/2099,0,0,0,0,Condition,SNOMED,
328,4126283,236684001,Tuberculous urethritis,Disorder,Standard,01/30/2002,12/30/2099,0,0,0,0,Condition,SNOMED,


In [22]:
## drop snomed ct codes which has HPO mapping ( following 8 snomed-ct codes with descendants has HPO mapping)
merged_df_without_hpo = pd.merge(non_linearized_snomed_descendants, snomed_ct_codes_drop_due_to_hpo, on='Code', how='inner')
merged_df_without_hpo

Unnamed: 0,Id,Code,Name_x,Class,Standard Concept Caption,Valid Start Date,Valid End Date,RC,DRC,PC,DPC,Domain,Vocabulary,Ancestors,Name_y,HPO
0,4100564,253120005,Lipomeningocele,Disorder,Standard,01/30/2002,12/30/2099,840,840,100,100,Condition,SNOMED,,Lipomeningocele,"[1, [""HP:0030710""], null, [[""HP:0030710"", ""Lip..."
1,4212197,414667000,Meningomyelocele,Disorder,Standard,01/30/2005,12/30/2099,26770,27790,690,950,Condition,SNOMED,,Meningomyelocele,"[1, [""HP:0002475""], null, [[""HP:0002475"", ""Mye..."
2,4068958,203994003,Myelocystocele,Disorder,Standard,01/30/2002,12/30/2099,40,50,20,20,Condition,SNOMED,,Myelocystocele,"[1, [""HP:0030709""], null, [[""HP:0030709"", ""Mye..."
3,253954,154283005,Pulmonary tuberculosis,Disorder,Standard,07/30/2003,12/30/2099,1102290,1344780,335830,413470,Condition,SNOMED,,Pulmonary tuberculosis,"[1, [""HP:0032262""], null, [[""HP:0032262"", ""Pul..."
4,4264916,61819007,Rachischisis,Disorder,Standard,01/30/2002,12/30/2099,0,0,0,0,Condition,SNOMED,,Rachischisis,"[1, [""HP:0003316""], null, [[""HP:0003316"", ""But..."
5,4034847,15202009,Tuberculoma,Disorder,Standard,01/30/2002,12/30/2099,10810,19520,1600,4040,Condition,SNOMED,,Tuberculoma,"[1, [""HP:6000585""], null, [[""HP:6000585"", ""Bra..."
6,434557,56717001,Tuberculosis,Disorder,Standard,01/30/2002,12/30/2099,50490,3193320,11660,1001260,Condition,SNOMED,,Tuberculosis,"[6, [""HP:0032262"", ""HP:0032271"", ""HP:6000799"",..."
7,4310580,423997002,"Tuberculosis, extrapulmonary",Disorder,Standard,01/30/2007,12/30/2099,50,50,10,10,Condition,SNOMED,,"Tuberculosis, extrapulmonary","[1, [""HP:0032271""], null, [[""HP:0032271"", ""Ext..."


In [31]:
# Convert the list to a set for faster lookups (optional but can improve performance for large data)
code_list = merged_df_without_hpo['Code'].tolist()

# Select rows in non_linearized_snomed_descendants where 'Code' is not in the code_list
filtered_df = non_linearized_snomed_descendants[~non_linearized_snomed_descendants['Code'].isin(code_list)]
filtered_df

Unnamed: 0,Id,Code,Name,Class,Standard Concept Caption,Valid Start Date,Valid End Date,RC,DRC,PC,DPC,Domain,Vocabulary,Ancestors
0,4141802,427099000,Active tuberculosis,Disorder,Standard,07/30/2007,12/30/2099,110,110,60,60,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ..."
1,4237741,58374007,Acute diffuse tuberculosis,Disorder,Standard,01/30/2002,12/30/2099,0,0,0,0,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ..."
2,4345213,240381007,Acute miliary cutaneous tuberculosis,Disorder,Standard,01/30/2002,12/30/2099,0,0,0,0,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ..."
3,435734,186276006,Acute miliary tuberculosis,Disorder,Standard,01/30/2002,12/30/2099,13440,18410,3030,4110,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ..."
4,4087419,186278007,Acute miliary tuberculosis of multiple sites,Disorder,Standard,01/30/2002,12/30/2099,2660,2660,420,420,Condition,SNOMED,"$parents[1].showAncestorsModal(d.CONCEPT_ID), ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
325,4121541,25738004,Tuberculous synovitis,Disorder,Standard,01/30/2002,12/30/2099,10,10,10,10,Condition,SNOMED,
326,4028666,13570003,Tuberculous tenosynovitis,Disorder,Standard,01/30/2002,12/30/2099,0,0,0,0,Condition,SNOMED,
327,36713392,717697005,Tuberculous ulceration of vulva,Disorder,Standard,01/30/2017,12/30/2099,0,0,0,0,Condition,SNOMED,
328,4126283,236684001,Tuberculous urethritis,Disorder,Standard,01/30/2002,12/30/2099,0,0,0,0,Condition,SNOMED,


In [41]:
# Merge dataframes linearized with five snomed-ct which was not linearized together
df_final = pd.concat([combined_df,filtered_df], ignore_index=True)
df_final = df_final[['Name', 'Code', 'disease_linearization']]
df_final

Unnamed: 0,Name,Code,disease_linearization
0,Adenoma of ampulla of Vater,721719006,Rare_abdominal_surgical_disease_new
1,Adenoma of pancreas,208061000119101,Rare_abdominal_surgical_disease_new
2,Chylous ascites of newborn,276526002,Rare_abdominal_surgical_disease_new
3,Cystadenoma of pancreas,235967003,Rare_abdominal_surgical_disease_new
4,Drug-induced retroperitoneal fibrosis,236015007,Rare_abdominal_surgical_disease_new
...,...,...,...
12465,Tuberculous synovitis,25738004,
12466,Tuberculous tenosynovitis,13570003,
12467,Tuberculous ulceration of vulva,717697005,
12468,Tuberculous urethritis,236684001,


In [42]:
len(df_final['Code'].unique())

12081

In [43]:
df_final.to_csv('SNOMED_CT_codes_with_descendents_with_without_linearization.txt', sep='\t', index=False)

In [3]:
df_final = pd.read_csv('SNOMED_CT_codes_with_descendents_with_without_linearization.txt', delimiter='\t')
df_final

Unnamed: 0,Name,Code,disease_linearization
0,Adenoma of ampulla of Vater,721719006,Rare_abdominal_surgical_disease_new
1,Adenoma of pancreas,208061000119101,Rare_abdominal_surgical_disease_new
2,Chylous ascites of newborn,276526002,Rare_abdominal_surgical_disease_new
3,Cystadenoma of pancreas,235967003,Rare_abdominal_surgical_disease_new
4,Drug-induced retroperitoneal fibrosis,236015007,Rare_abdominal_surgical_disease_new
...,...,...,...
12465,Tuberculous synovitis,25738004,
12466,Tuberculous tenosynovitis,13570003,
12467,Tuberculous ulceration of vulva,717697005,
12468,Tuberculous urethritis,236684001,


In [51]:
# Select rows where 'Code' is duplicated (keep all duplicates)
df_duplicates = df_final[df_final.duplicated('Code', keep=False)]
print('Unique rare disease linearized in multiple classes',len(df_duplicates['Code'].unique()))
df_duplicates

Unique rare disease linearized in multiple classes 377


Unnamed: 0,Name,Code,disease_linearization
8,Immunoglobulin G4 related aortitis,1187508009,Rare_abdominal_surgical_disease_new
83,"Brachydactyly, mesomelia, intellectual disabil...",765761009,Rare_bone_disease_new
97,Chondrodysplasia with disorder of sex developm...,720851007,Rare_bone_disease_new
99,Chondroectodermal dysplasia,62501005,Rare_bone_disease_new
100,Chondroectodermal dysplasia with night blindne...,763134002,Rare_bone_disease_new
...,...,...,...
12329,Tuberculosis of endocardium,52987001,
12412,Tuberculous Addison's disease,186270000,
12415,Tuberculous arachnoiditis,447253004,
12430,Tuberculous empyema,14527007,


In [10]:
df_linearization_icd10

Unnamed: 0,GardID,OrphaCode,Name,code,Linearization Class
0,7904,461,Recessive X-linked ichthyosis,Q80.1,Rare skin disease
1,871,848,Beta-thalassemia,D56.1,Rare hematologic disease
2,621,846,Alpha-thalassemia,D56.0,Rare hematologic disease
3,5696,778,Rett syndrome,F84.2,Rare neurologic disease
4,8378,731,Autosomal recessive polycystic kidney disease,Q61.1,Rare renal disease
...,...,...,...,...,...
352,19690,99827,Crimean-Congo hemorrhagic fever,A98.0,Rare infectious disease
353,19688,99824,Lassa fever,A96.2,Rare infectious disease
354,9444,99826,Marburg hemorrhagic fever,A98.3,Rare infectious disease
355,22442,615943,Granuloma faciale,L92.2,Rare skin disease


In [30]:
## remove incidence filter code from SNOMED & ICD-10 code linearization files
icd10_incidence_filter_code = ['Q21.1','H90.5','M35.3','G50.1','L81.3','E85.4','A31.0']
print('# of diseases droped due to incidence filter from icd10 codes',len(icd10_incidence_filter_code ) )
snomedct_incidence_filter_code = [62106007,58797008,23502006,62564004,65323003,209827006,110030002,312839008,239887007,405754008,7297005,70273001,262693007,127299008,31712002,41720003,194997002,765330003]
print('# of diseases droped due to incidence filter from snomed-ct codes',len(snomedct_incidence_filter_code) )

# of diseases droped due to incidence filter from icd10 codes 7
# of diseases droped due to incidence filter from snomed-ct codes 18


In [31]:
# Select icd10 codes after discarding incidece filter icd10 codes
df_icd10 = df_linearization_icd10[~df_linearization_icd10['code'].isin(icd10_incidence_filter_code)]
df_icd10

Unnamed: 0,GardID,OrphaCode,Name,code,Linearization Class
0,7904,461,Recessive X-linked ichthyosis,Q80.1,Rare skin disease
1,871,848,Beta-thalassemia,D56.1,Rare hematologic disease
2,621,846,Alpha-thalassemia,D56.0,Rare hematologic disease
3,5696,778,Rett syndrome,F84.2,Rare neurologic disease
4,8378,731,Autosomal recessive polycystic kidney disease,Q61.1,Rare renal disease
...,...,...,...,...,...
352,19690,99827,Crimean-Congo hemorrhagic fever,A98.0,Rare infectious disease
353,19688,99824,Lassa fever,A96.2,Rare infectious disease
354,9444,99826,Marburg hemorrhagic fever,A98.3,Rare infectious disease
355,22442,615943,Granuloma faciale,L92.2,Rare skin disease


In [33]:
# Following list of ICD-10 codes to map N3C enclave
print(df_icd10['code'].tolist() )

['Q80.1', 'D56.1', 'D56.0', 'F84.2', 'Q61.1', 'Q87.4', 'G10', 'D82.1', 'Q99.2', 'E76.1', 'E76.0', 'L93.1', 'Q77.4', 'Q16.3', 'M61.1', 'E79.1', 'Q77.5', 'I78.0', 'E80.5', 'K22.0', 'C45.1', 'F84.3', 'Q98.5', 'D47.5', 'G60.1', 'Q82.3', 'G90.1', 'Q23.4', 'Q86.0', 'Q12.1', 'Q77.1', 'P35.0', 'Q83.0', 'Q83.1', 'Q31.5', 'Q31.0', 'C57.0', 'Q51.0', 'Q51.5', 'P37.1', 'P35.8', 'C82.6', 'M35.2', 'M33.2', 'Q78.0', 'T88.3', 'D81.5', 'G70.0', 'Q85.1', 'M31.0+', 'N08.5*', 'M30.1', 'M31.7', 'M31.3', 'B75', 'E71.0', 'Q61.2', 'A54.3+', 'H13.1*', 'M31.4', 'A48.1', 'L10.0', 'F84.1', 'M30.3', 'Q99.0', 'E80.3', 'M72.2', 'E24.1', 'G40.8', 'K75.4', 'K74.3', 'L13.0', 'A05.1', 'L44.0', 'Q31.3', 'M91.1', 'G23.1', 'Q99.1', 'J63.2', 'Q97.0', 'D82.2', 'Q77.0', 'K03.5', 'P91.6', 'M11.1', 'Q16.1', 'Q01.1', 'Q31.1', 'Q78.3', 'Q06.2', 'U04.9', 'Q86.1', 'P35.1', 'D76.2', 'G23.0', 'Q72.4', 'O14.2', 'Q78.4', 'L40.1', 'G51.2', 'Q80.3', 'Q80.4', 'O01.1', 'O01.0', 'I67.5', 'E31.0', 'Q06.0', 'Q44.6', 'L98.2', 'Q04.1', 'Q04.4', 

In [34]:
# List of unique classes of rare diseases
unique_classes = df_icd10['Linearization Class'].unique()

# Loop through each class and print the count and codes
for rare_class in unique_classes:
    rare_diseases = df_icd10[df_icd10['Linearization Class'] == rare_class]
    print(f"# {rare_class} : {rare_diseases['code'].tolist()}\n")

# Rare skin disease : ['Q80.1', 'L93.1', 'L10.0', 'M72.2', 'L13.0', 'L44.0', 'L40.1', 'G51.2', 'Q80.3', 'Q80.4', 'L98.2', 'Q82.1', 'L57.1', 'L66.2', 'L66.3', 'L98.3', 'L72.2', 'L85.0', 'L68.1', 'L66.1', 'Q84.3', 'L43.1', 'M35.6', 'L92.1', 'L95.0', 'L12.3', 'L13.1', 'L95.1', 'L94.0', 'Q81.1', 'L10.1', 'E52', 'L51.2', 'L92.2']

# Rare hematologic disease : ['D56.1', 'D56.0', 'O14.2', 'D58.0', 'D56.2', 'D47.3', 'D68.1', 'D61.0', 'D45', 'D56.4', 'D67', 'D60.1', 'D66']

# Rare neurologic disease : ['F84.2', 'G10', 'F84.3', 'G60.1', 'G90.1', 'T88.3', 'G70.0', 'F84.1', 'G40.8', 'G23.1', 'P91.6', 'G23.0', 'I67.5', 'G37.1', 'G14', 'G23.3', 'G37.5', 'G24.1', 'G47.4', 'G93.2', 'T57.2', 'P94.0', 'G96.0', 'G70.2', 'G36.0', 'G37.0', 'G73.1', 'G12.0', 'G21.3', 'G21.0', 'F80.3', 'G23.2', 'G90.5', 'G90.6']

# Rare renal disease : ['Q61.1', 'Q61.2', 'Q61.5']

# Rare systemic or rheumatologic disease : ['Q87.4', 'M35.2', 'M33.2', 'M31.0+', 'N08.5*', 'M30.1', 'M31.7', 'M31.3', 'M31.4', 'M30.3', 'M11.1', '

In [35]:
# Select snomed-ct codes after discarding incidece filter snomed-ct codes (dropped 18 snomed-CT codes due to I.F.)
df_snct = df_final[~df_final['Code'].isin(snomedct_incidence_filter_code)]
df_snct

Unnamed: 0,Name,Code,disease_linearization
0,Adenoma of ampulla of Vater,721719006,Rare_abdominal_surgical_disease_new
1,Adenoma of pancreas,208061000119101,Rare_abdominal_surgical_disease_new
2,Chylous ascites of newborn,276526002,Rare_abdominal_surgical_disease_new
3,Cystadenoma of pancreas,235967003,Rare_abdominal_surgical_disease_new
4,Drug-induced retroperitoneal fibrosis,236015007,Rare_abdominal_surgical_disease_new
...,...,...,...
12465,Tuberculous synovitis,25738004,
12466,Tuberculous tenosynovitis,13570003,
12467,Tuberculous ulceration of vulva,717697005,
12468,Tuberculous urethritis,236684001,


In [39]:
# list snomed-ct codes after dropping RDs due to incidence filter
snct_without_incidence_filter = [str(x) for x in df_snct['Code'].tolist()]

# Print the result
print(snct_without_incidence_filter)


['721719006', '208061000119101', '276526002', '235967003', '236015007', '254615000', '1228875006', '197808006', '1187508009', '1196885009', '473418001', '236017004', '690751000119102', '86422009', '690761000119100', '719046005', '702342007', '724147004', '2391001', '42725006', '14870002', '254061001', '86268005', '720416007', '66758006', '389167007', '718559000', '254090007', '389162001', '20756002', '16583001000119104', '302211000119104', '1076951000119102', '1076971000119106', '302181000119103', '302221000119106', '302171000119101', '302191000119100', '1076931000119108', '16583041000119102', '302291000119108', '1076941000119104', '1076961000119100', '302261000119101', '302301000119109', '302251000119103', '302271000119107', '1076921000119105', '75930001', '720984008', '441809006', '389263004', '725141006', '254055004', '725142004', '717264003', '237889002', '1229999001', '725165009', '1264041000', '725050005', '783789002', '715487005', '90505000', '725166005', '782782004', '122997900

In [40]:
len(snct_without_incidence_filter)

12452

### Rare disease classes with SNOMED-CT Codes

In [41]:
# List of unique classes of rare diseases
unique_classes_snct = df_snct['disease_linearization'].unique()

# Loop through each class and print the count and codes
for rare_class in unique_classes_snct:
    rare_diseases = df_snct[df_snct['disease_linearization'] == rare_class]
    snct_code_rd_class = [str(x) for x in rare_diseases['Code'].tolist()]
    print(f"# {rare_class} : {snct_code_rd_class}\n")

# Rare_abdominal_surgical_disease_new : ['721719006', '208061000119101', '276526002', '235967003', '236015007', '254615000', '1228875006', '197808006', '1187508009', '1196885009', '473418001', '236017004', '690751000119102', '86422009', '690761000119100']

# Rare_bone_disease_new : ['719046005', '702342007', '724147004', '2391001', '42725006', '14870002', '254061001', '86268005', '720416007', '66758006', '389167007', '718559000', '254090007', '389162001', '20756002', '16583001000119104', '302211000119104', '1076951000119102', '1076971000119106', '302181000119103', '302221000119106', '302171000119101', '302191000119100', '1076931000119108', '16583041000119102', '302291000119108', '1076941000119104', '1076961000119100', '302261000119101', '302301000119109', '302251000119103', '302271000119107', '1076921000119105', '75930001', '720984008', '441809006', '389263004', '725141006', '254055004', '725142004', '717264003', '237889002', '1229999001', '725165009', '1264041000', '725050005', '78378

In [44]:
len(df_final['Code'].unique())

12081

In [48]:
len(df_snct['Code'].unique())

12063