# Exploring a SNOMED-CT uk extension Release

A quick tutorial for opening SNOMED CT UK edition + drug extention with python

In [1]:
import pandas as pd
import numpy as np
import json
import re
import os
import hashlib

## Loading the SNOMED UK extention release files

In [2]:
snomed_dir = os.path.join(os.getcwd(),'SNOMED_UK')

The version which is used here is the SNOMED 20200930 UK extension

Use Snapshot, instead of Full, here, as Full contains all historical concepts since 2014. Delta only contains differences from last version.
https://confluence.ihtsdotools.org/display/DOCGLOSS/Snapshot+release

In [3]:
base_term = f'{snomed_dir}/uk_sct2cl_30.2.0_20200930000001/'
int_terminology = base_term + 'SnomedCT_InternationalRF2_PRODUCTION_20190731T120000Z/Snapshot/Terminology'
uk_ext_terminology = base_term + 'SnomedCT_UKClinicalRF2_PRODUCTION_20200930T000001Z/Snapshot/Terminology'

In [4]:
def parse_file(filename, first_row_header=True, columns=None):
    with open(filename, encoding='utf-8') as f:
        entities = [[n.strip() for n in line.split('\t')] for line in f]
        return pd.DataFrame(entities[1:], columns=entities[0] if first_row_header else columns)

## SNOMED CT Design

### SNOMED CT Components
SNOMED CT is a clinical terminology containing concepts with unique meanings and formal logic based definitions organised into hierarchies.
For further information please see: https://confluence.ihtsdotools.org/display/DOCSTART/4.+SNOMED+CT+Basics

SNOMED CT content is represented into 3 main types of components:
- __Concepts__ representing clinical meanings that are organised into hierarchies.
- __Descriptions__ which link appropriate human readable terms to concepts
- __Relationships__ which link each concept to other related concepts

__NOTE:__ SNOMED-CT (UK Ed.) is an extension to the Int Ed. Both sets of files (Int. and the UK Ext.) are released as part of one 'UK Release'.

Load and merge the active concept from the international and UK Extention __Concept snapshot__ files

#### __Table 4.2.1-1:__ Concept file - Detailed Specification

|Field|Data type|Purpose|Mutable|Part of Primary Key|
|:-----|:-----|:-----|:-----|:-----|
|id|SCTID|Uniquely Idenfies the concept|NO|YES (Full/Snapshot)|
|effectiveTime|Time|Specifies the inclusive date at which the component version's state became the then current valid state of the component.|YES|YES (Full)<br>Optional (Snapshot)|
|active|Boolean|Specifies whether the concept was active or inactive from the nominal release date specified by the effectiveTime.|YES|NO|
|moduleId|SCTID|Identifies the concept version's module. Set to a descendant of 900000000000443000(Module) within the metadata hierarchy.|YES|NO|
|definitionStatusId|SCTID|Specifies if the concept version is primitive or defined. Set to a descendant of 900000000000444006(Definition status)in the metadata hierarchy.|YES|NO|

Taken from: https://confluence.ihtsdotools.org/display/DOCRELFMT

In [5]:
int_terms = parse_file(f'{int_terminology}/sct2_Concept_Snapshot_INT_20190731.txt')
uk_terms = parse_file(f'{uk_ext_terminology}/sct2_Concept_Snapshot_GB1000000_20200930.txt')
terms = pd.concat([int_terms, uk_terms])
active_terms = terms[terms.active == '1'] # active concepts are represented with 1

In [6]:
# Every concept has a unique concept identifier: active_terms['id'] 
active_terms.describe()

Unnamed: 0,id,effectiveTime,active,moduleId,definitionStatusId
count,380165,380165,380165,380165,380165
unique,380165,72,1,5,2
top,429729007,20020131,1,900000000000207008,900000000000074008
freq,1,172004,380165,349097,268381


Load and merge the active descriptions from the international and UK Extention __Description snapshot__ files

#### __Table 4.2.2-1:__ Description file - Detailed Specification

|Field|Data type|Purpose|Mutable|Part of Primary Key|
|:-----|:-----|:-----|:-----|:-----|
|id|SCTID|Uniquely identifies the description.|NO|YES (Full/Snapshot)|
|effectiveTime|Time|Specifies the inclusive date at which the component version's state became the then current valid state of the component|YES|YES (Full)<br>Optional \|Snapshot\||
|active|Boolean|Specifies whether the state of the description was active or inactive from the nominal release date specified by the effectiveTime.|YES|NO|
|moduleId|SCTID|Identifies the description version's module. Set to a child of 900000000000443000\|Module\| within the metadata hierarchy.|YES|NO|
|conceptId|SCTID|Identifies the concept to which this description applies. Set to the identifier of a concept in the 138875005 \|SNOMED CT Concept\| hierarchy within the Concept. Note that a specific version of a description is not directly bound to a specific version of the concept to which it applies. Which version of a description applies to a concept depends on its effectiveTime and the point in time at which it is accessed.|NO|NO|
|languageCode|String|Specifies the language of the description text using the two character ISO-639-1 code. Note that this specifies a language level only, not a dialect or country code.|NO|NO|
|typeId|SCTID|Identifies whether the description is fully specified name a synonym or other description type. This field is set to a child of 900000000000446008\|Description type\| in the Metadata hierarchy.|NO|NO|
|term|String|The description version's text value, represented in UTF-8 encoding.|YES|NO|
|caseSignificanceId|SCTID|Identifies the concept enumeration value that represents the case significance of this description version. For example, the term may be completely case sensitive, case insensitive or initial letter case insensitive. This field will be set to a child of 900000000000447004\|Case significance\| within the metadata hierarchy.|YES|NO|

Taken from: https://confluence.ihtsdotools.org/display/DOCRELFMT

In [7]:
int_desc = parse_file(f'{int_terminology}/sct2_Description_Snapshot-en_INT_20190731.txt')
uk_desc = parse_file(f'{uk_ext_terminology}/sct2_Description_Snapshot-en_GB1000000_20200930.txt')
descs = pd.concat([int_desc, uk_desc])
active_descs = descs[descs.active == '1']

In [8]:
active_descs.head()

Unnamed: 0,id,effectiveTime,active,moduleId,conceptId,languageCode,typeId,term,caseSignificanceId
0,101013,20170731,1,900000000000207008,126813005,en,900000000000013009,Neoplasm of anterior aspect of epiglottis,900000000000448009
1,102018,20170731,1,900000000000207008,126814004,en,900000000000013009,Neoplasm of junctional region of epiglottis,900000000000448009
2,103011,20170731,1,900000000000207008,126815003,en,900000000000013009,Neoplasm of lateral wall of oropharynx,900000000000448009
3,104017,20170731,1,900000000000207008,126816002,en,900000000000013009,Neoplasm of posterior wall of oropharynx,900000000000448009
4,105016,20170731,1,900000000000207008,126817006,en,900000000000013009,Neoplasm of esophagus,900000000000448009


Load and merge the relationships from the international and UK Extention __Relationship snapshot__ files

#### __Table 4.2.3-1:__ Relationship file - Detailed specification

|Field|Data type|Purpose|Mutable|Part of Primary Key|
|:-----|:-----|:-----|:-----|:-----|
|id|SCTID|Uniquely identifies the relationship.|NO|YES(Full/Snapshot)|
|effectiveTime|Time|Specifies the inclusive date at which the component version's state became the then current valid state of the component.|YES|YES(Full) Optional(Snapshot)|
|active|Boolean|Specifies whether the state of the relationship was active or inactive from the nominal release date specified by the effectiveTime field.|YES|NO|
|moduleId|SCTID|Identifies the relationship version's module. Set to a child of 900000000000443000\|Module\| within the metadata hierarchy.|YES|NO|
|sourceId|SCTID|Identifies the source concept of the relationship version. That is the concept defined by this relationship. Set to the identifier of a concept.|NO|NO|
|destinationId|SCTID|Identifies the concept that is the destination of the relationship version.<br>That is the concept representing the value of the attribute represented by the typeId column.<br>Set to the identifier of a concept.<br>Note that the values that can be applied to particular attributes are formally defined by the SNOMED CT Machine Readable Concept Model.|NO|NO|
|relationshipGroup|Integer|Groups together relationship versions that are part of a logically associated relationshipGroup. All active Relationship records with the same relationshipGroup number and sourceId are grouped in this way.|YES|NO|
|typeId|SCTID|Identifies the concept that represent the defining attribute (or relationship type) represented by this relationship version.<br><br>That is the concept representing the value of the attribute represented by the typeId column. <br><br>Set to the identifier of a concept. The concept identified must be either 116680003\|Is a\| or a subtype of 410662002\|Concept model attribute\|. The concepts that can be used as in the typeId column are formally defined as follows:<br>116680003\|is a\| OR < 410662002\|concept model attribute\|<br><br>__Note__ that the attributes that can be applied to particular concepts are formally defined by the SNOMED CT Machine Readable Concept Model.|NO|NO|
|characteristicTypeId|SCTID|A concept enumeration value that identifies the characteristic type of the relationship version (i.e. whether the relationship version is defining, qualifying, etc.) This field is set to a descendant of 900000000000449001\|Characteristic type\|in the metadata hierarchy.|YES|NO|
|modifierId|SCTID|A concept enumeration value that identifies the type of Description Logic(DL) restriction (some, all, etc.). Set to a child of 900000000000450001\|Modifier\| in the metadata hierarchy.<br> __Note__ Currently the only value used in this column is 900000000000451002\|Some\| and thus in practical terms this column can be ignored.|YES|NO|

Taken from: https://confluence.ihtsdotools.org/display/DOCRELFMT

In [9]:
int_relat = parse_file(f'{int_terminology}/sct2_Relationship_Snapshot_INT_20190731.txt')
uk_relat = parse_file(f'{uk_ext_terminology}/sct2_Relationship_Snapshot_GB1000000_20200930.txt')
relat = pd.concat([int_relat, uk_relat])
active_relat = relat[relat.active == '1']

In [10]:
active_relat.head()

Unnamed: 0,id,effectiveTime,active,moduleId,sourceId,destinationId,relationshipGroup,typeId,characteristicTypeId,modifierId
1,101021,20020131,1,900000000000207008,10000006,29857009,0,116680003,900000000000011006,900000000000451002
2,102025,20020131,1,900000000000207008,10000006,9972008,0,116680003,900000000000011006,900000000000451002
13,114022,20020131,1,900000000000207008,134035007,84371003,0,116680003,900000000000011006,900000000000451002
26,127021,20020131,1,900000000000207008,134136005,57250008,0,116680003,900000000000011006,900000000000451002
29,130025,20020131,1,900000000000207008,10002003,116175006,0,116680003,900000000000011006,900000000000451002


## SNOMED CT Concept Model

<img src="img/Association Between Files from 2019.png">

Taken from: https://confluence.ihtsdotools.org/display/DOCRELFMT

Find the fully specified name, Synonym or Definition of a SNOMED concept

__Description type__

|Type id|Term|
|:---:|:---|
|900000000000003001|Fully specified name|
|900000000000013009|Synonym|
|900000000000550004|Definition|


Create a DataFrame which contains only the active SNOMED codes and thier fully specified name

In [11]:
active_with_desc = pd.merge(active_terms, active_descs[active_descs['typeId'] == '900000000000003001'], left_on=['id'], right_on=['conceptId'], how='inner')
active_with_desc.describe()


Unnamed: 0,id_x,effectiveTime_x,active_x,moduleId_x,definitionStatusId,id_y,effectiveTime_y,active_y,moduleId_y,conceptId,languageCode,typeId,term,caseSignificanceId
count,380165,380165,380165,380165,380165,380165,380165,380165,380165,380165,380165,380165,380165,380165
unique,380165,72,1,5,2,380165,70,1,5,380165,1,1,380165,3
top,429729007,20020131,1,900000000000207008,900000000000074008,666482010,20170731,1,900000000000207008,429729007,en,900000000000003001,Cocarboxylase tetrahydrate (substance),900000000000448009
freq,1,172004,380165,349097,268381,1,241600,380165,349098,1,380165,380165,1,280125


### Inspect snomed for duplicate entries
So for some reason there are 3 concepts which have 2 active primary descriptions.

In [12]:
# Inspect the duplicates
active_with_desc[active_with_desc.duplicated(['id_x'], keep='first')]

Unnamed: 0,id_x,effectiveTime_x,active_x,moduleId_x,definitionStatusId,id_y,effectiveTime_y,active_y,moduleId_y,conceptId,languageCode,typeId,term,caseSignificanceId


In [13]:
# drop duplicates
active_with_desc = active_with_desc.drop_duplicates(['id_x'], keep='first')
assert len(active_with_desc) == len(active_terms)

## Create the Semantic tags
Create the top-level Concept which each concept is linked to:
tui -> term unique identifier

In [14]:
active_with_desc['semantic_tag'] = active_with_desc['term'].str.extract(r"\((\w+\s?.?\s?\w+.?\w+.?\w+.?)\)$")

In [15]:
active_with_desc[active_with_desc['semantic_tag'].isnull()].values

array([], shape=(0, 15), dtype=object)

In [16]:
# The number of unique Semantic tags
active_with_desc['semantic_tag'].unique()

array(['organism', 'substance', 'procedure', 'body structure', 'disorder',
       'occupation', 'finding', 'qualifier value',
       'morphologic abnormality', 'cell structure', 'physical object',
       'regime/therapy', 'product', 'medicinal product', 'cell', 'person',
       'ethnic group', 'environment', 'observable entity', 'event',
       'religion/philosophy', 'attribute', 'physical force', 'situation',
       'medicinal product form', 'navigational concept', 'clinical drug',
       'social concept', 'tumor staging', 'specimen', 'basic dose form',
       'life style', 'dose form', 'linkage concept', 'staging scale',
       'record artifact', 'assessment scale', 'SNOMED RT+CTV3',
       'geographic location', 'environment / location',
       'inactive concept', 'special concept', 'namespace concept',
       'racial group', 'link assertion', 'foundation metadata concept',
       'core metadata concept', 'disposition', 'unit of presentation',
       'OWL metadata concept', 'number'

Explore what each tui contains:

In [32]:
active_with_desc[active_with_desc['semantic_tag'] == 'product name']

Unnamed: 0,id_x,effectiveTime_x,active_x,moduleId_x,definitionStatusId,id_y,effectiveTime_y,active_y,moduleId_y,conceptId,languageCode,typeId,term,caseSignificanceId,semantic_tag
329585,774167006,20190131,1,900000000000207008,900000000000074008,3728216019,20190131,1,900000000000207008,774167006,en,900000000000003001,Product name (product name),900000000000448009,product name


# Create the input required for a MedCAT concept database
If there is any drug extentions skip to the next section

#### Create a MedCAT concept database including all synonyms

In [74]:
_ = pd.merge(active_terms, active_descs, left_on=['id'], right_on=['conceptId'], how='inner')
active_with_primary_desc = _[_['typeId'] == '900000000000003001']
active_with_primary_desc = active_with_primary_desc.drop_duplicates(['id_x'], keep='first')
active_with_synonym_desc = _[_['typeId'] == '900000000000013009']
active_with_all_desc = pd.concat([active_with_primary_desc, active_with_synonym_desc])

In [75]:
# Check if there are the same amount of active concepts
assert len(active_with_all_desc[active_with_all_desc['typeId'] == '900000000000003001']) == len(active_terms)

In [79]:
snomed_cdb_df = pd.merge(active_with_all_desc, active_with_desc, left_on=['id_x'], right_on=['conceptId'], how='inner')

In [77]:
# clean up the merge and rename the columns to fit the medcat Concept database criteria
snomed_cdb_df = snomed_cdb_df.loc[:, ['id_x_x','term_x','typeId_x','semantic_tag']]
snomed_cdb_df.columns = ['cui', 'name', 'name_status', 'semantic_tag']
snomed_cdb_df['ontologies'] = 'SNOMED-CT'
snomed_cdb_df['name_status'] = snomed_cdb_df['name_status'].replace(['900000000000003001', '900000000000013009'], ['P','A'])
snomed_cdb_df.head()


Unnamed: 0,cui,name,name_status,semantic_tag,ontologies
0,101009,Quilonia ethiopica (organism),P,organism,SNOMED-CT
1,101009,Quilonia ethiopica,A,organism,SNOMED-CT
2,102002,Hemoglobin Okaloosa (substance),P,substance,SNOMED-CT
3,102002,Hemoglobin Okaloosa,A,substance,SNOMED-CT
4,102002,"Hb 48(CD7), Leu-arg",A,substance,SNOMED-CT


There are 58 Semantic Tag categories total in the SNOMED taxonomy
- There is one root concept.
- There are 19 top level terms in bold.
- There are 39 sub terms.

Note there are likely to now be additional categories

Each semantic Tag is provided with a __type_id__ which are simply hashes of the tag:



### Specifying top levels terms and Semantic Tags

|Top level code |Semantic Tag|
|:---:|:---|
|__Root code__|__SNOMED RT+CTV3__|
|||
|__Y__|__Body structure (body structure)__|
|N|morphologic abnormality|
|N|cell structure|
|N|cell|
|||
|__Y__|__Clinical finding (finding)__|
|N|disorder|
|||
|__Y__|__Environment or geographical location (environment / location)__|
|N|environment|
|N|geographic location|
|||
|__Y__|__Event (event)__|
|||
|__Y__|__Observable entity (observable entity)__|
|||
|__Y__|__Organism (organism)__|
|||
|__Y__|__Pharmaceutical / biologic product (product)__|
|N|medicinal product|
|N|medicinal product form|
|N|clinical drug|
|__Y__|__Physical force (physical force)__|
|||
|__Y__|__Physical object (physical object)__|
|||
|__Y__|__Procedure (procedure)__|
|N|regime/therapy|
|||
|__Y__|__Qualifier value (qualifier value)__|
|N|administration method|
|N|disposition|
|N|intended site|
|N|number|
|N|release characteristic|
|N|transformation|
|N|basic dose form|
|N|dose form|
|N|role|
|N|state of matter|
|N|unit of presentation|
|||
|__Y__|__Record artifact (record artifact)__|
|||
|__Y__|__Situation with explicit context (situation)__|
|||
|__Y__|__SNOMED CT Model Component (metadata)__|
|N|core metadata concept|
|N|foundation metadata concept|
|N|linkage concept|
|N|attribute|
|N|link assertion|
|N|namespace concept|
|N|OWL metadata concept|
|||
|__Y__|__Social context (social concept)__|
|N|life style|
|N|racial group|
|N|ethnic group|
|N|occupation|
|N|person|
|N|religion/philosophy|
|||
|__Y__|__Special concept (special concept)__|
|N|inactive concept|
|N|navigational concept|
|||
|__Y__|__Specimen (specimen)__|
|||
|__Y__|__Staging and scales (staging scale)__|
|N|assessment scale|
|N|tumor staging|
|||
|__Y__|__Substance (substance)__|
|||


In [23]:
# List of all Semantic Tags
terms_list = snomed_cdb_df['semantic_tag'].unique().tolist()
terms_list.sort()
print(terms_list)

['OWL metadata concept', 'SNOMED RT+CTV3', 'administration method', 'assessment scale', 'attribute', 'basic dose form', 'body structure', 'cell', 'cell structure', 'clinical drug', 'core metadata concept', 'disorder', 'disposition', 'dose form', 'environment', 'environment / location', 'ethnic group', 'event', 'finding', 'foundation metadata concept', 'geographic location', 'inactive concept', 'intended site', 'life style', 'link assertion', 'linkage concept', 'medicinal product', 'medicinal product form', 'metadata', 'morphologic abnormality', 'namespace concept', 'navigational concept', 'number', 'observable entity', 'occupation', 'organism', 'person', 'physical force', 'physical object', 'procedure', 'product', 'product name', 'qualifier value', 'racial group', 'record artifact', 'regime/therapy', 'release characteristic', 'religion/philosophy', 'role', 'situation', 'social concept', 'special concept', 'specimen', 'staging scale', 'state of matter', 'substance', 'supplier', 'transfo

In [54]:
# Hash semantic tag to get a 8 digit code
snomed_cdb_df['type_ids'] = snomed_cdb_df['semantic_tag'].apply(
    lambda x: int(hashlib.sha256(x.encode('utf-8')).hexdigest(), 16) % 10**8)


In [56]:
snomed_cdb_df.head()

Unnamed: 0,cui,name,name_status,semantic_tag,onto,type_ids
0,101009,Quilonia ethiopica (organism),P,organism,SNOMED-CT,81102976
1,101009,Quilonia ethiopica,A,organism,SNOMED-CT,81102976
2,102002,Hemoglobin Okaloosa (substance),P,substance,SNOMED-CT,91187746
3,102002,Hemoglobin Okaloosa,A,substance,SNOMED-CT,91187746
4,102002,"Hb 48(CD7), Leu-arg",A,substance,SNOMED-CT,91187746


#### Saving your df to CSV

In [None]:
# Write the clinical terms to csv
snomed_cdb_df.to_csv('snomed_cdb_csv_SNOMED-CT-UK_Release_20200401.csv')

In [66]:
del snomed_cdb_df

# If there is a drug extension use below:

In [14]:
drug_extension = f'{snomed_dir}/SnomedCT_UKDrugRF2_PRODUCTION_20200930T000001Z/Snapshot/Terminology'
drug_terms = parse_file(f'{drug_extension}/sct2_Concept_Snapshot_GB1000001_20200930.txt')
active_drug_terms = drug_terms[drug_terms.active == '1']
drug_descriptions = parse_file(f'{drug_extension}/sct2_Description_Snapshot-en_GB1000001_20200930.txt')
active_drug_desc = drug_descriptions[drug_descriptions.active == '1']
# Merge in clinical snomed terminology
all_terms = pd.concat([active_terms, active_drug_terms])
all_descs = pd.concat([active_descs, active_drug_desc])

In [15]:
active_with_desc_drug_ext = pd.merge(all_terms, all_descs[all_descs['typeId'] == '900000000000003001'], left_on=['id'], right_on=['conceptId'], how='inner')
active_with_desc_drug_ext[active_with_desc_drug_ext.duplicated(['id_x'], keep='first')]
active_with_desc_drug_ext = active_with_desc_drug_ext.drop_duplicates(['id_x'], keep='first')
assert len(active_with_desc_drug_ext) == len(all_terms)
active_with_desc_drug_ext['semantic_tag'] = active_with_desc_drug_ext['term'].str.extract(r"\((\w+\s?.?\s?\w+.?\w+.?\w+.?)\)$")


In [16]:
_ = pd.merge(all_terms, all_descs, left_on=['id'], right_on=['conceptId'], how='inner')
active_with_primary_desc = _[_['typeId'] == '900000000000003001']
active_with_primary_desc = active_with_primary_desc.drop_duplicates(['id_x'], keep='first')
active_with_synonym_desc = _[_['typeId'] == '900000000000013009']
active_with_all_desc = pd.concat([active_with_primary_desc, active_with_synonym_desc])

In [17]:
snomed_cdb_df = pd.merge(active_with_all_desc, active_with_desc_drug_ext, left_on=['id_x'], right_on=['conceptId'], how='inner')

In [18]:
# clean up the merge and rename the columns to fit the medcat Concept database criteria
snomed_cdb_df = snomed_cdb_df[['id_x_x','term_x','typeId_x','semantic_tag']]
snomed_cdb_df.columns = ['cui', 'name', 'name_status', 'semantic_tag']
snomed_cdb_df['ontologies'] = 'SNOMED-CT'
snomed_cdb_df['name_status'] = snomed_cdb_df['name_status'].replace(['900000000000003001', '900000000000013009'], ["P","A"])


In [19]:
# Hash semantic tag to get a 8 digit code
snomed_cdb_df['type_ids'] = snomed_cdb_df['semantic_tag'].apply(
    lambda x: int(hashlib.sha256(x.encode('utf-8')).hexdigest(), 16) % 10**8)

In [20]:
snomed_cdb_df.head()

Unnamed: 0,cui,name,name_status,semantic_tag,ontologies,type_ids
0,101009,Quilonia ethiopica (organism),P,organism,SNOMED-CT,81102976
1,101009,Quilonia ethiopica,A,organism,SNOMED-CT,81102976
2,102002,Hemoglobin Okaloosa (substance),P,substance,SNOMED-CT,91187746
3,102002,Hemoglobin Okaloosa,A,substance,SNOMED-CT,91187746
4,102002,"Hb 48(CD7), Leu-arg",A,substance,SNOMED-CT,91187746


In [21]:
# write the clinical terms plus drug extension to csv:
file_name = input("Enter file name:")
snomed_cdb_df.to_csv(file_name+'.csv') #snomed_cdb_csv_SNOMED-CT-full_UK_drug_ext_Release_20200228

Enter file name:snomed_cdb_csv_SNOMED-CT-full_UK_drug_ext_Release_20211001


In [23]:
# Functions for finding the concept name and all synonymns for a SNOMED concept

def find_name(snomedcode):
    """
    Converts SNOMED code to Fully specified name and finds any Synonyms
    """
    df = snomed_cdb_df[(snomed_cdb_df['cui'] == snomedcode) & (snomed_cdb_df['name_status'] == 'P')]
    concept_name = df['name'].values
    return f"{''.join(concept_name)}"

def find_syn(snomedcode):
    """
    Converts SNOMED code and finds all Synonyms. Not including concept name
    """
    df = snomed_cdb_df[(snomed_cdb_df['cui'] == snomedcode) & (snomed_cdb_df['name_status'] == 'A')]
    synonym = df['name'].tolist()
    return f"{'; '.join(synonym)}"

In [24]:
print(find_name("50417007"))
print(find_syn("50417007"))

Lower respiratory tract infection (disorder)
Lower respiratory tract infection; Lower respiratory infection; Chest cold; LRTI - Lower respiratory tract infection


## Exploring SNOMED relationships

### Root and top-level Concepts
All concepts appear from the root concept 138875005 |SNOMED CT Concept (SNOMED RT+CTV3)|


####  Table 3: Top Level Concepts 
These concepts all root from the base concept: 138875005, (SNOMED CT Concept (SNOMED RT+CTV3))<br>These concepts are all linked via the relationship typeId: 116680003, (is a)
<br>A full list of relationship types can be found as children concepts of: 106237007, (linkage concept)



|SCTID|Semantic Tag|
|:---:|:---:|
|123037004 |Body structure|
|404684003 |Clinical finding|
|272379006 |Event|
|308916002 |Environment or geographical location|
|363787002 |Observable entity|
|410607006 |Organism|
|373873005 |Pharmaceutical / biologic product|
|78621006 |Physical force|
|260787004 |Physical object|
|71388002 |Procedure|
|362981000 |Qualifier value|
|419891008 |Record artifact|
|243796009 |Situation with explicit context|
|900000000000441003 |SNOMED CT Model Component (metadata)|
|48176007 |Social context|
|370115009 |Special concept|
|123038009 |Specimen|
|254291000 |Staging and scales|
|105590001 |Substance|


Taken from Techincal implementation guide(4.1), Table 4.1-3: https://confluence.ihtsdotools.org/display/DOCTIG 

## Creating the relationship dictionaries

Parent to children structure
pt2ch = {‘\<cui_for_pt\>’, \[\<list of cuis for children\>\], …}

In [None]:
# Merge relationship files
drug_ext_relat = parse_file(f'{drug_extension}/sct2_Relationship_Snapshot_GB1000001_20200318.txt')
active_drug_ext_relat = drug_ext_relat[drug_ext_relat.active == '1']
all_relat = pd.concat([active_relat, active_drug_ext_relat])
all_relat[['sourceId','destinationId','typeId']] = 'S-' + all_relat[['sourceId','destinationId','typeId']].astype(str)

In [None]:
all_relat.head()

In [None]:
# write the relationship terms plus drug extension relationships to csv:
file_name = input("Enter file name:")
all_relat.to_csv(file_name+'.csv') #snomed_rela_csv_SNOMED-CT-full_UK_drug_ext_Release_20200228

In [None]:
# Find all types of relationships
rel = all_relat['typeId'].unique()
for _ in rel:
    print(find_name(_), _)

# Parents and Children (IS A)
Subtype relationship 116680003|Is a (attribute)| relates a Concept to its immediate supertype Concepts.

In [None]:
# Parent to Children dictionary
pt2ch = dict([(key, []) for key in all_relat["destinationId"].unique()])
for index, v in all_relat.iterrows():
    if v['typeId'] == "S-116680003":
        _ = v['destinationId']
        pt2ch[_].append(v['sourceId'])
    else:
        pass

In [None]:
# Children to Parent dictionary
ch2pt = dict([(key, []) for key in all_relat["sourceId"].unique()])
for index, v in all_relat.iterrows():
    if v['typeId'] == "S-116680003":
        _ = v['sourceId']
        ch2pt[_].append(v['destinationId'])
    else:
        pass


In [None]:
# Write to 'isa' relationships to file
with open('isa_rela_pt2ch.txt', 'w') as outfile:
    json.dump(pt2ch, outfile)
with open('isa_rela_ch2pt.txt', 'w') as outfile:
    json.dump(ch2pt, outfile)

In [None]:
# Load 'isa' relationships to df
with open('isa_rela_pt2ch.txt') as json_file:
    pt2ch = json.load(json_file)
with open('isa_rela_ch2pt.txt') as json_file:
    ch2pt = json.load(json_file)

In [None]:
# Check if the Top level concepts are the same in the SNOMED UK Extention
top_level_concepts = all_relat[all_relat['destinationId']=='S-138875005']
top_level_concepts['conceptname'] = top_level_concepts['sourceId'].apply(find_name)
top_level_concepts[['sourceId', 'conceptname']].reset_index()

### Siblings

In [None]:
isa_rel = all_relat[all_relat['typeId'] == 'S-116680003']

In [None]:
# Find siblings
# cui to siblings dictionary
cui2sib_dic = dict([(key, set()) for key in isa_rel['sourceId'].unique()])

In [None]:
isa_rel.head()

In [None]:
# Find siblings function
def cui2sib(snomed):
    x = set()
    for a in ch2pt[snomed]:
        for b in pt2ch[a]:
            x.add(b)
    return x

In [None]:
%%timeit
for key in cui2sib_dic:
    cui2sib_dic[key].update(cui2sib(key))

In [None]:
%%timeit
cui2sib_dic = dict()
for key in tqdm(unique_snomed, total=len(unique_snomed)):
    value = cui2sib(key)
    cui2sib_dic[key].
    

In [None]:
with open('isa_rela_cui2sib.txt', 'w') as outfile:
    json.dump(cui2sib, outfile)

In [None]:
cui2sib('S-404684003')

# ICD-10 / OPCS-4 linkages with SNOMED-CT


Note the mapping priority

In [25]:
refset_terminology = f'{base_term}/SnomedCT_UKClinicalRF2_PRODUCTION_20200930T000001Z/Snapshot/Refset/Map'

In [26]:
mappings = parse_file(f'{refset_terminology}/der2_iisssciRefset_ExtendedMapSnapshot_GB1000000_20200930.txt')
mappings = mappings[mappings.active == '1']

In [27]:
mappings.mapPriority = mappings.mapPriority.astype(int)

In [None]:
icd10_refset_id = '999002271000000101'
opcs4_refset_id = '999002741000000101'

In [None]:
%%time
cui2mappings = dict()
for cui in snomed_cdb_df.cui.unique():
    cui_map = mappings[mappings.referencedComponentId == cui].loc[:, ['mapPriority', 'mapAdvice', 'mapTarget', 'refsetId']]
    if cui_map.shape[0] > 0:
        cui2mappings[cui] = cui_map.sort_values('mapPriority')

In [None]:
opcs_mappings = {}
icd10_mappings= {}
for cui, mappings in cui2mappings.items():
    icd10_codes = mappings[mappings.refsetId == icd10_refset_id]
    if icd10_codes.shape[0] > 0:
        icd10_mappings[cui] = icd10_codes
    opcs_codes = mappings[mappings.refsetId == opcs4_refset_id]
    if opcs_codes.shape[0] > 0:
        opcs_mappings[cui] = opcs_codes

In [None]:
import pickle
pickle.dump(opcs_mappings, open('20200930_opcs_mappings_full.pickle', 'wb'))
pickle.dump(icd10_mappings, open('20200930_icd10_mappings_full.pickle', 'wb'))

In [None]:
# condense mappings to a simple dict representation

In [None]:
def condense_mapping(cui2mappings):
    mapping_condensed = {}
    for cui, mappings in cui2mappings.items():
        mapping_condensed[cui] = mappings.mapTarget.replace('(\w\d\d)(\d*)', r'\1.\2', regex=True).tolist()
    return mapping_condensed

In [None]:
icd10_mapping_condensed = condense_mapping(icd10_mappings)

In [None]:
opcs_mapping_condensed = condense_mapping(opcs_mappings)

In [None]:
pickle.dump(icd10_mapping_condensed, open('icd10_mapping_condensed.pickle', 'wb'))
pickle.dump(opcs_mapping_condensed, open('opcs_mapping_condensed.pickle', 'wb'))

In [None]:
cui2mappings['1240751000000100']

In [None]:
opcs_mapping_condensed['S-104001']

### ICD-10 and OPSC-4 code to descriptions
Link to files on NHS TRUD


__icd:__ https://isd.digital.nhs.uk/trud3/user/authenticated/group/0/pack/28

__opcs:__ https://isd.digital.nhs.uk/trud3/user/authenticated/group/0/pack/10

#### ICD10 code2desc

In [None]:
icd_path = r"C:\Users\k1767582\Desktop\icd_df_10.5.0_20151102000001\ICD10_Edition5_20160401\Content"
icd10_mapping_detail = parse_file(f'{icd_path}/ICD10_Edition5_CodesAndTitlesAndMetadata_GB_20160401.txt')
icd10_mapping_detail['full_description'] = icd10_mapping_detail.DESCRIPTION +  icd10_mapping_detail.MODIFIER_4 + icd10_mapping_detail.MODIFIER_5

In [None]:
icd10_uk_codes = {c: desc for c, desc in zip(icd10_mapping_detail.CODE, icd10_mapping_detail.full_description)}
pickle.dump(icd10_uk_codes, open('icd10_uk_code2desc.pickle', 'wb'))

#### OPCS code2desc

In [None]:
opcs_filename = r'C:\Users\k1767582\Desktop\nhs_opcs4df_9.0.0_20191104000001\OPCS49 CodesAndTitles Nov 2019 V1.0.txt'
opcs_desc_df = parse_file(opcs_filename, first_row_header=False, columns=['code', 'desc'])
opcs_desc_df = {code: desc for code, desc in zip(opcs_desc_df.code, opcs_desc_df.desc)}
pickle.dump(opcs_desc_df, open('opcs_code2desc.pickle', 'wb'))