# Exploring a SNOMED-CT uk extension Release

In [1]:
import pandas as pd
import numpy as np
import json
import re

## Loading the SNOMED UK extention release files

In [2]:
snomed_dir = r'C:\Users\k1767582\Desktop\SNOMED' # /Users/shek/Desktop/medcat/SNOMED_UK/20191001_SNOMED_UK

Use Snapshot, instead of Full, here, as Full contains all historical concepts since 2014. Delta only contains differences from last version.
https://confluence.ihtsdotools.org/display/DOCGLOSS/Snapshot+release

In [3]:
base_term = f'{snomed_dir}/uk_sct2cl_29.0.0_20200401000001/'
int_terminology = base_term + 'SnomedCT_InternationalRF2_PRODUCTION_20180731T120000Z/Snapshot/Terminology'
uk_ext_terminology = base_term + 'SnomedCT_UKClinicalRF2_PRODUCTION_20200401T000001Z/Snapshot/Terminology' #'SnomedCT_UKClinicalRF2_PRODUCTION_20191001T000001Z/Snapshot/Terminology'

In [4]:
def parse_file(filename, first_row_header=True, columns=None):
    with open(filename, encoding='utf-8') as f:
        entities = [[n.strip() for n in line.split('\t')] for line in f]
        return pd.DataFrame(entities[1:], columns=entities[0] if first_row_header else columns)

## SNOMED CT Design

### SNOMED CT Components
SNOMED CT is a clinical terminology containing concepts with unique meanings and formal logic based definitions organised into hierarchies.
For further information please see: https://confluence.ihtsdotools.org/display/DOCSTART/4.+SNOMED+CT+Basics

SNOMED CT content is represented into 3 main types of components:
- __Concepts__ representing clinical meanings that are organised into hierarchies.
- __Descriptions__ which link appropriate human readable terms to concepts
- __Relationships__ which link each concept to other related concepts

__NOTE:__ SNOMED-CT (UK Ed.) is an extension to the Int Ed. Both sets of files (Int. and the UK Ext.) are released as part of one 'UK Release'.

Load and merge the active concept from the international and UK Extention __Concept snapshot__ files

#### __Table 4.2.1-1:__ Concept file - Detailed Specification

|Field|Data type|Purpose|Mutable|Part of Primary Key|
|:-----|:-----|:-----|:-----|:-----|
|id|SCTID|Uniquely Idenfies the concept|NO|YES (Full/Snapshot)|
|effectiveTime|Time|Specifies the inclusive date at which the component version's state became the then current valid state of the component.|YES|YES (Full)<br>Optional (Snapshot)|
|active|Boolean|Specifies whether the concept was active or inactive from the nominal release date specified by the effectiveTime.|YES|NO|
|moduleId|SCTID|Identifies the concept version's module. Set to a descendant of 900000000000443000(Module) within the metadata hierarchy.|YES|NO|
|definitionStatusId|SCTID|Specifies if the concept version is primitive or defined. Set to a descendant of 900000000000444006(Definition status)in the metadata hierarchy.|YES|NO|

Taken from: https://confluence.ihtsdotools.org/display/DOCRELFMT

In [5]:
int_terms = parse_file(f'{int_terminology}/sct2_Concept_Snapshot_INT_20180731.txt')
uk_terms = parse_file(f'{uk_ext_terminology}/sct2_Concept_Snapshot_GB1000000_20200401.txt')
terms = pd.concat([int_terms, uk_terms])
active_terms = terms[terms.active == '1'] # active concepts are represented with 1

In [6]:
# Every concept has a unique concept identifier: active_terms['id'] 
active_terms.describe()

Unnamed: 0,id,effectiveTime,active,moduleId,definitionStatusId
count,369816,369816,369816,369816,369816
unique,369816,65,1,5,2
top,287525003,20020131,1,900000000000207008,900000000000074008
freq,1,176398,369816,338954,273413


Load and merge the active descriptions from the international and UK Extention __Description snapshot__ files

#### __Table 4.2.2-1:__ Description file - Detailed Specification

|Field|Data type|Purpose|Mutable|Part of Primary Key|
|:-----|:-----|:-----|:-----|:-----|
|id|SCTID|Uniquely identifies the description.|NO|YES (Full/Snapshot)|
|effectiveTime|Time|Specifies the inclusive date at which the component version's state became the then current valid state of the component|YES|YES (Full)<br>Optional \|Snapshot\||
|active|Boolean|Specifies whether the state of the description was active or inactive from the nominal release date specified by the effectiveTime.|YES|NO|
|moduleId|SCTID|Identifies the description version's module. Set to a child of 900000000000443000\|Module\| within the metadata hierarchy.|YES|NO|
|conceptId|SCTID|Identifies the concept to which this description applies. Set to the identifier of a concept in the 138875005 \|SNOMED CT Concept\| hierarchy within the Concept. Note that a specific version of a description is not directly bound to a specific version of the concept to which it applies. Which version of a description applies to a concept depends on its effectiveTime and the point in time at which it is accessed.|NO|NO|
|languageCode|String|Specifies the language of the description text using the two character ISO-639-1 code. Note that this specifies a language level only, not a dialect or country code.|NO|NO|
|typeId|SCTID|Identifies whether the description is fully specified name a synonym or other description type. This field is set to a child of 900000000000446008\|Description type\| in the Metadata hierarchy.|NO|NO|
|term|String|The description version's text value, represented in UTF-8 encoding.|YES|NO|
|caseSignificanceId|SCTID|Identifies the concept enumeration value that represents the case significance of this description version. For example, the term may be completely case sensitive, case insensitive or initial letter case insensitive. This field will be set to a child of 900000000000447004\|Case significance\| within the metadata hierarchy.|YES|NO|

Taken from: https://confluence.ihtsdotools.org/display/DOCRELFMT

In [7]:
int_desc = parse_file(f'{int_terminology}/sct2_Description_Snapshot-en_INT_20180731.txt')
uk_desc = parse_file(f'{uk_ext_terminology}/sct2_Description_Snapshot-en_GB1000000_20200401.txt')
descs = pd.concat([int_desc, uk_desc])
active_descs = descs[descs.active == '1']

In [8]:
active_descs.head()

Unnamed: 0,id,effectiveTime,active,moduleId,conceptId,languageCode,typeId,term,caseSignificanceId
0,101013,20170731,1,900000000000207008,126813005,en,900000000000013009,Neoplasm of anterior aspect of epiglottis,900000000000448009
1,102018,20170731,1,900000000000207008,126814004,en,900000000000013009,Neoplasm of junctional region of epiglottis,900000000000448009
2,103011,20170731,1,900000000000207008,126815003,en,900000000000013009,Neoplasm of lateral wall of oropharynx,900000000000448009
3,104017,20170731,1,900000000000207008,126816002,en,900000000000013009,Neoplasm of posterior wall of oropharynx,900000000000448009
4,105016,20170731,1,900000000000207008,126817006,en,900000000000013009,Neoplasm of esophagus,900000000000448009


Load and merge the relationships from the international and UK Extention __Relationship snapshot__ files

#### __Table 4.2.3-1:__ Relationship file - Detailed specification

|Field|Data type|Purpose|Mutable|Part of Primary Key|
|:-----|:-----|:-----|:-----|:-----|
|id|SCTID|Uniquely identifies the relationship.|NO|YES(Full/Snapshot)|
|effectiveTime|Time|Specifies the inclusive date at which the component version's state became the then current valid state of the component.|YES|YES(Full) Optional(Snapshot)|
|active|Boolean|Specifies whether the state of the relationship was active or inactive from the nominal release date specified by the effectiveTime field.|YES|NO|
|moduleId|SCTID|Identifies the relationship version's module. Set to a child of 900000000000443000\|Module\| within the metadata hierarchy.|YES|NO|
|sourceId|SCTID|Identifies the source concept of the relationship version. That is the concept defined by this relationship. Set to the identifier of a concept.|NO|NO|
|destinationId|SCTID|Identifies the concept that is the destination of the relationship version.<br>That is the concept representing the value of the attribute represented by the typeId column.<br>Set to the identifier of a concept.<br>Note that the values that can be applied to particular attributes are formally defined by the SNOMED CT Machine Readable Concept Model.|NO|NO|
|relationshipGroup|Integer|Groups together relationship versions that are part of a logically associated relationshipGroup. All active Relationship records with the same relationshipGroup number and sourceId are grouped in this way.|YES|NO|
|typeId|SCTID|Identifies the concept that represent the defining attribute (or relationship type) represented by this relationship version.<br><br>That is the concept representing the value of the attribute represented by the typeId column. <br><br>Set to the identifier of a concept. The concept identified must be either 116680003\|Is a\| or a subtype of 410662002\|Concept model attribute\|. The concepts that can be used as in the typeId column are formally defined as follows:<br>116680003\|is a\| OR < 410662002\|concept model attribute\|<br><br>__Note__ that the attributes that can be applied to particular concepts are formally defined by the SNOMED CT Machine Readable Concept Model.|NO|NO|
|characteristicTypeId|SCTID|A concept enumeration value that identifies the characteristic type of the relationship version (i.e. whether the relationship version is defining, qualifying, etc.) This field is set to a descendant of 900000000000449001\|Characteristic type\|in the metadata hierarchy.|YES|NO|
|modifierId|SCTID|A concept enumeration value that identifies the type of Description Logic(DL) restriction (some, all, etc.). Set to a child of 900000000000450001\|Modifier\| in the metadata hierarchy.<br> __Note__ Currently the only value used in this column is 900000000000451002\|Some\| and thus in practical terms this column can be ignored.|YES|NO|

Taken from: https://confluence.ihtsdotools.org/display/DOCRELFMT

In [9]:
int_relat = parse_file(f'{int_terminology}/sct2_Relationship_Snapshot_INT_20180731.txt')
uk_relat = parse_file(f'{uk_ext_terminology}/sct2_Relationship_Snapshot_GB1000000_20200401.txt')
relat = pd.concat([int_relat, uk_relat])
active_relat = relat[relat.active == '1']

In [10]:
active_relat.head()

Unnamed: 0,id,effectiveTime,active,moduleId,sourceId,destinationId,relationshipGroup,typeId,characteristicTypeId,modifierId
1,101021,20020131,1,900000000000207008,10000006,29857009,0,116680003,900000000000011006,900000000000451002
2,102025,20020131,1,900000000000207008,10000006,9972008,0,116680003,900000000000011006,900000000000451002
13,114022,20020131,1,900000000000207008,134035007,84371003,0,116680003,900000000000011006,900000000000451002
26,127021,20020131,1,900000000000207008,134136005,57250008,0,116680003,900000000000011006,900000000000451002
29,130025,20020131,1,900000000000207008,10002003,116175006,0,116680003,900000000000011006,900000000000451002


## SNOMED CT Concept Model

<img src="img/Association Between Files from 2019.png">

Taken from: https://confluence.ihtsdotools.org/display/DOCRELFMT

Find the fully specified name, Synonym or Definition of a SNOMED concept

__Description type__

|Type id|Term|
|:---:|:---|
|900000000000003001|Fully specified name|
|900000000000013009|Synonym|
|900000000000550004|Definition|


Create a DataFrame which contains only the active SNOMED codes and thier fully specified name

In [11]:
active_with_desc = pd.merge(active_terms, active_descs[active_descs['typeId'] == '900000000000003001'], left_on=['id'], right_on=['conceptId'], how='inner')
active_with_desc.describe()

Unnamed: 0,id_x,effectiveTime_x,active_x,moduleId_x,definitionStatusId,id_y,effectiveTime_y,active_y,moduleId_y,conceptId,languageCode,typeId,term,caseSignificanceId
count,369819,369819,369819,369819,369819,369819,369819,369819,369819,369819,369819,369819,369819,369819
unique,369816,65,1,5,2,369819,63,1,5,369816,1,1,369819,3
top,298641000000100,20020131,1,900000000000207008,900000000000074008,563208019,20170731,1,900000000000207008,298641000000100,en,900000000000003001,Skin of part of side of face (body structure),900000000000448009
freq,2,176398,369819,338954,273416,1,249335,369819,338954,2,369819,369819,1,268467


So for some reason there are 3 concepts which have 2 active primary descriptions.

In [12]:
# Inspect the duplicates
active_with_desc[active_with_desc.duplicated(['id_x'], keep='first')]

Unnamed: 0,id_x,effectiveTime_x,active_x,moduleId_x,definitionStatusId,id_y,effectiveTime_y,active_y,moduleId_y,conceptId,languageCode,typeId,term,caseSignificanceId
352303,22711000000107,20040131,1,999000011000000103,900000000000074008,47671000000114,20101001,1,999000011000000103,22711000000107,en,900000000000003001,GP82 - sent to Health Board (finding),900000000000017005
353719,298641000000100,20071001,1,999000011000000103,900000000000074008,527611000000119,20071001,1,999000011000000103,298641000000100,en,900000000000003001,Antigen specific effector T cell measurement (...,900000000000020002
354108,321411000000108,20080401,1,999000011000000103,900000000000074008,618321000000116,20080401,1,999000011000000103,321411000000108,en,900000000000003001,Foetus with cardiovascular abnormality (disorder),900000000000020002


In [13]:
# drop duplicates
active_with_desc = active_with_desc.drop_duplicates(['id_x'], keep='first')
assert len(active_with_desc) == len(active_terms)

Create the top-level Concept which each concept is linked to:
tui -> term unique identifier

In [14]:
def find_tui(concept_name):
    return re.match(r"\((\w+\s?.?\s?\w+.?\w+.?\w+.?)\)$")
active_with_desc['tui'] = active_with_desc['term'].str.extract(r"\((\w+\s?.?\s?\w+.?\w+.?\w+.?)\)$")

In [15]:
active_with_desc.describe()

Unnamed: 0,id_x,effectiveTime_x,active_x,moduleId_x,definitionStatusId,id_y,effectiveTime_y,active_y,moduleId_y,conceptId,languageCode,typeId,term,caseSignificanceId,tui
count,369816,369816,369816,369816,369816,369816,369816,369816,369816,369816,369816,369816,369816,369816,369816
unique,369816,65,1,5,2,369816,63,1,5,369816,1,1,369816,3,58
top,287525003,20020131,1,900000000000207008,900000000000074008,563208019,20170731,1,900000000000207008,287525003,en,900000000000003001,Skin of part of side of face (body structure),900000000000448009,disorder
freq,1,176398,369816,338954,273413,1,249335,369816,338954,1,369816,369816,1,268467,77093


In [16]:
active_with_desc[active_with_desc['tui'].isnull()].values

array([], shape=(0, 15), dtype=object)

In [17]:
# The number of unique TUIs
active_with_desc['tui'].unique()

array(['organism', 'substance', 'procedure', 'body structure', 'disorder',
       'occupation', 'finding', 'qualifier value',
       'morphologic abnormality', 'cell structure', 'physical object',
       'regime/therapy', 'product', 'medicinal product', 'cell', 'person',
       'ethnic group', 'environment', 'observable entity', 'event',
       'religion/philosophy', 'attribute', 'physical force', 'situation',
       'medicinal product form', 'navigational concept', 'clinical drug',
       'social concept', 'tumor staging', 'specimen', 'basic dose form',
       'life style', 'dose form', 'linkage concept', 'staging scale',
       'record artifact', 'assessment scale', 'SNOMED RT+CTV3',
       'geographic location', 'environment / location',
       'inactive concept', 'special concept', 'namespace concept',
       'racial group', 'link assertion', 'foundation metadata concept',
       'core metadata concept', 'disposition', 'unit of presentation',
       'OWL metadata concept', 'number'

Explore what each tui contains:

In [18]:
active_with_desc[active_with_desc['tui'] == 'number']

Unnamed: 0,id_x,effectiveTime_x,active_x,moduleId_x,definitionStatusId,id_y,effectiveTime_y,active_y,moduleId_y,conceptId,languageCode,typeId,term,caseSignificanceId,tui
323114,734048000,20170731,1,900000000000207008,900000000000074008,3482145016,20170731,1,900000000000207008,734048000,en,900000000000003001,0.088 (number),900000000000448009,number


### Create the input required for a MedCAT concept database

In [19]:
snomed_cdb_active_only = active_with_desc.loc[:, ['id_x', 'term', 'tty', 'tui_code', 'tui']]
snomed_cdb_active_only.columns = ['cui', 'str', 'tty', 'tui', 'sty']
snomed_cdb_active_only['cui'] = snomed_cdb_active_only.cui.apply(lambda code: f'S-{code}')
snomed_cdb_active_only['onto'] = 'SNOMED-CT'

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


In [20]:
snomed_cdb_active_only # just for active concepts

Unnamed: 0,cui,str,tty,tui,sty,onto
0,S-101009,Quilonia ethiopica (organism),,,organism,SNOMED-CT
1,S-102002,Hemoglobin Okaloosa (substance),,,substance,SNOMED-CT
2,S-103007,Squirrel fibroma virus (organism),,,organism,SNOMED-CT
3,S-104001,Excision of lesion of patella (procedure),,,procedure,SNOMED-CT
4,S-106004,Structure of posterior carpal region (body str...,,,body structure,SNOMED-CT
5,S-107008,Structure of fetal part of placenta (body stru...,,,body structure,SNOMED-CT
6,S-108003,Entire condylar emissary vein (body structure),,,body structure,SNOMED-CT
7,S-109006,Anxiety disorder of childhood OR adolescence (...,,,disorder,SNOMED-CT
8,S-110001,Structure of visceral layer of Bowman's capsul...,,,body structure,SNOMED-CT
9,S-111002,Parathyroid structure (body structure),,,body structure,SNOMED-CT


#### Create a MedCAT concept database including all synonyms

In [21]:
_ = pd.merge(active_terms, active_descs, left_on=['id'], right_on=['conceptId'], how='inner')
active_with_primary_desc = _[_['typeId'] == '900000000000003001']
active_with_primary_desc = active_with_primary_desc.drop_duplicates(['id_x'], keep='first')
active_with_synonym_desc = _[_['typeId'] == '900000000000013009']
active_with_all_desc = pd.concat([active_with_primary_desc, active_with_synonym_desc])

In [22]:
# Check if there are the same amount of active concepts
assert len(active_with_all_desc[active_with_all_desc['typeId'] == '900000000000003001']) == len(active_terms)

In [23]:
snomed_cdb_df = pd.merge(active_with_all_desc, active_with_desc, left_on=['id_x'], right_on=['conceptId'], how='inner')

In [24]:
# clean up the merge and rename the columns to fit the medcat Concept database criteria
snomed_cdb_df = snomed_cdb_df.loc[:, ['id_x_x','term_x','typeId_x','tui']]
snomed_cdb_df.columns = ['cui', 'str', 'tty', 'sty']
snomed_cdb_df['onto'] = 'SNOMED-CT'
snomed_cdb_df['tty'] = snomed_cdb_df['tty'].replace(['900000000000003001', '900000000000013009'], [1,0])
snomed_cdb_df['cui'] = 'S-' + snomed_cdb_df['cui'].astype(str)
snomed_cdb_df

Unnamed: 0,cui,str,tty,sty,onto
0,S-101009,Quilonia ethiopica (organism),1,organism,SNOMED-CT
1,S-101009,Quilonia ethiopica,0,organism,SNOMED-CT
2,S-102002,Hemoglobin Okaloosa (substance),1,substance,SNOMED-CT
3,S-102002,Hemoglobin Okaloosa,0,substance,SNOMED-CT
4,S-102002,"Hb 48(CD7), Leu-arg",0,substance,SNOMED-CT
5,S-102002,Haemoglobin Okaloosa,0,substance,SNOMED-CT
6,S-103007,Squirrel fibroma virus (organism),1,organism,SNOMED-CT
7,S-103007,Squirrel fibroma virus,0,organism,SNOMED-CT
8,S-104001,Excision of lesion of patella (procedure),1,procedure,SNOMED-CT
9,S-104001,Excision of lesion of patella,0,procedure,SNOMED-CT


There are 58 Semantic Tag categories total in the SNOMED taxonomy
- There is one root concept.
- There are 19 top level terms in bold.
- There are 39 sub terms.

Each semantic Tag is provided with a __term unique identifier (TUI)__ which are structured are follows:
T- {##}{1#}{2#}{3#}
- {T- }  -> Common to all codes
- {##}  -> Top level terms in alphabetical order
- {#1}  -> First level term group
- {#2}  -> Second level term group
- {#3}  -> Third level term group


### Specifying top levels terms and Semantic Tags

|Top level code|Term (TUI) |Semantic Tag|
|:---:|:---:|:---|
|__Root code__|__T-00000__|__SNOMED RT+CTV3__|
||||
|__Y__|__T-01000__|__Body structure (body structure)__|
|N|T-01100|morphologic abnormality|
|N|T-01200|cell structure|
|N|T-01210|cell|
||||
|__Y__|__T-02000__|__Clinical finding (finding)__|
|N|T-02100|disorder|
||||
|__Y__|__T-03000__|__Environment or geographical location (environment / location)__|
|N|T-03100|environment|
|N|T-03200|geographic location|
||||
|__Y__|__T-04000__|__Event (event)__|
||||
|__Y__|__T-05000__|__Observable entity (observable entity)__|
||||
|__Y__|__T-06000__|__Organism (organism)__|
||||
|__Y__|__T-07000__|__Pharmaceutical / biologic product (product)__|
|N|T-07100|medicinal product|
|N|T-07110|medicinal product form|
|N|T-07111|clinical drug|
|__Y__|__T-08000__|__Physical force (physical force)__|
||||
|__Y__|__T-09000__|__Physical object (physical object)__|
||||
|__Y__|__T-10000__|__Procedure (procedure)__|
|N|T-10100|regime/therapy|
||||
|__Y__|__T-11000__|__Qualifier value (qualifier value)__|
|N|T-11100|administration method|
|N|T-11200|disposition|
|N|T-11300|intended site|
|N|T-11010|number|
|N|T-11400|release characteristic|
|N|T-11500|transformation|
|N|T-11020|basic dose form|
|N|T-11030|dose form|
|N|T-11600|role|
|N|T-11700|state of matter|
|N|T-11040|unit of presentation|
||||
|__Y__|__T-12000__|__Record artifact (record artifact)__|
||||
|__Y__|__T-13000__|__Situation with explicit context (situation)__|
||||
|__Y__|__T-14000__|__SNOMED CT Model Component (metadata)__|
|N|T-14100|core metadata concept|
|N|T-14200|foundation metadata concept|
|N|T-14300|linkage concept|
|N|T-14310|attribute|
|N|T-14320|link assertion|
|N|T-14400|namespace concept|
|N|T-14500|OWL metadata concept|
||||
|__Y__|__T-15000__|__Social context (social concept)__|
|N|T-15100|life style|
|N|T-15010|racial group|
|N|T-15020|ethnic group|
|N|T-15200|occupation|
|N|T-15300|person|
|N|T-15400|religion/philosophy|
||||
|__Y__|__T-16000__|__Special concept (special concept)__|
|N|T-16100|inactive concept|
|N|T-16200|navigational concept|
||||
|__Y__|__T-17000__|__Specimen (specimen)__|
||||
|__Y__|__T-18000__|__Staging and scales (staging scale)__|
|N|T-18100|assessment scale|
|N|T-18200|tumor staging|
||||
|__Y__|__T-19000__|__Substance (substance)__|
||||


In [25]:
# List of all Semantic Tags
terms_list = snomed_cdb_df['sty'].unique().tolist()
terms_list.sort()
print(terms_list)

['OWL metadata concept', 'SNOMED RT+CTV3', 'administration method', 'assessment scale', 'attribute', 'basic dose form', 'body structure', 'cell', 'cell structure', 'clinical drug', 'core metadata concept', 'disorder', 'disposition', 'dose form', 'environment', 'environment / location', 'ethnic group', 'event', 'finding', 'foundation metadata concept', 'geographic location', 'inactive concept', 'intended site', 'life style', 'link assertion', 'linkage concept', 'medicinal product', 'medicinal product form', 'metadata', 'morphologic abnormality', 'namespace concept', 'navigational concept', 'number', 'observable entity', 'occupation', 'organism', 'person', 'physical force', 'physical object', 'procedure', 'product', 'qualifier value', 'racial group', 'record artifact', 'regime/therapy', 'release characteristic', 'religion/philosophy', 'role', 'situation', 'social concept', 'special concept', 'specimen', 'staging scale', 'state of matter', 'substance', 'transformation', 'tumor staging', '

In [26]:
terms_dict = {
    "T-00000":"SNOMED RT+CTV3",
    "T-01000":"body structure",
    "T-01100":"morphologic abnormality",
    "T-01200":"cell structure",
    "T-01210":"cell",
    "T-02000":"finding",
    "T-02100":"disorder",
    "T-03000":"environment / location",
    "T-03100":"environment",
    "T-03200":"geographic location",
    "T-04000":"event",
    "T-05000":"observable entity",
    "T-06000":"organism",
    "T-07000":"product",
    "T-07100":"medicinal product",
    "T-07110":"medicinal product form",
    "T-07111":"clinical drug",
    "T-08000":"physical force",
    "T-09000":"physical object",
    "T-10000":"procedure",
    "T-10100":"regime/therapy",
    "T-11000":"qualifier value",
    "T-11100":"administration method",
    "T-11200":"disposition",
    "T-11300":"intended site",
    "T-11010":"number",
    "T-11400":"release characteristic",
    "T-11500":"transformation",
    "T-11020":"basic dose form",
    "T-11030":"dose form",
    "T-11600":"role",
    "T-11700":"state of matter",
    "T-11040":"unit of presentation",
    "T-12000":"record artifact",
    "T-13000":"situation",
    "T-14000":"metadata",
    "T-14100":"core metadata concept",
    "T-14200":"foundation metadata concept",
    "T-14300":"linkage concept",
    "T-14310":"attribute",
    "T-14320":"link assertion",
    "T-14400":"namespace concept",
    "T-14500":"OWL metadata concept",
    "T-15000":"social concept",
    "T-15100":"life style",
    "T-15010":"racial group",
    "T-15020":"ethnic group",
    "T-15200":"occupation",
    "T-15300":"person",
    "T-15400":"religion/philosophy",
    "T-16000":"special concept",
    "T-16100":"inactive concept",
    "T-16200":"navigational concept",
    "T-17000":"specimen",
    "T-18000":"staging scale",
    "T-18100":"assessment scale",
    "T-18200":"tumor staging",
    "T-19000":"substance",
}

In [27]:
# Test if the TUIs are correct for the version of snomed
assert len(terms_list) == len(terms_dict) # check if there is the same number of groups
for i in terms_list:
    assert i in terms_dict.values() # check if the terms are identical

In [28]:
# Add tui codes
dict2 = {v : k for k, v in terms_dict.items()}
snomed_cdb_df["tui"] = snomed_cdb_df["sty"].map(dict2)
snomed_cdb_df[['cui', 'str','onto','tty','tui','sty']]

Unnamed: 0,cui,str,onto,tty,tui,sty
0,S-101009,Quilonia ethiopica (organism),SNOMED-CT,1,T-06000,organism
1,S-101009,Quilonia ethiopica,SNOMED-CT,0,T-06000,organism
2,S-102002,Hemoglobin Okaloosa (substance),SNOMED-CT,1,T-19000,substance
3,S-102002,Hemoglobin Okaloosa,SNOMED-CT,0,T-19000,substance
4,S-102002,"Hb 48(CD7), Leu-arg",SNOMED-CT,0,T-19000,substance
5,S-102002,Haemoglobin Okaloosa,SNOMED-CT,0,T-19000,substance
6,S-103007,Squirrel fibroma virus (organism),SNOMED-CT,1,T-06000,organism
7,S-103007,Squirrel fibroma virus,SNOMED-CT,0,T-06000,organism
8,S-104001,Excision of lesion of patella (procedure),SNOMED-CT,1,T-10000,procedure
9,S-104001,Excision of lesion of patella,SNOMED-CT,0,T-10000,procedure


#### Saving your df to CSV

In [None]:
# Write the clinical terms to csv
snomed_cdb_df.to_csv('snomed_cdb_csv_SNOMED-CT-UK_Release_20200401.csv')

In [29]:
# Test dataset for presence of COVID-19 concepts.
a = snomed_cdb_df[snomed_cdb_df['str'].str.contains("novel coronavirus")]
a

Unnamed: 0,cui,str,tty,sty,onto,tui
920844,S-1240381000000105,2019-nCoV (novel coronavirus),0,organism,SNOMED-CT,T-06000
920847,S-1240391000000107,Antigen of 2019-nCoV (novel coronavirus),0,substance,SNOMED-CT,T-19000
920850,S-1240401000000105,Antibody to 2019-nCoV (novel coronavirus),0,substance,SNOMED-CT,T-19000
920853,S-1240411000000107,Ribonucleic acid of 2019-nCoV (novel coronavirus),0,substance,SNOMED-CT,T-19000
920856,S-1240421000000101,Serotype 2019-nCoV (novel coronavirus),0,qualifier value,SNOMED-CT,T-11000
920859,S-1240431000000104,Exposure to 2019-nCoV (novel coronavirus) infe...,0,event,SNOMED-CT,T-04000
920862,S-1240441000000108,Close exposure to 2019-nCoV (novel coronavirus...,0,event,SNOMED-CT,T-04000
920865,S-1240451000000106,Telephone consultation for suspected 2019-nCoV...,0,procedure,SNOMED-CT,T-10000
920868,S-1240461000000109,Measurement of 2019-nCoV (novel coronavirus) a...,0,procedure,SNOMED-CT,T-10000
920871,S-1240471000000102,Measurement of 2019-nCoV (novel coronavirus) a...,0,procedure,SNOMED-CT,T-10000


# To add the UK drug extension

In [30]:
drug_extension = f'{snomed_dir}/uk_sct2dr_28.7.0_20200325000001/SnomedCT_UKDrugRF2_PRODUCTION_20200318T000001Z/Snapshot/Terminology'
drug_terms = parse_file(f'{drug_extension}/sct2_Concept_Snapshot_GB1000001_20200318.txt')
active_drug_terms = drug_terms[drug_terms.active == '1']
drug_descriptions = parse_file(f'{drug_extension}/sct2_Description_Snapshot-en_GB1000001_20200318.txt')
active_drug_desc = drug_descriptions[drug_descriptions.active == '1']
all_terms = pd.concat([active_terms, active_drug_terms])
all_descs = pd.concat([active_descs, active_drug_desc])

In [31]:
active_with_desc_drug_ext = pd.merge(all_terms, all_descs[all_descs['typeId'] == '900000000000003001'], left_on=['id'], right_on=['conceptId'], how='inner')
active_with_desc_drug_ext[active_with_desc_drug_ext.duplicated(['id_x'], keep='first')]
active_with_desc_drug_ext = active_with_desc_drug_ext.drop_duplicates(['id_x'], keep='first')
assert len(active_with_desc_drug_ext) == len(all_terms)
active_with_desc_drug_ext['tui'] = active_with_desc_drug_ext['term'].str.extract(r"\((\w+\s?.?\s?\w+.?\w+.?\w+.?)\)$")
snomed_cdb_active_only = active_with_desc_drug_ext.loc[:, ['id_x', 'term', 'tty', 'tui_code', 'tui']]
snomed_cdb_active_only.columns = ['cui', 'str', 'tty', 'tui', 'sty']
snomed_cdb_active_only['cui'] = snomed_cdb_active_only.cui.apply(lambda code: f'S-{code}')
snomed_cdb_active_only['onto'] = 'SNOMED-CT'

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


In [32]:
_ = pd.merge(all_terms, all_descs, left_on=['id'], right_on=['conceptId'], how='inner')
active_with_primary_desc = _[_['typeId'] == '900000000000003001']
active_with_primary_desc = active_with_primary_desc.drop_duplicates(['id_x'], keep='first')
active_with_synonym_desc = _[_['typeId'] == '900000000000013009']
active_with_all_desc = pd.concat([active_with_primary_desc, active_with_synonym_desc])

In [33]:
snomed_cdb_df = pd.merge(active_with_all_desc, active_with_desc_drug_ext, left_on=['id_x'], right_on=['conceptId'], how='inner')

In [34]:
# clean up the merge and rename the columns to fit the medcat Concept database criteria
snomed_cdb_df = snomed_cdb_df.loc[:, ['id_x_x','term_x','typeId_x','tui']]
snomed_cdb_df.columns = ['cui', 'str', 'tty', 'sty']
snomed_cdb_df['onto'] = 'SNOMED-CT'
snomed_cdb_df['tty'] = snomed_cdb_df['tty'].replace(['900000000000003001', '900000000000013009'], [1,0])
snomed_cdb_df['cui'] = 'S-' + snomed_cdb_df['cui'].astype(str)
snomed_cdb_df

Unnamed: 0,cui,str,tty,sty,onto
0,S-101009,Quilonia ethiopica (organism),1,organism,SNOMED-CT
1,S-101009,Quilonia ethiopica,0,organism,SNOMED-CT
2,S-102002,Hemoglobin Okaloosa (substance),1,substance,SNOMED-CT
3,S-102002,Hemoglobin Okaloosa,0,substance,SNOMED-CT
4,S-102002,"Hb 48(CD7), Leu-arg",0,substance,SNOMED-CT
5,S-102002,Haemoglobin Okaloosa,0,substance,SNOMED-CT
6,S-103007,Squirrel fibroma virus (organism),1,organism,SNOMED-CT
7,S-103007,Squirrel fibroma virus,0,organism,SNOMED-CT
8,S-104001,Excision of lesion of patella (procedure),1,procedure,SNOMED-CT
9,S-104001,Excision of lesion of patella,0,procedure,SNOMED-CT


In [35]:
terms_list = snomed_cdb_df['sty'].unique().tolist()
# Test if the TUIs are correct for the version of snomed
assert len(terms_list) == len(terms_dict) # check if there is the same number of groups
for i in terms_list:
    assert i in terms_dict.values() # check if the terms are identical
# Add term codes
dict2 = {v : k for k, v in terms_dict.items()}
snomed_cdb_df["tui"] = snomed_cdb_df["sty"].map(dict2)
snomed_cdb_df[['cui', 'str','onto','tty','tui','sty']]

Unnamed: 0,cui,str,onto,tty,tui,sty
0,S-101009,Quilonia ethiopica (organism),SNOMED-CT,1,T-06000,organism
1,S-101009,Quilonia ethiopica,SNOMED-CT,0,T-06000,organism
2,S-102002,Hemoglobin Okaloosa (substance),SNOMED-CT,1,T-19000,substance
3,S-102002,Hemoglobin Okaloosa,SNOMED-CT,0,T-19000,substance
4,S-102002,"Hb 48(CD7), Leu-arg",SNOMED-CT,0,T-19000,substance
5,S-102002,Haemoglobin Okaloosa,SNOMED-CT,0,T-19000,substance
6,S-103007,Squirrel fibroma virus (organism),SNOMED-CT,1,T-06000,organism
7,S-103007,Squirrel fibroma virus,SNOMED-CT,0,T-06000,organism
8,S-104001,Excision of lesion of patella (procedure),SNOMED-CT,1,T-10000,procedure
9,S-104001,Excision of lesion of patella,SNOMED-CT,0,T-10000,procedure


In [None]:
# write the clinical terms plus drug extension to csv:
file_name = input("Enter file name:")
snomed_cdb_df.to_csv(file_name+'.csv') #snomed_cdb_csv_SNOMED-CT-full_UK_drug_ext_Release_20200228

In [36]:
# Tuis relevant for most projects
tuisd = ['T-02000', 'T-02100', 'T-07000', 'T-07100', 'T-07110', 'T-07111','T-10000', 'T-19000']
for _ in tuisd:
    print(terms_dict[_], _)

finding T-02000
disorder T-02100
product T-07000
medicinal product T-07100
medicinal product form T-07110
clinical drug T-07111
procedure T-10000
substance T-19000


In [37]:
# To check df for covid-19 concepts
b = pd.DataFrame()
a = snomed_cdb_df[snomed_cdb_df['str'].str.contains("novel coronavirus")]
for index, _ in a[["str", "cui"]].iterrows():
    b = b.append(_)
# To display all:
# with pd.option_context('display.max_rows', None, 'display.max_columns', None):
#    display(b)
print(f"There are: {len(b)} concepts which contain 'novel coronavirus' ")

There are: 38 concepts which contain 'novel coronavirus' 


In [46]:
# Functions for finding the concept name and all synonymns for a SNOMED concept

def find_name(snomedcode):
    """
    Converts SNOMED code to Fully specified name and finds any Synonyms
    """
    df = snomed_cdb_df[(snomed_cdb_df['cui'] == snomedcode) & (snomed_cdb_df['tty'] == 1)]
    concept_name = df['str'].values
    return f"{''.join(concept_name)}"

def find_syn(snomedcode):
    """
    Converts SNOMED code and finds all Synonyms. Not including concept name
    """
    df = snomed_cdb_df[(snomed_cdb_df['cui'] == snomedcode) & (snomed_cdb_df['tty'] == 0)]
    synonym = df['str'].tolist()
    return f"{'; '.join(synonym)}"

In [47]:
print(find_name("S-50417007"))
print(find_syn("S-50417007"))

Lower respiratory tract infection (disorder)
Lower respiratory tract infection; Lower respiratory infection; Chest cold; LRTI - Lower respiratory tract infection


## Exploring SNOMED relationships

### Root and top-level Concepts
All concepts appear from the root concept 138875005 |SNOMED CT Concept (SNOMED RT+CTV3)|


####  Table 3: Top Level Concepts 
These concepts all root from the base concept: 138875005, (SNOMED CT Concept (SNOMED RT+CTV3))<br>These concepts are all linked via the relationship typeId: 116680003, (is a)
<br>A full list of relationship types can be found as children concepts of: 106237007, (linkage concept)



|SCTID|Semantic Tag|
|:---:|:---:|
|123037004 |Body structure|
|404684003 |Clinical finding|
|272379006 |Event|
|308916002 |Environment or geographical location|
|363787002 |Observable entity|
|410607006 |Organism|
|373873005 |Pharmaceutical / biologic product|
|78621006 |Physical force|
|260787004 |Physical object|
|71388002 |Procedure|
|362981000 |Qualifier value|
|419891008 |Record artifact|
|243796009 |Situation with explicit context|
|900000000000441003 |SNOMED CT Model Component (metadata)|
|48176007 |Social context|
|370115009 |Special concept|
|123038009 |Specimen|
|254291000 |Staging and scales|
|105590001 |Substance|


Taken from Techincal implementation guide(4.1), Table 4.1-3: https://confluence.ihtsdotools.org/display/DOCTIG 

## Creating the relationship dictionaries

Parent to children structure
pt2ch = {‘\<cui_for_pt\>’, \[\<list of cuis for children\>\], …}

In [48]:
# Merge relationship files
drug_ext_relat = parse_file(f'{drug_extension}/sct2_Relationship_Snapshot_GB1000001_20200318.txt')
active_drug_ext_relat = drug_ext_relat[drug_ext_relat.active == '1']
all_relat = pd.concat([active_relat, active_drug_ext_relat])
all_relat[['sourceId','destinationId','typeId']] = 'S-' + all_relat[['sourceId','destinationId','typeId']].astype(str)

In [49]:
all_relat.head()

Unnamed: 0,id,effectiveTime,active,moduleId,sourceId,destinationId,relationshipGroup,typeId,characteristicTypeId,modifierId
1,101021,20020131,1,900000000000207008,S-10000006,S-29857009,0,S-116680003,900000000000011006,900000000000451002
2,102025,20020131,1,900000000000207008,S-10000006,S-9972008,0,S-116680003,900000000000011006,900000000000451002
13,114022,20020131,1,900000000000207008,S-134035007,S-84371003,0,S-116680003,900000000000011006,900000000000451002
26,127021,20020131,1,900000000000207008,S-134136005,S-57250008,0,S-116680003,900000000000011006,900000000000451002
29,130025,20020131,1,900000000000207008,S-10002003,S-116175006,0,S-116680003,900000000000011006,900000000000451002


In [None]:
# write the relationship terms plus drug extension relationships to csv:
file_name = input("Enter file name:")
all_relat.to_csv(file_name+'.csv') #snomed_rela_csv_SNOMED-CT-full_UK_drug_ext_Release_20200228

In [50]:
# Find all types of relationships
rel = all_relat['typeId'].unique()
for _ in rel:
    print(find_name(_), _)

Is a (attribute) S-116680003
Finding site (attribute) S-363698007
Part of (attribute) S-123005000
Has intent (attribute) S-363703001
Method (attribute) S-260686004
Interprets (attribute) S-363714003
Causative agent (attribute) S-246075003
Procedure site (attribute) S-363704007
Associated morphology (attribute) S-116676008
Laterality (attribute) S-272741003
Occurrence (attribute) S-246454002
Direct device (attribute) S-363699004
Direct morphology (attribute) S-363700003
Access (attribute) S-260507000
Revision status (attribute) S-246513007
Priority (attribute) S-260870009
Direct substance (attribute) S-363701004
Has focus (attribute) S-363702006
Associated finding (attribute) S-246090004
Component (attribute) S-246093002
Has interpretation (attribute) S-363713009
Has specimen (attribute) S-116686009
Indirect morphology (attribute) S-363709002
Recipient category (attribute) S-370131001
Pathological process (attribute) S-370135005
Has active ingredient (attribute) S-127489000
Specimen sou

### Parents and Children
Subtype relationship 116680003|Is a (attribute)| relates a Concept to its immediate supertype Concepts.

In [None]:
# Parent to Children dictionary
pt2ch = dict([(key, []) for key in all_relat["destinationId"].unique()])
for index, v in all_relat.iterrows():
    if v['typeId'] == "S-116680003":
        _ = v['destinationId']
        pt2ch[_].append(v['sourceId'])
    else:
        pass

In [None]:
# Children to Parent dictionary
ch2pt = dict([(key, []) for key in all_relat["sourceId"].unique()])
for index, v in all_relat.iterrows():
    if v['typeId'] == "S-116680003":
        _ = v['sourceId']
        ch2pt[_].append(v['destinationId'])
    else:
        pass


In [None]:
# Write to 'isa' relationships to file
with open('isa_rela_pt2ch.txt', 'w') as outfile:
    json.dump(pt2ch, outfile)
with open('isa_rela_ch2pt.txt', 'w') as outfile:
    json.dump(ch2pt, outfile)

In [None]:
# Load 'isa' relationships to df
with open('isa_rela_pt2ch.txt') as json_file:
    pt2ch = json.load(json_file)
with open('isa_rela_ch2pt.txt') as json_file:
    ch2pt = json.load(json_file)

In [51]:
# Check if the Top level concepts are the same in the SNOMED UK Extention
top_level_concepts = all_relat[all_relat['destinationId']=='S-138875005']
top_level_concepts['conceptname'] = top_level_concepts['sourceId'].apply(find_name)
top_level_concepts[['sourceId', 'conceptname']].reset_index()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,index,sourceId,conceptname
0,143166,S-254291000,Staging and scales (staging scale)
1,143194,S-260787004,Physical object (physical object)
2,143274,S-272379006,Event (event)
3,143613,S-308916002,Environment or geographical location (environm...
4,143811,S-123037004,Body structure (body structure)
5,143812,S-123038009,Specimen (specimen)
6,144417,S-48176007,Social context (social concept)
7,144501,S-71388002,Procedure (procedure)
8,144577,S-78621006,Physical force (physical force)
9,144660,S-362981000,Qualifier value (qualifier value)


### Siblings

In [52]:
isa_rel = all_relat[all_relat['typeId'] == 'S-116680003']

In [None]:
# Find siblings
# cui to siblings dictionary
cui2sib_dic = dict([(key, set()) for key in isa_rel['sourceId'].unique()])

In [None]:
isa_rel.head()

In [None]:
# Find siblings function
def cui2sib(snomed):
    x = set()
    for a in ch2pt[snomed]:
        for b in pt2ch[a]:
            x.add(b)
    return x

In [None]:
%%timeit
for key in cui2sib_dic:
    cui2sib_dic[key].update(cui2sib(key))

In [None]:
%%timeit
cui2sib_dic = dict()
for key in tqdm(unique_snomed, total=len(unique_snomed)):
    value = cui2sib(key)
    cui2sib_dic[key].
    

In [None]:
with open('isa_rela_cui2sib.txt', 'w') as outfile:
    json.dump(cui2sib, outfile)

In [None]:
cui2sib('S-404684003')

# ICD-10 / OPCS-4 linkages with SNOMED-CT


In [56]:
refset_terminology = f'{base_term}/SnomedCT_UKClinicalRF2_PRODUCTION_20200401T000001Z/Snapshot/Refset/Map'

In [57]:
mappings = parse_file(f'{refset_terminology}/der2_iisssciRefset_ExtendedMapSnapshot_GB1000000_20200401.txt')
mappings = mappings[mappings.active == '1']
mappings.referencedComponentId = mappings.referencedComponentId.apply(lambda s: f'S-{s}')

In [58]:
mappings.mapPriority = mappings.mapPriority.astype(int)

In [59]:
icd10_refset_id = '999002271000000101'
opcs4_refset_id = '999002741000000101'

In [61]:
%%time
cui2mappings = dict()
for cui in snomed_cdb_df.cui.unique():
    cui_map = mappings[mappings.referencedComponentId == cui].loc[:, ['mapPriority', 'mapAdvice', 'mapTarget', 'refsetId']]
    if cui_map.shape[0] > 0:
        cui2mappings[cui] = cui_map.sort_values('mapPriority')

Wall time: 9h 22min 25s


In [62]:
opcs_mappings = {}
icd10_mappings= {}
for cui, mappings in cui2mappings.items():
    icd10_codes = mappings[mappings.refsetId == icd10_refset_id]
    if icd10_codes.shape[0] > 0:
        icd10_mappings[cui] = icd10_codes
    opcs_codes = mappings[mappings.refsetId == opcs4_refset_id]
    if opcs_codes.shape[0] > 0:
        opcs_mappings[cui] = opcs_codes

In [64]:
import pickle
pickle.dump(opcs_mappings, open('opcs_mappings_full.pickle', 'wb'))
pickle.dump(icd10_mappings, open('icd10_mappings_full.pickle', 'wb'))

In [None]:
# condense mappings to a simple dict representation

In [65]:
def condense_mapping(cui2mappings):
    mapping_condensed = {}
    for cui, mappings in cui2mappings.items():
        mapping_condensed[cui] = mappings.mapTarget.replace('(\w\d\d)(\d*)', r'\1.\2', regex=True).tolist()
    return mapping_condensed

In [66]:
icd10_mapping_condensed = condense_mapping(icd10_mappings)

In [67]:
opcs_mapping_condensed = condense_mapping(opcs_mappings)

In [68]:
pickle.dump(icd10_mapping_condensed, open('icd10_mapping_condensed.pickle', 'wb'))
pickle.dump(opcs_mapping_condensed, open('opcs_mapping_condensed.pickle', 'wb'))