<a href="https://colab.research.google.com/github/ckakalou/medCAT/blob/main/MedCAT_Tutorial_%7C_Part_3_1_Building_a_Concept_Database_and_Vocabulary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### First we need to install MedCAT

In [None]:
! pip install --upgrade medcat
# Get the scispacy model
! pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_md-0.3.0.tar.gz

Requirement already up-to-date: medcat in /usr/local/lib/python3.7/dist-packages (1.0.30)
Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_md-0.3.0.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_md-0.3.0.tar.gz
Building wheels for collected packages: en-core-sci-md
  Building wheel for en-core-sci-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-sci-md: filename=en_core_sci_md-0.3.0-cp37-none-any.whl size=79931147 sha256=ca242970f090161477a46aa480004cf8e22a2f10897ef05a19eca15a54d1cf88
  Stored in directory: /root/.cache/pip/wheels/7e/1b/90/364b1e3c8f8c21241876892748fd737a6b3698f2632a9429ac
Successfully built en-core-sci-md


**Restart the runtime if on colab, sometimes necessary after installing models**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

from medcat.vocab import Vocab
from medcat.cdb import CDB
from medcat.config import Config
from medcat.cdb_maker import CDBMaker
from medcat.cat import CAT

In [None]:
!mkdir -p data
DATA_DIR = "./data/"

In [None]:
!wget https://raw.githubusercontent.com/CogStack/MedCAT/develop/tutorial/data/cdb_simple.csv -P ./data/
!wget https://raw.githubusercontent.com/CogStack/MedCAT/develop/tutorial/data/cdb_advanced.csv -P ./data/
!wget https://raw.githubusercontent.com/CogStack/MedCAT/master/tutorial/data/vocab_data.txt -P ./data/

--2021-06-08 15:21:12--  https://raw.githubusercontent.com/CogStack/MedCAT/develop/tutorial/data/cdb_simple.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50 [text/plain]
Saving to: ‘./data/cdb_simple.csv’


2021-06-08 15:21:12 (2.55 MB/s) - ‘./data/cdb_simple.csv’ saved [50/50]

--2021-06-08 15:21:12--  https://raw.githubusercontent.com/CogStack/MedCAT/develop/tutorial/data/cdb_advanced.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 150 [text/plain]
Saving to: ‘./data/cdb_advanced.csv’


2021-06-08 15:21:12 (6.17 MB/s) 

# Building a Vocabulary

The first of the two required models when running MedCAT is a Vocabulary model (Vocab). The model is used for two things: (1) Spell checking; and (2) Word Embedding. 

The Vocab is very simple and you can easily build it from a file that is structured as below:
```
<token>\t<word_count>\t<vector_embedding_separated_by_spaces>
```
`token` - Usually a word or subword if you are using Byte Pair Encoding or something similar.

`word_count` - The count for this word in your dataset or in any large dataset (wikipedia also works nicely).

`vector_embedding_separated_by_spaces` - precalculated vector embedding, can be from Word2Vec or BERT

---
An example with 3-dimension embedding would be:
```
house	34444	 0.3232 0.123213 1.231231
dog	14444	0.76762 0.76767 1.45454
.
.
.
```
The file is basically a TSV, but should not have any heading. 

---

**NOTE**: If spelling is important for your use-case, take care that there are no misspelt words in the Vocab.

In [None]:
# Let's have a look at an example, I've created a small vocabulary with only 2 words (the ones from above)
# Let's try to create a vocabulary from this two words.

vocab = Vocab()
vocab.add_words(DATA_DIR +'vocab_data.txt', replace=True)

**And that is everything, with this we have built our vocab and no futher training is necessary.**

---

A couple of useful functions of the vocab are presented below

In [None]:
# To see the words in the vocab
vocab.vocab.keys()

dict_keys(['house', 'dog'])

In [None]:
# If you want to add words manually (one by one) use:
vocab.add_word("test", cnt=31, vec=[1.42, 1.44, 1.55], replace=True)
vocab.vocab.keys()

dict_keys(['house', 'dog', 'test'])

In [None]:
# To get a vector of word use:
vocab.vec("house")

array([0.3232  , 0.123213, 1.231231])

In [None]:
# Or to get the count
vocab['house']

34444

In [None]:
# To check if a word is in the vocab:
"house" in vocab

True

### Before we save the vocab model, we need to create the unigram table for negative sampling

In [None]:
# This is necessary after each change of the vocabulary (when we add new words)
vocab.make_unigram_table()

### Save the model

In [None]:
vocab.save(DATA_DIR + "vocab.dat")

### Load the model

In [None]:
vocab = Vocab.load(DATA_DIR + "vocab.dat")

# Building the Concept Database (CDB)

The second model we are going to need when using MedCAT is the Concept Database (CDB). This database holds a list of all concepts that we would like to detect and link to. For a lot of medical use-cases we would use giant databases like the UMLS or SNOMED CT. However, MedCAT can be used with any database no matter how big/small it is. 

To prepare the CDB we start off with a CSV with the following structure:
```
cui,name
1,kidney failure
7,CoVid 2
7,coronavirus
```
This is the most basic version of the CSV file, it has only:

`cui` - The concept unique identifier, this is simply an `ID` in your database.

`name` - String/Name of that concept. It is important to write all possible names and abbreviations for a concept of interest.

If you have a concept that can be recognised through multiple different names (like the one above with cui=7), you can simply add multiple rows with the same concept ID and MedCAT will merge that during the build phase.

## The Full CSV Specification
```
cui,name,ontologies,name_status,type_ids,description
1,Kidney Failure,SNOMED,P,T047,kidneys stop working
.
.
.
```
The rest of the fields are optional, each can be included or left out in your CSV:

`ontologies` - Source ontology, e.g. HPO, SNOMED, HPC,...

`name_status` - Term type e.g. P - Primary Name. Primary names are important and I would always recommend to add this fields when creating your CDB. This will help distinguish between synonyms.

`type_ids` - Semantic type identifier - e.g. T047 (taken from UMLS). A list of all semantic types can be found [here](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt).


`description` - Description of this concept

***Note***: If one concept has multiple names, you can also separate the different names by a "|" - pipe - symbol 

In [None]:
cdb_simple = pd.read_csv(DATA_DIR + 'cdb_simple.csv')


In [None]:
cdb_simple

Unnamed: 0,cui,name
0,1,kidney failure
1,7,CoVid 2
2,7,coronavirus


Let's try building our own concept databse from a simple CSV

In [None]:
# First initialise the default configuration
config = Config()
config.general['spacy_model'] = 'en_core_sci_md'
maker = CDBMaker(config)

In [None]:
# Create an array containing CSV files that will be used to build our CDB
csv_path = [ DATA_DIR + 'cdb_advanced.csv', DATA_DIR + 'cdb_simple.csv',]

# Create your CDB
cdb = maker.prepare_csvs(csv_path, full_build=True)

Started importing concepts from: ./data/cdb_advanced.csv
Current progress: 0% at 0.000s per 0 rows
Current progress: 50% at 0.021s per 0 rows
Started importing concepts from: ./data/cdb_simple.csv
Current progress: 0% at 0.000s per 0 rows
Current progress: 33% at 0.007s per 0 rows
Current progress: 67% at 0.007s per 0 rows


**That is all, nothing else is necessary to build the CDB**

---

Some useful functions of the cdb are below

In [None]:
# To display all names and cui in the db
print(cdb.name2cuis)

{'kidney~failure': ['1'], 'failure~of~kidneys': ['1'], 'failure~of~kidney': ['1'], 'kf': ['1'], 'k~.~failure': ['1'], 'covid~2': ['7'], 'coronavirus': ['7']}


In [None]:
# To display all unique cuis and corresponding names in the db 
print(cdb.cui2names)

{'1': {'k~.~failure', 'failure~of~kidneys', 'kidney~failure', 'kf', 'failure~of~kidney'}, '7': {'covid~2', 'coronavirus'}}


In [None]:
# To display cui to preferred name
print(cdb.cui2preferred_name)


{'1': 'Kidney Failure'}


In [None]:
# We have a link from cui to type ids
print(cdb.cui2type_ids)


{'1': {'T047'}, '7': set()}


### Save the model

In [None]:
cdb.save(DATA_DIR + "cdb.dat")

### Load the model

In [None]:
cdb = CDB.load(DATA_DIR + "cdb.dat")

# End

This is everything you need to create your own MedCAT models. In the next notebook you will see how to train and use these models to annotate documents. 