# Dataset Download

Download, unzip, and place in previous directory. Call it `datasets`.
https://download-directory.github.io/?url=https%3A%2F%2Fgithub.com%2Fncbi%2FBioConceptVec%2Ftree%2Fmaster%2Fdatasets

Four directories:

drug_drug_interactions,
drug_gene_interactions,
gene_gene_interactions,
protein_protein_interactions

Download embeddings:
1. **BioConceptVec cbow:** [embedding](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/bioconceptvec_word2vec_cbow.bin) (2.4GB) and [concept-only](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/concept_cbow.json) (798MB).
2. **BioConceptVec skip-gram:** [embedding](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/bioconceptvec_word2vec_skipgram.bin) (2.4GB) and [concept-only](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/concept_skip.json) (812MB).
3. **BioConceptVec glove:** [embedding](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/bioconceptvec_glove.bin) (2.4GB) and [concept-only](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/concept_glove.json) (835MB).
4. **BioConceptVec fastText:** [embedding](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/bioconceptvec_fasttext.bin) (2.4GB) and [concept-only](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/concept_fast.json) (813MB).

In [1]:
# Make directory embeddings in ../
!mkdir ../embeddings

Download the embeddings and place them in the `embeddings` directory.

In [2]:
# Download BioConceptVec embeddings
!wget https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/bioconceptvec_glove.bin 

--2023-05-28 18:45:37--  https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/bioconceptvec_glove.bin
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.13, 165.112.9.228
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2599454842 (2.4G) [application/octet-stream]
Saving to: ‘bioconceptvec_glove.bin’



In [None]:
!mv bioconceptvec_glove.bin ../embeddings

Download concept-only

In [3]:
# Download BioConceptVec embeddings concept-only
!wget https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/concept_glove.json

--2023-05-28 18:49:17--  https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/concept_glove.json
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.228, 130.14.250.13
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 875517426 (835M) [application/json]
Saving to: ‘concept_glove.json’


2023-05-28 18:51:14 (7.14 MB/s) - ‘concept_glove.json’ saved [875517426/875517426]



In [4]:
!mv concept_glove.json ../embeddings

What does `concept-only` mean?
- https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/concept_glove.json

# PubTator Mapping

In [8]:
# Download pubtator mappings from https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/bioconcepts2pubtatorcentral.gz and save it in ../datasets
!wget https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/bioconcepts2pubtatorcentral.gz

--2023-05-29 11:35:08--  https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/bioconcepts2pubtatorcentral.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.230, 130.14.250.7
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.230|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5707678655 (5.3G) [application/x-gzip]
Saving to: ‘bioconcepts2pubtatorcentral.gz’


2023-05-29 11:44:14 (9.98 MB/s) - ‘bioconcepts2pubtatorcentral.gz’ saved [5707678655/5707678655]



In [9]:
# Extract the file
!gunzip bioconcepts2pubtatorcentral.gz

In [10]:
# Move the file to datasets
!mv bioconcepts2pubtatorcentral ../datasets/bioconcepts2pubtatorcentral

In [None]:
import gzip

# Load the gene-to-PubTator mapping into a dictionary
def load_gene_mapping():
    gene_mapping = {}
    with gzip.open("gene2pubtator.gz", "rt") as mapping_file:
        for line in mapping_file:
            parts = line.strip().split("\t")
            gene_id = parts[1]
            gene_name = parts[2]
            gene_mapping[gene_id] = gene_name
    return gene_mapping

# Convert PubTator ID to gene name
def pubtator_id_to_gene_name(pubtator_id):
    gene_mapping = load_gene_mapping()
    if pubtator_id in gene_mapping:
        return gene_mapping[pubtator_id]
    else:
        return None

# Convert gene name to PubTator ID
def gene_name_to_pubtator_id(gene_name):
    gene_mapping = load_gene_mapping()
    for gene_id, name in gene_mapping.items():
        if name.lower() == gene_name.lower():
            return gene_id
    return None

# Example usage
pubtator_id = "12345"
gene_name = pubtator_id_to_gene_name(pubtator_id)
print("PubTator ID:", pubtator_id)
print("Gene Name:", gene_name)

gene = "BRCA1"
pubtator_id = gene_name_to_pubtator_id(gene)
print("Gene Name:", gene)
print("PubTator ID:", pubtator_id)