# Anatomical terms

Most preprocessing was carried out for the neuro-knowledge-engine repo. Here, we're going to perform additional preprocessing of anatomical terms so we can compare their occurrences in full texts, abstracts, and coordinate data.

## Load the terms

In [1]:
import pandas as pd

In [2]:
anat = pd.read_csv("../lexicon/lexicon_brain.csv", index_col=None)
anat.head()

Unnamed: 0,HARVARD_OXFORD,TERMS,SOURCE,TYPE
0,accumbens,accumbens,Harvard-Oxford,term
1,accumbens,acb,NeuroNames,acronym
2,accumbens,nucleus accumbens,NeuroNames,term
3,accumbens,accumbens nucleus,NeuroNames,term
4,accumbens,nucleus accumbens septi,NeuroNames,term


## Preprocess the terms

In [3]:
import preproc

In [4]:
terms = []
for term in anat["TERMS"]:
    term = preproc.preprocess_text(term)
    terms.append(term)
terms[:5]

['accumbens',
 'acb',
 'nucleus accumbens',
 'accumbens nucleus',
 'nucleus accumbens septi']

## Identify n-grams

In [5]:
ngrams = [term for term in terms if " " in term]
len(ngrams)

211

# PubMed corpus

This corpus will be used to train GloVe embeddings.

## Load PMIDs

In [6]:
import os, shutil

In [7]:
pubmed_pmids = [int(pmid.strip()) for pmid in open("../../../pubmed/query_190428/pmids.txt").readlines()]
text_pmids = [int(file.replace(".txt", "")) for file in os.listdir("../../../nlp/corpus") if not file.startswith(".")]
pubmed_pmids = list(set(pubmed_pmids).intersection(set(text_pmids)))
len(pubmed_pmids)

20504

In [8]:
df = pd.read_csv("../metadata/metadata.csv", encoding="latin-1")
coord_pmids = [int(pmid) for pmid in df["PMID"]]
len(coord_pmids)

18155

In [9]:
pmids = list(set(pubmed_pmids).union(set(coord_pmids)))
len(pmids)

29828

In [10]:
path = "../../../nlp/corpus"
for pmid in pmids:
    shutil.copyfile("{}/{}.txt".format(path, pmid), "pubmed/{}.txt".format(pmid))

## Consolidate n-grams

In [11]:
preproc.run_preproc("pubmed", pmids, ngrams=ngrams, 
                    preproc_texts=False, preproc_ngrams=True)

## Concatenate files

In [12]:
with open("corpus_bias.txt", "w+") as outfile:
    for pmid in pmids:
        text = open("pubmed/{}.txt".format(pmid), "r").read()
        outfile.write(text)

# Full text corpus

In [13]:
pmids = df["PMID"].astype(int)
len(pmids)

18155

In [14]:
path = "../../../nlp/corpus"
for pmid in pmids:
    shutil.copyfile("{}/{}.txt".format(path, pmid), "fulltexts/{}.txt".format(pmid))

In [15]:
preproc.run_preproc("fulltexts", pmids, ngrams=ngrams, preproc_texts=False, preproc_ngrams=True)

# Abstract corpus

These have not previously been preprocessed, so we will carry out all the steps that the PubMed and full text corpora had been subjected to. These included:

1. Lowercasing
2. Removal of symbols
3. Lemmatization
4. Consolidation of n-grams (psychological and neuroanatomical)

## Basic preprocessing

In [16]:
ngrams_anat = ngrams
ngrams_rdoc = [word.strip().replace("_", " ") for word in open("../lexicon/lexicon_rdoc.txt", "r").readlines() if "_" in word]
ngrams_cogneuro = [word.strip().replace("_", " ") for word in open("../lexicon/lexicon_cogneuro.txt", "r").readlines() if "_" in word]
ngrams_dsm = [word.strip().replace("_", " ") for word in open("../lexicon/lexicon_dsm.txt", "r").readlines() if "_" in word]
ngrams_psych = [word.strip().replace("_", " ") for word in open("../lexicon/lexicon_psychiatry.txt", "r").readlines() if "_" in word]
ngrams = list(set(ngrams_anat + ngrams_rdoc + ngrams_cogneuro + ngrams_dsm + ngrams_psych))
ngrams.sort(key = lambda x: x.count(" "), reverse = True)

In [17]:
preproc.run_preproc("abstracts", pmids, ngrams=ngrams, preproc_texts=True, preproc_ngrams=True)

## Removal of metadata

In [4]:
import os

In [14]:
abstracts = [file for file in os.listdir("abstracts") if not file.startswith(".")]
for abstract in abstracts:
    file = "abstracts/{}".format(abstract)
    text = open(file, "r").read()
    text = " . ".join(text.split(" . ")[3:])
    text = text.split(" pmid ")[0].split(" pmcid ")[0].split(" doi ")[0].split(" copyright ")[0]
    with open(file, "w+") as outfile:
        outfile.write(text)