<div style="font-size:30px" align="center"> <b> Training Word2Vec on Biomedical Abstracts in PubMed </b> </div>

<div style="font-size:18px" align="center"> <b> Brandon L. Kramer - University of Virginia's Bicomplexity Institute </b> </div>

<br>

This notebook borrows from several resources to train a Word2Vec model on a subset of the PubMed database taken from January 2021. Overall, I am interested in testing whether diversity and racial terms are becoming more closely related over time. To do this, I train the model on 1990-1995 data and then a random sample of 2015-2020 data. 

#### Import packages and ingest data 

Let's load all of our packages 

In [3]:
# load packages
import os
import psycopg2 as pg
import pandas.io.sql as psql
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from textblob import Word
from gensim.models import Word2Vec
import multiprocessing 

# set cores, grab stop words
cores_available = multiprocessing.cpu_count() - 1
stop = stopwords.words('english')

# connect to the database, download data 
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))

early_query = '''SELECT fk_pmid, year, abstract, publication 
                    FROM pubmed_2021.biomedical_abstracts 
                    WHERE year <= 1995'''
later_query = '''SELECT fk_pmid, year, abstract, publication 
                    FROM pubmed_2021.biomedical_abstracts 
                    WHERE year >= 2015'''

# convert to a dataframe, show how many missing we have (none)
pubmed_earlier = pd.read_sql_query(early_query, con=connection)
pubmed_later = pd.read_sql_query(later_query, con=connection)
pubmed_earlier.head()

Unnamed: 0,fk_pmid,year,abstract,publication
0,1279136,1992,The amyloid protein precursor (APP) of Alzheim...,The Journal of neuroscience : the official jou...
1,1279137,1992,Neurons of the medial pontine reticular format...,The Journal of neuroscience : the official jou...
2,1279138,1992,Embryonic striatal grafts develop a modular or...,The Journal of neuroscience : the official jou...
3,1279139,1992,The degree of parallel processing in frontal c...,The Journal of neuroscience : the official jou...
4,1279139,1992,The degree of parallel processing in frontal c...,The Journal of neuroscience : the official jou...


#### Matching the Sample Sizes 

Since the 2015-2020 data is larger than the 1990-1995 data, we want to take a random sample of the later data to make the sample sizes the same for comparison later.

In [4]:
abstracts_to_sample = pubmed_earlier.count().year
pubmed_later = pubmed_later.sample(n=abstracts_to_sample, random_state=1)
pubmed_later.count().year

328075

#### Cleaning the text data

Convert all text to lower case, remove punctuation, numbers, dots, digits and stop words, and finally lemmatize. 

In [6]:
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract'].str.lower()
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].str.replace(r'[^\w\s]+', '', regex=True)
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if  not x.isdigit()))
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if not x in stop))
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].apply(lambda x:' '.join([Word(word).lemmatize() for word in x.split()]))
pubmed_earlier.head()

Unnamed: 0,fk_pmid,year,abstract,publication,abstract_clean
0,1279136,1992,The amyloid protein precursor (APP) of Alzheim...,The Journal of neuroscience : the official jou...,amyloid protein precursor app alzheimers disea...
1,1279137,1992,Neurons of the medial pontine reticular format...,The Journal of neuroscience : the official jou...,neuron medial pontine reticular formation mprf...
2,1279138,1992,Embryonic striatal grafts develop a modular or...,The Journal of neuroscience : the official jou...,embryonic striatal graft develop modular organ...
3,1279139,1992,The degree of parallel processing in frontal c...,The Journal of neuroscience : the official jou...,degree parallel processing frontal cortexbasal...
4,1279139,1992,The degree of parallel processing in frontal c...,The Journal of neuroscience : the official jou...,degree parallel processing frontal cortexbasal...


In [7]:
pubmed_later['abstract_clean'] = pubmed_later['abstract'].str.lower()
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].str.replace(r'[^\w\s]+', '', regex=True)
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if  not x.isdigit()))
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if not x in stop))
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].apply(lambda x:' '.join([Word(word).lemmatize() for word in x.split()]))
pubmed_later.head()

Unnamed: 0,fk_pmid,year,abstract,publication,abstract_clean
12603,25644602,2015,Binding of the Hedgehog (Hh) protein signal to...,Genes & development,binding hedgehog hh protein signal receptor pa...
103191,26273593,2015,This study presents clinical outcomes of prima...,BioMed research international,study present clinical outcome primary cleft p...
671651,30481191,2018,Administrating antibiotics to young piglets ma...,PloS one,administrating antibiotic young piglet may sho...
65615,26018967,2015,BACKGROUND:\nNeuroblastoma (NB) is the most co...,PloS one,background neuroblastoma nb common cancer infa...
7328,25601974,2015,OBJECTIVE:\nExamination of regional care patte...,Pediatrics,objective examination regional care pattern an...


#### Training the Word2Vec Models 

Now, let's train these Word2Vec models and save them as a binary file to visualize later. 

In [None]:
# run the model on the earlier data 
earlier_list=[]
for i in pubmed_earlier['abstract_clean']:
    li = list(i.split(" "))
    earlier_list.append(li)
earlier_model = Word2Vec(earlier_list, min_count=5, size=512, window=5, iter=5, workers=cores_available)

os.chdir("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/")
earlier_model.save("word2vec_1990_95.model")
earlier_model.save("word2vec_1990_95.bin")

In [None]:
# run the model on the later data 
later_list=[]
for i in pubmed_later['abstract_clean']:
    li = list(i.split(" "))
    later_list.append(li)
later_model = Word2Vec(later_list, min_count=5, size=512, window=5, iter=5, workers=cores_available)

os.chdir("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/")
later_model.save("word2vec_2015_20.model")
later_model.save("word2vec_2015_20.bin")

lemmatizing or word2vec resources 
[Guru 99's Tutorial on Word Embeddings](https://www.guru99.com/word-embedding-word2vec.html)
[Stackoverflow Post on Lemmatizing in Pandas](https://stackoverflow.com/questions/47557563/lemmatization-of-all-pandas-cells)