<div style="font-size:30px" align="center"> <b> Training Word2Vec on Biomedical Abstracts in PubMed </b> </div>

<div style="font-size:18px" align="center"> <b> Brandon L. Kramer - University of Virginia's Bicomplexity Institute </b> </div>

<br>

This notebook borrows from several resources to train a Word2Vec model on a subset of the PubMed database taken from January 2021. Overall, I am interested in testing whether diversity and racial terms are becoming more closely related over time. To do this, I train the model on 1990-1995 data and then a random sample of 2015-2020 data. 

#### Import packages and ingest data 

Let's load all of our packages 

In [1]:
# load packages
import os
import psycopg2 as pg
import pandas.io.sql as psql
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from textblob import Word
from gensim.models import Word2Vec
import multiprocessing 

# set cores, grab stop words
cores_available = multiprocessing.cpu_count() - 1
stop = stopwords.words('english')

# connect to the database, download data 
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))

early_query = '''SELECT fk_pmid, year, abstract, publication 
                    FROM pubmed_2021.biomedical_human_abstracts 
                    WHERE year <= 2000'''
later_query = '''SELECT fk_pmid, year, abstract, publication 
                    FROM pubmed_2021.biomedical_human_abstracts 
                    WHERE year >= 2010'''

# convert to a dataframe, show how many missing we have (none)
pubmed_earlier = pd.read_sql_query(early_query, con=connection)
pubmed_later = pd.read_sql_query(later_query, con=connection)
pubmed_earlier.head()

Unnamed: 0,fk_pmid,year,abstract,publication
0,1279136,1992,The amyloid protein precursor (APP) of Alzheim...,The Journal of neuroscience : the official jou...
1,1279153,1992,Bone marrow suppression is the major dose-limi...,The Journal of pediatrics
2,1279211,1992,A total of 41 patients with stage 1 malignant ...,The Journal of urology
3,1279212,1992,Prostatic blood flow was measured with 15oxyge...,The Journal of urology
4,1279219,1992,We correlated the American Urological Associat...,The Journal of urology


#### Matching the Sample Sizes 

Since the 2010-2020 data is larger than the 1990-2000 data, we want to take a random sample of the later data to make the sample sizes the same for comparison later.

In [2]:
abstracts_to_sample = pubmed_earlier.count().year
pubmed_later = pubmed_later.sample(n=abstracts_to_sample, random_state=1)
pubmed_later.count().year

475077

#### Cleaning the text data

Convert all text to lower case, remove punctuation, numbers, dots, digits and stop words, and finally lemmatize. 

In [3]:
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract'].str.lower()
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].str.replace(r'[^\w\s]+', '', regex=True)
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if  not x.isdigit()))
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if not x in stop))
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].apply(lambda x:' '.join([Word(word).lemmatize() for word in x.split()]))
pubmed_earlier.head()

Unnamed: 0,fk_pmid,year,abstract,publication,abstract_clean
0,1279136,1992,The amyloid protein precursor (APP) of Alzheim...,The Journal of neuroscience : the official jou...,amyloid protein precursor app alzheimers disea...
1,1279153,1992,Bone marrow suppression is the major dose-limi...,The Journal of pediatrics,bone marrow suppression major doselimiting tox...
2,1279211,1992,A total of 41 patients with stage 1 malignant ...,The Journal of urology,total patient stage malignant teratoma testis ...
3,1279212,1992,Prostatic blood flow was measured with 15oxyge...,The Journal of urology,prostatic blood flow measured 15oxygenwater po...
4,1279219,1992,We correlated the American Urological Associat...,The Journal of urology,correlated american urological association aua...


In [4]:
pubmed_later['abstract_clean'] = pubmed_later['abstract'].str.lower()
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].str.replace(r'[^\w\s]+', '', regex=True)
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if not x.isdigit()))
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if not x in stop))
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].apply(lambda x:' '.join([Word(word).lemmatize() for word in x.split()]))
pubmed_later.head()

Unnamed: 0,fk_pmid,year,abstract,publication,abstract_clean
744879,27283331,2016,OBJECTIVE:\nWe examined whether measures of vi...,Annals of the rheumatic diseases,objective examined whether measure vitamin ass...
538102,25540455,2014,"Most eukaryotic lineages are microbial, and ma...",Systematic biology,eukaryotic lineage microbial many recently sam...
650860,26484889,2015,UNASSIGNED:\nThis study examined the prevalenc...,PloS one,unassigned study examined prevalence correlate...
556065,25698436,2015,OBJECTIVE:\nThe authors sought to clarify the ...,The American journal of psychiatry,objective author sought clarify source parento...
1252066,31626126,2019,Studies have found that the measurement of bod...,Medicine,study found measurement body composition used ...


#### Training the Word2Vec Models 

Now, let's train these Word2Vec models and save them as a binary file to visualize later. 

In [5]:
# run the model on the earlier data 
earlier_list=[]
for i in pubmed_earlier['abstract_clean']:
    li = list(i.split(" "))
    earlier_list.append(li)
earlier_model = Word2Vec(earlier_list, min_count=5, size=512, window=5, iter=5, workers=cores_available)

os.chdir("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/")
earlier_model.save("word2vec_1990_2000.model")
earlier_model.save("word2vec_1990_2000.bin")

# run the model on the later data 
later_list=[]
for i in pubmed_later['abstract_clean']:
    li = list(i.split(" "))
    later_list.append(li)
later_model = Word2Vec(later_list, min_count=5, size=512, window=5, iter=5, workers=cores_available)

os.chdir("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/")
later_model.save("word2vec_2010_2020.model")
later_model.save("word2vec_2010_2020.bin")

#### References 

[Guru 99's Tutorial on Word Embeddings](https://www.guru99.com/word-embedding-word2vec.html)

[Stackoverflow Post on Lemmatizing in Pandas](https://stackoverflow.com/questions/47557563/lemmatization-of-all-pandas-cells)