<div style="font-size:30px" align="center"> <b> Training Word2Vec on Biomedical Abstracts in PubMed </b> </div>

<div style="font-size:18px" align="center"> <b> Brandon Kramer - University of Virginia's Biocomplexity Institute </b> </div>

<br>

This notebook borrows from several resources to train a Word2Vec model on a subset of the PubMed database taken from January 2021. Overall, I am interested in testing whether diversity and racial terms are becoming more closely related over time. To do this, I train the model on 1990-1995 data and then a random sample of 2015-2020 data. 

#### Import packages and ingest data 

Let's load all of our packages 

In [7]:
# load packages
import os
import psycopg2 as pg
import pandas.io.sql as psql
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from textblob import Word
from gensim.models import Word2Vec
import multiprocessing 

# set cores, grab stop words
cores_available = multiprocessing.cpu_count() - 1
stop = stopwords.words('english')

#os.chdir("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/")
#pubmed_earlier = pd.read_csv("cleaned_1990_2000_0821.csv")

#### Matching the Sample Sizes 

Since the 2010-2020 data is larger than the 1990-2000 data, we want to take a random sample of the later data to make the sample sizes the same for comparison later.

In [13]:
os.chdir("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/")
pubmed_earlier = pd.read_csv("cleaned_1990_2000_0821.csv")
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract'].str.lower()
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].str.replace('-', ' ')
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].str.replace(r'[^\w\s]+', '', regex=True)
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if  not x.isdigit()))
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if not x in stop))
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].apply(lambda x:' '.join([Word(word).lemmatize() for word in x.split()]))
pubmed_earlier.head()

Unnamed: 0,fk_pmid,year,abstract,human,nonhuman,abstract_clean
0,1655949,1991,advanced glycosylation endproducts (ages) are ...,2,1,advanced glycosylation endproducts age derived...
1,9283572,1997,we conducted a morphologic and anatomic study ...,0,0,conducted morphologic anatomic study alar cart...
2,10559607,1999,many men with early stage prostate cancer suff...,3,0,many men early stage prostate cancer suffer re...
3,9117021,1997,we investigated whether exposure to a low leve...,2,0,investigated whether exposure low level microg...
4,8863225,1996,thirty-eight children (2 months to 26 yearsofa...,7,0,thirty eight child month yearsofage underwent ...


#### Cleaning the text data

Convert all text to lower case, remove punctuation, numbers, dots, digits and stop words, and finally lemmatize. 

In [16]:
os.chdir("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/")
pubmed_later = pd.read_csv("cleaned_2010_2020_0821.csv")
pubmed_later = pubmed_later[pubmed_later['abstract'].notnull()]
#abstracts_to_sample = pubmed_earlier.count().year
#pubmed_later = pubmed_later.sample(n=abstracts_to_sample, random_state=1)
pubmed_later['abstract_clean'] = pubmed_later['abstract'].str.lower()
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].str.replace('-', ' ')
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].str.replace(r'[^\w\s]+', '', regex=True)
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if not x.isdigit()))
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if not x in stop))
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].apply(lambda x:' '.join([Word(word).lemmatize() for word in x.split()]))
pubmed_later.head()
pubmed_later.count()

fk_pmid           1131646
year              1131646
abstract          1131646
human             1131646
nonhuman          1131646
abstract_clean    1131646
dtype: int64

#### Training the Word2Vec Models 

Now, let's train these Word2Vec models and save them as a binary file to visualize later. 

In [14]:
# run the model on the earlier data 
earlier_list=[]
for i in pubmed_earlier['abstract_clean']:
    li = list(i.split(" "))
    earlier_list.append(li)
earlier_model = Word2Vec(earlier_list, min_count=5, vector_size=512, window=5, epochs=5, seed=123, workers=cores_available)

os.chdir("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/")
earlier_model.save("word2vec_1990_2000_0821.model")
earlier_model.save("word2vec_1990_2000_0821.bin")

In [17]:
# run the model on the later data 
later_list=[]
for i in pubmed_later['abstract_clean']:
    li = list(i.split(" "))
    later_list.append(li)
later_model = Word2Vec(later_list, min_count=5, vector_size=512, window=5, epochs=5, seed=123, workers=cores_available)

os.chdir("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/")
later_model.save("word2vec_2010_2020_0821.model")
later_model.save("word2vec_2010_2020_0821.bin")

#### References 

[Guru 99's Tutorial on Word Embeddings](https://www.guru99.com/word-embedding-word2vec.html)

[Stackoverflow Post on Lemmatizing in Pandas](https://stackoverflow.com/questions/47557563/lemmatization-of-all-pandas-cells)