<div style="font-size:30px" align="center"> <b> Training Word2Vec on Biomedical Abstracts in PubMed </b> </div>

<div style="font-size:18px" align="center"> <b> Brandon Kramer - University of Virginia's Biocomplexity Institute </b> </div>

<br>

This notebook borrows from several resources to train a Word2Vec model on a subset of the PubMed database taken from January 2021. Overall, I am interested in testing whether diversity and racial terms are becoming more closely related over time. To do this, I train the model on 1990-1995 data and then a random sample of 2015-2020 data. 

#### Import packages and ingest data 

Let's load all of our packages 

In [2]:
# load packages
import os
import psycopg2 as pg
import pandas.io.sql as psql
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from textblob import Word
from gensim.models import Word2Vec
import multiprocessing 

# set cores, grab stop words
cores_available = multiprocessing.cpu_count() - 1
stop = stopwords.words('english')

# connect to the database, download data 
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))

early_query = '''SELECT fk_pmid, year, abstract, publication 
                    FROM pubmed_2021.human_abstracts_0721 
                    WHERE year <= 2000'''
later_query = '''SELECT fk_pmid, year, abstract, publication 
                    FROM pubmed_2021.human_abstracts_0721 
                    WHERE year >= 2010'''

# convert to a dataframe, show how many missing we have (none)
pubmed_earlier = pd.read_sql_query(early_query, con=connection)
pubmed_later = pd.read_sql_query(later_query, con=connection)
pubmed_earlier.head()

Unnamed: 0,fk_pmid,year,abstract,publication
0,8956565,1996,OBJECTIVE:\nDetermination of skeletal or bone ...,AJR. American journal of roentgenology
1,9651461,1998,Guidelines of psychosexual management for infa...,Pediatrics
2,10414842,1999,We estimated risk of suicide in adults in New ...,Social science & medicine (1982)
3,7723951,1995,African-Americans have an unexplained increase...,Neurology
4,7974531,1994,OBJECTIVE:\nAlthough US blacks are known to ha...,Stroke


#### Matching the Sample Sizes 

Since the 2010-2020 data is larger than the 1990-2000 data, we want to take a random sample of the later data to make the sample sizes the same for comparison later.

In [3]:
abstracts_to_sample = pubmed_earlier.count().year
pubmed_later = pubmed_later.sample(n=abstracts_to_sample, random_state=1)
pubmed_later.count().year

369100

#### Cleaning the text data

Convert all text to lower case, remove punctuation, numbers, dots, digits and stop words, and finally lemmatize. 

In [5]:
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract'].str.lower()
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].str.replace(r'[^\w\s]+', '', regex=True)
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if  not x.isdigit()))
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if not x in stop))
pubmed_earlier['abstract_clean'] = pubmed_earlier['abstract_clean'].apply(lambda x:' '.join([Word(word).lemmatize() for word in x.split()]))
pubmed_earlier.head()

Unnamed: 0,fk_pmid,year,abstract,publication,abstract_clean
0,8956565,1996,OBJECTIVE:\nDetermination of skeletal or bone ...,AJR. American journal of roentgenology,objective determination skeletal bone age ofte...
1,9651461,1998,Guidelines of psychosexual management for infa...,Pediatrics,guideline psychosexual management infant born ...
2,10414842,1999,We estimated risk of suicide in adults in New ...,Social science & medicine (1982),estimated risk suicide adult new south wale ns...
3,7723951,1995,African-Americans have an unexplained increase...,Neurology,africanamericans unexplained increased inciden...
4,7974531,1994,OBJECTIVE:\nAlthough US blacks are known to ha...,Stroke,objective although u black known excess stroke...


In [6]:
pubmed_later['abstract_clean'] = pubmed_later['abstract'].str.lower()
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].str.replace(r'[^\w\s]+', '', regex=True)
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if not x.isdigit()))
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].apply(lambda x:' '.join(x for x in x.split() if not x in stop))
pubmed_later['abstract_clean'] = pubmed_later['abstract_clean'].apply(lambda x:' '.join([Word(word).lemmatize() for word in x.split()]))
pubmed_later.head()

Unnamed: 0,fk_pmid,year,abstract,publication,abstract_clean
219363,30850025,2019,BACKGROUND:\nFemale life expectancy and mortal...,BMC public health,background female life expectancy mortality ra...
256520,22629321,2012,Research has shown that people are able to jud...,PloS one,research shown people able judge sexual orient...
133216,25870227,2015,Plasmodium falciparum merozoites use diverse a...,Infection and immunity,plasmodium falciparum merozoite use diverse al...
679703,23805296,2013,NMDA receptors are activated after binding of ...,PloS one,nmda receptor activated binding agonist glutam...
614039,26845760,2016,BACKGROUND:\nThe existence of partial volume e...,PloS one,background existence partial volume effect bra...


#### Training the Word2Vec Models 

Now, let's train these Word2Vec models and save them as a binary file to visualize later. 

In [7]:
# run the model on the earlier data 
earlier_list=[]
for i in pubmed_earlier['abstract_clean']:
    li = list(i.split(" "))
    earlier_list.append(li)
earlier_model = Word2Vec(earlier_list, min_count=5, size=512, window=5, iter=5, seed=123, workers=cores_available)

os.chdir("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/")
earlier_model.save("word2vec_1990_2000_0721.model")
earlier_model.save("word2vec_1990_2000_0721.bin")

# run the model on the later data 
later_list=[]
for i in pubmed_later['abstract_clean']:
    li = list(i.split(" "))
    later_list.append(li)
later_model = Word2Vec(later_list, min_count=5, size=512, window=5, iter=5, seed=123, workers=cores_available)

os.chdir("/sfs/qumulo/qhome/kb7hp/git/diversity/data/word_embeddings/")
later_model.save("word2vec_2010_2020_0721.model")
later_model.save("word2vec_2010_2020_0721.bin")

#### References 

[Guru 99's Tutorial on Word Embeddings](https://www.guru99.com/word-embedding-word2vec.html)

[Stackoverflow Post on Lemmatizing in Pandas](https://stackoverflow.com/questions/47557563/lemmatization-of-all-pandas-cells)