## word2vec in code
Packages used:
1. `gensim.word2vec`: https://radimrehurek.com/gensim/models/word2vec.html The Gensim documentation is very nicely written.

2. `pymed`: https://github.com/gijswobben/pymed

I'd like to make a word2vec model on pubmed articles' keywords with the first 5000 `Neonicotinoids` qurery.

In [1]:
import pandas as pd
from pymed import PubMed
import numpy as np
from gensim.models import Word2Vec
import inspect
# print(inspect.getsource(Word2Vec))

### Pubmed API call for querying keywords of neonicotinoids associated articles

Below code cited and modified from https://stackoverflow.com/questions/72006411/pubmed-fetch-article-details-to-a-daframe

In [2]:
search_term = 'Neonicotinoids'
max_results = 5000

def pubmed_searcher(search_term, max_results):
    '''
    Search max_results # of Pubmed articles with the query (search_term)
    '''
    pubmed = PubMed(tool="PubMedSearcher", email="myemail@ccc.com")

    ## PUT YOUR SEARCH TERM HERE ##
    results = pubmed.query(search_term, max_results)
    articleList = []
    articleInfo = []

    for article in results:

    # Print the type of object we've found (can be either PubMedBookArticle or PubMedArticle).
    # We need to convert it to dictionary with available function
        articleDict = article.toDict()
        articleList.append(articleDict)

    # Generate list of dict records which will hold all article details that could be fetch from PUBMED API
    for article in articleList:
        
    #Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
        pubmedId = article['pubmed_id'].partition('\n')[0]
        
        # Append article info to dictionary
        if 'keywords' in article.keys() and len(article['keywords']) != 0:
            articleInfo.append({u'pubmed_id':pubmedId,
                                u'publication_date':article['publication_date'], 
                                u'authors':article['authors'],
                                u'keywords':article['keywords']})
            
    print('available keys from pubmed API: ' + str(article.keys()))
    
    df=pd.json_normalize(articleInfo)
    
    return df


In [3]:
# Example output
df = pubmed_searcher(search_term, max_results)
df.head()

available keys from pubmed API: dict_keys(['pubmed_id', 'title', 'abstract', 'keywords', 'journal', 'publication_date', 'authors', 'methods', 'conclusions', 'results', 'copyrights', 'doi', 'xml'])


Unnamed: 0,pubmed_id,publication_date,authors,keywords
0,36149570,2022-09-24,"[{'lastname': 'Zhou', 'firstname': 'Hong-Xia',...","[Floral nectar, Neonicotinoid, Pollinators, Sa..."
1,36144866,2022-09-24,"[{'lastname': 'Lu', 'firstname': 'Xingxing', '...","[Flupyrimin derivatives, low bee-toxicity, mol..."
2,36140100,2022-09-24,"[{'lastname': 'Jiao', 'firstname': 'Shasha', '...","[broad-specific mAb, immunochromatography, neo..."
3,36127060,2022-09-21,"[{'lastname': 'Zhang', 'firstname': 'Bai-Zhong...","[Imidacloprid resistance, Sitobion miscanthi, ..."
4,36127049,2022-09-21,"[{'lastname': 'Mezei', 'firstname': 'Imre', 'i...","[Green peach aphid, Insecticide and neonicotin..."


We see that the keywords were in both upper and lower cases terms. Therefore, make simple pre-processing here to lowercase every word.

In [4]:
df['keywords'] = df['keywords'].apply(lambda x: [word.lower() if word is not None else word for word in x])
# df['keywords'] = df['keywords'].apply(lambda x: [word.split(' ') if word is not None else word for word in x])
# df['keywords'] = df['keywords'].apply(lambda x: sum([word if word is not None else [] for word in x], []))
# df['keywords'] = df['keywords'].apply(lambda x: [word if word is not '' else word for word in x])
df.head()

Unnamed: 0,pubmed_id,publication_date,authors,keywords
0,36149570,2022-09-24,"[{'lastname': 'Zhou', 'firstname': 'Hong-Xia',...","[floral nectar, neonicotinoid, pollinators, sa..."
1,36144866,2022-09-24,"[{'lastname': 'Lu', 'firstname': 'Xingxing', '...","[flupyrimin derivatives, low bee-toxicity, mol..."
2,36140100,2022-09-24,"[{'lastname': 'Jiao', 'firstname': 'Shasha', '...","[broad-specific mab, immunochromatography, neo..."
3,36127060,2022-09-21,"[{'lastname': 'Zhang', 'firstname': 'Bai-Zhong...","[imidacloprid resistance, sitobion miscanthi, ..."
4,36127049,2022-09-21,"[{'lastname': 'Mezei', 'firstname': 'Imre', 'i...","[green peach aphid, insecticide and neonicotin..."


Build a word2vec model baseline with the keywords as the `sentences`. The default model is a CBOW model:

In [5]:
cbow_model = Word2Vec(sentences=df['keywords'], vector_size=100, window=5, min_count=1, workers=4)

In [6]:
# print out number of unique keywords in this corpus
len(cbow_model.wv)

5746

In [7]:
# print out first 10 keywords and their mapping in the word vector
[(key, value) for key, value in cbow_model.wv.key_to_index.items() if value < 11 ]

[('imidacloprid', 0),
 ('neonicotinoids', 1),
 ('neonicotinoid', 2),
 ('thiamethoxam', 3),
 ('pesticides', 4),
 ('acetamiprid', 5),
 ('clothianidin', 6),
 ('insecticide', 7),
 ('pesticide', 8),
 ('oxidative stress', 9),
 ('risk assessment', 10)]

In [8]:
vector = cbow_model.wv['imidacloprid']  # get numpy vector of a word
vector

array([-5.47558814e-03,  1.15996161e-02,  8.38345941e-03,  1.31329158e-02,
       -4.02463367e-03, -2.19030082e-02,  1.20871393e-02,  3.46199572e-02,
       -1.87183581e-02, -9.02592205e-03,  3.06077651e-03, -2.20779702e-02,
       -7.84096401e-03,  1.28989788e-02,  1.85582798e-03, -1.42069990e-02,
        1.23926550e-02, -5.73959900e-03, -1.03785265e-02, -2.89911795e-02,
        1.37930559e-02,  7.02149328e-03,  1.91754494e-02, -3.29084788e-03,
        9.85193159e-03, -8.92621116e-04, -5.58804395e-03,  4.93477704e-03,
       -2.02950183e-02, -5.34288818e-04,  5.27832704e-03,  1.48142877e-04,
        9.77971964e-03, -2.45441962e-02, -8.63257516e-03,  5.86847309e-03,
        1.13223922e-02, -1.13216983e-02, -2.00672285e-03, -1.98060852e-02,
       -7.65720056e-03, -8.60788301e-03, -1.88366678e-02, -1.27569761e-03,
        7.35833030e-03, -7.94641674e-03, -1.89480484e-02,  9.02668014e-03,
        1.03715099e-02,  1.78529955e-02,  7.08234757e-06, -6.91294624e-03,
       -8.86891875e-03, -

Neonicotinoids popped up as the toppest related word alongside neonicotinoids as a keyword, which makes sense as imidacloprid is a chemical compound part of the neonicotinoids family. 

In [9]:
sims = cbow_model.wv.most_similar('imidacloprid', topn=10)  # get other similar words
sims

NameError: name 'model' is not defined

In [None]:
skip_gram_model = Word2Vec(sentences=df['keywords'], vector_size=100, window=5, min_count=1, workers=4, sg=1)

In [None]:
vector = skip_gram_model.wv['imidacloprid']  # get numpy vector of a word
vector

In [None]:
sims = skip_gram_model.wv.most_similar('imidacloprid', topn=10)  # get other similar words
sims