## word2vec in code
Packages used:
1. `gensim.word2vec`: https://radimrehurek.com/gensim/models/word2vec.html The Gensim documentation is very nicely written.

2. `pymed`: https://github.com/gijswobben/pymed

I'd like to make a word2vec model on pubmed articles' keywords with the first 5000 `Neonicotinoids` qurery.

In [1]:
import pandas as pd
from pymed import PubMed
import numpy as np
from gensim.models import Word2Vec
import inspect
# print(inspect.getsource(Word2Vec))

### Pubmed API call for querying keywords of neonicotinoids associated articles

Below code cited and modified from https://stackoverflow.com/questions/72006411/pubmed-fetch-article-details-to-a-daframe

In [2]:
search_term = 'Neonicotinoids'
max_results = 5000

def pubmed_searcher(search_term, max_results):
    '''
    Search max_results # of Pubmed articles with the query (search_term)
    '''
    pubmed = PubMed(tool="PubMedSearcher", email="myemail@ccc.com")

    ## PUT YOUR SEARCH TERM HERE ##
    results = pubmed.query(search_term, max_results)
    articleList = []
    articleInfo = []

    for article in results:

    # Print the type of object we've found (can be either PubMedBookArticle or PubMedArticle).
    # We need to convert it to dictionary with available function
        articleDict = article.toDict()
        articleList.append(articleDict)

    # Generate list of dict records which will hold all article details that could be fetch from PUBMED API
    for article in articleList:
        
    #Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
        pubmedId = article['pubmed_id'].partition('\n')[0]
        
        # Append article info to dictionary
        if 'keywords' in article.keys() and len(article['keywords']) != 0:
            articleInfo.append({u'pubmed_id':pubmedId,
                                u'publication_date':article['publication_date'], 
                                u'authors':article['authors'],
                                u'keywords':article['keywords']})
            
    print('available keys from pubmed API: ' + str(article.keys()))
    
    df=pd.json_normalize(articleInfo)
    
    return df


In [3]:
# Example output
df = pubmed_searcher(search_term, max_results)
df.head()

available keys from pubmed API: dict_keys(['pubmed_id', 'title', 'abstract', 'keywords', 'journal', 'publication_date', 'authors', 'methods', 'conclusions', 'results', 'copyrights', 'doi', 'xml'])


Unnamed: 0,pubmed_id,publication_date,authors,keywords
0,36149570,2022-09-24,"[{'lastname': 'Zhou', 'firstname': 'Hong-Xia',...","[Floral nectar, Neonicotinoid, Pollinators, Sa..."
1,36144866,2022-09-24,"[{'lastname': 'Lu', 'firstname': 'Xingxing', '...","[Flupyrimin derivatives, low bee-toxicity, mol..."
2,36140100,2022-09-24,"[{'lastname': 'Jiao', 'firstname': 'Shasha', '...","[broad-specific mAb, immunochromatography, neo..."
3,36127060,2022-09-21,"[{'lastname': 'Zhang', 'firstname': 'Bai-Zhong...","[Imidacloprid resistance, Sitobion miscanthi, ..."
4,36127049,2022-09-21,"[{'lastname': 'Mezei', 'firstname': 'Imre', 'i...","[Green peach aphid, Insecticide and neonicotin..."


We see that the keywords were in both upper and lower cases terms. Therefore, make simple pre-processing here to lowercase every word.

In [4]:
df['keywords'] = df['keywords'].apply(lambda x: [word.lower() if word is not None else word for word in x])
# df['keywords'] = df['keywords'].apply(lambda x: [word.split(' ') if word is not None else word for word in x])
# df['keywords'] = df['keywords'].apply(lambda x: sum([word if word is not None else [] for word in x], []))
# df['keywords'] = df['keywords'].apply(lambda x: [word if word is not '' else word for word in x])
df.head()

Unnamed: 0,pubmed_id,publication_date,authors,keywords
0,36149570,2022-09-24,"[{'lastname': 'Zhou', 'firstname': 'Hong-Xia',...","[floral nectar, neonicotinoid, pollinators, sa..."
1,36144866,2022-09-24,"[{'lastname': 'Lu', 'firstname': 'Xingxing', '...","[flupyrimin derivatives, low bee-toxicity, mol..."
2,36140100,2022-09-24,"[{'lastname': 'Jiao', 'firstname': 'Shasha', '...","[broad-specific mab, immunochromatography, neo..."
3,36127060,2022-09-21,"[{'lastname': 'Zhang', 'firstname': 'Bai-Zhong...","[imidacloprid resistance, sitobion miscanthi, ..."
4,36127049,2022-09-21,"[{'lastname': 'Mezei', 'firstname': 'Imre', 'i...","[green peach aphid, insecticide and neonicotin..."


Build a word2vec model baseline with the keywords as the `sentences`. The default model is a CBOW model:

In [5]:
cbow_model = Word2Vec(sentences=df['keywords'], vector_size=100, window=5, min_count=1, workers=4)

In [6]:
# print out number of unique keywords in this corpus
len(cbow_model.wv)

5746

In [7]:
# print out first 10 keywords and their mapping in the word vector
[(key, value) for key, value in cbow_model.wv.key_to_index.items() if value < 11 ]

[('imidacloprid', 0),
 ('neonicotinoids', 1),
 ('neonicotinoid', 2),
 ('thiamethoxam', 3),
 ('pesticides', 4),
 ('acetamiprid', 5),
 ('clothianidin', 6),
 ('insecticide', 7),
 ('pesticide', 8),
 ('oxidative stress', 9),
 ('risk assessment', 10)]

In [8]:
vector = cbow_model.wv['imidacloprid']  # get numpy vector of a word
vector

array([-5.40593034e-03,  1.15888966e-02,  8.43298808e-03,  1.31010190e-02,
       -4.03585657e-03, -2.18122415e-02,  1.20129082e-02,  3.46110687e-02,
       -1.87167116e-02, -9.20154154e-03,  3.12844943e-03, -2.18708999e-02,
       -7.95106310e-03,  1.29893012e-02,  1.78413186e-03, -1.42300595e-02,
        1.24159968e-02, -5.62345563e-03, -1.03669940e-02, -2.89037786e-02,
        1.36500588e-02,  6.95272116e-03,  1.91477295e-02, -3.36500071e-03,
        9.90599673e-03, -9.25304368e-04, -5.56671061e-03,  4.97595686e-03,
       -2.01655030e-02, -5.28744538e-04,  5.34750288e-03,  1.04849823e-04,
        9.88434535e-03, -2.45723110e-02, -8.71593598e-03,  5.79477614e-03,
        1.12880385e-02, -1.13384482e-02, -2.05988670e-03, -1.97763313e-02,
       -7.68292416e-03, -8.54059402e-03, -1.87690835e-02, -1.27834384e-03,
        7.36738276e-03, -7.90740084e-03, -1.90359708e-02,  9.00849141e-03,
        1.04246465e-02,  1.78031195e-02,  9.97503157e-05, -6.99138548e-03,
       -8.93480144e-03, -

Neonicotinoids popped up as the toppest related word alongside neonicotinoids as a keyword, which makes sense as imidacloprid is a chemical compound part of the neonicotinoids family. 

In [9]:
sims = cbow_model.wv.most_similar('imidacloprid', topn=10)  # get other similar words
sims

[('neonicotinoids', 0.7829824686050415),
 ('neonicotinoid', 0.7474757432937622),
 ('pesticides', 0.6899251937866211),
 ('insecticides', 0.6350824236869812),
 ('insecticide', 0.6348559260368347),
 ('clothianidin', 0.6332234740257263),
 ('pesticide', 0.6209536194801331),
 ('neonicotinoid insecticides', 0.6012749671936035),
 ('acetamiprid', 0.5999992489814758),
 ('oxidative stress', 0.5961906313896179)]

In [10]:
skip_gram_model = Word2Vec(sentences=df['keywords'], vector_size=100, window=5, min_count=1, workers=4, sg=1)

In [11]:
vector = skip_gram_model.wv['imidacloprid']  # get numpy vector of a word
vector

array([-0.05299078,  0.07829338,  0.07447132,  0.02108439,  0.00597955,
       -0.06380751,  0.0301357 ,  0.14573403, -0.09605741, -0.02828232,
       -0.02943823, -0.12244009, -0.00080855,  0.04335691,  0.05463602,
       -0.0572549 ,  0.06403438, -0.02601936,  0.0023799 , -0.13929603,
        0.05880679,  0.02052441,  0.10127067, -0.04167465,  0.02516998,
        0.00175928, -0.05085333, -0.01591752, -0.05303929, -0.00647691,
        0.05200607,  0.00078895,  0.00399932, -0.09744287, -0.00562432,
        0.02841873,  0.01386285, -0.01645965,  0.00830765, -0.095583  ,
        0.00058756, -0.07347596, -0.06742942,  0.00739888,  0.02853376,
       -0.04295408, -0.09183193,  0.00756953,  0.03483294,  0.04236066,
        0.04897863, -0.05266525, -0.05966743, -0.02183077, -0.02638467,
        0.04107733,  0.05603094, -0.01519084, -0.05776702,  0.01266443,
        0.00205501,  0.01886354,  0.00732932,  0.00023657, -0.05240031,
        0.08751692,  0.04117076,  0.06144652, -0.0452157 ,  0.08

In [12]:
sims = skip_gram_model.wv.most_similar('imidacloprid', topn=10)  # get other similar words
sims

[('neonicotinoids', 0.9888986349105835),
 ('neonicotinoid', 0.9860737919807434),
 ('pesticides', 0.9816869497299194),
 ('pesticide', 0.9788960218429565),
 ('oxidative stress', 0.978755533695221),
 ('insecticide', 0.9768159985542297),
 ('insecticides', 0.9750504493713379),
 ('neonicotinoid insecticides', 0.974653422832489),
 ('clothianidin', 0.9737851619720459),
 ('thiamethoxam', 0.9717853665351868)]