## word2vec in code
Packages used:
1. `gensim.word2vec`: https://radimrehurek.com/gensim/models/word2vec.html The Gensim documentation is very nicely written.

2. `pymed`: https://github.com/gijswobben/pymed

I'd like to make a word2vec model on pubmed articles' keywords with the first 5000 `Neonicotinoids` qurery.

In [None]:
Word2Vec.score

In [129]:
import pandas as pd
from pymed import PubMed
import numpy as np
from gensim.models import Word2Vec
import inspect
# print(inspect.getsource(Word2Vec))

### Pubmed API call for querying keywords of neonicotinoids associated articles

Below code cited and modified from https://stackoverflow.com/questions/72006411/pubmed-fetch-article-details-to-a-daframe

In [67]:
search_term = 'Neonicotinoids'
max_results = 5000

def pubmed_searcher(search_term, max_results):
    '''
    Search max_results # of Pubmed articles with the query (search_term)
    '''
    pubmed = PubMed(tool="PubMedSearcher", email="myemail@ccc.com")

    ## PUT YOUR SEARCH TERM HERE ##
    results = pubmed.query(search_term, max_results)
    articleList = []
    articleInfo = []

    for article in results:

    # Print the type of object we've found (can be either PubMedBookArticle or PubMedArticle).
    # We need to convert it to dictionary with available function
        articleDict = article.toDict()
        articleList.append(articleDict)

    # Generate list of dict records which will hold all article details that could be fetch from PUBMED API
    for article in articleList:
        
    #Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
        pubmedId = article['pubmed_id'].partition('\n')[0]
        
        # Append article info to dictionary
        if 'keywords' in article.keys() and len(article['keywords']) != 0:
            articleInfo.append({u'pubmed_id':pubmedId,
                                u'publication_date':article['publication_date'], 
                                u'authors':article['authors'],
                                u'keywords':article['keywords']})
            
    print('available keys from pubmed API: ' + str(article.keys()))
    
    df=pd.json_normalize(articleInfo)
    
    return df


In [83]:
# Example output
df = pubmed_searcher(search_term, max_results)
df.head()

available keys from pubmed API: dict_keys(['pubmed_id', 'title', 'abstract', 'keywords', 'journal', 'publication_date', 'authors', 'methods', 'conclusions', 'results', 'copyrights', 'doi', 'xml'])


Unnamed: 0,pubmed_id,publication_date,authors,keywords
0,36149570,2022-09-24,"[{'lastname': 'Zhou', 'firstname': 'Hong-Xia',...","[Floral nectar, Neonicotinoid, Pollinators, Sa..."
1,36144866,2022-09-24,"[{'lastname': 'Lu', 'firstname': 'Xingxing', '...","[Flupyrimin derivatives, low bee-toxicity, mol..."
2,36140100,2022-09-24,"[{'lastname': 'Jiao', 'firstname': 'Shasha', '...","[broad-specific mAb, immunochromatography, neo..."
3,36127060,2022-09-21,"[{'lastname': 'Zhang', 'firstname': 'Bai-Zhong...","[Imidacloprid resistance, Sitobion miscanthi, ..."
4,36127049,2022-09-21,"[{'lastname': 'Mezei', 'firstname': 'Imre', 'i...","[Green peach aphid, Insecticide and neonicotin..."


We see that the keywords were in both upper and lower cases terms. Therefore, make simple pre-processing here to lowercase every word.

In [90]:
df['keywords'] = df['keywords'].apply(lambda x: [word.lower() if word is not None else word for word in x])
df.head()

Unnamed: 0,pubmed_id,publication_date,authors,keywords
0,36149570,2022-09-24,"[{'lastname': 'Zhou', 'firstname': 'Hong-Xia',...","[floral nectar, neonicotinoid, pollinators, sa..."
1,36144866,2022-09-24,"[{'lastname': 'Lu', 'firstname': 'Xingxing', '...","[flupyrimin derivatives, low bee-toxicity, mol..."
2,36140100,2022-09-24,"[{'lastname': 'Jiao', 'firstname': 'Shasha', '...","[broad-specific mab, immunochromatography, neo..."
3,36127060,2022-09-21,"[{'lastname': 'Zhang', 'firstname': 'Bai-Zhong...","[imidacloprid resistance, sitobion miscanthi, ..."
4,36127049,2022-09-21,"[{'lastname': 'Mezei', 'firstname': 'Imre', 'i...","[green peach aphid, insecticide and neonicotin..."


Build a word2vec model baseline with the keywords as the `sentences`. The default model is a CBOW model:

In [120]:
cbow_model = Word2Vec(sentences=df['keywords'], vector_size=100, window=5, min_count=1, workers=4)

In [121]:
# print out number of unique keywords in this corpus
len(cbow_model.wv)

5746

In [122]:
# print out first 10 keywords and their mapping in the word vector
[(key, value) for key, value in cbow_model.wv.key_to_index.items() if value < 11 ]

[('imidacloprid', 0),
 ('neonicotinoids', 1),
 ('neonicotinoid', 2),
 ('thiamethoxam', 3),
 ('pesticides', 4),
 ('acetamiprid', 5),
 ('clothianidin', 6),
 ('insecticide', 7),
 ('pesticide', 8),
 ('oxidative stress', 9),
 ('risk assessment', 10)]

In [123]:
vector = cbow_model.wv['imidacloprid']  # get numpy vector of a word
vector

array([-5.37396967e-03,  1.15496507e-02,  8.31562281e-03,  1.31636309e-02,
       -4.04892303e-03, -2.18364391e-02,  1.21429665e-02,  3.45299169e-02,
       -1.87110025e-02, -8.97512678e-03,  3.01831122e-03, -2.21252199e-02,
       -7.83312786e-03,  1.29188178e-02,  1.86460989e-03, -1.42153082e-02,
        1.23566343e-02, -5.74537599e-03, -1.03757735e-02, -2.89883446e-02,
        1.38254678e-02,  7.03370571e-03,  1.92135628e-02, -3.27003119e-03,
        9.90342069e-03, -9.10659030e-04, -5.58727281e-03,  4.98251989e-03,
       -2.04840060e-02, -5.40945039e-04,  5.26142027e-03,  1.53016634e-04,
        9.84102208e-03, -2.45484244e-02, -8.71882495e-03,  5.75637119e-03,
        1.13511411e-02, -1.12676071e-02, -1.99815654e-03, -1.98061727e-02,
       -7.65244802e-03, -8.65007006e-03, -1.88368708e-02, -1.25759922e-03,
        7.34956563e-03, -7.94593710e-03, -1.89446546e-02,  9.05300584e-03,
        1.03714652e-02,  1.78687852e-02, -1.29648324e-05, -6.96580065e-03,
       -8.87754653e-03, -

Neonicotinoids popped up as the toppest related word alongside neonicotinoids as a keyword, which makes sense as imidacloprid is a chemical compound part of the neonicotinoids family. 

In [124]:
sims = model.wv.most_similar('imidacloprid', topn=10)  # get other similar words
sims

[('neonicotinoids', 0.7831562757492065),
 ('neonicotinoid', 0.7481412291526794),
 ('pesticides', 0.6898139119148254),
 ('insecticide', 0.6359738111495972),
 ('insecticides', 0.6348689198493958),
 ('clothianidin', 0.6328756213188171),
 ('pesticide', 0.6214989423751831),
 ('neonicotinoid insecticides', 0.6010925769805908),
 ('acetamiprid', 0.6006953120231628),
 ('oxidative stress', 0.5956632494926453)]

In [125]:
skip_gram_model = Word2Vec(sentences=df['keywords'], vector_size=100, window=5, min_count=1, workers=4, sg=1)

In [127]:
vector = skip_gram_model.wv['imidacloprid']  # get numpy vector of a word
vector

array([-0.05417468,  0.07851605,  0.07388604,  0.0212368 ,  0.00612038,
       -0.06378842,  0.03000311,  0.14596061, -0.09644782, -0.02864049,
       -0.02910103, -0.1225659 , -0.0005069 ,  0.04263362,  0.05482033,
       -0.05720612,  0.06408644, -0.02621115,  0.00274776, -0.13921742,
        0.05886362,  0.02096497,  0.10082798, -0.04196379,  0.02531511,
        0.00195682, -0.05104769, -0.01590193, -0.05362089, -0.00608162,
        0.05183888,  0.00088851,  0.00391552, -0.09756049, -0.00478361,
        0.02779034,  0.01438181, -0.01623379,  0.00868666, -0.09539924,
        0.00047635, -0.07378464, -0.06715459,  0.0074744 ,  0.02881611,
       -0.04243603, -0.09174044,  0.00770856,  0.0350023 ,  0.04214233,
        0.0482609 , -0.05280864, -0.0599121 , -0.02204715, -0.02581426,
        0.04092196,  0.05552968, -0.01496538, -0.05778163,  0.01301325,
        0.00144699,  0.01876798,  0.00766746,  0.00017886, -0.05237012,
        0.08727906,  0.04093103,  0.06111833, -0.0450944 ,  0.08

In [128]:
sims = skip_gram_model.wv.most_similar('imidacloprid', topn=10)  # get other similar words
sims

[('neonicotinoids', 0.9889056086540222),
 ('neonicotinoid', 0.9860917925834656),
 ('pesticides', 0.9817160964012146),
 ('pesticide', 0.9788943529129028),
 ('oxidative stress', 0.9787634015083313),
 ('insecticide', 0.9768016934394836),
 ('insecticides', 0.9750645160675049),
 ('neonicotinoid insecticides', 0.9746518731117249),
 ('clothianidin', 0.9737898707389832),
 ('thiamethoxam', 0.9717805981636047)]