## word2vec in code
Packages used:
1. `gensim.word2vec`: https://radimrehurek.com/gensim/models/word2vec.html The Gensim documentation is very nicely written.

2. `pymed`: https://github.com/gijswobben/pymed

### Below code cited and modified from https://stackoverflow.com/questions/72006411/pubmed-fetch-article-details-to-a-daframe

In [40]:
import pandas as pd
from pymed import PubMed
import numpy as np

search_term = 'Neonicotinoids'
max_results = 5000

def pubmed_searcher(search_term, max_results):
    pubmed = PubMed(tool="PubMedSearcher", email="myemail@ccc.com")


    ## PUT YOUR SEARCH TERM HERE ##
    results = pubmed.query(search_term, max_results)
    articleList = []
    articleInfo = []

    for article in results:

    # Print the type of object we've found (can be either PubMedBookArticle or PubMedArticle).
    # We need to convert it to dictionary with available function
        articleDict = article.toDict()
        articleList.append(articleDict)

    # Generate list of dict records which will hold all article details that could be fetch from PUBMED API
    for article in articleList:
        
    #Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
        pubmedId = article['pubmed_id'].partition('\n')[0]
        
        # Append article info to dictionary
        if 'keywords' in article.keys() and len(article['keywords']) != 0:
            articleInfo.append({u'pubmed_id':pubmedId,
                                u'publication_date':article['publication_date'], 
                                u'authors':article['authors'],
                                u'keywords':article['keywords']})
    print('available keys from pubmed API: ' + article.keys())
    df=pd.json_normalize(articleInfo)


dict_keys(['pubmed_id', 'title', 'abstract', 'keywords', 'journal', 'publication_date', 'authors', 'methods', 'conclusions', 'results', 'copyrights', 'doi', 'xml'])


In [42]:
df.head()

Unnamed: 0,pubmed_id,publication_date,authors,keywords
0,36149570,2022-09-24,"[{'lastname': 'Zhou', 'firstname': 'Hong-Xia',...","[Floral nectar, Neonicotinoid, Pollinators, Sa..."
1,36144866,2022-09-24,"[{'lastname': 'Lu', 'firstname': 'Xingxing', '...","[Flupyrimin derivatives, low bee-toxicity, mol..."
2,36140100,2022-09-24,"[{'lastname': 'Jiao', 'firstname': 'Shasha', '...","[broad-specific mAb, immunochromatography, neo..."
3,36127060,2022-09-21,"[{'lastname': 'Zhang', 'firstname': 'Bai-Zhong...","[Imidacloprid resistance, Sitobion miscanthi, ..."
4,36127049,2022-09-21,"[{'lastname': 'Mezei', 'firstname': 'Imre', 'i...","[Green peach aphid, Insecticide and neonicotin..."


In [7]:
df['authors'].iloc[0][0].keys()

dict_keys(['lastname', 'firstname', 'initials', 'affiliation'])

In [43]:
len(df)

2462

In [45]:
from gensim.models import Word2Vec

In [46]:
model = Word2Vec(sentences=df['keywords'], vector_size=100, window=5, min_count=1, workers=4)
model

<gensim.models.word2vec.Word2Vec at 0x7f9c8657f1d0>

In [47]:
vector = model.wv['imidacloprid']  # get numpy vector of a word

In [48]:
vector

array([-0.01010708,  0.01351375,  0.00074349, -0.00057339,  0.00299313,
       -0.01134256,  0.00504032,  0.01545781,  0.00340838, -0.0085745 ,
        0.00796269, -0.00192529,  0.00457482, -0.00505163,  0.01059754,
       -0.00652867,  0.01281002, -0.00975233, -0.00779032, -0.00096461,
        0.00579988, -0.0010256 ,  0.01431772,  0.00701831, -0.01109368,
        0.0032349 ,  0.00274159,  0.00297145, -0.0008881 ,  0.00190335,
        0.00524809, -0.00367374, -0.00656906, -0.00827771,  0.00251864,
        0.00961471,  0.01104125, -0.00836937, -0.01288685,  0.00510887,
        0.0042891 ,  0.00030601,  0.00494902, -0.00254613,  0.01097928,
       -0.0001704 , -0.00224076, -0.00410447, -0.00095617,  0.00086192,
        0.00582085, -0.00158261, -0.01212031, -0.01195711, -0.00917481,
        0.00231709,  0.00453065,  0.01020761,  0.00395687, -0.00300088,
        0.00176286,  0.00472081,  0.00715456,  0.00206285,  0.00320365,
       -0.00095005, -0.00018021,  0.01084903, -0.00635162, -0.00

We can reduce the image channels from 3 to 1 first, to simplify the original matrix ($X$):

In [50]:
sims = model.wv.most_similar('imidacloprid', topn=10)  # get other similar words
sims

[('Chlorpyrifos-E', 0.38085365295410156),
 ('Bound residue', 0.3651526868343353),
 ('Pesticides', 0.3536563813686371),
 ('Oxidative stress', 0.3431783616542816),
 ('Detection', 0.3222809135913849),
 ('Neonicotinoids', 0.3202517330646515),
 ('irradiation synthesis', 0.3180220425128937),
 ('high mobility group box protein 1', 0.3162441551685333),
 ('Molecular ecotoxicology', 0.3143049478530884),
 ('Colias philodice', 0.3126586377620697)]