<h1><center> MANAGEMENT AND ANALYSIS OF PHYSICS DATASET (mod.B)  </center></h1>

<h1><center> Final project :  Analysis of Covid-19 papers  </center></h1>


<h2><center> Date : 23/09/2020 , University of Padua </center></h2>

<h3><center> Authors : Camilla Quaglia (1242830) , Edoardo Antonaci (1234431) and Walter Zigliotto  (1230665) </center></h3>

# Third part

### Get the embedding for the title of the papers

[Word-embedding](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa) is a particular technique that allow the representation of words as vectors. The main idea is to enable the detection of similarity between same meaning words and diversity between different meaning words (words with similar context will occupy close spatial positions) through a numerical words encoding. This procedure reduces also the dimension of the vector needed to represent a word. For such purpose, many methods can be adopted.     

[Word2Vec](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa) is one of the most popular technique to learn word embeddings using shallow neural network. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag of Words (CBOW).

The pre-trained model we load is [wiki.en.vec](https://fasttext.cc/docs/en/pretrained-vectors.html) developed by [Bojanowski et al. (2016)](https://arxiv.org/abs/1607.04606). This algorithm is based on the Skip Gram technique and the result is presented as a huge dictionary in the following format **key : vector**. The vector dimension is 300.


##### Skip Gram neural network model


<img src="skip_gram.png" width="350">

The aim of the assignment is to:

* Load the pre-trained model.
* Create a DataFrame or a Bag that is composed by: "paper-id" and "title-embedding".
* Compute the cosine similarity between each paper and to figure out a couple of papers with the highest cosine similarity score.

#### LIBRARIES

[Gensim](https://pypi.org/project/gensim/) is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. In particular, [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html) was used to load the trained word vectors model.   

In [1]:
from dask.distributed import Client
from dask import delayed

import time
import json
import os

import dask.dataframe as dd
import dask.bag as db

import re
import nltk
from collections import Counter
import numpy as np
import pandas as pd


# To load the model
import gensim
from gensim.models import KeyedVectors

#### LOAD THE MODEL

In [2]:
init = time.time()

trained_word_dict = KeyedVectors.load_word2vec_format('wiki.en.vec')

end = time.time()

print("Time needed to load the model: ", np.round((end - init)/60), " m.")

Time needed to load the model:  25.0  m.


Preloaded algorithm example.

In [3]:
trained_word_dict['covid']

array([ 4.1562e-02,  4.4748e-01,  3.4775e-01,  2.8584e-01, -2.0439e-02,
        5.2691e-02, -2.1214e-01, -5.1576e-01, -1.3696e-02,  2.9796e-01,
        7.6661e-02, -1.6682e-01, -2.0146e-02,  3.9696e-02, -5.2283e-01,
        3.2844e-02,  3.1001e-01,  1.9401e-01, -1.4449e-01,  3.3345e-01,
        1.2913e-01,  6.9068e-01, -4.4541e-02, -6.1467e-02,  1.3640e-01,
        6.7444e-02, -1.6460e-01, -1.8028e-01,  2.4397e-01, -1.8775e-02,
       -4.1462e-01,  2.1488e-01,  4.2596e-02,  2.3271e-01, -3.7913e-01,
        7.2442e-01,  1.9139e-01,  2.9194e-01,  5.9075e-02,  2.1271e-01,
       -1.5846e-01, -2.9983e-01, -2.8076e-02, -1.9447e-02,  1.0548e-01,
        2.3515e-01,  1.4341e-01,  7.0788e-02,  6.1037e-02, -1.9155e-01,
       -1.7390e-02,  6.0412e-03, -5.1979e-04, -2.9971e-01,  1.9084e-01,
       -1.4681e-01,  3.1331e-02, -4.5124e-01,  1.9161e-01, -1.9114e-01,
        2.9304e-01,  2.4916e-01,  3.5672e-01, -2.3828e-01, -2.1532e-01,
       -2.8428e-03,  4.2796e-02, -2.7403e-01,  2.8116e-01, -3.22

#### CREATE A BAG COMPOSED BY: "PAPER-ID" AND "TITLE-EMBEDDING".

In [4]:
client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:52423  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 8.46 GB


Loading pre-ordered data.  

In [5]:
filename = os.path.join('data', 'papers_in_one_line_json', '*.json')
lines = db.read_text(filename)
js = lines.map(json.loads).repartition(20) 

In [6]:
def flatten(rec):
    return {"paper_id": rec['paper_id'], "title": rec['metadata']['title']}

Map has been used to apply flatten function to js, filtered to avoid empty title presence. 

In [7]:
resume_data = js.map(flatten).filter(lambda rec: rec['title'] != '').compute()

In [8]:
print("Resume data type is: ", type(resume_data), ".")

Resume data type is:  <class 'list'> .


In [9]:
def funct_to_vec(title):
    
    # title selection
    title_and_words = title['title']
    
    # splitting lower case title words
    split_title = title_and_words.lower().split()
    
    # list of words vectors
    vec_words = []

    for words in split_title:
        # if word is correctly converted into vector it is saved in a list
        try:
            vec_words.append(trained_word_dict[words])
        except:
            pass 
    
    # return a dictionary of paper id, paper title and paper title into vectors
    return{"paper_id": title["paper_id"], "title": title['title'], "vec_title": vec_words}


List of single line results.

In [10]:
dict_list = []

for res in resume_data:
    dict_list.append(funct_to_vec(res))
    

List transformation into dask bag sequence.

In [11]:
dict_bag = db.from_sequence(dict_list)

In [12]:
dict_bag.take(5)

({'paper_id': '000a0fc8bbef80410199e690191dc3076a290117',
  'title': 'PfSWIB, a potential chromatin regulator for var gene regulation and parasite development in Plasmodium falciparum',
  'vec_title': [array([ 0.11559   ,  0.30192   , -0.11465   ,  0.01001   , -0.032187  ,
          -0.10755   ,  0.060674  , -0.10477   ,  0.17488   ,  0.0081116 ,
          -0.02263   ,  0.065401  ,  0.1133    ,  0.054737  , -0.06209   ,
          -0.029822  , -0.16608   ,  0.12224   ,  0.045251  ,  0.2134    ,
           0.027965  , -0.031319  , -0.25392   , -0.20146   , -0.19688   ,
          -0.015251  , -0.27038   ,  0.10511   ,  0.074226  ,  0.01554   ,
          -0.014038  ,  0.16516   , -0.17375   , -0.016743  ,  0.013919  ,
           0.01119   , -0.12599   , -0.11975   ,  0.079578  , -0.037088  ,
          -0.071665  , -0.085153  , -0.1117    ,  0.020142  , -0.161     ,
           0.0019132 ,  0.13843   ,  0.15445   , -0.026397  , -0.014582  ,
           0.00060368, -0.19382   ,  0.11267   ,  0

#### COMPUTE THE COSINE SIMILARITY

Cosine similarity is used to evaluate the cosine of the angle between two vectors. On one hand, if the cosine is equal to 1, it means that the vectors have the same orientation (similar vectors), on the other hand, if the cosine is equal to 0, it means that the vectors have different orientation (dissimilar vectors).

In [13]:
def cosine(word_vec1, word_vec2):
    
    # min length values is used to evaluate the size of the vectors to compute    
    m = min(len(word_vec1['vec_title']), len(word_vec2['vec_title']))
    
    # norm evaluation
    word_vec1_norm = np.linalg.norm(word_vec1['vec_title'])   
    word_vec2_norm = np.linalg.norm(word_vec2['vec_title'])
    
    cosine = np.vdot(word_vec1['vec_title'][:m], word_vec2['vec_title'][:m])/(word_vec1_norm*word_vec2_norm)
    
    return{
        'ID 1': word_vec1['paper_id'],
        'ID 2' : word_vec2['paper_id'],
        'Title 1': word_vec1['title'],
        'Title 2': word_vec2['title'],
        'Title 1 vec': word_vec1['vec_title'],
        'Title 2 vec':word_vec2['vec_title'],
        'Cosine' : cosine}

In [14]:
def cosine_similarity(dictionary):
    
    # List of dictionary
    cosine_res = []

    for i in range(len(dictionary)):
        for j in range(len(dictionary)):

            # because if i == j -> 1, and |i<j| == |i>j|
            if i > j:

                first = dictionary[i]
                second = dictionary[j]

                cosine_res.append(cosine(first, second))

    return pd.DataFrame(cosine_res)

In [15]:
# COSINE SIMILARITY EVALUATION
cosine_df = cosine_similarity(dict_list)
cosine_df.head()

  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,Cosine,ID 1,ID 2,Title 1,Title 1 vec,Title 2,Title 2 vec
0,0.276407,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,000a0fc8bbef80410199e690191dc3076a290117,Correlation between antimicrobial consumption ...,"[[0.12417, 0.15876, -0.35509, 0.84948, -0.1025...","PfSWIB, a potential chromatin regulator for va...","[[0.11559, 0.30192, -0.11465, 0.01001, -0.0321..."
1,0.174946,000b0174f992cb326a891f756d4ae5531f2845f7,000a0fc8bbef80410199e690191dc3076a290117,Full Title: A systematic review of MERS-CoV (M...,"[[-0.15856, 0.075777, -0.11876, 0.39696, 0.249...","PfSWIB, a potential chromatin regulator for va...","[[0.11559, 0.30192, -0.11465, 0.01001, -0.0321..."
2,0.239066,000b0174f992cb326a891f756d4ae5531f2845f7,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,Full Title: A systematic review of MERS-CoV (M...,"[[-0.15856, 0.075777, -0.11876, 0.39696, 0.249...",Correlation between antimicrobial consumption ...,"[[0.12417, 0.15876, -0.35509, 0.84948, -0.1025..."
3,0.218494,000b7d1517ceebb34e1e3e817695b6de03e2fa78,000a0fc8bbef80410199e690191dc3076a290117,Supplementary Information An eco-epidemiologic...,"[[-0.070217, -0.025447, 0.081508, 0.18712, -0....","PfSWIB, a potential chromatin regulator for va...","[[0.11559, 0.30192, -0.11465, 0.01001, -0.0321..."
4,0.166431,000b7d1517ceebb34e1e3e817695b6de03e2fa78,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,Supplementary Information An eco-epidemiologic...,"[[-0.070217, -0.025447, 0.081508, 0.18712, -0....",Correlation between antimicrobial consumption ...,"[[0.12417, 0.15876, -0.35509, 0.84948, -0.1025..."


Ordering dataframe, removing NaN values.

In [16]:
order_cosine_df = cosine_df.dropna().sort_values(by='Cosine', ascending = False)

TOP 20 couple of papers with the highest cosine similarity score.

In [17]:
order_cosine_df[['Title 1', 'Title 2', 'Cosine']].head(20)

Unnamed: 0,Title 1,Title 2,Cosine
214475,Original Article,Original Article,1.0
134028,To appear in: Public Health,To appear in: Public Health in Practice,0.850842
281705,Respiratory viral infections,RESPIRATORY VIRAL INFECTION AND ASTHMA,0.838242
130135,Role of CD25 þ CD4 þ T cells in acute and pers...,Role of CD25 + CD4 + T cells in acute and pers...,0.824387
109445,CYSTEINE PROTEASES,Membrane-anchored serine proteases in health a...,0.717121
186612,WORKSHEET for EvidenceBased Review of Science ...,WORKSHEET for Evidence-Based Review of Science...,0.697213
15261,Preparation of Recombinant Viral Glycoproteins...,Production of complex viral glycoproteins in p...,0.641119
177805,Bioinformati 32. Bioinformatics and Nanotechno...,Proteomics of viruses,0.635318
395896,Antibody therapies for the treatment of COVID-19,368 ANTIVIRAL THERAPY (NON-HIV),0.628243
391920,COVID 19: a clue from innate immunity,Kawasaki disease: a matter of innate immunity,0.626963


In [18]:
client.close()