# <center> Word embeddings training </center> 

<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 10px">
    <ol>
        <li><a href="#download_data">Importing Needed packages, download_data and preprocess</a></li>
        <li><a href="#skipgram">Word2vec: skipgram</a></li>
        <li><a href="#CBOW">Word2vec: cbow</a></li>
        <li><a href="#FastText">fasttext</a></li>
    </ol>
</div>
<br>
<hr>


<h1 id='download_data'>1.Importing Needed packages, download_data and preprocess</h1>

Lets load required libraries

In [1]:
import gensim
from gensim.models import Word2Vec
from gensim.models.fasttext import FastText
import warnings
import string
import numpy as np
import os
from random import shuffle
import re
import urllib.request
import zipfile
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import pandas as pd
from IPython.display import display_html
warnings.filterwarnings("ignore")

## About the dataset

* **The QUAERO French Medical Corpus:**\
The QUAERO French Medical Corpus has been initially developed as a resource for named entity recognition and normalization <a href="http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Workshop-BioTxtM2014%20Proceedings.pdf#page=33">[1]</a>. It was then improved with the purpose of creating a gold standard set of normalized entities for French biomedical text, that was used in the CLEF eHealth evaluation lab <a href="https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxjbGVmZWhlYWx0aDIwMTV8Z3g6NmJmNjQ0YWNlN2MwMTU2MA">[2]</a> <a href="http://ceur-ws.org/Vol-1609/16090028.pdf">[3]</a>.\
It is a complete corpus, tokenized and with one sentence per line.

* **The QUAERO French Press Corpus:**\
It is a complete corpus, tokenized and with one sentence per line.

## Download and Unzipping data 

In [2]:
#download the data
urllib.request.urlretrieve("https://perso.limsi.fr/neveol/TP_ISD2020.zip", filename="TP_ISD2020.zip")
with zipfile.ZipFile("TP_ISD2020.zip", 'r') as zip_ref:
    zip_ref.extractall("TP_ISD2020")

In [3]:
data1 = open('TP_ISD2020/QUAERO_FrenchMed/QUAERO_FrenchMed_traindev.ospl', encoding="utf8")
data2 = open('TP_ISD2020/QUAERO_FrenchPress/QUAERO_FrenchPress_traindev.ospl', encoding="utf8")

In [4]:
file_str1 = data1.read()
file_str2 = data2.read()

## Preprocess

In [5]:
def preprocess(str_val):
    str_val = re.sub("\r", " ", str_val)
    str_val = re.sub("\d+", " ", str_val)
    str_val = re.sub("\n", " ", str_val)
    str_val = re.sub("\uf0b7", "", str_val)

    for punc in string.punctuation +'’' :
        if punc != '.':
            str_val= str_val.replace(punc," ")
    sentences = str_val.split(".")
    
    filtered_sentences=[]
    for sentence in sentences:
        if len(sentence)>1:
            filtered_sentences.append(sentence.split())
    return filtered_sentences


In [9]:
filtered_sentences1=preprocess(file_str1)
filtered_sentences2=preprocess(file_str2)

In [10]:
print(filtered_sentences1[:1])

[['EMEA', 'H', 'C', 'PRIALT', 'Qu', 'est', 'ce', 'que', 'Prialt', 'Prialt', 'est', 'une', 'solution', 'pour', 'perfusion', 'contenant', 'le', 'principe', 'actif', 'ziconotide', 'à', 'des', 'concentrations', 'de', 'ou', 'microgrammes', 'par', 'millilitre']]


In [11]:
print(filtered_sentences2[:1])

[['Patricia', 'Martin', 'que', 'voici', 'que', 'voilà', 'oh', 'bonjour', 'Nicolas', 'Stoufflet']]


<h2 id="skipgram">Word2vec: skipgram</h2>

In [12]:
model_skipgram1 = Word2Vec(min_count=1,sg=1, size=100, window=10)
# sg=1 means skipgram, else CBOW
model_skipgram1.build_vocab(filtered_sentences1)  # The QUAERO French Medical Corpus
%time model_skipgram1.train(filtered_sentences1, total_examples=model_skipgram1.corpus_count, epochs=100)

CPU times: user 34.6 s, sys: 86.1 ms, total: 34.7 s
Wall time: 14.6 s


(3215659, 4168400)

In [13]:
model_skipgram2 = Word2Vec(min_count=1,sg=1, size=100, window=10)
# sg=1 means skipgram, else CBOW
model_skipgram2.build_vocab(filtered_sentences2) # The QUAERO French Press Corpus
%time model_skipgram2.train(filtered_sentences1, total_examples=model_skipgram2.corpus_count, epochs=100)

CPU times: user 30.7 s, sys: 91.4 ms, total: 30.8 s
Wall time: 11.8 s


(2200900, 4168400)

<h2 id="CBOW">Word2vec: CBOW</h2>

In [14]:
model_CBOW1 = Word2Vec(min_count=1,sg=0, workers=4, size=100, window=10) #sg=0 -> CBOW, sg here for skip-gram
model_CBOW1.build_vocab(filtered_sentences1) # The QUAERO French Medical Corpus
%time model_CBOW1.train(filtered_sentences1, total_examples=model_CBOW1.corpus_count, epochs=100)

CPU times: user 13.9 s, sys: 96.9 ms, total: 14 s
Wall time: 4.75 s


(3215502, 4168400)

In [15]:
model_CBOW2 = Word2Vec(min_count=1,sg=0, workers=4, size=100, window=10) #sg=0 -> CBOW, sg here for skip-gram
model_CBOW2.build_vocab(filtered_sentences2) # The QUAERO French Press Corpus
%time model_CBOW2.train(filtered_sentences2, total_examples=model_CBOW2.corpus_count, epochs=100)

CPU times: user 7min 48s, sys: 905 ms, total: 7min 49s
Wall time: 2min 2s


(81673301, 112056900)

<h2 id="FastText">FastText</h2>

In [16]:
embedding_size = 60
window_size = 40
min_word = 1
down_sampling = 1e-2
%time model_fastText1 = FastText(filtered_sentences1, size=embedding_size, window=window_size, min_count=min_word, sample=down_sampling,sg=0, iter=10)

CPU times: user 30.6 s, sys: 749 ms, total: 31.4 s
Wall time: 15.3 s


In [17]:
%time model_fastText2 = FastText(filtered_sentences2, size=embedding_size, window=window_size, min_count=min_word, sample=down_sampling,sg=0, iter=10)

CPU times: user 14min 21s, sys: 2.53 s, total: 14min 24s
Wall time: 4min 59s


In [18]:
# save only the word vectors
model_skipgram1.wv.save("skipgram_vector_Medical.bin")
model_skipgram2.wv.save("skipgram_vector_Press.bin")
model_CBOW1.wv.save("cbow_vector_Medical.bin")
model_CBOW2.wv.save("cbow_vector__Press.bin")
model_fastText1.wv.save("subword_vector_Medical.bin")
model_fastText2.wv.save("subword_vector__Press.bin")

In [23]:
def display_html_table(html_str):
    """Change the look and display style of table"""
    
    display_html(html_str.replace('table','table style="padding:20px;display:inline;color:navy;font-size:1.1em"'),raw=True)
    
def display_side_by_side(*args):
    html_str=''
    
    for df in args:
        html_str+=df.to_html()    
    
    display_html_table(html_str)
 
def display_similar(positive:list,topn=10):
    """get similar concepts from 3 different models"""
    
    topn_cbow=model_CBOW1.wv.most_similar(positive=w1, topn=topn)
    topn_skipgram=model_skipgram1.wv.most_similar(positive=w1, topn=topn)
    topn_fastText1=model_fastText1.wv.most_similar(positive=w1, topn=topn)
    
    display_side_by_side(
                     pd.DataFrame(topn_cbow,columns=['cbow','cosine_sim']),
                     pd.DataFrame(topn_skipgram,columns=['skipgram','cosine_sim']),
                     pd.DataFrame(topn_fastText1,columns=['fastText','cosine_sim']))

In [31]:
w1=['patient']
display_similar(w1,topn=10)

Unnamed: 0,cbow,cosine_sim
0,risque,0.659704
1,carte,0.652916
2,éviter,0.630665
3,LEMP,0.629352
4,symptômes,0.611607
5,donc,0.609024
6,détecter,0.597967
7,délai,0.595042
8,qu,0.587571
9,alerte,0.585716

Unnamed: 0,skipgram,cosine_sim
0,stimulateur,0.607457
1,Paragangliome,0.598704
2,repos,0.586261
3,Mononucléose,0.582444
4,souffre,0.57958
5,encourus,0.572705
6,certitude,0.571104
7,Montrez,0.569867
8,gériatriques,0.561261
9,rencontrés,0.551481

Unnamed: 0,fastText,cosine_sim
0,patiente,0.993348
1,Patient,0.989609
2,pays,0.987813
3,doigts,0.98412
4,ont,0.982365
5,aient,0.98231
6,doivent,0.979921
7,étaient,0.979289
8,avaient,0.978384
9,traités,0.977978


In [32]:
w1=['traitement']
display_similar(w1,topn=10)

Unnamed: 0,cbow,cosine_sim
0,risque,0.605526
1,VIH,0.587713
2,infectés,0.565611
3,SEP,0.558821
4,rapport,0.558474
5,maladie,0.55274
6,patients,0.550472
7,médecin,0.544888
8,début,0.538424
9,préalablement,0.536745

Unnamed: 0,skipgram,cosine_sim
0,semaines,0.478879
1,début,0.458008
2,FK,0.44962
3,Tacrolimus,0.447027
4,concomitant,0.443106
5,mois,0.441773
6,lépromateuse,0.436572
7,Pendant,0.434478
8,instauration,0.425523
9,primordial,0.424809

Unnamed: 0,fastText,cosine_sim
0,Traitement,0.998823
1,Taaitement,0.997338
2,trait,0.995208
3,Allaitement,0.99107
4,évitement,0.990766
5,traitment,0.990556
6,allaitement,0.989832
7,traitements,0.987738
8,traite,0.986706
9,étroitement,0.985531


In [35]:
w1=['maladie']
display_similar(w1,topn=10)

Unnamed: 0,cbow,cosine_sim
0,Parkinson,0.798033
1,liée,0.668104
2,avancé,0.635994
3,SIDA,0.632543
4,atteint,0.630167
5,affection,0.599116
6,SEP,0.598122
7,Recklinghausen,0.597324
8,avancée,0.592628
9,infection,0.592554

Unnamed: 0,skipgram,cosine_sim
0,Parkinson,0.743655
1,AINS,0.681906
2,Inflammation,0.668791
3,vraie,0.632421
4,Hodgkin,0.618639
5,Hirsprung,0.616744
6,Cushing,0.615372
7,Basedow,0.608064
8,constituée,0.607129
9,mouton,0.606919

Unnamed: 0,fastText,cosine_sim
0,prócoce,0.997969
1,Maladie,0.997834
2,obtenue,0.997395
3,probabilité,0.997204
4,longue,0.997174
5,matière,0.997148
6,magot,0.996791
7,rétractile,0.996657
8,nombre,0.996514
9,unique,0.996456


In [36]:
solutionw1=['solution']
display_similar(w1,topn=10)

Unnamed: 0,cbow,cosine_sim
0,Parkinson,0.798033
1,liée,0.668104
2,avancé,0.635994
3,SIDA,0.632543
4,atteint,0.630167
5,affection,0.599116
6,SEP,0.598122
7,Recklinghausen,0.597324
8,avancée,0.592628
9,infection,0.592554

Unnamed: 0,skipgram,cosine_sim
0,Parkinson,0.743655
1,AINS,0.681906
2,Inflammation,0.668791
3,vraie,0.632421
4,Hodgkin,0.618639
5,Hirsprung,0.616744
6,Cushing,0.615372
7,Basedow,0.608064
8,constituée,0.607129
9,mouton,0.606919

Unnamed: 0,fastText,cosine_sim
0,prócoce,0.997969
1,Maladie,0.997834
2,obtenue,0.997395
3,probabilité,0.997204
4,longue,0.997174
5,matière,0.997148
6,magot,0.996791
7,rétractile,0.996657
8,nombre,0.996514
9,unique,0.996456


In [37]:
w1=['jaune']
display_similar(w1,topn=10)

Unnamed: 0,cbow,cosine_sim
0,pâle,0.92885
1,Fabr,0.863825
2,flavicollis,0.845411
3,éthylcellulose,0.84425
4,dioxyde,0.830397
5,Calotermes,0.827315
6,fer,0.819271
7,oxyde,0.816628
8,Méthylhydroxypropylcellulose,0.81268
9,Ethylcellulose,0.804589

Unnamed: 0,skipgram,cosine_sim
0,pâle,0.757826
1,orange,0.738929
2,hexagonaux,0.714933
3,flavicollis,0.698671
4,Calotermes,0.693252
5,Urines,0.69214
6,navet,0.690432
7,Fabr,0.689275
8,éthylcellulose,0.686687
9,replicase,0.686608

Unnamed: 0,fastText,cosine_sim
0,soir,0.994354
1,flou,0.993687
2,soigné,0.992824
3,fois,0.992518
4,jeune,0.991764
5,croisé,0.991223
6,conduit,0.990327
7,congeler,0.990284
8,conduira,0.990013
9,soin,0.989671
