# ETNLP: exploring different word embeddings
Given many word embedding models, how do you know which one to use? Is it possible to have preliminary evaluations to predict which models to use for a certain downstream task? ETNLP will be a convenient tool for this purpose.

- Readmore in this paper: https://arxiv.org/abs/1903.04433 of Xuan-Son Vu, Thanh Vu, Son N. Tran, Lili Jiang.

## There are some TODOs in this Notebook:
- TODO#1: (as always) read the codes and comments from begining to the end.
- TODO#2: extract new set of embeddings based on new documents.
- TODO#3: visualize to see the extracted embeddings.

In [1]:
%load_ext autoreload
%autoreload 2
import sys
import time
import os
import etnlp_api
import nltk

  from ._conv import register_converters as _register_converters


In [2]:
# Include *.py files from other folders
module_path = os.path.abspath(os.path.join('../../'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [3]:
from pythonlibs.rnn.data import dataset_sentiment_doc

In [14]:
text = dataset_sentiment_doc()

In [15]:
text

"All good so far. I was incredibly hesitant to buy a refurbished phone on Amazon after reading so many negative reviews on various products. I spent a lot of time researching my options and figured this one was probably my safest bet. Phone came in perfect condition (seriously, it looks brand new) and so far all the functions seem great. Set up was easy, unlocked, ready to set my fingerprint and everything. It's fast, sleek, and beautiful. Camera and audio are also great.\nDEFECTIVE BATTERY - The phone was defective the day it arrived. The battery does not hold a charge consistently or indicate how long it will actually last. This defect is sporadic—so sometimes it works and sometimes it doesn't. I thought I was doing something wrong at first so, after a few weeks of frustration, I did a test to see: tracking how long I charged and how long it lasted. This confirmed it was not my fault or a defective charger and that the phone I purchased from Electonic Deals was the problem."

In [18]:
sent_arr = nltk.sent_tokenize(text)

In [19]:
for sen in sent_arr:
    print(sen)

All good so far.
I was incredibly hesitant to buy a refurbished phone on Amazon after reading so many negative reviews on various products.
I spent a lot of time researching my options and figured this one was probably my safest bet.
Phone came in perfect condition (seriously, it looks brand new) and so far all the functions seem great.
Set up was easy, unlocked, ready to set my fingerprint and everything.
It's fast, sleek, and beautiful.
Camera and audio are also great.
DEFECTIVE BATTERY - The phone was defective the day it arrived.
The battery does not hold a charge consistently or indicate how long it will actually last.
This defect is sporadic—so sometimes it works and sometimes it doesn't.
I thought I was doing something wrong at first so, after a few weeks of frustration, I did a test to see: tracking how long I charged and how long it lasted.
This confirmed it was not my fault or a defective charger and that the phone I purchased from Electonic Deals was the problem.


In [20]:
# 1. Vocab list to extract embeddings
from collections import Counter
word_counter = Counter()
for s in sent_arr:
    word_counter.update(s.split())

In [25]:
print(len(word_counter))

120


In [26]:
print(list(word_counter.keys())[:10])

['All', 'good', 'so', 'far.', 'I', 'was', 'incredibly', 'hesitant', 'to', 'buy']


In [28]:
!pwd

/mnt/data/OProjects/DeepLearning/HPC2N/TFnDeepLearning/codes/jupyterlabs/day1


In [36]:
vocab_file = "../../../data/etnlp/senti_vocab.txt"

In [37]:
!ls $vocab_file

../../../data/etnlp/senti_vocab.txt


In [38]:
# 2. Get embeddings of these vectors only:
# write vocab file first:
fwriter = open(vocab_file, "w")
for word in word_counter.keys():
    fwriter.write(word + "\n")
fwriter.close()

In [40]:
# Load embedding models and write down:
from etnlp_api import embedding_config
from etnlp_api import embedding_extractor

# You don't need to run this part, just for demonstration the whole process.
def do_extraction():
    emb1 = "/mnt/data/PretrainedEmbeddings/EN/NER_fasttext_wiki_subword.vec"
    emb2 = "/mnt/data/PretrainedEmbeddings/EN/NER_fasttext_wiki.vec"
    emb3 = "/mnt/data/PretrainedEmbeddings/EN/NER_glove.840B.300d.vec"
    emb4 = "/mnt/data/PretrainedEmbeddings/EN/NER_W2V_google_news.vec"
    C2V = None
    out1 = "../../../data/etnlp/Tiny_fasttext_wiki_subword.vec"
    out2 = "../../../data/etnlp/Tiny_fasttext_wiki.vec"
    out3 = "../../../data/etnlp/Tiny_glove.840B.300d.vec"
    out4 = "../../../data/etnlp/Tiny_W2V_google_news.vec"

    VOCAB_FILE = vocab_file
    # OUTPUT_FORMAT=".txt;.npz;.gz"
    OUTPUT_FORMAT = ".txt"
    # embedding_config
    embedding_config.do_normalize_emb = False

    emb_files = [emb1, emb2, emb3, emb4]
    out_files = [out1, out2, out3, out4]

    for emb_file, out_file in zip(emb_files, out_files):
        embedding_extractor.extract_embedding_for_vocab_file(emb_file, VOCAB_FILE,
                                                         C2V, out_file, OUTPUT_FORMAT)
    print("DONE")

02. Extracting word embeddings ...
Reading embedding file (may take a while)
model_paths_list =  ['/mnt/data/PretrainedEmbeddings/EN/NER_fasttext_wiki_subword.vec']
model_formats_list =  ['word2vec']
- At line 0
- done. Found 85 vectors for 120 words
Done
02. Extracting word embeddings ...
Reading embedding file (may take a while)
model_paths_list =  ['/mnt/data/PretrainedEmbeddings/EN/NER_fasttext_wiki.vec']
model_formats_list =  ['word2vec']
- At line 0
- done. Found 85 vectors for 120 words
Done
02. Extracting word embeddings ...
Reading embedding file (may take a while)
model_paths_list =  ['/mnt/data/PretrainedEmbeddings/EN/NER_glove.840B.300d.vec']
model_formats_list =  ['word2vec']
- At line 0
- done. Found 85 vectors for 120 words
Done
02. Extracting word embeddings ...
Reading embedding file (may take a while)
model_paths_list =  ['/mnt/data/PretrainedEmbeddings/EN/NER_W2V_google_news.vec']
model_formats_list =  ['word2vec']
- At line 0
- done. Found 80 vectors for 120 words
D

# 2. Visualization

In [2]:
!pwd

/mnt/data/OProjects/DeepLearning/HPC2N/TFnDeepLearning/codes/jupyterlabs/day1


In [4]:
!ls "../../../data/etnlp/"

conll2003_ner_vocab.txt  Ti_ft_wi_sub.vec  Ti_gl.vec
senti_vocab.txt		 Ti_ft_wi.vec	   Ti_W2V.vec


In [5]:
# from etnlp_api import embedding_visualizer
out1 = "../../../data/etnlp/Ti_ft_wi_sub.vec"
out2 = "../../../data/etnlp/Ti_ft_wi.vec"
out3 = "../../../data/etnlp/Ti_gl.vec"
out4 = "../../../data/etnlp/Ti_W2V.vec"

INPUT_FILES = "%s;%s;%s;%s"%(out1, out2, out3, out4)

# etnlp_api.embedding_visualizer.visualize_multiple_embeddings(INPUT_FILES)

## 2.1 Interactive visualization:
Please run on Terminal tab, it won't work by !<command> on this Notebook.
- 1. active the environment
- 2. sh scripts/03.run_etnlp_visualizer_iter_local.sh

## 2.2 Side-by-side visualization:
Please run on Terminal tab, it won't work by !<command> on this Notebook.
- 1. active the environment
- 2. sh scripts/04.run_etnlp_visualizer_sbs.sh

# 3. Evaluation
- See more at the github page: https://github.com/vietnlp/etnlp

# 4. TODOs:
- TODO#2: extract new set of embeddings based on new documents.
- TODO#3: visualize to see the extracted embeddings.

In [1]:
# Please go to bbc.com or any other website to copy and paste here 
# a piece of text (not too long, not too short). 
new_text = """
<PASTE YOUR TEXT HERE>
"""

# Next: please replicate the whole process again with this new text.

In [6]:
in1 = "../../../data/etnlp/Ti_ft_wi_sub.vec"
in2 = "../../../data/etnlp/Ti_ft_wi.vec"
in3 = "../../../data/etnlp/Ti_gl.vec"
in4 = "../../../data/etnlp/Ti_W2V.vec"

# TODO HERE: the same with the do_extraction() function but with the above 4 in_embeddings.

# Conclusions: after this Notebook, you know:
- How to extract new embeddings for a new downstream tasks.
- How to visualizing different embeddings to explore them interactively or compare them side-by-side.
- How to have preliminary evaluations to have a better judgment which embeddings to use.