Bow to Bert

07/10/2023

NOTE: The code here has been updated to work with current versions of Tensorflow/Pytorch engaged from the transformers module. It is in its own repo at context-aware-word-vectors

==========================================================================

Evolution of word vectors from long, sparse, and one-hot to short, dense, and context sensitive

This is the source code to go along with the blog article

Bow to Bert

Figure 3. BERT embeddings are contextual. Each row show three sentences. The sentence in the middle expresses the same context as the sentence on its right, but different from the one on its left. All three sentences in the row have a word in common. The numbers show the computed cosine-similarity between the indicated word pairs. BERT embedding for the word in the middle is more similar to the same word on the right than the one on the left.

Summary

Word vectors have evolved over the years to know the difference between "record the play" vs "play the record". They have evolved from a one-hot world where every word was orthogonal to every other word, to a place where word vectors morph to suit the context. Slapping a BoW on word vectors is the usual way to build a document vector for tasks such as classification. But BERT does not need a BoW as the vector shooting out of the top [CLS] token is already fine tuned for the specific classification objective...

Dependencies

tensorflow
numpy

To reproduce the resuts in the post

Download: crawl-300d-2M-subword.vec

Download: BERT-Base, Uncased

Edit the script getBertWordVectors.sh and update path accordingly
	BERT_BASE_DIR="$PRE_TRAINED_HOME/bert/uncased_L-12_H-768_A-12"
Edit the scripts fasttext_sentence_similarity.py & fasttext_word_similarity.py  and update path accordingly
	filename = os.environ["PRE_TRAINED_HOME"] + '/fasttext/crawl-300d-2M-subword.vec'

Get BERT word embeddings for the words/sentences in bert_sentences.txt

./getBertWordVectors.sh

Process BERT embeddings to compute cosine similarity for context sensitive words

pipenv run python bert_similarity.py

to get output like:

arms bend at the elbow  <=>  germany sells arms to saudi arabia 			<=>  0.482
arms bend at the elbow  <=>  wave your arms around 						<=>  0.615

Fasttext word similarity

pipenv run python ./fasttext_word_similarity.py holiday vacation paper

to get output like:

Cosine Similarity: holiday & vacation : 0.7388389
Cosine Similarity: holiday & paper : 0.2716892
Cosine Similarity: vacation & paper : 0.27176374

Fasttext sentence similarity

pipenv run python ./fasttext_sentence_similarity.py

to get output like:

words not found in fasttext.. 0

Cosine Similarity: enjoy your holiday & have a fun vacation : 0.72311985
Cosine Similarity: enjoy your holiday & study the paper : 0.5743288
Cosine Similarity: have a fun vacation & study the paper : 0.51478416

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
bert_master		bert_master
images		images
README.md		README.md
bertWordVectors.jsonl		bertWordVectors.jsonl
bert_sentences.txt		bert_sentences.txt
bert_similarity.py		bert_similarity.py
fasttext_sentence_similarity.py		fasttext_sentence_similarity.py
fasttext_word_similarity.py		fasttext_word_similarity.py
getBertWordVectors.sh		getBertWordVectors.sh
sentences.txt		sentences.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bert_master

bert_master

images

images

README.md

README.md

bertWordVectors.jsonl

bertWordVectors.jsonl

bert_sentences.txt

bert_sentences.txt

bert_similarity.py

bert_similarity.py

fasttext_sentence_similarity.py

fasttext_sentence_similarity.py

fasttext_word_similarity.py

fasttext_word_similarity.py

getBertWordVectors.sh

getBertWordVectors.sh

sentences.txt

sentences.txt

Repository files navigation

Bow to Bert

Summary

Dependencies

To reproduce the resuts in the post

Get BERT word embeddings for the words/sentences in bert_sentences.txt

Process BERT embeddings to compute cosine similarity for context sensitive words

Fasttext word similarity

Fasttext sentence similarity

About

Releases

Packages

Contributors 2

Languages

ashokc/Bow-to-Bert

Folders and files

Latest commit

History

Repository files navigation

Bow to Bert

Summary

Dependencies

To reproduce the resuts in the post

Get BERT word embeddings for the words/sentences in bert_sentences.txt

Process BERT embeddings to compute cosine similarity for context sensitive words

Fasttext word similarity

Fasttext sentence similarity

About

Resources

Stars

Watchers

Forks

Languages