In [1]:
import httrees.embeddings as emb
import pandas as pd
from gensim.utils import effective_n_jobs
from gensim.models import Word2Vec, fasttext

12/07/2021 11:04:31 AM adding document #0 to Dictionary(0 unique tokens: [])
12/07/2021 11:04:31 AM built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
12/07/2021 11:04:31 AM Dictionary lifecycle event {'msg': "built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)", 'datetime': '2021-12-07T11:04:31.195503', 'gensim': '4.1.2', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}


# Fine-tuning embeddings: `httrees.embeddings`

In this example we'll walk through the process of fine-tuning some pre-trained word2vec embeddings from `gensim` on our own data.

We'll use the Amazon reviews data again ([see the previous documentation](https://github.com/bllguo/CourseProject/blob/main/docs.ipynb)), but only the actual reviews text. I prepared a separate `.csv` file with only the reviews: 

In [2]:
df = pd.read_csv('data/docs.csv')
df.head()

Unnamed: 0,Text
0,The description and photo on this product need...
1,This was a great book!!!! It is well thought t...
2,"I am a first year teacher, teaching 5th grade...."
3,I got the book at my bookfair at school lookin...
4,Hi! I'm Martine Redman and I created this puzz...


Now, this can be passed to the model for training already, but `httrees.embeddings` also provides a `StreamCorpus` class that can be used to stream your text to your model instead of loading the entire corpus into memory at once. This is useful for larger datasets. 

In [3]:
corpus = emb.StreamCorpus('data/docs.csv', encoding='utf-8-sig', skip_rows=1)
for doc in corpus:
    print(doc)
    break

['the', 'description', 'and', 'photo', 'on', 'this', 'product', 'needs', 'to', 'be', 'changed', 'to', 'indicate', 'this', 'product', 'is', 'the', 'buffalos', 'version', 'of', 'this', 'beef', 'jerky.']


Two wrappers are provided around `gensim` embedding models, `EmbeddingTuner` and the more specific `Word2VecTuner`.
`EmbeddingTuner` is minimalistic. As an example, we'll wrap a `FastText` model and do some additional training. We can obtain pretrained `FastText` models [from Facebook](https://fasttext.cc/docs/en/crawl-vectors.html).

Note: more extensive examples are available from `gensim`'s [documentation](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model).

In [4]:
model = fasttext.load_facebook_model('cc.en.300.bin')
tuner = emb.EmbeddingTuner(model)

12/07/2021 11:04:31 AM loading 2000000 words for fastText model from cc.en.300.bin
12/07/2021 11:04:51 AM FastText lifecycle event {'params': 'FastText(vocab=0, vector_size=300, alpha=0.025)', 'datetime': '2021-12-07T11:04:51.968603', 'gensim': '4.1.2', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}
12/07/2021 11:04:51 AM Updating model with new vocabulary
12/07/2021 11:05:00 AM FastText lifecycle event {'msg': 'added 2000000 new unique words (100.0%% of original 2000000) and increased the count of 0 pre-existing words (0.0%% of original 2000000)', 'datetime': '2021-12-07T11:05:00.863008', 'gensim': '4.1.2', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'prepare_vocab'}
12/07/2021 11:05:10 AM deleting the raw counts dictionary of 2000000 items
12/07/2021 11:05:10 AM sample=1e

Now we simply update the vocabulary, then train on top with our own data.

In [5]:
tuner.train(corpus, epochs=1)

12/07/2021 11:06:08 AM collecting all words and their counts
12/07/2021 11:06:08 AM PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
12/07/2021 11:06:09 AM PROGRESS: at sentence #10000, processed 900659 words, keeping 59826 word types
12/07/2021 11:06:09 AM PROGRESS: at sentence #20000, processed 1679397 words, keeping 90519 word types
12/07/2021 11:06:09 AM PROGRESS: at sentence #30000, processed 2490190 words, keeping 116946 word types
12/07/2021 11:06:09 AM collected 140947 word types from a corpus of 3294344 raw words and 40000 sentences
12/07/2021 11:06:09 AM Updating model with new vocabulary
12/07/2021 11:06:16 AM FastText lifecycle event {'msg': 'added 6106 new unique words (4.332124841252385%% of original 140947) and increased the count of 16640 pre-existing words (11.805856101939026%% of original 140947)', 'datetime': '2021-12-07T11:06:16.117852', 'gensim': '4.1.2', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)

Sometimes the model itself is not available. For example, for `word2vec`, from `gensim` we only have the `KeyedVectors`. We don't have the internal weights of the model, but we can still train with these vectors as a starting point.

We can load in embeddings to the `Word2VecTuner` either from `KeyedVectors` or directly from `gensim`'s data repository; in this case we'll use the latter:

In [6]:
model = emb.Word2VecTuner(Word2Vec(window=3, 
                                   min_count=1, 
                                   workers=effective_n_jobs(-1)-1, 
                                   vector_size=300))

12/07/2021 11:07:41 AM Word2Vec lifecycle event {'params': 'Word2Vec(vocab=0, vector_size=300, alpha=0.025)', 'datetime': '2021-12-07T11:07:41.099668', 'gensim': '4.1.2', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}


In [7]:
model.load_embeddings(gensim_model='word2vec-google-news-300')

12/07/2021 11:07:41 AM Downloading word2vec-google-news-300...
12/07/2021 11:07:42 AM loading projection weights from C:\Users\bllgu/gensim-data\word2vec-google-news-300\word2vec-google-news-300.gz
12/07/2021 11:08:20 AM KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from C:\\Users\\bllgu/gensim-data\\word2vec-google-news-300\\word2vec-google-news-300.gz', 'binary': True, 'encoding': 'utf8', 'datetime': '2021-12-07T11:08:20.547289', 'gensim': '4.1.2', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'load_word2vec_format'}
12/07/2021 11:08:20 AM Download complete.
12/07/2021 11:08:20 AM Updating model vocabulary...
12/07/2021 11:08:22 AM collecting all words and their counts
12/07/2021 11:08:22 AM PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
12/07/2021 11:08:22 AM PROGRESS: at sentence #10000, processed 10000 words, keeping 10000 wo

Once we initialize `Word2Vec` with the pre-trained embeddings, we can do additional training.
Here we add the handy `EpochSaver` callback, which saves the model after each epoch.

In [8]:
epoch_saver = emb.EpochSaver('w2v')
model.train(corpus, 
            epochs=1, 
            callbacks=[epoch_saver])

12/07/2021 11:27:32 AM collecting all words and their counts
12/07/2021 11:27:32 AM PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
12/07/2021 11:27:32 AM PROGRESS: at sentence #10000, processed 900659 words, keeping 59826 word types
12/07/2021 11:27:32 AM PROGRESS: at sentence #20000, processed 1679397 words, keeping 90519 word types
12/07/2021 11:27:33 AM PROGRESS: at sentence #30000, processed 2490190 words, keeping 116946 word types
12/07/2021 11:27:33 AM collected 140947 word types from a corpus of 3294344 raw words and 40000 sentences
12/07/2021 11:27:33 AM Updating model with new vocabulary
12/07/2021 11:27:42 AM Word2Vec lifecycle event {'msg': 'added 111226 new unique words (78.91335040830951%% of original 140947) and increased the count of 29721 pre-existing words (21.086649591690495%% of original 140947)', 'datetime': '2021-12-07T11:27:42.874820', 'gensim': '4.1.2', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD6

Now we can save our final model to disk.

In [9]:
model.model.save('tuned_gigaword.model')

12/07/2021 11:28:47 AM Word2Vec lifecycle event {'fname_or_handle': 'tuned_gigaword.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-12-07T11:28:47.912551', 'gensim': '4.1.2', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'saving'}
12/07/2021 11:28:47 AM storing np array 'vectors' to tuned_gigaword.model.wv.vectors.npy
12/07/2021 11:29:05 AM storing np array 'syn1neg' to tuned_gigaword.model.syn1neg.npy
12/07/2021 11:29:23 AM not storing attribute cum_table
12/07/2021 11:29:25 AM saved tuned_gigaword.model


Or, we can save just the embeddings. Here they are pickled for later use.

In [10]:
import pickle
new_word_vectors = {k: model.model.wv.get_vector(k) for k in model.model.wv.index_to_key}
with open('tuned_embeddings_.pkl', 'wb') as handle:
    pickle.dump(new_word_vectors, handle)