### fse

All fse models require an iterable/generator which produces a tuple. The tuple has two fields: words and index. The index is required for the multi-thread processing, as sentences might not be processed sequentially. The index dictates, which row of the corresponding sentence vector matrix the sentence belongs to.


In [140]:
import logging
logging.basicConfig(format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s', level=logging.INFO)

In [141]:
from fse import SplitIndexedList

sentences_a = ["Hello there", "how are you?"]
sentences_b = ["today is a good day", "Lorem ipsum"]

s = SplitIndexedList(sentences_a, sentences_b)
print(len(s))
s[0]

4


(['Hello', 'there'], 0)

In [142]:
s.items

['Hello there', 'how are you?', 'today is a good day', 'Lorem ipsum']

## Preparing data

In [50]:
from fse import Vectors
import gensim.downloader as api
data = api.load("quora-duplicate-questions")
glove = Vectors.from_pretrained("glove-wiki-gigaword-100")

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

2023-04-13 11:28:18,819 : MainThread : INFO : loading Vectors object from /Users/dbuchaca/.cache/huggingface/hub/models--fse--glove-wiki-gigaword-100/snapshots/3282d5e7c5e979c2411ba9703d63a46243a2047e/glove-wiki-gigaword-100.model
2023-04-13 11:28:19,435 : MainThread : INFO : loading vectors from /Users/dbuchaca/.cache/huggingface/hub/models--fse--glove-wiki-gigaword-100/snapshots/3282d5e7c5e979c2411ba9703d63a46243a2047e/glove-wiki-gigaword-100.model.vectors.npy with mmap=None
2023-04-13 11:28:19,476 : MainThread : INFO : setting ignored attribute vectors_norm to None
2023-04-13 11:28:21,247 : MainThread : INFO : KeyedVectors lifecycle event {'fname': '/Users/dbuchaca/.cache/huggingface/hub/models--fse--glove-wiki-gigaword-100/snapshots/3282d5e7c5e979c2411ba9703d63a46243a2047e/glove-wiki-gigaword-100.model', 'datetime': '2023-04-13T11:28:21.247560', 'gensim': '4.1.2', 'python': '3.9.12 (main, Apr  5 2022, 01:53:17) \n[Clang 12.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event

In [56]:
sentences = []
for d in data:
    # Let's blow up the data a bit by replicating each sentence.
    for i in range(8):
        sentences.append(d["question1"].split())
        sentences.append(d["question2"].split())

    s = IndexedList(sentences)
print(len(s))

6468640


In [65]:
sentences[1]

['What',
 'is',
 'the',
 'step',
 'by',
 'step',
 'guide',
 'to',
 'invest',
 'in',
 'share',
 'market?']

## Training Fse

To train an fse model you need pretrained word embeddings. Let us use ones form glove

In [66]:
from fse.models import uSIF
model = uSIF(glove, workers=1, lang_freq="en")

2023-04-13 11:32:31,863 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


In [67]:
model.train(s)

2023-04-13 11:32:36,376 : MainThread : INFO : scanning all indexed sentences and their word counts
2023-04-13 11:32:41,378 : MainThread : INFO : SCANNING : finished 5553011 sentences with 61384762 words
2023-04-13 11:32:42,479 : MainThread : INFO : finished scanning 6468640 sentences with an average length of 11 and 71556728 total words
2023-04-13 11:32:42,723 : MainThread : INFO : estimated memory for 6468640 sentences with 100 dimensions and 400000 vocabulary: 2621 MB (2 GB)
2023-04-13 11:32:42,723 : MainThread : INFO : initializing sentence vectors for 6468640 sentences
2023-04-13 11:32:58,213 : MainThread : INFO : pre-computing uSIF weights for 400000 words
2023-04-13 11:32:59,254 : MainThread : INFO : begin training
2023-04-13 11:33:04,261 : MainThread : INFO : PROGRESS : finished 42.55% with 2752619 sentences and 20942916 words, 550523 sentences/s
2023-04-13 11:33:09,262 : MainThread : INFO : PROGRESS : finished 84.86% with 5489164 sentences and 41766810 words, 547309 sentences/s

(6468624, 49255184)

## Inspecting learned vectors for the different sentences

After learning we can inspect the vectors **`model.sv`**.




In [116]:
print(f'There are len(model.sv)={len(model.sv)} vectors')
print(f'The vector embedding size is len(model.sv[0])={len(model.sv[0])}')
print(f'Note len(model.sv)=len(s)={len(s)} which is different than len(glove.vectors)={len(glove.vectors)}')

There are len(model.sv)=6468640 vectors
The vector embedding size is len(model.sv[0])=100
Note len(model.sv)=len(s)=6468640 which is different than len(glove.vectors)=400000


In [95]:
len(model.sv)

6468640

In [91]:
model.sv

<fse.models.sentencevectors.SentenceVectors at 0x7f87316f13d0>

To compute the similarity or distance between two sentence from the training set you can call:

In [117]:
print(model.sv.similarity(0,1).round(3))
print(model.sv.distance(0,1).round(3))

0.964
0.036


In [136]:
print(model.sv.similarity(1000,10).round(3))

0.168


In [135]:
sentences[1000]

['How',
 'is',
 'the',
 'new',
 'Harry',
 'Potter',
 'book',
 "'Harry",
 'Potter',
 'and',
 'the',
 'Cursed',
 "Child'?"]

In [133]:
sentences[9]

['What',
 'is',
 'the',
 'step',
 'by',
 'step',
 'guide',
 'to',
 'invest',
 'in',
 'share',
 'market?']

## Inference with the model.

One can generate an embedding for a sentence as follows:

In [81]:
tmp = ("Hello my friends".split(), 0)
model.infer([tmp])

2023-04-13 11:40:14,326 : MainThread : INFO : scanning all indexed sentences and their word counts
2023-04-13 11:40:14,328 : MainThread : INFO : finished scanning 1 sentences with an average length of 3 and 3 total words
2023-04-13 11:40:14,333 : MainThread : INFO : removing 5 principal components took 0s


array([[ 2.52873838e-01, -2.85084248e-02,  2.71090940e-02,
        -2.78804988e-01, -7.40084723e-02,  4.57522571e-01,
        -1.05381116e-01,  2.74592582e-02, -6.45715743e-02,
        -3.40565652e-01, -1.88027695e-03, -7.27266222e-02,
         1.93650678e-01,  1.54085010e-01, -1.17584080e-01,
        -2.86389828e-01,  9.37029570e-02, -1.55728608e-01,
        -3.68186563e-01,  3.55130613e-01, -1.01584151e-01,
         2.67165512e-01, -3.59775722e-02, -1.73546225e-01,
         1.11245878e-01,  9.16430578e-02, -2.18638271e-01,
        -5.78938574e-02,  4.64368463e-01,  1.15901031e-01,
         2.43736461e-01,  2.93561935e-01,  3.84000361e-01,
         1.23893097e-01,  1.68842077e-03,  2.47208923e-01,
         1.76382944e-01,  6.20462634e-02,  2.72806257e-01,
        -1.29266381e-01, -1.28560856e-01,  1.32527083e-01,
         2.21165240e-01, -1.13869853e-01, -1.39036745e-01,
        -1.13875166e-01, -4.00712490e-01,  3.18430007e-01,
         3.94267976e-01, -1.02989197e-01, -1.09559409e-0

The model allows you to generate a matrix embedding for a given input batch of sentences as follows

In [83]:
batch = [("Hello my friends".split(), 0),
         ("I loved the old playstation games".split(), 1),
         ("I liked oldschool nintendo videogames".split(),2)]

model.infer(batch).shape

2023-04-13 12:01:27,807 : MainThread : INFO : scanning all indexed sentences and their word counts
2023-04-13 12:01:27,814 : MainThread : INFO : finished scanning 3 sentences with an average length of 4 and 14 total words
2023-04-13 12:01:27,821 : MainThread : INFO : removing 5 principal components took 0s


(3, 100)

## Quering the model

In [40]:
import fse
from fse import Vectors, Average, IndexedList

vecs = Vectors.from_pretrained("glove-wiki-gigaword-50")
model = Average(vecs)

sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model.train(IndexedList(sentences))
model.sv.similarity(0,1)

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

0.8598846

In [39]:
?fse.SIF.train

In [12]:
from gensim.models import Word2Vec
sentences = [['cat', 'say', 'meow'], ['dog', 'say', 'woof']]
model = Word2Vec(sentences, min_count=1)

In [None]:
from fse import Vectors, Average, IndexedList
vecs = Vectors.from_pretrained("glove-wiki-gigaword-50")
model = Average(vecs)

sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]

model.train(IndexedList(sentences))

In [9]:
from fse.models import sentencevectors
#se = Sentence2Vec(model)
#sentences_emb = se.train(sentences)

In [11]:
#?sentencevectors