# Exploring sentence, document, and character-level embedding.


Previously we covered Word Embeddings and Distance Measurements for Text, emphasizing word order and semantics. Now, we extend this to document and sentence embeddings. Doc2Vec captures contextual embeddings for paragraphs, extendable to sentences. Sent2Vec focuses on sentence embeddings via n-grams, preceded by a thorough examination of fastText for word representations using n-grams. Towards the end, an introduction to the Universal Sentence Encoder (USE) is provided.

## Venturing into Doc2Vec

"Doc2Vec" is a natural language processing algorithm that extends the concept of word embeddings to entire documents or paragraphs. It is an extension of the popular Word2Vec model, which generates word embeddings by learning to predict the context of a word within a given window of text. Doc2Vec, introduced by Mikolov et al. in 2014, works similarly but instead learns to generate fixed-length vectors (embeddings) for entire documents or paragraphs. These document embeddings capture the semantic meaning and context of the entire document, enabling various downstream tasks such as document classification, clustering, and similarity retrieval.

### Building paragraph vectors using Doc2Vec
Doc2Vec is an extension of Word2Vec that extends the concept of word embeddings to entire documents. It's a powerful technique for representing documents as continuous-valued vectors, which can then be used for various downstream NLP tasks such as document classification, clustering, similarity search, and more.

#### Building a Doc2Vec model


In [33]:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [34]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [35]:
documents=[TaggedDocument(doc,[i]) for i,doc in enumerate(common_texts)]
documents

[TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]),
 TaggedDocument(words=['survey', 'user', 'computer', 'system', 'response', 'time'], tags=[1]),
 TaggedDocument(words=['eps', 'user', 'interface', 'system'], tags=[2]),
 TaggedDocument(words=['system', 'human', 'system', 'eps'], tags=[3]),
 TaggedDocument(words=['user', 'response', 'time'], tags=[4]),
 TaggedDocument(words=['trees'], tags=[5]),
 TaggedDocument(words=['graph', 'trees'], tags=[6]),
 TaggedDocument(words=['graph', 'minors', 'trees'], tags=[7]),
 TaggedDocument(words=['graph', 'minors', 'survey'], tags=[8])]

In [36]:
model = Doc2Vec(documents, vector_size=5, min_count=1, workers=4,epochs = 40)
model.train(documents, total_examples=model.corpus_count,
epochs=model.epochs)

In [37]:
model.vector_size

5

In [38]:
len(model.docvecs)

  len(model.docvecs)


9

In [39]:
len(model.wv)


12

In [40]:
words = list(model.wv.key_to_index.keys())
words

['system',
 'graph',
 'trees',
 'user',
 'minors',
 'eps',
 'time',
 'response',
 'survey',
 'computer',
 'interface',
 'human']

In [41]:
vector=model.infer_vector(['user', 'interface', 'for','computer'])
vector

array([ 0.0027331 , -0.00960313,  0.05049047, -0.02768766,  0.06817956],
      dtype=float32)

#### Changing vector size and min_count

In [42]:
model=Doc2Vec(documents, min_count=3,epochs=40,vector_size=50)
model.train(documents, total_examples=model.corpus_count,epochs=model.epochs)

In [43]:
len(model.wv)

4

In [44]:
word=list(model.wv.key_to_index.keys())
word

['system', 'graph', 'trees', 'user']

In [45]:
vector=model.infer_vector(['user', 'interface', 'for','computer'])
vector


array([ 0.00050749, -0.00069027,  0.00566557, -0.00225846,  0.00671367,
        0.00243411, -0.00819021, -0.00957174, -0.00364764, -0.0047829 ,
       -0.00785492, -0.00024565, -0.00118925, -0.00234127,  0.00762444,
        0.0054061 ,  0.00044488,  0.00571045,  0.00824073, -0.00513679,
       -0.00990448, -0.00757426,  0.00459863, -0.00892727,  0.00692203,
       -0.00593891,  0.00920657,  0.0025982 , -0.00669509,  0.00575561,
       -0.00659957,  0.00393462, -0.00810801, -0.00737172,  0.00671644,
        0.00493415,  0.004701  ,  0.00752889,  0.00385293,  0.00540007,
       -0.00541467, -0.0051679 , -0.00673952,  0.00482542,  0.00345801,
       -0.00858704, -0.0058589 , -0.00026754, -0.00943008, -0.00538201],
      dtype=float32)

As we can see, the vector size is now 50 and only 4 terms are in the vocabulary.
This is because min_count was modified to 3 and, consequently, terms that were
equal to or greater than 3 terms are present in the vocabulary now.

Testing two different approaches of doc2vec


1.   PV-DM
2.   PV-DBOW



In [46]:
model=Doc2Vec(documents,vector_size=50,min_count=2,epochs=40,dm=1)
model.train(documents,total_examples=model.corpus_count,epochs=model.epochs)

dm equal to 0 builds the Doc2Vec model based on the distributed bag-of-words approach and vice versa

In [47]:
model=Doc2Vec(documents,vector_size=50,min_count=2,epochs=40,dm=0)
model.train(documents, total_examples=model.corpus_count,epochs=model.epochs)

The dm_concat parameter is used in the PV-DM approach. Its value, when set to 1,
indicates to the algorithm that the context vectors should be concatenated while trying to
predict the target word. This, of course, leads to building a larger model since multiple
word embeddings get concatenated.

In [48]:
model=Doc2Vec(documents,vector_size=50,min_count=2,epochs=40,dm=1,window=2,min_alpha=0.005,dm_concat=1)
model.train(documents, total_examples=model.corpus_count,epochs=model.epochs)


The window size parameter controls the distance between the word under concentration
and the word to be predicted, similar to the Word2Vec approach.

In [49]:
model=Doc2Vec(documents,vector_size=50,min_count=2,epochs=40,window=2,dm=0)
model.train(documents,total_examples=model.corpus_count,epochs=model.epochs)

Now, let's explore what the learning rate is and how it can be leveraged.

 For Doc2Vec,
the initial learning rate can be specified using the alpha parameter. With the min_alpha
parameter, we can specify what value the learning rate should drop to over the course of
training

In [50]:
model=Doc2Vec(documents, vector_size=50,min_count=2,epochs=40,alpha=0.3,min_alpha=0.05,window=2,dm=1)
model.train(documents, epochs=model.epochs,total_examples=model.corpus_count)


#### Exploring fastapi

Let's see the two- and three-character n-grams for the word language:
la, lan, an, ang, ng, ngu, gu, gua, ua, uag, ag, age, ge
fastText leads to parameter sharing among various words that have any overlapping n-grams. We capture their morphological information from sub-words to build an
embedding for the word itself. Also, when certain words are missing from the training
vocabulary or rarely occur, we can still have a representation for them if their n-grams are
present as part of other words.


Why n-grams are useful:

Sharing Similarities: The machine can find connections between words that share these n-gram pieces. For example, "language" and "angle" both have "an" and "ag" as n-grams, suggesting they might be related in some way.

Understanding Unfamiliar Words: If the machine encounters a new word (like "lingual") that it hasn't seen before, it can still make some sense of it because "lingual" shares n-grams ("in", "gu") with words it already knows.

Basically, n-grams help the machine learn word meanings by looking at smaller building blocks that can appear in many different words. This is especially useful for rare words or for languages with complex morphology (where words are built from smaller meaningful parts).

**Buiding a fasttext model**

In [51]:
from gensim.models import FastText
from gensim.test.utils import common_texts

In [52]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [53]:
model = FastText(vector_size=5, window=3, min_count=1)

model.build_vocab(common_texts)
model.train(common_texts, total_examples=len(common_texts), epochs=10)

(36, 290)

In [54]:
vocab=list(model.wv.key_to_index.keys())
vocab

['system',
 'graph',
 'trees',
 'user',
 'minors',
 'eps',
 'time',
 'response',
 'survey',
 'computer',
 'interface',
 'human']

In [55]:
model.wv['human']

array([-0.03166137,  0.02326731,  0.01241683,  0.00036033,  0.02841445],
      dtype=float32)

In [56]:
model.wv.most_similar(positive=['computer','interface'],negative=['human'])

[('user', 0.7968785762786865),
 ('system', 0.17462188005447388),
 ('response', 0.104334257543087),
 ('survey', 0.009604760445654392),
 ('trees', -0.07640466839075089),
 ('time', -0.1330047994852066),
 ('minors', -0.13927175104618073),
 ('eps', -0.24093686044216156),
 ('graph', -0.291752427816391)]

Since word representations in FastText are built using the n-grams, min_n, and
max_n characters, this helps us by setting the minimum and maximum lengths of
the character n-grams so that we can build representations.

In [57]:
model=FastText(vector_size=5,window=3,min_count=1, min_n=1, max_n=5)
model.build_vocab(common_texts)
model.train(common_texts,total_examples=len(common_texts), epochs=10)

(36, 290)

In [58]:
model.wv['rubber']

array([ 0.01833104, -0.02146881,  0.00600105, -0.03445042, -0.0165866 ],
      dtype=float32)

In [59]:
model.wv.most_similar(positive=['computer','human'],negative=['rubber'])

[('trees', 0.795038104057312),
 ('eps', 0.7793108820915222),
 ('minors', 0.2440604716539383),
 ('time', 0.1623203009366989),
 ('user', -0.04820726439356804),
 ('graph', -0.15672056376934052),
 ('survey', -0.20417772233486176),
 ('interface', -0.3921482563018799),
 ('response', -0.6897355914115906),
 ('system', -0.8435077667236328)]

#### Extending the built model to incorporate words from new sentences

In [60]:
sentences_to_be_added = [["I", "am", "learning", "Natural", "Language", "Processing"],
                         ["Natural", "Language", "Processing"]]

In [61]:
model.build_vocab(sentences_to_be_added, update=True)
model.train(common_texts, total_examples=len(sentences_to_be_added), epochs=10)

(43, 290)

In [62]:
vocab=list(model.wv.key_to_index.keys())
vocab

['system',
 'graph',
 'trees',
 'user',
 'minors',
 'eps',
 'time',
 'response',
 'survey',
 'computer',
 'interface',
 'human',
 'I',
 'am',
 'learning',
 'Natural',
 'Language',
 'Processing']