# 1. Distributional semantics: bee and honey vs. bee an bumblebee

Why do we need this? Well for example, we need this in search. So let's say we want to do some ranking. For example, we have some keywords, and then we have some candidates to rank. And then we can just compute these kind of similarities between our query and our candidates, and then get the top most similar results.

And actually there are numerous of applications of these techniques. For example, you can also think about some ontology learning. What it means is that sometimes you need to represent the hierarchical structure of some area. You need to know that there are some concepts, and there are some examples of these concepts. For example, you might want to know that, I don't know, there are plumbers, and that they can fix tap or faucet. And you need to know that tap and faucet are similar words that present the same concept. 



![ws_1.png](pics/ws_1.png)

This can be also done by distributional semantics, and this is what we are going to cover right now. Okay, so for example we want to understand that bee and bumblebee are similar. How can we get that? Let us start with counting some word co-occurrences. So we can just decide that we are interested in the words that co-occur in a small sliding window. For example, in a window of size ten. And if the words co-occur we say plus 1 for this counter, and get these green counters in the slide.

So this way we will understand that bee and honey are related. They are called syntagmatic associates because they often co-occur together in some contexts. However if we get back to our example to understand that tap and faucet are similar, that's not what we need. We need just to get to know some other second order co-occurrence, which means that these two words would co-occur with similar words in their contexts. For example, we can compute a long, sparse vector for bee, the cells, what are the most popular neighbors of this word? And we will also count the same vector for bumblebee. And after that, we will compute similarity between these two vectors. This way we will understand that bee and bumblebee can be interchangeably used in the language. And this means that they are similar, right? So they're usually called paradigmatic parallels, and this is the type of co-occurrence that we usually need.

Now let us get in a little bit more details first on how to compute those green counts. Okay, so as I have already said, you can compute just word co-occurrences. But they can be biased because of too popular words in the vocabulary, like stop words, and then it will be rather noisy estimates, right? So you need some help to penalize too popular words. And then one way to do this would be **Pointwise Mutual Information**. It says that you should put the individual counts of the words to the denominator.

This way you will understand whether these two words are randomly co-occurrent or not. So if you look to the first formula, you see that in the numerator you have the joint probability of the words. And in the denominator you have the joint probability in the case that the two random variables are independent, right? So if the words were independent, then we could just say that this is two probabilities, it is factorized. So in case of independent words, you will get 1 there for this fraction. 

And in case of dependent words that occur too much together, you will get something more. So this is the intuition of PMI. This is whether the words are randomly co-occurred or they are really related.
Now do you see any more problems with this measure?

Well actually, there are some. So when you see some counts and logarithm applied to them, you should have some bad feeling that you are going to have 0 somewhere. And indeed you can have those words that never co-occur together, or those words that co-occur really rare, and then very low numbers for your logarithms. So the good idea is just to say, let us take the maximum of the PMI and 0. This way we will get rid of those minus infinity values, and we will get nice positive Pointwise Mutual Information.

![ws_1.png](pics/ws_2.png)

![ws_1.png](pics/ws_3.png)

This is the measure that is usually used, and the idea that goes for all these measures actually would be just distributional hypothesis. It says that you can know the word by the company it keeps. So the meaning of the word is somehow defined by the context of this word.


Well if you want to measure a cosine similarity between these long, sparse vectors, maybe it's not a good idea. So it is long, it is noisy, it is too sparse. Let us try to do some dimensionality reduction.

![ws_1.png](pics/ws_4.png)

You have lots of different options how to do this factorization, and we will get into them later. What we need to know now is that we are going to compare now the rows of v matrix instead of the original sparse rows of X matrix.

This way we will get some measure of whether the words are similar, and this will be the output of our model. So far we have looked into how our words occur with other words from a sliding window. So we had some contexts, which would be words from a sliding window. 

![ws_1.png](pics/ws_5.png)

However, we can have some more complicated notion of the contexts. For example, you can have some syntax parses. Then you will know that you have some syntactic dependencies between the words. And you can see that some word has co-occurred with another word and had this type of relationship between them, right? So for example, Australian has co-occurred with scientist as a modifier. So in this model, you will say that your contexts are word plus the type of the relationship. And in this case you will have not a square matrix, but some matrix of words by contexts. And for the contexts you will have the vocabulary of those word class modifier units, okay? This will be actually a better idea to do, because syntax can really help you to understand what is important to local context and what is not. What is just some random co-occurrences that are near, but that are not meaningful.

# 2. Explicit and implicit matrix factorization

In the previous chapter, you could see that to build models of distributional semantics we need some kind of matrix factorization. And now we are going to cover different ways to perform matrix factorization in many details. 

Let us start with an approach based on **singular value decomposition**. So this is just a fact from linear algebra that says that every **matrix can be factorized into those three matrices**. The left and the right matrices will contain so called **left and right eigenvectors**. And the metrics in the middle will be **diagonal and the values on the diagonal will be related to eigenvalues**. 

![mf_1.png](pics/mf_1.png)

Now importantly, those values on the diagonal will be sorted in the decreasing order. And how many of them do we have there? Well, the number of those values on the diagonal will be the number of non-zero eigenvalues of X transposed X. So it is related to the rank of that matrix.

Awesome, so this is just something that we know, but how can we use this in our task?

The idea is as follows. So once those values are sorted, what if we keep first k components? Probably, if we keep just this blue regions of every matrix, we will get some nice approximation of the original X matrix.

![mf_1.png](pics/mf_2.png)

Now, it sounds rather hand-wavey, right? What is a nice approximation? Actually, there is a very accurate answer to this question.

So always blue part is the best approximation of rank k for X matrics, in terms of the loss that is written in the bottom of this slide.

Now do you remember that, actually, in the previous chapter, we were going to factorize our matrix into two matrices, not into three matrices, okay.

Well, what can we do about it? We can just use some heuristics, actually. So one idea would be to say, okay, let us take the first and the second matrix and put them to phi matrix. And then the last one and put it to theta, or another option would be to say that the diagonal matrix in between should be honestly split between two matrices phi and theta. So we apply squared root and say that one squared root goes to the left and another squared root goes to the right. 

![mf_1.png](pics/mf_3.png)

Okay, so you can see that SVD provides some ways to build $\Phi$ and $\Theta$ matrices. Now let us just summarize again what we have just realized. We were going to build models of distributional semantics, which means we have some word concurrence matrix filled with $PMI$ values or with some concurrence values. And we were going to represent it with a factorization of phi and theta matrices. And then we could see that **SVD** can provide this for us, and you can actually see that those $\Phi$ $u$ vectors and $\Theta$ $v$ vector would be our embeddings. Will be our what vectors that we are interested in, and also if you just multiply $\Phi$ $U$ and $\Theta$ $V$, as a inner product, you will get the value equal to, let's say, $PMI$ in the left-hand side matrix, right? So I just want you to remember this notation. 

![mf_1.png](pics/mf_4.png)

We have $\Phi$ and $\Theta$ matrices. We have $\Phi$ $u$ and $\Theta$ $V$ vectors. And the matrix way of thinking about it corresponds to the way of thinking about every element like $PMI$ is somehow equal to the dot product of $\Phi$ $U$ and $\Theta$ $V$. Okay, awesome. So far we have been using the squared loss because SVD deals with squared loss but this is not perfect maybe, maybe we can do better. 

So the next model, which is called global vectors, tries to do better. 

![mf_1.png](pics/mf_5.png)

So don't be scared. Let us break it down. You can see that it is still some squared loss, but it is weighted. So f function provides some weights and you'll see that f function is increasing. So if the word concurrences are higher, then this particular element of the matrix is more important to be approximated exactly.

Okay, now I have a question for you. Why does this green line for f function stops increasing at some point and just goes as a constant?

![mf_1.png](pics/mf_6.png)

Well, you might remember about stop words that are not important. So starting from some moment, the words should not get bigger weights here just because this is somehow noisy.



Awesome, now let us look into the brackets and see that we have a green part and the red part there. 

So the red part is our **regional matrix**. We used to have there word concurrences or BMI values, now we have logarithms. Okay, why not? 

Now what is the green part? So usually we would have just the inner product of $\phi$ $U$ and $\theta$ $V$. This would correspond to our matrix factorization task. Now it's almost the same. We just have those $b$ terms that are some bias terms. This is not actually important but we say, well, maybe we should chew on them as well and have some more freedom in our model. But this is again not so important.

Now, how do we train this model? In case of SVD, we have just an exact recipe from linear algebra how to build those three matrices. Now, we have just some loss that we want to minimize.What do we do with some losses that you want to minimize? 

Well, we can try to do **stochastic gradient descent**. In this case, we can treat every element of the matrix as some example. So we take an element, we perform one step of gradient descent and then update the parameters and take another element, and proceed in this way. Finally, we will get our global vectors model trained and we'll obtain those $\phi$ $U$ and $\theta$ $V$ vectors vectors that can be used as word embeddings. And actually, this model is rather recent and very nice way to provide word embeddings, so please keep it in mind.

Now, let us forget about matrix factorization for a few slides. Let us think about what would be other approaches to build word embeddings. Another approach to think about the task would be language modeling. So just a recap, language modelling is about probabilities of some words given some other words. 

![mf_1.png](pics/mf_7.png)

So in this case, in case of so called skip-gram model, we are going to produce the probability of context given some focus word. For example, the context words can come from some window of a fixed size.

Now, we assumed that this window is represented as a bag of words. So that's why we'll just go to this product and have probabilities of one word given another word.

Now how would we produce these probabilities. You can see the formula in the bottom of the slide, and you might recognize that it is softmax.

Okay, so it means that it will be indeed normalized as we want, as a correct probability distribution over the vocabulary. Now what is their insight? Again, inner products between $\phi$ $u$ and $\theta$ $v$. So these inner products correspond to some similarity between two vectors. We take this similarity, normalize them using softmax, and get probabilities.

Okay, so this model is called **Skip-gram**, it is very important, but unfortunately, it is rather slow. Because softmax computation is slow especially if you have a big vocabulary. What can we do about it?

Actually we can just say let us reformulate the whole procedure like that.

![mf_1.png](pics/mf_8.png)

You can see that we have some green and red parts of this slide. The green part corresponds to positive examples.

The positive examples are about those pairs of words, $u$ and $v$, that co-occurred together.

We just take it from data, right? And for this pair of words that co-occur, we want to predict yes and there are some other pairs of words that do not co-occur, and we just sample them randomly. And for them, we want to predict no. Now how do we model these predictions? We model them with sigmoid function. So you'll see that now, sigmoid function is applied to inner products.

It will give us probabilities whether yes or no.

Okay, now maybe you are somehow scared with the mathematical expectation that you see here. And actually, we do not take any mathematical expectation though we write it in some theoretical way, but what we do we just sample. So $k$ will correspond to the number of samples, and we sample just $k$ words from the vocabulary for every given $u$ word. And we use all those samples in this term, so you can just forget about expectation for now.

What do we get by this model? Well, again we build those embeddings for words, and now we do not need to normalize anything by the size of the vocabulary, and this is nice. So this model, **skip-gram negative sampling**, is very efficient and used a lot.


The final slide for you is somehow to understand that this skip-gram negative sampling model is still related to matrix factorization approach that we have discussed.

![mf_1.png](pics/mf_9.png)

What if we take the derivatives of this loss? So we say that, what would be the best value for the inner product?

If you do this, you will understand that the best value for this inner product is a shifted PMI value, like here. So you'll see that it is PMI value minus logarithm of $k$, where $k$ is the number of negative samples.

![mf_1.png](pics/mf_10.png)

This is just some nice fact. This was published in a recent paper and it says that even though in skip-gram negative sampling model, we do not think about any matrices. However, **you can still interpret it as some implicit matrix factorization of shifted PMI values into our usual two matrices**, $\Phi$ and $\Theta$. So this is rather important fact because it says that more recent models are still very similar to more previous models, like SVD decomposition of PMI values. Because now, the only thing that is different is that we have shifted PMI and some other loss. Okay, this all for the mathematical part of the methods, and in the next chapter, we will see how to use these models. 



# 3. Word2vec and doc2vec (and how to evaluate them)

In the previous chapter, we had all necessary background to see what is inside **word2vec** and **doc2vec**. These two models are rather famous, so we will see how to use them in some tasks. 

Okay, let us get started with word2vec. So it is just some software package that has several different variants.

![wtv_1.png](pics/wtv_1.png)

One variant would be **continuous bag-of-words**. It means that we try to predict one word given the context words. Another option would be to do vice versa and predict context words given some words and this one will be called **skip-gram**. Then softmax computation is usually too slow, and producing those probabilities is not effective. 

So there are some ways to avoid that, and one way would be **negative sampling**. So you might remember from the previous chapter, that we have already discussed **skip-gram negative sampling model**. And this is one of architectures of this word2vec program.

This is open source, so you can just find the code there. 

Okay, now how do we use these models? 
![wtv_2.png](pics/wtv_2.png)

One task would be to produce some meaningful similarities between words. So you remember that we could build **word embeddings**, sum vectors that represent the meaning of the word. Now if we just apply cosine similarity to those vectors, we will get some measure of similarity between the words.

How can we test this model? How can we see that actually those similarity measures are good and somehow meaningful?

Well this is actually a very complicated question, but we can **use some human judgements**. 

So what we see is that there are some data sets provided by linguists that look like the first table in this slide. For example, they say that tiger and tiger are super similar. And media and radio are also similar, but not to that extent, and so on. So you have some ranked list of word pairs with their similarities. 

Now you can produce the more similarities by your model as the table in the right. And then just compare these two rank list, let's say with **Spearman's correlation**. And then you will see whether your model somehow agrees with the assessors. Obviously, using human judgements is not always the best way. **It would be better to use some extrinsic evaluation**. For example, you could build a ranking system and then apply word similarities there, compute the quality of the ranking system, and use this to evaluate your model. 

![wtv_2.png](pics/wtv_3.png)

Okay, anyways, let us come to the next task. The next task is rather appealing. So if you have not heard about it, look, we have some vectors for the words. For example, we have a vector for king, then we can apply some arithmetic operations over these vectors. Like king minus man plus woman, we get some other vector, and the closest word for this vector will be queen, you see? So we can somehow understand relations between the words. We can understand that man to woman is related in the same way as king to queen.

You can think about some other analogy like, for example, Moscow minus Russia plus France will be equal to Paris. And something like that, when I say equal it means that those cosine similarity. Gets its maximum for the target worth. This task become very famous after the recent papers. However it have been started a lot in cognition science and it was called **relational similarity**. On the contrary, the similarity that we have been discussing up to this moment was called **attributional similarity**. 

Now how do you evaluate word analogies task? Again, we usually rely on human judgments. So there are some datasets that say that man to woman relates the same as king to queen and so on, for many, many examples. And then we try to predict the last word and compute the accuracy of these predictions. 

Awesome, now let us see how different models perform on these two tasks.

![wtv_2.png](pics/wtv_4.png)

So let us try to remember what is every model about. So the first row is about PMI, we can compute PMI values between the words and just the long sparse vectors of PMI as word embeddings. Second, we can apply SVD to the PMI matrix and get somehow, dense and low dimensional vectors. Then we can do skip-gram negative sampling module, that we have discussed a lot in the previous video. Now, do remember what is GLoVe?

![wtv_2.png](pics/wtv_5.png)

Well, GloVe was also covered in the previous chapter. And it was about **measures factorization with respect to weighted squared loss**. So you might remember this green $F$ function that was increasing and at some point, it just went constant not to be overwhelmed with too frequent words.

Okay, so we have four methods, **different ways to perform matrix factorization** maybe implicit matrix factorization and obtain **word embeddings**. And you can see that actually they perform really similar. So different columns here correspond to different datasets of word pairs.

So you can see that the bold best values are somehow spread around this table. So very old methods like **SVD** is not much worse than very recent methods like **skip-gram negative sampling**. 

Now what havens with word analogies task? There are also two data sets here, one from Google and another from Microsoft Research.

And one take away of it would be that the quality is very nice. So for Google that is said is about 70% of accuracy, which means that in 70%, we can guess the right word correctly. For example, we can guess that king minus man plus woman is equal to queen.

This is awesome, but actually we'll see some problems with that in the next chapter.

Okay, now let us come to **paragraph2vec** or **doc2vec**. Actually these two names are about the same model. **Paragraph2vec** name goes from the paper. **Doc2vec** name goes from `gensim` library where it is implemented.

![wtv_2.png](pics/wtv_6.png)


You remember that in word2vec, we had two architectures to say that we produce contexts given some focus word or vice versa focus word given some contexts. Now we can also have some document ID. So we will treat the document the same way as we treated words. So we will have some ID in some fixed vocabulary of the documents and then we will build embeddings for the documents. Now there are again, two architecture. 

The first architecture, **DM**, stands for **providing the probabilities of focus words, given everything we have**. And **DBOW** architecture stands for **providing the probability of the context given the document**. So the last one is somehow similar to skip-gram model, right? But instead of the focus words, we condition on the **document**. Now, how can we use this model? Well we can use it to provide some documents similarities and apply, for example, for ranking again. How can it test that we document similarities provided by our model a good?

![wtv_2.png](pics/wtv_7.png)

Well we need some test set again, so that it does set released by the way paper in the bottom of the slide provides triplets from archive papers. We have some paper, and then another paper, which is known to be similar, and then a third paper which is dissimilar. So the task is to predict if this one is the dissimilar one. And if the model can do this, then the model provides good estimates for document similarities. And so we'll just compute the accuracy of this prediction task, Okay? Now I want just to sum up everything that we have covered. 

![wtv_2.png](pics/wtv_9.png)

So there are models called word2vec and doc2vec, that actually not even models but rather implementations of different architectures.

You can find them, for example, in `gensim` library and play with them. And there are different ways to use this model. And for every usage of the model, we need some dataset to evaluate whether the usage will be good. Whether the provided word similarities or document similarities will be good enough. Some other ways to evaluate these models would be to see whether each component of the vector is interpretable in some way or to look into the geometry of this space of the embeddings. This might be more complicated, so we are not going into the details of these ways, and maybe it is also not so needed. So one takeaway that is really needed to be understood is that count-based methods like SVD applied to PMI metrics are not so different from predictive based methods as word2vec. So there is no magic behind them, and in the next chapter, we will actually see some problems behind them. 


# 4. Word analogies without magic: king - man + woman != queen

In the previous chapter, you have seen that Word2Vec is smooth and works nicely for lots of different tasks. However, in this chapter we will raise some doubts and we will see that especially for world analogies task everything is not so smooth.

![wa_1.png](pics/wa_1.png)

Just to recap, the **word2vec** model is trained in an unsupervised manner. It means that the model just sees the let's say Wikipedia. And it this to obtain word vectors.

Now the word vectors that are obtained have some nice properties. For example, if you take the vector for king, you'll subtract the vector for man and add the vector for woman, you will get the vector. And the closest word to this one will be queen. And this is awesome, right? So it looks like the model could understand some meaning of the language. Even though we did not have this in the data explicitly.

But, well let us look into more closer details. How this of the closest word is performed?
![wa_1.png](pics/wa_2.png)

So you see that we have these arithmetic expression and then we maximize cosine similarity between the result of the expression and all the candidates in our space but we exclude three candidates from this search. So we say that our source was, so let's say king, man and woman do not participate in this search.

And well, you know what is this rather important trick that is usually omitted in descriptions of word2vec model.

However, let us see what would happen if we looked into this honestly and if we performed the maximization in the whole space.

The picture shows what would be the closest neighbor to the arithmetic expression in case of the search. The color shows the ratio. The names on the left correspond to different categories of analogies. It is not so important for now, let us look into the last one which is called Encyclopedia. The example about king will fall into this one. So what we see is that when we do king minus man plus woman we get some vector and in most of cases it will be close to the b vector, which is king here. Also in some cases, it can be close to a prime vector, which is woman here. But never to b prime vector, which is our target queen vector. So you see actually in the, let's say 90% or 80% of different analogies. We find the vector which is close to b vector instead of the target b prime vector.

Well you know it somehow ruins a little bit the picture that word2vec understands our language. Now I want to dig a little bit deeper into it. How can it be that when we exclude a, a prime and b vectors, we actually find b prime vectors.

But if we do not exclude them, we end up with b vector,

so I think that this picture can shed some light.

![wa_1.png](pics/wa_3.png)

The thing is that the shift vector a prime minus a seems to be close to 0. So this plus a woman minus men is close to 0. It means that when we employ our and we try to find the closest neighbor, well the closest neighbor is actually b. But once b is excluded, the next closest neighbor is indeed b prime. And we say that, okay, king is excluded and queen is found.

Okay, so maybe we can just use much more simple methods to do this. I mean, can we just the nearest neighbor of b? And do not apply any arithmetic operations at all. Well, some people tried that and they said that for one particular category of analogies. The plural category which is apples to apples is same as orange to oranges.

**Just the strategy to take the closest neighbor of b results in 70% accuracy**. So you see this is a really high accuracy, very similar to what we could see for world2vec back in the previous videos. And just by a very dumb approach. 

This is another visualization all for the same idea. 

![wa_1.png](pics/wa_4.png)

So this comes from a recent paper and it says let us split our word analogy examples into several brackets. So for example, those analogy examples where b and b prime vectors are similar will be going to the right and those examples, where b and b prime vectors are not so similar, we will be going to the left.

Now the blue bars in this slide show the accuracy of wold2vec for every bucket.

So you can easily see that the blue bars are high on the right, and low on the left. Which means that **word2vec works really nice in those analogies where b and b prime are similar**. And it **works poorly for those more complicated tasks where they are not similar**.

Now let us see what are those more complicated tasks? So let us study what types of analogies covered in this diagram. There are actually four main types of analogies.

![wa_1.png](pics/wa_5.png)

For example you can find actor and actress in the very bottom line. This is kind of the same thing as our king and queen example. But we have much more here. 

1. We have some morphological examples. We have inflectional morphology which means that we can just change the form of the word like orange to oranges is the same as apple to apples. 

2. we can have derivational morphology, which can also change the part of speech, like bake to baker is the same as play to player. 

3. Now we have lots of different semantical analogies. For example, we have hypernyms there. This would be, for example, peach to fruit is the same as cucumber to vegetable.

4. We have many more, for example, the nice one is about colors like blood is red and sky is blue. 

And there have many different options, and this is not so easy to build this dataset, so we need some **linguistic expertise**. Anyways, once we have this, can we look into how word2vec performs for different analogies. And can we compare word2vec with a very simple baseline. The baseline would be to just to take the closest neighbor to one of the query words.

So here we go.
![wa_1.png](pics/wa_6.png)

Each line here corresponds to some analogy example. For example, one line could correspond to apple to apples is the same as orange to oranges. Now the left point for every line is the performance of the baseline. And the right point of every line is the performance of word2vec. So it means that horizontal lines

show you that word2vec is not better than base line. When the line has a high slope, it means that word2vec does a good job. So you see that for **inflectional morphology which is an easier task word2vec performs well and for derivational morphology. All the lines are horizontal**.
![wa_1.png](pics/wa_6.png)

Now what happens with semantic analogies. Well this is a nice picture, so the thing on the left is about different types of analogies and most of them have horizontal slow push means that word2vec doesn't work for them. But two lines, red lines have high slope and those two are the examples about genders. Like man to woman is as king to queen is as actor to actress and so on. And the picture on the right is about some named entities and the three red lines are about countries and capitals. Examples that are really popular in world2vec descriptions. For example, Moscow to Russia is the same as Paris to France. 

So you know what? **Those very famous examples are kind of the only ones that actually work with Word2vec.** I mean there are not the only ones but it looks things are generally worst in random for different tasks.

**Okay, so the takeaway of this insight would be that you should be very careful about hype that you see around.** So it is always nice to dig into some details, like how is a relation performed? What would happen with a little bit different tasks? And see whether some of these provide some good or bad solutions. 

**So to me it looks like word2vec works nicely for word similarity task**. For example, if you have some application where you need to understand that tap and faucet are really similar and should be placed into one category then word2vec is your choice. 

But you shouldn't be blinded, and you shouldn't think that it somehow solves the language or provides the solutions for word analogy task in all the cases. So it works sometimes, but not always. And this is a nice question to have further research on it.

Okay, in the next chapter, we'll talk about some extensions of those techniques like word2vec. We will see what are now current state of that approaches and what are some open source implementations that you can use in your applications. So stay tuned, and we will get some practical advice, what models to build in your cases. 


# 5. Why words? From character to sentence embeddings

Hey. We have just covered lots of ways to build **word embeddings**. But, you know what? Why words? I mean, sometimes we need representations for sentences, and it's not so obvious so far how to get that, and in some cases, **we need to go to sub-word level**. For example, we might have a language with rich morphology and then it would be nice to somehow use this morphology in our models. 

Actually, linguistics can be really helpful. So, we will see a couple of examples right here. Let us start with morphology. 
![mor_1.png](pics/mor_1.png)
So, for example, in English you can say mature or immature, and then relevant, irrelevant and so on. So, you know that there are some **prefixes** that can change the meaning of the word to the **opposite one**. So we will have antonyms. Now, on the other hand, you can understand that there are **some suffixes that do not change the semantics of the words a lot**. For example, I have no bake and the baker are still about the similar concepts in some sense. So the idea of the window proposed in the paper in the bottom of the slide is the following. 

Let us try to put the words that have some **not important morphological changes together in the space**, and on the **opposite**, let us try to have the embeddings for words that have some **prefixes that change the meaning of the word completely**. Let us put them as far as possible. 

So you see you try to put some words closer and some words more far away. 

You can do this in many ways like, let's say it would be some **regularization** of your model, or you will have some loss, **and then at some other losses**, to make sure that you have this additional constraints. Okay? 

This idea is nice, but sometimes we don't have linguists to tell us what is the morphological patterns in the language. What can we do then? 

Well, in this case, we can try just to have more brute-force approach. 

![mor_2.png](pics/mor_2.png)

This would be to go to character n-grams. This is **FastText** model proposed by Facebook research, and this is really famous just because it has a good implementation and you can play with it. So, the idea is as follows. **Let us represent a word by a set of character n-grams, and also let us put the word itself to this set as well.**

For example, for character n-grams of size three, we'll have this example in this slide. 

Usually, we will have several n-values, like n from three to six. And we will have n-grams of different length in the same set, and this set will represent the word. Now, how can we use this? Well, if you remember in skip-gram negative sampling, we had some similarity between two words and we could represent it just by those product of these words. Now, what if the words are represented not by vectors but by set of vectors? 

Well, we can sum. So, we can say that now we have a sum over all those character n-grams, and every character n-gram is represented by the vector.

Awesome. So, I think the idea is great, and it works well for languages with **rich morphology**. So, FastText model provides a nice way to represent sub-word unions. 

![mor_2.png](pics/mor_3.png)

Now, what if we need to go to another level and to represent sentences? Do you have any ideas how to build sentence embeddings? There are some ideas summarized in this slide. So, the more simple ideas would be, what if we just take the pre-trained vectors, let's say, from Word2Vec model and average them to obtain the embedding of the sentence? Well, you might have also some weighted average, for example with TF-IDF weights. 

But, you know what? It might be not too nice approach because those pre-trained vectors are trained with some other objectives, and they might not suit well for our task. 

So, another idea would be somehow to represent the sentences as a sum of sub-sentence units. Let's have a closer look. 

First, we are going to represent the **similarity between word and the sentence**, and our training data will be those words that occur in some sentences. So it will be a positive pair, word occurred in a sentence. The negative example will be some word that occurs in some other sentence. So we assume that they are not similar. Now, how do we model this similarity? 

Again, we have a **sum over sub-unions**. So, these unions will be word n-grams. And, a bag of word n-grams will represent a sentence. Awesome. So, you see that this model is very similar to FastText model, but instead of having **character n-grams to represent words, you have word n-grams to represent sentences**. 

Also, another minor difference is that now you have **average not the sum**. We have this one divided by the size of the set, but this is not so important. 

![mor_2.png](pics/mor_4.png)

So you see that in different levels of our language we can have some similar ideas. What if we build some general approach for all these levels? An attempt to build this general approach is found in a very recent paper which is called **StarSpace**. 
![mor_2.png](pics/mor_5.png)

So, the idea here is that we have some entities represented by features. For example, **words represented by character n-grams** or **sentences represented by word n-grams**. But you can go further. For example, we can think about recommendation systems. There we have users, and they are represented as bag of items that they like. For example, as bag of movies. So we'll learn how to embed users and movies in the same space. 

Another example, it would be document classification problem. So there you have documents as a bag of words and you have labels, for example, sentiment labels, and these are rather simple entities that are represented by a singleton feature, the label itself. So, in this application, you will try to learn to produce correct labels for document. So, you'll say that the similarity measure between the documents and the label should be high if this label can be found in the supervised data for this document, and low vice-versa. So you build the model and you get the embeddings for labels, and documents, and words in the same space. 

Now, you can read about all those applications on the GitHub page. But I want to cover in more details just one application. And this will be, again, sentence embeddings. 

![mor_2.png](pics/mor_6.png)

So let's say, we have some supervised data about similar sentences. So we know that some group of sentences are duplicates and they are similar. Let us put them into one line of the file. Let us have tabs between the sentences and let us have spaces between the words. Now, in this format, we can feed these data to StarSpace and say that we need to train the embeddings for words and sentences. Then, what happens next? Well, the similar sentences are the good source for positive examples. And we will just take two sentences from the line and use them as a positive example in our model. Now these similar sentences can be just sampled at random, for example we take a sentence from one line and just a random sentence from another line and say that they are a negative example. Then we train those kind of word2vec models in some sense, and obtain the embeddings for all our entities. 

![mor_2.png](pics/mor_7.png)

Awesome. So, the last thing that I want to cover is Deep learning approaches to build sentence representations. So, actually, everything up to this point, we are rather shallow networks. So if we speak about deep learning, we could have three main trends here. 

One trend would be, obviously, recurrent neural networks that are popular in NLP. Another would be convolutional neural networks that are actually much faster than recurrent neural networks so it seems like it is a super promising approach. And the third one would be recursive neural networks or so-called Tree-LSTMs or Dynamic Acyclic Graph, DAG-LSTM. So, these kind of models use the syntax of the language to build some hierarchical representations. These are rather awesome approaches. We will not have too much time to cover them, but you just need to know that syntax can help us to build the representation of the sentence. 


Now, the take-away of this slide would be that linguistics can help us, for example as morphology syntax in many many tasks. 
![mor_2.png](pics/mor_7.png)

The last architecture that I want to cover in this chapter is called skip-thought vectors. And it is based on recurrent neural networks. So the idea is as follows. You have some sentence, and you want to predict the next sentence. You encode your sentence with a recurrent neural network and get some hidden representation. It is called thought vector. Now, once you have these, you try to generate the next sentence with the language model. So you already know that there are neural language models. **Now it is a conditional neural language model conditioned on this thought vector**. And the great thing is that this thought vector is going to represent the meaning of the sentence, and it can be used as the embedding. 

Actually, this architecture is called **encoder-decoder architecture**. 


# 6. Quiz - Word and sentence embeddings 

![wv_quiz_1.png](pics/wv_quiz_1.png)
![wv_quiz_1.png](pics/wv_quiz_2.png)
![wv_quiz_1.png](pics/wv_quiz_3.png)
![wv_quiz_1.png](pics/wv_quiz_4.png)
![wv_quiz_1.png](pics/wv_quiz_5.png)
![wv_quiz_1.png](pics/wv_quiz_1_answer.png)


