## How to create a vector representation from your concordances by hand

In the previous lab assignment you created two concordances for two meanings of a target word e.g. “bass”, “mouse”, “knight” so that you can analyse the context. In this notebook, we exlain how you can manually create a vector representation of the this context and use it to calculate the similarity.

A vector is an array of positions where each possible context word owns a fixed position. For example, the first position is for the word "a", the second for the word "be", etc. The array can have a position for all words (the length for the full vocabulary) but it is usually retricted to the more frequent words. If we want to compare the vectors for two meanings, we have to make sure that the two vectors have equal length and each position represents the same context word. The only difference is the frequency count for each position derived from the concordance.

Before analysing a corpus, all positions are set to zero. Next all corpus sentences are checked for the occurrence of a target word. Sentences with the target word are analysed for the context words. If a word is observed in the left or right context of a target word, the position of context word in the vector array is set to one (or incremented with one). After processing a complete corpus, the vector array has scores that represent the typical context for a target word

### An example
Assume we have the following two concordances for the meanings of “bass”:

(a)   I play  bass   in a band	[meaning 1]
(b) I catch   bass   in a lake	[meaning 2]

The one-left and one-right context words are: "play", "in", "catch", "in". Making the list unique and ordering them in alphabetic order gives: "catch", "in", "play". A list of 3 unique words so the shared vector for both meanings needs to have 3 positions to store the counts for these words.

An empty vector representing a meaning then looks as follows: [0, 0, 0]. The first position is for counting "catch" as a context, the second is for counting "in" and the third for counting "play". Note that we do NOT make a difference between left and right context here.

To represent (a) and (b) we need two vectors counting the context words:

[0, 1, 1], which means "catch" occurs zero times, "in" occurs once and "play" occurs once
[1, 1, 0], which means "catch" occurs once, "in" occurs once and "play" occurs zero times

To calculate how different the vectors are we need to calculate the so-called cosine similarity. To simplify this, we first normalise the counts by dividing them by the length of the vector, which is 3. This results in:

[0/3, 1/3, 1/3]
[1/3, 1/3, 0/3]

The cosine similarity is then the sum of the dot product of the values for each position:

cosine similarity = 0/3*1/3 + 1/3*1/3 + 1/3*0 = 0 + 1/9 +  0 = 0.11

If the the contexts would yield exactly the same counts, we would have values of 1/3+1/3+1/3 = 1. If the counts are all different, we would have the values of 0 + 0 + 0 = 0. We thus always get a value between 0 (totally different) and 1 (totally the same). So the 0.11 suggests that the vectors are more different (below 0.5) than similar (above 0.5).

Instead of absolute frequencies you can also just mark occurrence (1) or not (0) or relative frequencies (the count divided by the number of sentences conisdered). These variants do not make a difference for our example because we have frequencies of 1 and 0 only and only one example per vector representation. 

What would be the effect of using absolute frequencies when you use a large collection of texts?
What would be the effect of using absolute frequencies when the size of the texts varies across the corpora for each meaning?


**Excersise**:
Apply the above strategy to your own target word and concordances. Choose how you select the context features (left 1,2, 3, etc., right 1, 2, 3, etc., content words only, etc.) and why and choose a way of weighting the counts. Motivate you choice and comment on the result. Manually extract the similarity score for the meanings according to the vector similarity based on the normalised dot product.