# pmi-embeddings

This is a tutorial how to use pmi-embeddings.py. It does not tell you step-by-step how to write your own script for producing word embeddings, but it does guide you through the basic idea and parameters of the tool.

For the command line use, you can refer to [this document](https://docs.google.com/document/d/1TjVWqrhalCDjkOQf-JLk1jmC6N83MWGUIEVjbJpm9Es).



In [1]:
import make_embeddings as embs

## Defining the basic settings

At first we need to initialize the Embeddings object. This object wants at least the name of our corpus file, which must be in the WPL (word-per-line) format. In this tutorial, you can use the [akkadian.zip](https://github.com/asahala/pmi-embeddings/tree/main/corpora) test corpus. In order to use it, you must first unzip it and change the file_name path below, e.g. to ```lex/akkadian.txt```

In [2]:
file_name = "akk_corpus.wpl"

The embeddings object can also take various other parameters summarized as follows:

| Parameter | Type | What it does |
| :- | -: | :- |
| **chunk_size** | int | Defines how many words of the corpus are being processed at the time. Lower value decreases memory usage but increases the runtime. |
| **window_size** | int | Defines how many preceding and following words are taken as the context. |
| **min_count** | int | Words that have frequency lower than the set value are ignored in the co-occurrence matrix. |
| **subsampling_rate** | float | Rate for Word2vec style subsampling. Subsampling randomly removes words that have a frequency higher than some given threshold. A typical value used is 0.00001 but in small corpora (< 10M words) subsampling is generally not very useful. |
| **k_factor** | int | Defines the magnitude of Context Similarity Weighting, that is, a method of downsampling duplication and repetition in the corpus. Value of 0 sets it off. Useful values are typically between 1 and 3. |
| **dynamic_window** | bool | Dynamic window gives less importance to co-occurrences if the words are far apart. |
| **window_scaling** | bool | Window scaling compensates the co-occurrence frequencies based on the window size. It ensures that a word cannot occur with another several times within the same window. |
| **dirty_stopwords** | bool | If set true, all stopwords are deleted without placeholders (shortens the distance between relevant words. |
| **verbose** | bool | If set true, the script will output processing information. | 





In [3]:
chunk_size = 400000
parameters = {
    "window_size": 3,
    "min_count": 1,
    "subsampling_rate": None,
    "k_factor": 3,
    "dynamic_window": True,
    "window_scaling": False,
    "dirty_stopwords": False,
    "verbose": True
           }

With our all parameters set, we can initialize our Embeddings object.

In [4]:
embeddings = embs.Cooc(file_name, chunk_size, **parameters)

Next, we want to build a co-occurrence matrix from our corpus. This matrix is of size w×w, where w = number of unique words in our corpus. The co-occurrence matrix contains information how many times words co-occur in our corpus.

In [5]:
embeddings.count_cooc()

> ------------------------------------------------
> Reading akk_corpus.wpl...
   Corpus statistics:
      spans       6954
      tokens      926074
      types       5968
      lacunae     292127
      stopwords   229210
      frag. rate: 0.32
    (0.62 seconds)
> ------------------------------------------------
> Extracting bigrams...
> Matrix size: 35617024 (1947 kB)
> Non-zero elements: 243419
    (3.11 seconds)
> ------------------------------------------------
> Calculating context similarities...
    (3.23 seconds)
> ------------------------------------------------


Now we want to calculate a PMI matrix based on our co-occurrence matrix. PMI is an association measure that measures the statistical significance of word co-occurrences. Words that have statistically significant co-occurrences are given a score of >0 and words that seem to repulse others are given a score of <0. If the co-occurrence is statistically independent, it has a score of 0.

In word embeddings the PMI values are often shifted and cut. Shifted PMI generally subtracts a small constant number from the PMI scores to get rid of borderline significant associations that are very close to being independent. The scores can be also cut, completely removing all the scores below a certain threshold. This threshold is typically set 0 (remove all repulsive co-occurrences). This is known as PPMI or Positive PMI. 

| Parameter | Type | What it does |
| :- | -: | :- |
| **threshold** | int | Defines the shift value _s_ for Shifted PMI. See _shift_type_ for more information. | 
| **shift_type** | int | Defines which formula is used for Shifted PMI. Value 0 cuts the PMI scores at _-s_. Value 1 shifts the PMI towards negative by $log_{2}$ _s_. |
| **lambda_** | float | Adds smoothing to the co-occurrence matrix before calculating PMI as in Jungmaier et al. (2020). This method reduces PMI's bias toward rare words. Small values such as 0.0001 have been reported to work well. |
| **alpha** | float | Context Distribution Smoothing (Levy et al. 2015). This also compensates PMI's bias toward rare words. A value of 0.75 has been reported to work well. |

We can again define the parameters neatly as a dictionary.

In [6]:
pmi_parameters = {
    'shift_type': 0,
    'alpha': None,
    'lambda_': None, 
    'threshold': 5
    }

Next we calculate the PMI matrix.

In [7]:
embeddings.calculate_pmi(**pmi_parameters)

> Calculating PMI...
    (0.03 seconds)
> ------------------------------------------------


Now we have a huge (Shifted) PMI matrix that is of the same size as our previous co-occurrence matrix. Most of its values are zero, because most of the words never co-occur with each in our corpus, especially if we are using a moderately small window size (e.g. 1-3), which is often recommended.

The next step is to factorize the PMI matrix by using a method called Truncated [Singular Value Decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition) (SVD). This yields us a dense matrix that consists only of the values that best describe our co-occurrences in the corpus. The size of this matrix is arbitrary: it's other dimension still equals the number of unique words in our corpus, but the other dimension will be set to a fixed size, often something between 50 and 300.

Lower dimensionality often results into better generalization, but this may not always be true. The best value is often found by trial-and-error and depend on the task we want to use our word embeddings for.

Let's assume that we have a toy PMI matrix that would look like this:

| | puppy | small | pet | wood | stone | ... |
|--|--|--|--|--|--|--|
|dog|5.5|1.4 |4.0 |0.02 |0.0 | ... |
|cat|0.4|1.8 |3.9 |0.0 | 0.01| ... |
|house|0.3 |1.5 | 0.6| 2.1| 2.5|...|
|castle|0.0| 0.3| 0.0| 0.9| 4.0|...|

The SVD matrix truncated into two dimensions could look like this:

| | feature1 | feature2 |
|--|--|--|
|dog|1.6|0.01|
|cat|1.4|0.01|
|house|0.1|1.8 |
|castle|0.0|1.9|

As we can see, the vectors (rows) for cat and dog look quite similar, but not very similar to those of house and castle. For the animals, the first value is high and the second is low, while for the buildings it is the opposite.

Now, we will set our dimensionality and run the matrix factorization.

In [8]:
dimensions = 60
embeddings.factorize(dimensions)

> SVD...
> Normalizing vectors...
    (0.76 seconds)
> ------------------------------------------------


Finally, we want to save our word embeddings into Word2vec compatible format. This can be done by calling the save_word_vectors() function and passing it a vector file name and our Embeddings object containing the word vectors.

In [9]:
vector_file = 'akkadian.vec'

embeddings.save_vectors(vector_file)

> Filtering zero-vectors...
> Saving 4481 non-zero vectors (1487 discarded)... 
    (0.57 seconds)
> ------------------------------------------------
