# Użycie Power Means w Senteval - projekt


## Przygotowanie narzędzi

In [0]:
!git clone https://github.com/facebookresearch/SentEval \
&& cd SentEval \
&& pip install . \
&& cd ./data/downstream/ \
&& ./get_transfer_data.bash

!cd /content/ \
 && curl -Lo glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip \
 && unzip glove.840B.300d.zip \
 && rm glove.840B.300d.zip

Cloning into 'SentEval'...
remote: Enumerating objects: 687, done.[K
remote: Total 687 (delta 0), reused 0 (delta 0), pack-reused 687[K
Receiving objects: 100% (687/687), 33.25 MiB | 19.79 MiB/s, done.
Resolving deltas: 100% (431/431), done.
Cloning into 'mosesdecoder'...
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (20/20), done.[K
^C
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0   315    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0   352    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  3 2075M    3 79.5M    0     0  53.4M      0  0:00:38  0:00:01  0:00:37 68.5M^C


## Wstęp teoretyczny

Krotkie overview danych na jakich pracujemy

### Power Means

Na chama wzory i je opisac - co dlaczego i jak

Power means:
$$
(\frac{x_1^p + ... + x_n^p}{n})^{1/p}
$$



### Senteval - omowienie

o samej libce, jak porownej, na czym - transfer tasks krotko mna czym polga

## Implementacja

In [0]:
from __future__ import absolute_import, division, unicode_literals

import sys
import io
import numpy as np
import logging
import senteval

In [0]:
PATH_TO_SENTEVAL = '../'
PATH_TO_DATA = '/content/SentEval/data'
PATH_TO_VEC = '/content/glove.840B.300d.txt'

In [0]:
def gen_mean(values: list, p: float):
    '''
    Compute power mean from values
    :param values: 
    :param p:
    :return:
    '''
    p = float(p)
    return np.power(
        np.mean(
            np.power(
                np.array(values, dtype=complex),
                p),
            axis=0),
        1 / p
    )


def get_sentence_embedding(sentence: str, embeddings_vectors: list, embeddings_dimensionality: int):
    '''
    Function compute full sentence into vector by power mean.
    :param sentence: sentence from batch set
    :param embeddings_vectors: all embeddings vector for transfer task
    :param embeddings_dimensionality: dimension of word embedding vector
    :return: sentence embedding for sentence
    '''
    word_embeddings = []
    p = 2.0
    for tok in sentence:
        if tok in embeddings_vectors:
            word_embeddings.append(embeddings_vectors[tok])

    if not word_embeddings:
        return np.zeros(embeddings_dimensionality)
    else:
        return [gen_mean(word_embeddings, p).real]


In [0]:
def create_dictionary(sentences: list, threshold=0):
    '''
    map id to word and word to id
    :param sentences: all sentences for transfer task
    :param threshold: 
    :return: 
    '''
    words = {}
    for s in sentences:
        for word in s:
            words[word] = words.get(word, 0) + 1

    if threshold > 0:
        newwords = {}
        for word in words:
            if words[word] >= threshold:
                newwords[word] = words[word]
        words = newwords
    words['<s>'] = 1e9 + 4
    words['</s>'] = 1e9 + 3
    words['<p>'] = 1e9 + 2

    sorted_words = sorted(words.items(), key=lambda x: -x[1])  # inverse sort
    id2word = []
    word2id = {}
    for i, (w, _) in enumerate(sorted_words):
        id2word.append(w)
        word2id[w] = i

    return id2word, word2id


In [0]:
def get_wordvec(path_to_vec: str, word2id):
    '''
    :param path_to_vec: path to word embeddings file (glove) 
    :param word2id: dict mapping word to word id
    :return: word embeddings for dictionary
    '''
    word_vec = {}

    with io.open(path_to_vec, 'r', encoding='utf-8') as f:
        for line in f:
            word, vec = line.split(' ', 1)
            if word in word2id:
                word_vec[word] = np.fromstring(vec, sep=' ')

    logging.info('Found {0} words with word vectors, out of \
        {1} words'.format(len(word_vec), len(word2id)))
    return word_vec

In [0]:
def prepare(params, samples):
    '''
    Load word embeddings and map dictionary to word vector
    :param params: parameters SentEval and our methods (e.g. words vector, dimention of vector)
    :param samples: list of all sentences from transfer tasks
    :return: void
    '''
    _, params.word2id = create_dictionary(samples)
    params.word_vec = get_wordvec(PATH_TO_VEC, params.word2id)
    params.wvec_dim = 300
    return


In [0]:
def batcher(params, batch):
    '''
    Computing power means from batch
    :param params: SentEval params and prepared word vectors
    :param batch: part of sentences from transfer tasks 
    :return: computed embeddings
    '''
    batch = [sent if sent != [] else ['.'] for sent in batch]
    embeddings = []

    for sent in batch:
        sentvec = get_sentence_embedding(sent, params.word_vec, params.wvec_dim)
        embeddings.append(sentvec)

    embeddings = np.vstack(embeddings)
    return embeddings

In [0]:
params_senteval = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 5}
params_senteval['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128,
                                 'tenacity': 3, 'epoch_size': 2}



*   `task_path` - ścieżka do folderu z _transfer tasks_
*   `usepytorch` - czy do obliczeń jest wykorzystywana biblioteka `PyTorch`
*   `kfold` - k-fold validation dla tasków: **MR** i **CR**
*   `classifier`:
    * `nhid` - ilość ukrytych warstw (dla `0` jest to *LogisticRegression*)
    * `optim` - optimizer
    * `batch_size` - rozmiar batch
    * `tenacity` - po ilu powtórzeniach przerywane są obliczenia, gdy nie wzrośnie `dev_acc`
    * `epoch_size` - 



In [0]:
logging.basicConfig(format='%(asctime)s : %(message)s', level=logging.DEBUG)

se = senteval.engine.SE(params_senteval, batcher, prepare)
transfer_tasks = ['MR', 'CR', 'SST2']
results = se.eval(transfer_tasks)
print(results)

2019-09-09 21:04:40,513 : ***** Transfer task : MR *****


2019-09-09 21:04:47,501 : Found 18490 words with word vectors, out of         20328 words
2019-09-09 21:04:47,513 : Generating sentence embeddings
2019-09-09 21:04:48,898 : Generated sentence embeddings
2019-09-09 21:04:48,900 : Training pytorch-MLP-nhid0-rmsprop-bs128 with (inner) 5-fold cross-validation
2019-09-09 21:05:09,760 : Best param found at split 1: l2reg = 0.0001                 with score 68.07
2019-09-09 21:05:22,272 : Best param found at split 2: l2reg = 0.0001                 with score 61.86
2019-09-09 21:05:36,098 : Best param found at split 3: l2reg = 0.001                 with score 68.72
2019-09-09 21:05:48,385 : Best param found at split 4: l2reg = 0.001                 with score 67.26
2019-09-09 21:06:01,411 : Best param found at split 5: l2reg = 1e-05                 with score 66.62
2019-09-09 21:06:01,917 : Dev acc : 66.51 Test acc : 66.95

2019-09-09 21:06:01,923 : ***** Transfer task : CR *****


201

{'MR': {'devacc': 66.51, 'acc': 66.95, 'ndev': 10662, 'ntest': 10662}, 'CR': {'devacc': 67.26, 'acc': 68.13, 'ndev': 3775, 'ntest': 3775}, 'SST2': {'devacc': 72.13, 'acc': 71.61, 'ndev': 872, 'ntest': 1821}}


## Ewaluacja

Porownanie baseline przy innych, wzmiankowanych w conneau sentence embeddingsach dla tych samych transfer taskow.

Wykres slupkowy/box plot(?) do porownania wynikow

## Wnioski