# Koehn

In this notebook I replicate Koehn (2015): _What's in an embedding? Analyzing word embeddings through multilingual evaluation_. This paper proposes to i) evaluate an embedding method on more than one language, and ii) evaluate an embedding model by how well its embeddings capture syntactic features. He finds that most methods perform similarly on this task, but that dependency based embeddings perform better. Dependency based embeddings particularly perform better when you decrease the dimensionality. Overall, the aim is to have an evalation method that tells you something about the structure of the learnt representations.

Embedding models tested:
- cbow
- skip-gram
- glove
- dep
- cca
- brown

The proposed task is classification of various syntactic features, using an L2-regularized linear classifier. Koehn uses a majority baseline. Upper bound assigns the most probably class. He looked at the following features. Obviously, some features only apply to a subset of the lexicon.
- pos
- headpos (the pos of the word's head)
- label
- gender
- case
- number
- tense

He tested on the following languages:
- Basque
- English
- French
- German
- Hungarian
- Polish
- Swedish

Word embeddings were trained on automatically PoS-tagged and dependency-parsed data using existing models. This is so the dependency-based embeddings can be trained. The evaluation is on hand-labelled data. English training data is a subset of Wikipedia; English test data comes from PTB. For all other languages, both the training and test data come from a shared task on parsing morphologically rich languages. Koehn trained embeddings with window size 5 and 11 and dimensionality 10, 100, 200.

Dependency-based embeddings perform the best on almost all tasks. They even do well when the dimensionality is reduced to 10, while other methods perform poorly in this case.

I'll need:
- models
- learnt representations
- automatically labeled data
- hand-labeled data

In [1]:
%matplotlib inline
import os
import csv
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

data_path = '../../data'
tmp_path = '../../tmp'

  from pandas.core import datetools


## Models

## Learnt representations

### GloVe

In [2]:
size = 50
fname = 'embeddings/glove.6B.{}d.txt'.format(size)
glove_path = os.path.join(data_path, fname)
glove = pd.read_csv(glove_path, sep=' ', header=None, index_col=0, quoting=csv.QUOTE_NONE).T