This repository contains the python implementation of the paper Beyond Context: A New Perspective for Word Embeddings.
FeVER stands for Feature embeddings for Vector Representation.
The input of this implementation is not the pure text. Instead, a multi-label data format is used as the input of our system. Because of this, we need to preprocess the text into a suitable format, which is described in Training files.
pytorch 1.1.0 numpy ExAssist
- Go to the directory of the project.
- Install the code in development mode.
python setup.py develop
This python implementation used ExAssist to track each experiments.
Every time you run an experiment, all the output files (include experiment settings and details) will be saved in
Experiments directory. If
Experiments directory does not exist, a new one will be created.
In this subsection, a small example is used to show how to use this repository. The behavior of our code is controlled by a config file. After installation, you can directly run our code like:
python FeVER/main.py example/config.ini
config.ini file contains all the configuation for running.
A toy dataset is stored in the
Different files in this directory has different usage.
To train the feature embeddings based on multi-label classification, you need to prepare three files:
context_feature_training.txt: This file contains all training data in format of multi-label data. It contains the predicting word index and features of the context (output of psi function in the paper). Each word is mapped to an index by the
vocabulary.txtfile. A file contains following content:
2 4 3 idx1, idx2 feat1:1.0 feat2:1.0 idx1, idx3 feat3:1.0 feat4:1.0
In this file, the first line indicates the number of training examples,
features and labels. For example, in this tiny file, it means there are 2
traiing examples, 4 features in total and at most 3 words.
Starting from second line, all the content in the file is training examples.
Second line means word
idx2 are showed in
the same context and this context has features
Third line means word
idx3 are showed in the same context and
this context has the features of
label_feature_training.txt: This file contains the word features with the same format of
context_feature_trainingfile. Eeah line in the file represents a word in the vocabulary and its features (output of phi function in the paper). Suppose we have tiny file looks like this:
3 4 3 idx1 feat1:1.0 feat2:1.0 idx2 feat3:1.0 feat4:1.0 idx3 feat4:1.0 feat5:1.0
In this file, there are 3 words, 4 features. Second line means word
idx1 has features
frequency.txt: This file contains the frequency of each word in the context. Each line in this file is corresponding to each line in the
After training, the model needs feature files to extract word embeddings.
Be note, here we can use different vocabulary as long as we can extract feature
from this vocabulary.
feature files are in the same format as