Aishwarya Kamath @ashkamath: https://github.com/ashkamath
Jonas Pfeiffer @JoPfeiff: https://github.com/JoPfeiff
We are trying to define a framework for NLP tasks that easily maps any kind of word embedding data set with any kind of text data set. The framework should decrease the amount of additional code needed to work on different NLP tasks.
We have found that for many NLP tasks similar preprocessing steps are needed.
This entails
- tokenizing the text
- replacing words with embeddings (pretrained or newly learnt)
- bucketizing sentences based on their length
- padding sentences to a specific length
- replacing unseen words with
<UNK>
- creating a generator that loops through the sentences
We therefore want to create a framework that provides these common functionalities out-of-the-box to be able to focus on the core task of the project faster.
Currently the framework has the following capabilities:
This is the main class to-be-called and can be found in data_loading/data_loader.py
The DataLoader
class maps embeddings to text data sets. This code needs to edited to be able to accept different kinds embedding and text data sets.
In order to combine Fast-Text embeddings with the SNLI data set we can call the data_loader by:
dl = DataLoader(data_set='SNLI', embedding_loading='in_dict', embeddings_initial='FastText-Wiki',
embedding_params={}, K_embeddings=float('inf'))
gen = dl.get_generator(data_set='train', batch_size=64, drop_last=True)
data , batch = gen.next()
The generator loops through the defined data_set
once. So for each epoch, a new generator has to be called. drop_last=True
defines that if the last batch is not full, it is not passed.
This class loads embeddings based on a defined strategy. Currently two versions are implemented:
embedding_loading='top_k'
embedding_loading='in_dict'
top_k
loads the firstk
embeddings from file, assuming that they are sorted by most frequent on the top. If all embeddings should be loaded setK_embeddings=float('inf')
in_dict
preloads all embeddings and the selects only those embeddings that occure in the text data set.
The class also gives the possibility to store all the loaded data on disc in a pickle file and load it again into the object. this can be done by
dl.load()
dl.dump()
However, these functions only dump what has currently been loaded into the object. To load everything at the start and then dump it to file call
dl.get_all_and_dump()
This function also automatically bucketizes all the sentences based on the defined bucketizing strategy.
The current tokenizer is based on spaCy.io and can easily be replaced in data_loading/data_utils.py
in the function tokenize()
Currently a set of out-of-the-box embedding and text data sets have been implemented. These are:
The core class is Embeddings
which can be found in embeddings/embeddings.py
. However, this should only be used as the super class for the specialized embeddings. New embedding inherit this class (e.g. class FastTextEmbeddings(Embeddings)
in embeddings/fasttext_embeddings.py
). Only if the embeddings are to be initialized randomly, the core Embeddings
class is to be called.
A generic path-based embedding class is implemented that can process any kind of Embeddings stored as text documents in the structure
<word>\t<float>\t<float>\t...\t<float>\n
This object is called if the parameter embeddings_initial='Path'
is called when creating the data_loading
object. The parameters, e.g. where the embedding data is stored, is passed as a dictionary:
embedding_params = {'path':'../data/embeddings/bow2.words', 'name':'bow2'}
A set of pre-implemented word embeddings are:
- Fast-Text:
https://fasttext.cc/docs/en/english-vectors.htmlembeddings_initial='FastText-Crawl'
embeddings_initial='FastText-Wiki'
- Glove-Embeddings:
https://nlp.stanford.edu/projects/glove/embeddings_initial='Glove-Twitter-25'
embeddings_initial='Glove-Twitter-50'
embeddings_initial='Glove-Twitter-100'
embeddings_initial='Glove-Twitter-200'
embeddings_initial='Glove-Common-42B-300'
embeddings_initial='Glove-Common-840B-300'
embeddings_initial='Glove-Wiki-50'
embeddings_initial='Glove-Wiki-100'
embeddings_initial='Glove-Wiki-200'
embeddings_initial='Glove-Wiki-300'
- Lear-Embeddings:
embeddings_initial='Lear'
- Polyglot-Embeddings:
http://bit.ly/19bSoASembeddings_initial='Polyglot'
To implement a new Embedding class, this should inherit the class Embeddings
which can be found in embeddings/embeddings.py
. This has all the basic functionality implemented thats needed for most embedding data. The new Embedding class only needs two functions which are data set dependet. These are:
-
load_top_k(self, K, preload=False)
This loads the top_k embeddings from file with the assumption that the embeddings are ordered based on their frequency. The following functionalities for implementing this function are important:- Adding the term should be done using
self.add_term(term, preload=preload)
- If special embeddings (
<UNK>
,<PAD>
,<START>
,<END>
) need to be added, this is to be done using
special_embeddings = self.add_special_embeddings(len(embeddings[0]), preload=preload)
- The function should return the embeddings as a
np.array()
return np.array(embeddings)
- Adding the term should be done using
-
get_name(self)
This should return the name of the embeddings e.g.'FastText-Wiki'
The new functionality needs to be added to DataLoader
in data_loading/data_loader.py
. The defined object needs to be callable in this object using a name. Two new lines need to be added to:
if self.embeddings_initial in FASTTEXT_NAMES:
self.embedding = FastTextEmbeddings(self.embeddings_initial)
elif self.embeddings_initial in POLYGLOT_NAMES:
self.embedding = PolyglotEmbeddings()
elif self.embeddings_initial in LEAR_NAMES:
self.embedding = LearEmbeddings()
elif self.embeddings_initial in GLOVE_NAMES:
self.embedding = GloveEmbeddings(self.embeddings_initial)
elif self.embeddings_initial == "Path":
self.embedding = PathEmbeddings(self.embedding_params)
else:
raise Exception("No valid embedding was set")
For reference please look at embeddings/fasttext_embeddings.py
The text data set implements a bucketized loading structure. That means, that sentences are bucketized based on their length (conditioned on words in the dictionary) and stored in memory.
A generator is callable that loops through each of the data points randomly by first sampling a bucket, and then sampling from each bucket.
Out-of-the-box text data sets are:
- SNLI data set
data_set='SNLI'
https://nlp.stanford.edu/projects/snli/snli_1.0.zip - Billion Word Benchmark data set
data_set='BillionWords'
New text data sets are to inherit the class TextData
that can be found in text/text_data.py
. This class has out-of-the-box functionalities like loading and storing data, but also a generator function is defined here, which samples from bucketized sentences.
The new class needs two functions:
- loading()
The data is loaded to memory and extracted into sentences.
Important factors of this function:- data is to be stored in
self.data_set
This variable has been initialized inTextData
and is a dictionary. - raw data is to be stored in
self.data_set['train']
,self.data_set['dev']
andself.data_set['test']
- each data point is to be stored as a dictionary element ({}) and stored in the list (e.g.
self.data_set['train'] = []
) - parsing a sentence is to be done using
for whichelem['sentence'] = self.embeddings.encode_sentence(string_sentence, initialize=initialize_term, count_up=True)
string_sentence
is the raw text format of the sentence to be encoded. The definedtokenizer
indata_utils
will tokenize this string. - data is to be stored in
- bucketize_data()
This function bucketizes the data into defined buckets based on the length which self.embeddings.encode_sentence() has returned. The data needs to stored in the format:This is to be predefined for each bucket.bucketized[bucket_name] = {} bucketized[bucket_name]['data'] = [] bucketized[bucket_name]['buckets'] = [bucket_size] bucketized[bucket_name]['length'] = 0 bucketized[bucket_name]['position'] = 0
data
includes all data points in element dictionary formatbuckets
is the bucket size (or bucket sizes for SNLI)length
is the number of data points in this bucketposition
is to be defined 0 and will be counted up in the generator (refer to logic in generator to understand this)
Each data point is to be processed usingpad_positions(elem['sentence_positions'], PAD_position, b1)
Each data point in each bucket needs the following information:
For whichelem['sentence_length'] = int elem['sentence_positions'] = list
sentence_length
is the actual length of the sentence befor paddingsentence_positions
is a list of indexes including the padded words
Please refer to text/SNLI_data_loading.py
and text/billion_words.py
for more information
The capabilities for the new text data set need to be added and callable in DataLoader
here:
if data_set == "SNLI":
self.labels = {'neutral': 0, 'entailment': 1, 'contradiction': 2, '-': 3}
self.data_ob = SNLIData(self.labels, data_params=param_dict, bucket_params=bucket_params,
embeddings=self.embedding)
self.load_class_data = self.data_ob.load_snli
self.generator = self.data_ob.generator
elif data_set == 'billion_words':
self.data_ob = BillionWordsData(embeddings=self.embedding, data_params=param_dict)
self.load_class_data = self.data_ob.load_billion_words
self.generator = self.data_ob.generator
else:
raise Exception("No valid data_set set was set")