Skip to content

Here we implement a data pre-processing pipeline (which we built to use as part of a larger codebase) that abstracts over the different types of embeddings and downstream task data sets and provides easy-to-call functions which can be used while building and training models. We hope it will be useful to you as well!

License

Notifications You must be signed in to change notification settings

ashkamath/nlp-data-loading-framework-

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contributors

Aishwarya Kamath @ashkamath: https://github.com/ashkamath
Jonas Pfeiffer @JoPfeiff: https://github.com/JoPfeiff

Framework for NLP Text Data

We are trying to define a framework for NLP tasks that easily maps any kind of word embedding data set with any kind of text data set. The framework should decrease the amount of additional code needed to work on different NLP tasks.
We have found that for many NLP tasks similar preprocessing steps are needed.
This entails

  • tokenizing the text
  • replacing words with embeddings (pretrained or newly learnt)
  • bucketizing sentences based on their length
  • padding sentences to a specific length
  • replacing unseen words with <UNK>
  • creating a generator that loops through the sentences

We therefore want to create a framework that provides these common functionalities out-of-the-box to be able to focus on the core task of the project faster.

Currently the framework has the following capabilities:

DataLoader

This is the main class to-be-called and can be found in data_loading/data_loader.py
The DataLoader class maps embeddings to text data sets. This code needs to edited to be able to accept different kinds embedding and text data sets.
In order to combine Fast-Text embeddings with the SNLI data set we can call the data_loader by:

dl = DataLoader(data_set='SNLI', embedding_loading='in_dict', embeddings_initial='FastText-Wiki', 
                embedding_params={}, K_embeddings=float('inf'))
gen = dl.get_generator(data_set='train', batch_size=64, drop_last=True)
data , batch = gen.next()

The generator loops through the defined data_set once. So for each epoch, a new generator has to be called. drop_last=True defines that if the last batch is not full, it is not passed.

This class loads embeddings based on a defined strategy. Currently two versions are implemented:

  • embedding_loading='top_k'
  • embedding_loading='in_dict'
    top_k loads the first k embeddings from file, assuming that they are sorted by most frequent on the top. If all embeddings should be loaded set K_embeddings=float('inf')
    in_dict preloads all embeddings and the selects only those embeddings that occure in the text data set.

The class also gives the possibility to store all the loaded data on disc in a pickle file and load it again into the object. this can be done by

dl.load()
dl.dump()

However, these functions only dump what has currently been loaded into the object. To load everything at the start and then dump it to file call

dl.get_all_and_dump()
This function also automatically bucketizes all the sentences based on the defined bucketizing strategy.

The current tokenizer is based on spaCy.io and can easily be replaced in data_loading/data_utils.py in the function tokenize()

Currently a set of out-of-the-box embedding and text data sets have been implemented. These are:

Embeddings

The core class is Embeddings which can be found in embeddings/embeddings.py. However, this should only be used as the super class for the specialized embeddings. New embedding inherit this class (e.g. class FastTextEmbeddings(Embeddings) in embeddings/fasttext_embeddings.py). Only if the embeddings are to be initialized randomly, the core Embeddings class is to be called.

A generic path-based embedding class is implemented that can process any kind of Embeddings stored as text documents in the structure <word>\t<float>\t<float>\t...\t<float>\n This object is called if the parameter embeddings_initial='Path' is called when creating the data_loading object. The parameters, e.g. where the embedding data is stored, is passed as a dictionary:
embedding_params = {'path':'../data/embeddings/bow2.words', 'name':'bow2'}

A set of pre-implemented word embeddings are:

  • Fast-Text:
    https://fasttext.cc/docs/en/english-vectors.html
    • embeddings_initial='FastText-Crawl'
    • embeddings_initial='FastText-Wiki'
  • Glove-Embeddings:
    https://nlp.stanford.edu/projects/glove/
    • embeddings_initial='Glove-Twitter-25'
    • embeddings_initial='Glove-Twitter-50'
    • embeddings_initial='Glove-Twitter-100'
    • embeddings_initial='Glove-Twitter-200'
    • embeddings_initial='Glove-Common-42B-300'
    • embeddings_initial='Glove-Common-840B-300'
    • embeddings_initial='Glove-Wiki-50'
    • embeddings_initial='Glove-Wiki-100'
    • embeddings_initial='Glove-Wiki-200'
    • embeddings_initial='Glove-Wiki-300'
  • Lear-Embeddings:
    • embeddings_initial='Lear'
  • Polyglot-Embeddings:
    http://bit.ly/19bSoAS
    • embeddings_initial='Polyglot'

Implementing new Embedding classes

To implement a new Embedding class, this should inherit the class Embeddings which can be found in embeddings/embeddings.py. This has all the basic functionality implemented thats needed for most embedding data. The new Embedding class only needs two functions which are data set dependet. These are:

  • load_top_k(self, K, preload=False)
    This loads the top_k embeddings from file with the assumption that the embeddings are ordered based on their frequency. The following functionalities for implementing this function are important:

    • Adding the term should be done using
      self.add_term(term, preload=preload)
    • If special embeddings (<UNK>, <PAD>, <START>, <END>) need to be added, this is to be done using
      special_embeddings = self.add_special_embeddings(len(embeddings[0]), preload=preload)
    • The function should return the embeddings as a np.array()
      return np.array(embeddings)
  • get_name(self)
    This should return the name of the embeddings e.g. 'FastText-Wiki'

The new functionality needs to be added to DataLoader in data_loading/data_loader.py. The defined object needs to be callable in this object using a name. Two new lines need to be added to:

            if self.embeddings_initial in FASTTEXT_NAMES:
                self.embedding = FastTextEmbeddings(self.embeddings_initial)
            elif self.embeddings_initial in POLYGLOT_NAMES:
                self.embedding = PolyglotEmbeddings()
            elif self.embeddings_initial in LEAR_NAMES:
                self.embedding = LearEmbeddings()
            elif self.embeddings_initial in GLOVE_NAMES:
                self.embedding = GloveEmbeddings(self.embeddings_initial)
            elif self.embeddings_initial == "Path":
                self.embedding = PathEmbeddings(self.embedding_params)
            else:
                raise Exception("No valid embedding was set")

For reference please look at embeddings/fasttext_embeddings.py

Text Data Sets

The text data set implements a bucketized loading structure. That means, that sentences are bucketized based on their length (conditioned on words in the dictionary) and stored in memory.
A generator is callable that loops through each of the data points randomly by first sampling a bucket, and then sampling from each bucket.
Out-of-the-box text data sets are:

Implementing New Text Data Sets

New text data sets are to inherit the class TextData that can be found in text/text_data.py. This class has out-of-the-box functionalities like loading and storing data, but also a generator function is defined here, which samples from bucketized sentences.
The new class needs two functions:

  • loading()
    The data is loaded to memory and extracted into sentences.
    Important factors of this function:
    • data is to be stored in self.data_set
      This variable has been initialized in TextData and is a dictionary.
    • raw data is to be stored in self.data_set['train'], self.data_set['dev'] and self.data_set['test']
    • each data point is to be stored as a dictionary element ({}) and stored in the list (e.g. self.data_set['train'] = [])
    • parsing a sentence is to be done using
    elem['sentence'] = self.embeddings.encode_sentence(string_sentence, 
                                                                  initialize=initialize_term, 
                                                                  count_up=True) 
    
    for which string_sentence is the raw text format of the sentence to be encoded. The defined tokenizer in data_utils will tokenize this string.
  • bucketize_data()
    This function bucketizes the data into defined buckets based on the length which self.embeddings.encode_sentence() has returned. The data needs to stored in the format:
            bucketized[bucket_name] = {}
            bucketized[bucket_name]['data'] = []
            bucketized[bucket_name]['buckets'] = [bucket_size]
            bucketized[bucket_name]['length'] = 0
            bucketized[bucket_name]['position'] = 0
    
    This is to be predefined for each bucket.
    • data includes all data points in element dictionary format
    • buckets is the bucket size (or bucket sizes for SNLI)
    • length is the number of data points in this bucket
    • position is to be defined 0 and will be counted up in the generator (refer to logic in generator to understand this)
      Each data point is to be processed using pad_positions(elem['sentence_positions'], PAD_position, b1)
      Each data point in each bucket needs the following information:
    elem['sentence_length'] = int
    elem['sentence_positions'] = list
    
    For which
    • sentence_length is the actual length of the sentence befor padding
    • sentence_positions is a list of indexes including the padded words

Please refer to text/SNLI_data_loading.py and text/billion_words.py for more information

The capabilities for the new text data set need to be added and callable in DataLoader here:

        if data_set == "SNLI":
           self.labels = {'neutral': 0, 'entailment': 1, 'contradiction': 2, '-': 3}
           self.data_ob = SNLIData(self.labels, data_params=param_dict, bucket_params=bucket_params,
                                   embeddings=self.embedding)
           self.load_class_data = self.data_ob.load_snli
           self.generator = self.data_ob.generator
       elif data_set == 'billion_words':
           self.data_ob = BillionWordsData(embeddings=self.embedding, data_params=param_dict)
           self.load_class_data = self.data_ob.load_billion_words
           self.generator = self.data_ob.generator
       else:
           raise Exception("No valid data_set set was set")

About

Here we implement a data pre-processing pipeline (which we built to use as part of a larger codebase) that abstracts over the different types of embeddings and downstream task data sets and provides easy-to-call functions which can be used while building and training models. We hope it will be useful to you as well!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%