Initial commit

UKPLab · Jul 13, 2017 · 27d598d · 27d598d
1 parent 55f1afb
commit 27d598d
Show file tree

Hide file tree

Showing 37 changed files with 1,532,452 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,29 @@
+*.pkl
+*.pyc
+*~
+
+.settings
+.pydevproject
+.project
+
+
+results_lichtenberg/
+pkl/
+alt/
+results/
+
+corpora/
+tmp/
+models/
+lichtenberg_scripts/SingleTask
+lichtenberg_scripts/MultiTask
+
+/levy_deps.words
+
+data/ace
+data/conll2003_ner
+data/ecb+
+data/tac2015
+data/tempeval3
+data/wsj_pos
+data/removeComments.py
diff --git a/Pretrained_Models.md b/Pretrained_Models.md
@@ -0,0 +1,76 @@
+#Pretrained Sequence Tagging Models
+In the following some pre-trained models are provided for different common sequence tagging task. These models can be used by executing:
+```
+python RunModel.py modelname.h5 input.txt
+```
+
+For the English models, we used the word embeddings by [Levy et al.](https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/). For the German files, we used the word embeddings by [Reimers et al.](https://www.ukp.tu-darmstadt.de/research/ukp-in-challenges/germeval-2014/)
+
+
+## POS
+We trained POS-tagger on the [Universal Dependencies]((http://universaldependencies.org/)) v1.3 dataset:
+Trained on universal dependencies v1.3 Englisch: 
+
+| Language | Development (Accuracy) | Test (Accuracy) |
+|----------|:-----------:|:----:|
+|[English (UD)](https://public.ukp.informatik.tu-darmstadt.de/reimers/2017_SequenceTaggingModels/EN_UD_POS.h5) | 95.47% | 95.55% |
+|[German (UD)](https://public.ukp.informatik.tu-darmstadt.de/reimers/2017_SequenceTaggingModels/DE_UD_POS.h5) | 94.86% | 93.99% | 
+
+Further, we trained models on the Wall Street Journal:
+
+| Language | Development (Accuracy) | Test (Accuracy) |
+|----------|:-----------:|:----:|
+|[English (WSJ)](https://public.ukp.informatik.tu-darmstadt.de/reimers/2017_SequenceTaggingModels/EN_WSJ_POS.h5) | 97.18% | 97.21% |
+
+The depicted performance is accuracy.
+
+
+## Chunking
+Trained on [CoNLL 2000 Chunking dataset](http://www.cnts.ua.ac.be/conll2000/chunking/). Performance is F1-score.
+
+| Language | Development (F1) | Test(F1) |
+|----------|:-----------:|:----:|
+|[English (CoNLL 2003)](https://public.ukp.informatik.tu-darmstadt.de/reimers/2017_SequenceTaggingModels/EN_Chunking.h5) | 95.40% | 94.70% |
+
+
+## NER
+Trained on [CoNLL 2003](http://www.cnts.ua.ac.be/conll2003/ner/) and [GermEval 2014](https://sites.google.com/site/germeval2014ner/)
+
+| Language | Development (F1) | Test (F1) |
+|----------|:-----------:|:----:|
+|[English (CoNLL 2003)](https://public.ukp.informatik.tu-darmstadt.de/reimers/2017_SequenceTaggingModels/EN_NER.h5) | 94.29% | 90.87% | 
+|[German (CoNLL 2003)](https://public.ukp.informatik.tu-darmstadt.de/reimers/2017_SequenceTaggingModels/DE_NER_CoNLL.h5) | 80.80% | 77.49% | 
+|[German (GermEval 2014)](https://public.ukp.informatik.tu-darmstadt.de/reimers/2017_SequenceTaggingModels/DE_NER_GermEval.h5) | 80.85% | 80.00% |
+
+
+## Entities
+Trained on ACE 2005 (https://catalog.ldc.upenn.edu/LDC2006T06)
+
+| Language | Development (F1) | Test (F1) |
+|----------|:-----------:|:----:|
+|[English](https://public.ukp.informatik.tu-darmstadt.de/reimers/2017_SequenceTaggingModels/EN_Entities.h5) | 82.46% | 85.78% | 
+
+
+## Events
+Trained on TempEval3 (https://www.cs.york.ac.uk/semeval-2013/task1/)
+
+| Language | Development (F1) | Test (F1) |
+|----------|:-----------:|:----:|
+|[English](https://public.ukp.informatik.tu-darmstadt.de/reimers/2017_SequenceTaggingModels/EN_Events.h5) |- | 82.28% | 
+
+
+## Parameters
+In the following are some parameters & configurations listed for the pretrained models.
+
+```
+English NER:
+Glove 6B 100 embeddings with params = {'dropout': [0.25, 0.25], 'classifier': 'CRF', 'LSTM-Size': [100,75], 'optimizer': 'nadam', 'charEmbeddings': 'CNN', 'miniBatchSize': 32}
+
+German NER (CoNLL 2003 and GermEval 2014):
+Reimers et al., 2014, GermEval embeddings with params = {'dropout': [0.25, 0.25], 'classifier': 'CRF', 'LSTM-Size': [100,75], 'optimizer': 'nadam', 'charEmbeddings': 'CNN', 'miniBatchSize': 32}
+
+Entities:
+Glove 6B 100 embeddings, params = {'dropout': [0.25, 0.25], 'classifier': 'CRF', 'LSTM-Size': [100,75], 'optimizer': 'nadam', 'charEmbeddings': 'CNN', 'miniBatchSize': 32}
+```
+
+
diff --git a/README.md b/README.md
@@ -0,0 +1,124 @@
+In the following repository you can find an LSTM-CRF implementation used for Sequence Tagging, e.g. POS-tagging, Chunking, or Named Entity Recognition. The implementation is based on Keras 1.x and can be run with theano or tensorflow as backend.
+
+The hyperparameters of this network can be easily configured, so that you can re-run the proposed system by [Huang et al., Bidirectional LSTM-CRF Models for Sequence Tagging](https://arxiv.org/abs/1508.01991), [Ma and Hovy, End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](https://arxiv.org/abs/1603.01354) and [Lample et al, Neural Architectures for Named Entity Recognition](https://arxiv.org/abs/1603.01360).
+
+The implementation was optimized for **performance** using a smart shuffeling of the trainings data to group sentences with same length together. This increases the training speed by multiple factors in comparison to the implementations by Ma or Lample.
+
+The training of the network is simple and can easily be extended to new datasets and languages. For example, see [Train_POS.py](Train_POS.py).
+
+Pretrained models can be stored and loaded for inference. Simply execute `python RunModel.py models/modelname.h5 input.txt`. Pretrained-models for some sequence tagging task using this LSTM-CRF implementations are provided in [Pretrained Models](Pretrained_Models.md).
+
+This implementation can be used for **Multi-Task Learning**, i.e. learning simultanously several task with non-overlapping datasets. The file [Train_MultiTask.py](Train_MultiTask.py) depicts an example, how the LSTM-CRF network can be used to learn POS-tagging and Chunking simultaneously. The number of tasks is not limited. Tasks can be supervised at the same level or at different output level, for example, to re-implement the approach by [Sogaard and Goldberg, Deep multi-task learning with low level tasks supervised at lower layers](http://anthology.aclweb.org/P16-2038).
+
+
+# Setup
+First clone or download the source code.
+
+Setup an virtual environment (optional):
+``` 
+virtualenv foldername/.env
+source foldername/.env/bin/activate
+```
+Install the requirements:
+```
+cd foldername
+pip install -r requirements.txt
+```
+
+If everything works well, you can run `python Train_POS.py` to train a deep POS-tagger for the POS-tagset from universal dependencies.
+
+
+# Training
+Training new models is simple. Look at `Train_POS.py` and `Train_Chunking.py` for examples.
+
+Place new datasets in the folder `data`. The system expects three files `train.txt`, `dev.txt` and `test.txt` in a CoNLL format. I.e. each token is in a new line, different columns are seperated by a white space (either a space or a tab). Sentences are seperated by an empty line.
+
+For an example look at `data/conll2000_chunking/train.txt`. Files with multiple columns, like `data/unidep_pos/train.txt` are no problem, as we will specify later which columns should be used for training.
+
+To train a LSTM-network, you must specify the following lines of code (`Train_POS.py`):
+```
+datasetName = 'unidep_pos'
+dataColumns = {1:'tokens', 3:'POS'} #Tab separated columns, column 1 contains the token, 3 the universal POS tag
+labelKey = 'POS'
+
+embeddingsPath = 'levy_deps.words' #Word embeddings by Levy et al: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
+```
+
+`datasetName` defines the name of the dataset, here it will use the data in the folder `data/unidep_pos`. `dataColumns` specifies the columns that should be read from the CoNLL file, in this case the first column and the third column should be read. The counting starts at 0. The first column contains the tokens, and the third column the POS-tag. Note, that we must always specify a 'tokens' column. The other columns can be named arbitrarily.
+
+`labelKey` will specify which column should serve as label, in this case we want to perform POS-tagging. The name must match with the name specified in the dictionary `dataColumns`.
+
+`embeddingsPath` contains the path to pre-trained word embeddings. The format for this must be text-based, i.e. each line contains the embedding for a word and the first column in that line is the word, followed by the dense vector. Our script will automatically download the embeddings by Levy et al. if they are not present.
+
+
+If you want to perform chunking instead of POS-tagging, simple change the first lines (`Train_Chunking.py`):
+```
+datasetName = 'conll2000_chunking'
+dataColumns = {0:'tokens', 1:'POS', 2:'chunk_BIO'} #Tab separated columns, column 0 contains the token, 1 the POS, 2 the chunk information using a BIO encoding
+labelKey = 'chunk_BIO'
+```
+
+**Note:** By appending a *_BIO* to a column name, we indicate that this column is BIO encoded. The system will then compute the F1-score instead of the accuracy. 
+
+# Running a stored model
+If enabled during the trainings process, models are stored to the 'models' folder. Those models can be loaded and be used to tag new data. An example is implemented in `RunModel.py`:
+
+```
+python RunModel.py models/modelname.h5 input.txt
+```
+
+This script will read the model `models/modelname.h5` as well as the text file `input.txt`. The text will be splitted into sentences and tokenized using NLTK. The tagged output will be written in a CoNLL format to standard out.
+
+
+# Multi-Task-Learning
+The class `neuralnets/MultiTaskLSTM.py` implements a Multi-Task-Learning setup using LSTM. The code and parameters are similar to the Single-Task setup.
+
+The file `Train_MultiTask.py` contains an example how to run the code. There, we define which datasets should be used:
+```
+posName = 'unidep_pos'
+posColumns = {1:'tokens', 3:'POS'}
+
+chunkingName = 'conll2000_chunking'
+chunkingColumns = {0:'tokens', 1:'POS', 2:'chunk_BIO'}
+
+
+datasetFiles = [
+        (posName, posColumns),
+        (chunkingName, chunkingColumns)
+    ]
+
+#....
+
+datasetTuples = {
+    'POS': (posData, 'POS', True),
+    'Chunking': (chunkingData, 'chunk_BIO', True)
+    }
+```
+
+As before, we define the dataset names with the column names and store these information in the `datasetFiles` array. The dictionary `datasetTuples` contains the preprocessed datasets (`posData` and `chunkingData`), the column we like to use as label (`POS` and `chunk_BIO`). The boolean parameter defines whether this dataset should be evaluated. If it is set to `False`, no performance scores will be printed for this dataset.
+
+
+# LSTM-Hyperparameters
+The parameters in the LSTM-CRF network can be configured by passing a parameter-dictionary to the BiLSTM-constructor: `BiLSTM(params)`.
+
+The following parameters exists:
+* **miniBatchSize**: Size (Nr. of sentences) for mini-batch training. Default value: 32
+* **dropout**: Set to 0, for no dropout. For naive dropout, set it to a real value between 0 and 1. For variational dropout, set it to a two-dimensional tuple or list, with the first entry corresponding to output dropout and the second entry to the recurrent dropout. Default value: [0.25, 0.25]
+* **classifier**: Set to `Softmax` to use a softmax classifier or to `CRF` to use a CRF-classifier as the last layer of the network. Default value: `Softmax`
+* **LSTM-Size**: List of integers with the number of recurrent units for the stacked LSTM-network. The list [100,75,50] would create 3 stacked BiLSTM-layers with 100, 75, and 50 recurrent units. Default value: [100]
+* **optimizer**: Available optimizers: SGD, AdaGrad, AdaDelta, RMSProp, Adam, Nadam. Default value: `nadam`
+* **earlyStopping**: Early stoppig after certain number of epochs, if no improvement on the development set was achieved. Default value: 5
+* **addFeatureDimensions**: Dimension for additional features, that are passed to the network. Default value: 10
+* **charEmbeddings**: Available options: [None, 'CNN', 'LSTM']. If set to `None`, no character-based representations will be used. With `CNN`, the approach by [Ma & Hovy](https://arxiv.org/abs/1603.01354) using a CNN will be used. With `LSTM`, an LSTM network will be used to derive the character-based representation ([Lample et al.](https://arxiv.org/abs/1603.01360)). Default value: `None`
+	* **charEmbeddingsSize**: The dimension for characters, if the character-based representation is enabled. Default value: 30
+	* **charFilterSize**: If the CNN approach is used, this parameters defined the filter size, i.e. the output dimension of the convolution. Default: 30
+	* **charFilterLength**: If the CNN approach is used, this parameters defines the filter length. Default: 3
+	* **charLSTMSize**: If the LSTM approach is used, this parameters defines the size of the recurrent units. Default: 25
+* **clipvalue**: If non-zero, the gradient will be clipped to this value. Default: 0
+* **clipnorm**: If non-zero, the norm of the gradient will be normalized to this value. Default: 1
+
+For the MutliTaskLSTM.py-network, the following additional parameter exists:
+* **customClassifier**: A dictionary, that maps each dataset an individual classifier. For example, the POS tag could use a Softmax-classifier, while the Chunking dataset is trained with a CRF-classifier.
+
+
+
diff --git a/RunModel.py b/RunModel.py
@@ -0,0 +1,42 @@
+#!/usr/bin/python
+#Usage: python RunModel.py modelPath inputPath"
+from __future__ import print_function
+import nltk
+from util.preprocessing import addCharInformation, createMatrices, addCasingInformation
+from neuralnets.BiLSTM import BiLSTM
+import sys
+
+if len(sys.argv) < 3:
+    print("Usage: python RunModel.py modelPath inputPath")
+    exit()
+
+modelPath = sys.argv[1]
+inputPath = sys.argv[2]
+
+with open(inputPath, 'r') as f:
+    text = f.read()
+
+
+# :: Load the model ::
+lstmModel = BiLSTM()
+lstmModel.loadModel(modelPath)
+
+
+# :: Prepare the input ::
+sentences = [{'tokens': nltk.word_tokenize(sent)} for sent in nltk.sent_tokenize(text)]
+addCharInformation(sentences)
+addCasingInformation(sentences)
+
+dataMatrix = createMatrices(sentences, lstmModel.mappings)
+
+# :: Tag the input ::
+tags = lstmModel.tagSentences(dataMatrix)
+
+
+# :: Output to stdout ::
+for sentenceIdx in range(len(sentences)):
+    tokens = sentences[sentenceIdx]['tokens']
+    tokenTags = tags[sentenceIdx]
+    for tokenIdx in range(len(tokens)):
+        print("%s\t%s" % (tokens[tokenIdx], tokenTags[tokenIdx]))
+    print("")
diff --git a/Train_Chunking.py b/Train_Chunking.py
@@ -0,0 +1,83 @@
+from __future__ import print_function
+import os
+import logging
+import sys
+from neuralnets.BiLSTM import BiLSTM
+from util.preprocessing import perpareDataset, loadDatasetPickle
+
+# :: Change into the working dir of the script ::
+abspath = os.path.abspath(__file__)
+dname = os.path.dirname(abspath)
+os.chdir(dname)
+
+# :: Logging level ::
+loggingLevel = logging.INFO
+logger = logging.getLogger()
+logger.setLevel(loggingLevel)
+
+ch = logging.StreamHandler(sys.stdout)
+ch.setLevel(loggingLevel)
+formatter = logging.Formatter('%(message)s')
+ch.setFormatter(formatter)
+logger.addHandler(ch)
+
+
+
+
+######################################################
+#
+# Data preprocessing
+#
+######################################################
+
+
+# :: Train / Dev / Test-Files ::
+datasetName = 'conll2000_chunking'
+dataColumns = {0:'tokens', 1:'POS', 2:'chunk_BIO'} #Tab separated columns, column 0 contains the token, 1 the POS, 2 the chunk information using a BIO encoding
+labelKey = 'chunk_BIO'
+
+embeddingsPath = 'levy_deps.words' #Word embeddings by Levy et al: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
+
+#Parameters of the network
+params = {'dropout': [0.25, 0.25], 'classifier': 'CRF', 'LSTM-Size': [50], 'optimizer': 'nadam', 'charEmbeddings': None, 'miniBatchSize': 32}
+
+
+
+
+frequencyThresholdUnknownTokens = 50 #If a token that is not in the pre-trained embeddings file appears at least 50 times in the train.txt, then a new embedding is generated for this word
+
+datasetFiles = [
+        (datasetName, dataColumns),
+    ]
+
+
+# :: Prepares the dataset to be used with the LSTM-network. Creates and stores cPickle files in the pkl/ folder ::
+pickleFile = perpareDataset(embeddingsPath, datasetFiles)
+
+
+######################################################
+#
+# The training of the network starts here
+#
+######################################################
+
+#Load the embeddings and the dataset
+embeddings, word2Idx, datasets = loadDatasetPickle(pickleFile)
+data = datasets[datasetName]
+
+
+print("Dataset:", datasetName)
+print(data['mappings'].keys())
+print("Label key: ", labelKey)
+print("Train Sentences:", len(data['trainMatrix']))
+print("Dev Sentences:", len(data['devMatrix']))
+print("Test Sentences:", len(data['testMatrix']))
+
+model = BiLSTM(params)
+model.setMappings(embeddings, data['mappings'])
+model.setTrainDataset(data, labelKey)
+model.verboseBuild = True
+model.modelSavePath = "models/%s/%s/[DevScore]_[TestScore]_[Epoch].h5" % (datasetName, labelKey) #Enable this line to save the model to the disk
+model.evaluate(50)
+
+