A word tagging project created during my master thesis with the AUEB NLP Group. In this project various machine learning are being tested and compared in order to check their accuracy when performing part of speech tagging on greek words . Nevertheless the code was developed in such a way, that it allows 1-of-a-K classification of words regardless of language. This means that you can use this software for other tasks of this type, such as Named Entity Recognition.
More info regarding the master thesis and the NLP Group of AUEB can be found: http://nlp.cs.aueb.gr/
An addition to the word tagging project is a tokenizer implemented with song tokenization in mind.
- Unix or Windows Computer and admin rights to it
- Java 8+
-
8GB memory else you have to tweak the code to be stream based, which I don’t recommend, since it may be much much more slower
- Nvidia GPU with 4+GB RAM and a lot of Cuda cores if you are planning to use large sets and deeplearning4j
- Very Good knowledge of Java (8)
- Average knowledge of Unix or Windows bash
- Basic Knowledge of maven and packaging
- If you plan to mess around with native code, good knowledge of C, C++ and or Cuda.
- Install Maven on your system
- Install DeepLearning4j (at time of writing Snapshot-3.9).
- Have some training data in the CONLL format.
- Have some test data in the CONLL format.
- If you want to classify new data just use them in CONLL format
- SVMs work only for Windows, If you need them for Unix either wrap Stanford’s SVMFactory Class like I did or maybe use another package.
- To use the tokenizer on a song, the song lyrics must be stored in a directory. The lyrics must agree with the common lyric format (each lyric in a line, empty line between verses, repetition marks the lyrics to be repeated etc.)
- Create a new Class with a main
- Create a new ExperiementSetup.Builder and build it after setting the appropriate variables for the experiement.
- Use the Javadoc
- The tokenizer can be used by itself through the main method it includes or be used by another class. In the first case, just call it's main method (after setting the file path). In the second case, create a SongTokenizer object, read the file into a String with the SongHelper class and call the tokenize method (from the SongTokenizer class) with the aforementioned String as an input.
-
All the possible Categories, provided in a file, separated with new line (\n). Example:
Category1
Category2
Category3
…
eof (end of file) -
Word-Category files as wordsets (trainset, testset etc). Each word-category pair is in the same line separated with any preferred delimeter (space and tab are recommended). You have to set the delimeter in code else the space is used. Each Pair is separated via newline \n. Example:
word1 Category1
word2 Category3
word3 Category1
…
eof -
Embedding file. In this file we have a word-double valued vector Pair. The token is always first and then follow the double values of the vector. Default Delimeter is space:” ”. Each pair is separated by new line. Example:
word1 1.0 2.0 3.0 4.0 5.0
word2 2.0 3.0 4.0 5.0 6.0
…
eof -
Gazzete’s or dictionaries with words that contain Category-Word spairs can be used to classify directly known words/tokens. Pairs are separated with each other by the use of new line. In this case the word-category separator is always space. Example:
word1 cat1
word2 cat2
…
eof -
Special Categories and words. The Pair: %newarticle% null is used to separate word sequences with each other. This means that Sequence classifiers (such as CRF and LSTM) as well as the Feature Builder will stop looking before and after they encounter that pair. SO AVOID USING null AS A CATEGORY and %newarticle% as a word/token.
-
For the tokenizer, the only input is the path to the file with the song lyrics(has to be set either in the SongTokenizer.main or when invoking the SongHelper.readSongFile, which stores the file content into a String).
After a classifier provides a classified wordset, it can be evaluated against the original. This happens through the evaluation class. Also a metatagger with programmed rules is provided. See the code and the Javadoc for more information, since those 2 classes are depended on what you plan to do and you should change/reimplement them if needed.
For evaluation several deep learning packages have been used. The LSTM in the Thesis text belongs to Wang Ling. I have made some changes in its code but due to incompatibility with current dl4j libraries I ve decided to keep as a separate project. You can ask me for the code at asikis.thomas@gmail.com.
For the implementation of other deep nets, currently the computation graph of dl4j is used. See more about it here: http://deeplearning4j.org/compgraph
Have only been implemented for Windows 64 bit by wrapping the Windows Factory of Stanford CoreNLP and providing the appropriate binaries. For other distributions you might have to do it yourself.
For now I am providing some small data files in Greek, just for testing purposes. Since the size of the experimental data is more than 500MB zipped, email me on asikis.thomas@gmail.com. and I will provide you with a download link.
The tokenizer, when used as a standalone tool via its main method, will print out the tokenized song. If it's used by another class, the tokenize method returns the List<String> containing the song which can be used as necessary.
The tokens are:
- |<new verse>|, which marks a new verse
- |<new lyric>|, which marks a new lyric
- |<repetition start>|, which marks the start of a repetition*
- |<repetition end>|, which marks the end of a repetition end*
- |<word>|, which marks a new word(in the context of song tokenization, this token is unnecessary and therefore discarded from the List<String>)
*when the two tokens are placed one after another in the list, it means that only the following lyric is repeated.