# Preprocess corpora
##### See [preprocessing.py](https://github.com/devmount/GermanWordEmbeddings/blob/master/preprocessing.py) from [GermanWordEmbeddings](https://devmount.github.io/GermanWordEmbeddings/)

The following code gives an example of how to us the preprocessing script to filter and transform a given corpus. You need [gensim](https://radimrehurek.com/gensim/install.html) and [NLTK](http://www.nltk.org/install.html) for this script to work.

### General usage
The usage of the script can be seen with the default `-h` or `--help` flag:

In [1]:
%%bash
python preprocessing.py --help

usage: preprocessing.py [-h] [-p] [-s] [-u] [-b] raw target

Script for preprocessing public corpora

positional arguments:
  raw                source file with raw data for corpus creation
  target             target file name to store corpus in

optional arguments:
  -h, --help         show this help message and exit
  -p, --punctuation  remove punctuation tokens
  -s, --stopwords    remove stop word tokens
  -u, --umlauts      replace german umlauts with their respective digraphs
  -b, --bigram       detect and process common bigram phrases


### Raw corpus example

To show the results of the preprocessing script, a corpus with 1k sentences from German news data is used. Here's the first 5 sentences of this unprocessed corpus:

In [2]:
%%bash
head -n 5 corpus/corpus.raw

Zwei Männer fuhren unter Drogeneinfluss.
Das war im Juli dieses Jahres.
Krimi-Serie: Im Tatort verteilt der Entführer 500.000 Euro am Alex.
Er sagte aber nicht, für wie lange.
Insgesamt investierte die Baugenossenschaft dafür 2,4 Millionen Euro.


### Remove punctuation

Punctuation is removed with the `-p` or `--punctuation` flag. You see that the dot in the 500k Euro was removed, but not the comma in 2,4M Euro. So numbers were not falsified. The dash between compund words (*Krimi-Serie*) is not removed.

In [3]:
%%bash
python preprocessing.py -p corpus/corpus.raw corpus/corpus.nopunct
head -n 5 corpus/corpus.nopunct

Zwei Männer fuhren unter Drogeneinfluss
Das war im Juli dieses Jahres
Krimi-Serie Im Tatort verteilt der Entführer 500000 Euro am Alex
Er sagte aber nicht für wie lange
Insgesamt investierte die Baugenossenschaft dafür 2,4 Millionen Euro


2015-07-11 19:18:52,477 : INFO : preprocessing 1000 sentences
2015-07-11 19:18:52,860 : INFO : preprocessing of 1000 sentences finished!


### Remove stopwords

Stopwords are removed with the `-s` or `--stopwords` flag. Therefore a list of German stopwords from NLTK is used. Words like *im*, *aber*, *nicht* and *wie* are removed from the example.

In [4]:
%%bash
python preprocessing.py -ps corpus/corpus.raw corpus/corpus.nostop
head -n 5 corpus/corpus.nostop

Zwei Männer fuhren Drogeneinfluss
Das Juli Jahres
Krimi-Serie Im Tatort verteilt Entführer 500000 Euro Alex
Er sagte lange
Insgesamt investierte Baugenossenschaft dafür 2,4 Millionen Euro


2015-07-11 19:18:53,861 : INFO : preprocessing 1000 sentences
2015-07-11 19:18:54,285 : INFO : preprocessing of 1000 sentences finished!


### Transform umlauts

German umlauts are transformed with the `-u` or `--umlauts` flag, so that they become their respective digraph representation, e.g. *ä* -> *ae* and *ß* -> *ss*.

In [5]:
%%bash
python preprocessing.py -psu corpus/corpus.raw corpus/corpus.nouml
head -n 5 corpus/corpus.nouml

Zwei Maenner fuhren Drogeneinfluss
Das Juli Jahres
Krimi-Serie Im Tatort verteilt Entfuehrer 500000 Euro Alex
Er sagte lange
Insgesamt investierte Baugenossenschaft dafuer 2,4 Millionen Euro


2015-07-11 19:18:55,313 : INFO : preprocessing 1000 sentences
2015-07-11 19:18:55,738 : INFO : preprocessing of 1000 sentences finished!


### Bigram phrases

The corpus is transformed to bigram phrases with the `-b` or `--bigram` flag. In the example corpus the words *Millionen* and *Euro* are transformed to one token *Millionen_Euro*.

In [6]:
%%bash
python preprocessing.py -psub corpus/corpus.raw corpus/corpus.psu
head -n 5 corpus/corpus.psu.bigram

Zwei Maenner fuhren Drogeneinfluss
Das Juli Jahres
Krimi-Serie Im Tatort verteilt Entfuehrer 500000 Euro Alex
Er sagte lange
Insgesamt investierte Baugenossenschaft dafuer 2,4 Millionen_Euro


2015-07-11 19:18:56,718 : INFO : preprocessing 1000 sentences
2015-07-11 19:18:57,144 : INFO : preprocessing of 1000 sentences finished!
2015-07-11 19:18:57,144 : INFO : train bigram phrase detector
2015-07-11 19:18:57,144 : INFO : collecting all words and their counts
2015-07-11 19:18:57,144 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2015-07-11 19:18:57,181 : INFO : collected 13779 word types from a corpus of 8044 words (unigram + bigrams) and 995 sentences
2015-07-11 19:18:57,181 : INFO : merging 13779 counts into Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
2015-07-11 19:18:57,186 : INFO : merged Phrases<13779 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
2015-07-11 19:18:57,187 : INFO : transform corpus to bigram phrases
