**Topic Modeling for German Legal Text**

In this notebook we demonstrate the use of Gensim, for basic topic modeling of German Legal Text. A similar approach serves as a backbone of our study, "[ERST: Leveraging Topic Features for Context-Aware Legal Reference Linking](https://https://jurix2019.oeg-upm.net/)" 

We provide a resources folder, with supporting content.

# Initialization 

In [1]:
!pip install nltk germalemma joblib pyLDAvis gensim




In [2]:
import nltk.data
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

At first we must define stopwords to be removed, and import relevant libraries. We give examples of some stopwords we remove. This needs to be carefully adapted to the dataset.



In [0]:
from germalemma import GermaLemma
import os
import re
import pickle
from pattern.de import parse, split
import re
from joblib import Parallel, delayed 

stopwords = nltk.corpus.stopwords.words('german')
stopwords.append("the")
stopwords.append("of")
stopwords.append("and")
stopwords.append("http")
stopwords.append("https")
stopwords.append("their")
stopwords.append("werden.en")


# Pre-processing

Now we will start going through a series of documents inside a folder, saving for each, their content and the filename.

Please adapt the *path* variable, to suit your data location.

On some cases the encoding might also need adaptation.

In [4]:
path="examples"
docs=[]
filenames=[]

#We start by going through the path, and reading documents
for r, d, f in os.walk(path):
  for n in f:
    fi =open(path+"/"+n, 'r', encoding='utf-8')
    filenames.append(n)
    x = fi.read()
    fi.close()
    docs.append(x)

print("Files found:\n"+','.join(sorted(filenames)))


Files found:
example.txt


We parallelize the tokenization, for a small performance improvement.


In [5]:
def tokenize(d):
  result=list(tokenizer.tokenize(d))
  return result

tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
doc_sentences=Parallel(n_jobs=10)(delayed(tokenize)(d) for d in docs)
print("Done with tokenization")

Done with tokenization


*Note:* We use a classifier-based POS tagger, pickled and provided as a linked resource.

In [0]:
from ClassifierBasedGermanTagger import ClassifierBasedGermanTagger #Provided in the resource folder
with open('nltk_german_classifier_data.pickle', 'rb') as f:
    tagger = pickle.load(f)

Next we also illustrate a parallelized version of tagging. Here timeouts can be defined and the behavior to deal with them.

At this stage there are options for removing characters that would create inaccurate sentence-level splits.

The result of the tagging will look something like: 
*('Prospekthaftung', 'NE'), ('und', 'KON')...('geschlossenen', 'ADJA'), ('Fonds.', 'NN')...*


In [7]:
verbose=False
timeout_duration=10
default=None #default result when a timeout takes place

def timeout(func, args=(), kwargs={}, timeout_duration=1, default=None):
    import signal
    class TimeoutError(Exception):
        pass
    def handler(signum, frame):
        raise TimeoutError()

    # set the timeout handler
    signal.signal(signal.SIGALRM, handler) 
    signal.alarm(timeout_duration)
    try:
        result = func(args)
    except TimeoutError as exc:
        print("Timeout!")
        result = default
    finally:
        signal.alarm(0)
    return result

def tag_this(doc,number,total):
  tags_doc=list()
  s_count=0
  for sentence in doc:
    s_count+=1
    if (verbose and s_count%50==0):
      print(str(s_count)+" of "+str(len(doc))+"; doc #"+str(number)+" of "+str(total))
    #We illustrate in the next line the replacement of common sentence-level split characters that might cause inaccurate results.
    split_sentence=[x.replace("]", "").replace("[", "").replace("^", "").replace(",", "").replace(":", "").replace(";", "") for x in str(sentence).replace("\\n", " ").replace("\n", " ").replace("„","„ ").replace("/", " ").replace("“"," “").replace("("," (").replace(")",") ").replace("?"," ? ").replace("…"," …").split(" ")]
    result=timeout(tagger.tag, args=(split_sentence), kwargs={},timeout_duration=timeout_duration, default=None)  
    if result!= default:#Special handing can be added here...
      tags_doc.append(result)
  return(tags_doc)

#Now we do part of speech tagging
tagged=Parallel(n_jobs=5)(delayed(tag_this)(doc_sentences[i],i,len(doc_sentences)) for i in range(0,len(doc_sentences)))
print("Done")



Done


After tagging we can carry out lemmatization.

The result will look something like: 
*'Prospekthaftung', 'und',..., 'geschlossen','Fonds.'...*

In [8]:
lemmatizer=GermaLemma()
lemmatized=[]
for doc in tagged:
  lemmas_doc=list()
  for sentence in doc:
    for item in sentence:
      if len(item[0])>2 and not item[0].lower() in stopwords and not re.match(".{1,3}\.", item[0].lower()) and not re.match("[0-9]{1}", item[0].lower()) and not re.match("\W{1}", item[0].lower()) and not re.match("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", item[0].upper()):
        if not str(item[1]).startswith(("N", "V", "ADJ", "ADV")):
          lemmat_word=item[0].lower().strip('\"').strip('„').strip("“")
          if lemmat_word[-1] == '.':
            lemmat_word=lemmat_word[0:-1]
          lemmat_word=lemmat_word.strip(')').strip('(')
          lemmas_doc.append(lemmat_word)
        else:
          passes=False
          try:
            lemmat_word=str(item[0].lower().strip('\"').strip('„').strip("“"))
            if lemmat_word[-1] == '.':
              lemmat_word=lemmat_word[0:-1]
            lemmat_word=lemmat_word.strip(')').strip('(')
            lemmatized_word = lemmatizer.find_lemma(lemmat_word, str(item[1]))
            if not lemmatized_word in stopwords:
              passes=True
          except:
            print(str(item)+ ", gave an error!")
          if passes:
            lemmas_doc.append(lemmatized_word.lower())
  lemmatized.append(lemmas_doc)
print("Done")

Done


Next we save in a CSV file, such that each line is a document. This concludes the pre-processing state.

In [0]:
f = open("corpus_lemmatized.csv", "w")
for doc in lemmatized:
  str_acc="\""
  for item in doc:
    str_acc+=item+" "
  str_acc+="\""
  f.write(str_acc+"\n")
f.close()

# Topic Modeling with Gensim

In this section we illustrate the process of topic modeling with [Gensim](https://https://radimrehurek.com/gensim/). 

In [0]:
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.models.ldamulticore import LdaMulticore

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [11]:
#Lemmatized represents the csv generated during pre-processing

# Create dictionary and corpus
id2word = corpora.Dictionary(lemmatized)
corpus = [id2word.doc2bow(text) for text in lemmatized]

lda_model = LdaMulticore(corpus=corpus,id2word=id2word,
                          num_topics=30, random_state=100,
                          #update_every=1,
                          chunksize=100, passes=10, #alpha='auto',
					                eta='auto', iterations=200, per_word_topics=True,workers=10)


print("Done")
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. The lower the better.

coherence_model_lda = CoherenceModel(model=lda_model, texts=lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Done

Perplexity:  -5.560813559727236

Coherence Score:  0.9999999999999998


In [12]:
import pickle
pickle.dump(lda_model, open( "lda_model.p", 'wb') )
print("Dumped model to file: lda_model.p")

pickle.dump(id2word, open( "id2word.p", 'wb' ) )
print("Dumped id2word to file: id2word.p")

pickle.dump(corpus, open( "corpus.p", 'wb' ) )
print("Dumped corpus to file: corpus.p")

Dumped model to file: lda_model.p
Dumped id2word to file: id2word.p
Dumped corpus to file: corpus.p


In [0]:
with open('lda_model.p', 'rb') as f:
    lda_model = pickle.load(f)

with open('id2word.p', 'rb') as f:
    id2word = pickle.load(f)

with open('corpus.p', 'rb') as f:
    corpus = pickle.load(f)

Next we use the LDA visualization library ([LDAVis](https://https://github.com/cpsievert/LDAvis), [PyLDAVis](https://https://github.com/bmabey/pyLDAvis)) for some simple visualization of the topics and key words.

*Please note:*


*   Deprecation warnings might show.
*   With the simple example provided, it is possible to stumble upon an error related to missing data, that would produce complex data types that cannot be serialized to the JSON format used by the LDAVis library.
*   Running this notebook on different congfigurations from which the pickled data was produced might lead to errors with data types.
*   We share example pickled files from our study, [here](https://https://drive.google.com/open?id=1RUTGL_kx3oW4kT-gbV5zrVX3b7KBCMwp), and we include as resources the generated topic visualizations for our dataset.

 


In [0]:
visualisation = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
pyLDAvis.save_html(visualisation, "visualization.html")
print("Done")


For topic modeling with a worklow similar to the one described in this notebook, but using new datasets, many other studies would be required (e.g. as suggested [here](https://github.com/trinker/topicmodels_learning)).