# Wikipedia Corpus

Corpus from: https://dumps.wikimedia.org/dewiki/20200820/

Sentences for comparison from: https://github.com/t-systems-on-site-services-gmbh/german-wikipedia-text-corpus

In [1]:
#imports
from xml.etree.ElementTree import *
import xml.etree.ElementTree as ET
from collections import Counter
import os
import pprint
import gensim
from gensim import corpora
from gensim import models
from gensim import similarities
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.models import LdaMulticore
import nltk
from nltk.corpus import stopwords
from smart_open import open 
import spacy
import de_core_news_md

In [2]:
number_objects = 2000
number_corpus_object = 500

## Preprocessing:

1. Lade das Language-Modell "de_core_news_xx" von Spacy
2. Erstelle die Funktion "preprocess_text in die das Object "text" verarbeiet wird
3. Aufrufen von "text" mit dem Language-Modell. Das Modell wandelt "text" in Tokens um -> "prep_text"
4. In "prep-tokens" werden durch eine Schleife, alle Tokens von "prep_text" gespeichert
    4.1 Entferne Stopwörter in Tokens 
    4.2 Lemmatisiere diese
    4.3 Schreibe alles klein
5. Zurückgegeben werden alle Tokens, ohne Stopwörter, in lemma Form, klein geschrieben des Inputs "text"

In [3]:
# load the language model from spacy
spacy_data = de_core_news_md.load()

def preprocess_text(text):
    # load and tokenize text
    prep_text = spacy_data(text)
    # list for tokens
    prep_tokens = []
    # for every token in text
    for token in prep_text:
        # remove stopwords and punctuatiuon
        if token.pos_ != 'PUNCT' and token.is_stop == False:
            # lemmatize and transform to lowercase
            lemma_token = token.lemma_.lower()
            # remove non-alphabetic tokens
            if lemma_token.isalpha() or lemma_token == '-PRON-':
                prep_tokens.append(lemma_token)
    # return preprocessed text 
    return prep_tokens

## Build the corpus

Create a corpus from the text contents of the XML file:

First test:

Print text from \<text>:

In [4]:
xml_file = "/Volumes/MAINZ_BB/dewiki-20200820-pages-articles-multistream.xml"

Object-Stream Test

1. Erstelle Klasse "MyCorpus"
2. Funktion "_iter_" nimmt den Imput "self" als Generator
3. Schleife durchläuft alle Elemente der XML-Datei. Events als "start" und "end" darstellen.
    3.1 Das Document wird dargestellt als Text zwischen "text" Tags der XML-Datei
    3.2 Gib jedes gefundene Objekt an Generatror zurück 
    3.3 Lösche die das verarbeitete Objekt

Now define the corpus:

1. Klasse "Mycorpus_small"
2. Funktion "_iter_" nimmt den Imput "self" als Generator
3. Schleife durchläuft alle Elemente der XML-Datei. Events als "start" und "end" darstellen.
    3.1 Das Document wird dargestellt als Text zwischen "text" Tags der XML-Datei
    3.2 verarbeite nur die ersten 200 Objekte
    3.3 Gib jedes gefundene Objekt an Generatror zurück 
    3.4 Zähl Index hoch
    3.5 Lösche die das verarbeitete Objekt

In [5]:
# Define a smaller corpus, containing only the first 200 documents:
class MyCorpus_small:
    def __iter__(self):
        index = 0
        # define the XML tree
        for event, elem in ET.iterparse(xml_file, events = ("start", "end")):
            if index < number_corpus_object:
                # Each document is represented as an object between <text> tags in the xml file
                if event == 'end' and "text" in elem.tag:
                    # Transfom the corpus to vectors
                    yield dictionary.doc2bow(preprocess_text(elem.text))
                    index+=1
                    elem.clear()
            else:
                break

In [6]:
corpus_small = MyCorpus_small()

---

Get texts from the XML File
Einsellen wie viele Artikel geladen werden sollen

In [7]:
%%time
text_ids = {}
texts = []
index = 0
for event, elem in ET.iterparse(xml_file, events = ("start", "end")):        
    if index < number_objects:
        if event == 'end' and "text" in elem.tag:
            text_ids[index]=str(elem.text)
            index += 1  
            texts.append(str(elem.text))
            elem.clear()
    else:
        break


CPU times: user 451 ms, sys: 30.6 ms, total: 481 ms
Wall time: 481 ms


---

## Build the Dictionary

def build_dictionary(xml_file):
    index = 0
    first_elem = True
    for event, elem in ET.iterparse(xml_file, events = ("start", "end")):        
        if index < number_objects:
            if event == "end" and "text" in elem.tag:
                text = preprocess_text(elem.text)
                if first_elem:
                    dictionary = Dictionary([text])
                    first_elem = False
                    index += 1
                else:
                    dictionary.add_documents([text])
                    index += 1
                elem.clear()
        else:
            break
    return dictionary

In [8]:
#load the dictionary
dictionary = Dictionary.load('data/wiki.dict')

In [9]:
print(dictionary)

Dictionary(20309 unique tokens: ['abc', 'abkehr', 'ablehnen', 'abrufen', 'abschluss']...)


---

## Similarity with LDA (Latent Dirichlet Allocation)

### Train the LDA model

Parameters:
* corpus: the corpus
* num_topics: topics to be extracted from the training corpus
* id2word: id to word mapping, the dictionary
* workers: number of cpu cores used

Currently not working with the streamed corpus. test_corpus is a temporary solution (hopefully) which contains the first 200 documents, manually added to a list.

In [10]:
%%time
lda = LdaMulticore(corpus_small, num_topics=number_objects, id2word=dictionary)

CPU times: user 11min 29s, sys: 1min 9s, total: 12min 39s
Wall time: 11min 46s


First experiments have shown that a topic number of 10 (default) is too low. 100 resulted in better disctinction between the different articles.
__Further fine tuning needed here__

In [11]:
#save the index
lda.save("data/index_wiki.txt")

In [12]:
#load the dictionary
lda = LdaModel.load("data/index_wiki.txt")

In [13]:
%%time
corpus_index = similarities.MatrixSimilarity(list(lda[corpus_small]), num_features=len(dictionary))

CPU times: user 6min 39s, sys: 6min 41s, total: 13min 20s
Wall time: 5min 45s


In [14]:
test_document = texts[23]

In [15]:
test_doc_raw = test_document
test_vec = dictionary.doc2bow(preprocess_text(test_doc_raw))
#print(test_vec)
# convert to lda space
test_vec_lda = lda[test_vec]
#print(test_vec_lda)

In [16]:
sims = corpus_index[test_vec_lda]

<h2><b>Ergebnis der Plagiatsprüfung</b></h2>

In [17]:
hits = 0
for ids in list(enumerate(sims)):
    if ids[1] >= 0.75:
        hits += 1
        print("Übereinstimmung von ","%.2f" %(ids[1]*100),"%","\n","Document ID:",ids[0],texts[ids[0]],"\n", "------------------------------------")
print(hits, "Plagiatsfälle gefunden")

Übereinstimmung von  100.00 % 
 Document ID: 8 __NOTOC__

{{SubTOC|Titel=Liste von Autoren|Index=Liste von Autoren}}

== Aa ==
: [[Bertus Aafjes]] (1914–1993), NL
: [[Jeppe Aakjær]] (1866–1930), DK
: [[Johannes Aal]] (um 1500–1551), CH
: [[Hans Aanrud]] (1863–1953), NO
: [[Emil Aarestrup]] (1800–1856), DK
: [[Soazig Aaron]] (* 1949), FR
: [[Ivar Aasen]] (1813–1896), NO

== Ab ==
: [[Petrus Abaelardus]] (1079–1142), FR
: [[Sait Faik Abasıyanık]] (1906–1954), TR
: [[Lynn Abbey]] (* 1948), US
: [[Jacob Abbott]] (1803–1879), US
: [[John Stevens Cabot Abbott]] (1805–1877), US
: [[Abdullah bin Abdul Kadir]] (1795–1852), MAL
: [[Abe Kōbō]] (1924–1993), JP
: [[Rebecca Abe]] (* 1967), D
: [[Hans Karl Abel]] (1876–1951), D
: [[Curt Abel-Musgrave]] (1860–1938), D, GB, USA
: [[Matthias Abele von und zu Lilienberg]] (1616/1618–1677), AT
: [[Joe Abercrombie]] (* 1974), GB
: [[Walter Abish]] (* 1931), US
: [[Hermann Able]] (1930–2013), D
: [[Dan Abnett]] (* 1965), GB
: [[Abraham a Sancta Clara]] (164