# Wikipedia Corpus

Corpus from: https://dumps.wikimedia.org/dewiki/20200820/

Sentences for comparison from: https://github.com/t-systems-on-site-services-gmbh/german-wikipedia-text-corpus

In [1]:
#imports
from xml.etree.ElementTree import *
import xml.etree.ElementTree as ET
from collections import Counter
import os
import pprint
import gensim
from gensim import corpora
from gensim import models
from gensim import similarities
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.models import LdaMulticore
import nltk
from nltk.corpus import stopwords
from smart_open import open 
import spacy
import de_core_news_md


from functions import *

### Global Variables

In [2]:
# the XML-file
xml_file = "data/wiki_corpus/dewiki-20200820-pages-articles-multistream.xml"

# number of documents to parse 
num_documents = 200


## Preprocessing

To be able to return the title of a given article later on, we need to store those in a dictionary:

In [3]:
title_ids = get_titles(xml_file)

## Build the corpus

Create a corpus from the text contents of the XML file.

1. Corpus is defined as a class object, so it can be called when needed.
2. Loops through the XML-file, searching for closing "text" tags.
3. Returns the text contents from these nodes in preprocessed form.
4. Then clears the current node from memory

In [9]:
# Define the corpus as an object
class MyCorpus:
    def __iter__(self):
        # define the XML tree
        for event, elem in ET.iterparse(xml_file, events = ("start", "end")):            
            # Each document is represented as an object between <text> tags in the xml file
            if event == 'end' and "text" in elem.tag:
                # Transfom the corpus to vectors
                yield dictionary.doc2bow(preprocess_text(elem.text))
                # clear the node
                elem.clear()                

Initialize the corpus, without loading it into memory, this is not needed when working with the smaller corpus.

In [4]:
corpus = MyCorpus(xml_file)

The whole corpus is too big for this experiment and takes too long to parse through. For our proof-of-concept approach we therefore propose a function which only loops through the first i documents (text nodes) in the XML tree:

In [4]:
# Define a smaller corpus, containing only the first i documents:
class MyCorpus_small:
    def __iter__(self):
        index = 0
        # define the XML tree
        for event, elem in ET.iterparse(xml_file, events = ("start", "end")):
            if index < num_documents:
                # Each document is represented as an object between <text> tags in the xml file
                if event == 'end' and "text" in elem.tag:
                    # Transfom the corpus to vectors
                    yield dictionary.doc2bow(preprocess_text(elem.text))
                    index+=1
                    # clear the node
                    elem.clear()
            else:
                break    

Initialize the smaller corpus, again without loading it into memory:

In [5]:
corpus_small = MyCorpus_small()

---

## Build the Dictionary

To further work with the corpus in vector form, we need to build a dictionary. 

This function needs to be called only once, since we are able to save the dictionary created by it and load it in future use.

__DO NOT RUN THE FOLLOWING CODE IF THE DICTIONARY CAN BE LOADED FROM A FILE__

In [42]:
%%time
# build the dictionary:
dictionary = build_dictionary(xml_file)

Wall time: 2min 12s


In [43]:
%%time
# remove words that appear only once
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]
dictionary.filter_tokens(once_ids)
# remove gaps in id sequence after words that were removed
dictionary.compactify()

Wall time: 43 ms


In [44]:
#save the dictionary
dictionary.save('data/wiki_200_new.dict')

__CONTINUE HERE TO LOAD THE DICTIONARY__

In [6]:
#load the dictionary
dictionary = Dictionary.load('data/wiki_200_new.dict')

In [6]:
# check if the dictionary has been loaded 
print(dictionary)

Dictionary(20308 unique tokens: ['abc', 'abkehr', 'ablehnen', 'abrufen', 'abschluss']...)


---

## Similarity with LDA (Latent Dirichlet Allocation)

### Train the LDA model

Parameters:
* corpus: the corpus
* num_topics: topics to be extracted from the training corpus
* id2word: id to word mapping, the dictionary
* workers: number of cpu cores used

The trained model can be stored and loaded, as same as the dictionary before.

In [9]:
%%time
lda = LdaMulticore(corpus_small, num_topics=200, id2word=dictionary)

Wall time: 4min 30s


First experiments have shown that a topic number of 10 (default) is too low. 100 resulted in better disctinction between the different articles.
__Further fine tuning needed here__

In [10]:
#save the trained model
lda.save("data/lda_model_200_t200.txt")

In [7]:
#load the trained model
lda = LdaModel.load("data/lda_model_200_t200.txt")

Index the corpus with the trained model:

In [8]:
%%time
corpus_index = similarities.MatrixSimilarity(list(lda[corpus_small]), num_features=len(dictionary))

Wall time: 2min 14s


In [31]:
#save the index
corpus_index.save("data/lda_index_200_t200.txt")

In [9]:
#load the index from disk
corpus_index.load("data/lda_index_200_t200.txt")

<gensim.similarities.docsim.MatrixSimilarity at 0x18173db0130>

## Similarity Check

Now that we have a LDA model and an index we can check the similarity of an input document against all documents in our corpus.

In [11]:
# define document to use in similarity check
test_document = open('data/wikibeispiele.txt', encoding='utf-8')
test_document = test_document.read()

In [12]:
print(test_document)

﻿Der Kleinspecht (Dryobates minor, Syn.: Dendrocopos minor) ist eine Vogelart aus der Gattung der Buntspechte (Dendrocopos). Diese gehören zur Unterfamilie der Echten Spechte in der Familie der Spechte (Picidae).
Die Art zählt mit einer Körperlänge von rund 15 cm zu den kleinsten Echten Spechten. Sie ist in 11 Unterarten über die gesamte westliche und nördliche Paläarktis bis an die asiatische Pazifikküste verbreitet.
Der Kleinspecht ist ein typischer Vertreter der Buntspechte mit schwarz-weiß kontrastierendem Gefieder, trotzdem ist er in der West- und Zentralpaläarktis auf Grund seiner Kleinheit unverwechselbar
Beide Geschlechter des Kleinspechtes sind fast während des gesamten Jahres sehr ruffreudig
Der Höhepunkt der gesanglichen Aktivität liegt jedoch im Spätwinter und im zeitigen Frühjahr
Die dichteste Verbreitung liegt in der planaren und collinen Stufe. Bedeutend seltener brüten Kleinspechte in Mitteleuropa in höhergelegenen Gebieten.
Er bevorzugt Waldgebiete und Gehölze mit eine

In [13]:
# transform the document to vector space
test_vec = dictionary.doc2bow(preprocess_text(test_document))
# convert to lda space
test_vec_lda = lda[test_vec]

In [14]:
# get the similarities
sims = corpus_index[test_vec_lda]

In [15]:
sims = corpus_index[test_vec_lda]
print(list(enumerate(sims)))

[(0, 0.0), (1, 0.0), (2, 0.57573116), (3, 0.0), (4, 0.0), (5, 0.032051582), (6, 0.0), (7, 0.0), (8, 0.048284315), (9, 0.06864172), (10, 0.068411574), (11, 0.019717814), (12, 0.070938125), (13, 0.072628155), (14, 0.0), (15, 0.028425444), (16, 0.21034276), (17, 0.017979994), (18, 0.0), (19, 0.09790744), (20, 0.06591607), (21, 0.056166682), (22, 0.01733437), (23, 0.051073823), (24, 0.0035758216), (25, 0.072429344), (26, 0.0), (27, 0.072628155), (28, 0.019298662), (29, 0.072628155), (30, 0.04496612), (31, 0.037945636), (32, 0.036152083), (33, 0.0), (34, 0.45968515), (35, 0.0), (36, 0.10352524), (37, 0.014574563), (38, 0.26127774), (39, 0.0), (40, 0.0), (41, 0.56370336), (42, 0.0011630558), (43, 0.08237219), (44, 0.0), (45, 0.6886529), (46, 0.0), (47, 0.5062068), (48, 0.0), (49, 0.0), (50, 0.1388013), (51, 0.56422883), (52, 0.0), (53, 0.5511428), (54, 0.0), (55, 0.0), (56, 0.0013991656), (57, 0.041555025), (58, 0.09268699), (59, 0.3270393), (60, 0.0010350089), (61, 0.3014194), (62, 0.652655

## Results

In [17]:
hits = 0
for ids in list(enumerate(sims)):
    if ids[1] >= 0.5:
        hits += 1
        title = title_ids.get(ids[0])
        print("Similarity Score: ",ids[1],"\n","Document ID:",ids[0],"\n","Title:", title,"\n", "------------------------------------")
print(hits, "cases of possible plagiarism detected.")

Similarity Score:  0.57573116 
 Document ID: 2 
 Title: Ang Lee 
 ------------------------------------
Similarity Score:  0.56370336 
 Document ID: 41 
 Title: Alfred Hitchcock 
 ------------------------------------
Similarity Score:  0.6886529 
 Document ID: 45 
 Title: Anime 
 ------------------------------------
Similarity Score:  0.5062068 
 Document ID: 47 
 Title: Al Pacino 
 ------------------------------------
Similarity Score:  0.56422883 
 Document ID: 51 
 Title: Antimon 
 ------------------------------------
Similarity Score:  0.5511428 
 Document ID: 53 
 Title: Arsen 
 ------------------------------------
Similarity Score:  0.6526557 
 Document ID: 62 
 Title: Antike 
 ------------------------------------
Similarity Score:  0.51825005 
 Document ID: 77 
 Title: Außenbandruptur des oberen Sprunggelenkes 
 ------------------------------------
Similarity Score:  0.6429161 
 Document ID: 84 
 Title: Aristoteles 
 ------------------------------------
Similarity Score:  0.50620