# Wikipedia Corpus

Corpus from: https://dumps.wikimedia.org/dewiki/20200820/

Sentences for comparison from: https://github.com/t-systems-on-site-services-gmbh/german-wikipedia-text-corpus

In [3]:
#imports
from xml.etree.ElementTree import *
import xml.etree.ElementTree as ET
from collections import Counter
import os
import pprint
import gensim
from gensim import corpora
from gensim import models
from gensim import similarities
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.models import LdaMulticore
import nltk
from nltk.corpus import stopwords
from smart_open import open 
import spacy
import de_core_news_md
import pickle
import numpy as np

from ipywidgets import FileUpload
from IPython.display import display
from IPython.core.display import display, HTML


from functions import *

### Global Variables

In [2]:
# the XML-file
xml_file = "/data/dewiki-20200820-pages-articles-multistream.xml"

# number of documents to parse 
num_documents = 200

# similarity threshold, when does a document count as plagiarism
sim_threshold = 0.2

## Preprocessing

To be able to return the title of a given article later on, we need to store those in a dictionary:

In [3]:
# load the index from disk
title_ids = pickle.load(open("data/title_ids200.pickle", "rb"))

## Build the corpus

Create a corpus from the text contents of the XML file.

1. Corpus is defined as a class object, so it can be called when needed.
2. Loops through the XML-file, searching for closing "text" tags.
3. Returns the text contents from these nodes in preprocessed form.
4. Then clears the current node from memory

Initialize the corpus, without loading it into memory, this is not needed when working with the smaller corpus.

The whole corpus is too big for this experiment and takes too long to parse through. For our proof-of-concept approach we therefore propose a function which only loops through the first i documents (text nodes) in the XML tree:

In [4]:
# Define a smaller corpus, containing only the first i documents:
class MyCorpus_small:
    def __iter__(self):
        index = 0
        # define the XML tree
        for event, elem in ET.iterparse(xml_file, events = ("start", "end")):
            if index < num_documents:
                # Each document is represented as an object between <text> tags in the xml file
                if event == 'end' and "text" in elem.tag:
                    # Transfom the corpus to vectors
                    yield dictionary.doc2bow(preprocess_text(elem.text))
                    index+=1
                    # clear the node
                    elem.clear()
            else:
                break    

Initialize the smaller corpus, again without loading it into memory:

In [5]:
corpus_small = MyCorpus_small()

---

## Build the Dictionary

To further work with the corpus in vector form, we need to build a dictionary. 

This function needs to be called only once, since we are able to save the dictionary created by it and load it in future use.

__DO NOT RUN THE FOLLOWING CODE IF THE DICTIONARY CAN BE LOADED FROM A FILE__

__CONTINUE HERE TO LOAD THE DICTIONARY__

In [6]:
#load the dictionary
dictionary = Dictionary.load('data/wiki_200_new.dict')

In [7]:
# check if the dictionary has been loaded 
print(dictionary)

Dictionary(20308 unique tokens: ['abc', 'abkehr', 'ablehnen', 'abrufen', 'abschluss']...)


---

## Similarity with LDA (Latent Dirichlet Allocation)

### Train the LDA model

Parameters:
* corpus: the corpus
* num_topics: topics to be extracted from the training corpus
* id2word: id to word mapping, the dictionary
* workers: number of cpu cores used

The trained model can be stored and loaded, as same as the dictionary before.

In [6]:
# define document to use in similarity check
test_document = open('beispieltexte/AlPacino.txt', encoding='utf-8')
document_name = '"'+os.path.basename(test_document.name)+'"'
test_document = test_document.read()

In [7]:
len(test_document)

1632

In [10]:
#pseudo funktion zum errechnen der richtigen topic zahl 
topics = int(len(test_document)/(23000/len(test_document)))

In [11]:
topics

1089

In [12]:
%%time
model = LdaMulticore(corpus_small, num_topics=topics, id2word=dictionary)

CPU times: user 4min 57s, sys: 24.9 s, total: 5min 22s
Wall time: 5min 5s


In [13]:
print(model)

LdaModel(num_terms=20308, num_topics=1089, decay=0.5, chunksize=2000)


First experiments have shown that a topic number of 10 (default) is too low. 100 resulted in better disctinction between the different articles.
__Further fine tuning needed here__

Index the corpus with the trained model:

In [14]:
%%time
corpus_index = similarities.MatrixSimilarity(list(model[corpus_small]), num_features=len(dictionary))

CPU times: user 2min 59s, sys: 3min 4s, total: 6min 4s
Wall time: 2min 30s


## Similarity Check

Now that we have a LDA model and an index we can check the similarity of an input document against all documents in our corpus.

In [15]:
print(test_document)

Gegenüber seinen Rivalen Aldi Süd und Lidl drohte Aldi Nord immer weiter zurückzufallen. „Viele Jahre hatte der Discounter zu wenig in die Modernisierung investiert und zu spät auf neue Trends reagiert.“[30] Daher führte die Aldi-Nord-Gruppe mit dem Aldi Nord Instore Konzept (ANIKo) ebenfalls ein „Modernisierungsprogramm“ durch. Insgesamt wurde die bis Anfang 2019 vorgesehene Umgestaltung der 2250 Aldi-Nord-Filialen auf 5,2 Milliarden Euro kalkuliert. Im Anschluss erfolgte die Umgestaltung der rund 2400 Märkte im europäischen Ausland. Da keine expliziten Kostensenkungsprogramme bekannt sind, werden nach Beobachtern die Umbaukosten nur durch den später erwarteten höheren Umsatz wieder ausgeglichen.[31] Das Standardsortiment von Aldi Nord lag im Geschäftsjahr 2017 bei rund 1400 Artikeln. Der Test mit diesen Albrecht-Supermärkten scheiterte, da er weder in den Ladengrößen noch in der Sortimentsvielfalt der inzwischen davongeeilten Vollsortimenter-Konkurrenz ebenbürtig war. Diese noch unte

In [16]:
# transform the document to vector space
test_vec = dictionary.doc2bow(preprocess_text(test_document))
# convert to lda space
test_vec_lda = model[test_vec]

In [17]:
# get the similarities
sims = corpus_index[test_vec_lda]

## Results

In [18]:
# creates result tags for html output
result_html = ""
cr_level =""
hits = 0
for ids in list(enumerate(sims)):
    if ids[1] >= sim_threshold and "Liste von Autoren" not in title_ids.get(ids[0]):
        hits += 1
        title = title_ids.get(ids[0])

        if ids[1] < 0.5:
            cr_level="zero"
        if ids[1] >= 0.6:
            cr_level="low"
        if ids[1] >= 0.7:
            cr_level="medium"
        if ids[1] >= 0.8:
            cr_level="higher"
        if ids[1] >= 0.9:
            cr_level="high"
        result_html = result_html+" <tr class='"+cr_level+"'><td><a href='https://de.wikipedia.org/wiki/"+title+"'>"+title+"</a></td> "+"<td>"+str(round(ids[1],2))+"</td> "+"<td>"+str(ids[0])+"</td> </tr> "

In [19]:
# html output of all results
display(HTML("""
<style>
.r_table {
  font-family: Arial;
  border-collapse: collapse;
  width: 100%;}
  
.r_table th {border: 1px solid #ddd;padding: 8px;}

.r_table th {
  font-size: 16px;
  padding-top: 12px;
  padding-bottom: 12px;
  text-align: left;
  background-color: steelblue;
  color: white;
  border: 1px solid #ddd;}
  
.r_table td {border: 1px solid #ddd;font-size: 14px; text-align:left;}

.high td{background-color: #F8E0E0;}
.higher td{background-color: #F8ECE0;}
.medium td{background-color: #F7F8E0;}
.low td{background-color: #E0F8E0;}
.zero td{background-color: white;}
</style>

<h3> The tested input """+document_name+""" has the following similarity results </h3> 
<table class="r_table">
  <tr>
    <th>Document Title</th>
    <th>Similarity Score</th> 
    <th>Document-ID</th>
  </tr>
  """+result_html+"""

</table>
<h4>"""+str(hits)+""" wikipedia documents with higher similarity found</h4> """))

Document Title,Similarity Score,Document-ID
Al Pacino,0.61,47
Angelina Jolie,0.42,82
Angela Merkel,0.52,89
Adolf Hitler,0.38,122
Ampelkoalition,0.21,126
Aldi,0.48,180


In [20]:
hit_ids = {}
hit_title =[]
for ids in list(enumerate(sims)):
    if ids[1] >= sim_threshold and "Liste von Autoren" not in title_ids.get(ids[0]):
        hit_ids[ids[0]] = ids[1]
        hit_title.append(title_ids.get(ids[0]))
hit_ids
hit_title

['Al Pacino',
 'Angelina Jolie',
 'Angela Merkel',
 'Adolf Hitler',
 'Ampelkoalition',
 'Aldi']

## Sentence Similarity

The next step would be to define all documents that were found to have a specific similarity score as a new corpus. Then we can check the similarty score for each sentence from the input document in relation to the sentences from the "new" corpus.

### Build new dictionary

In [21]:
%%time
index = 0
first_elem = True
# loop through all nodes
for event, elem in ET.iterparse(xml_file, events = ("start", "end")):        
    if index < num_documents:
        # check if current node contains a document
        if event == "end" and "text" in elem.tag:
            if index in hit_ids.keys():
                # preprocess the text
                text = preprocess_text(elem.text)
                # if this is the first document found, create a new dictionary with it
                if first_elem:
                    dictionary_hits = Dictionary([text])
                    first_elem = False
                    index += 1
                # all documents after the first one get appended to the dictionary
                else:
                    dictionary_hits.add_documents([text])
                    index += 1
                # clear the node
                elem.clear()
                
            else:
                index += 1
                elem.clear()
    else:
        break

CPU times: user 15.2 s, sys: 1.97 s, total: 17.1 s
Wall time: 16.1 s


In [22]:
len(dictionary_hits)

11651

In [23]:
# Define a smaller corpus, containing only the first i documents:
class MyCorpus_small_hits:
    def __iter__(self):
        index = 0
        # define the XML tree
        for event, elem in ET.iterparse(xml_file, events = ("start", "end")):
            if index < num_documents:
                if index in hit_ids.keys():
                    # Each document is represented as an object between <text> tags in the xml file
                    if event == 'end' and "text" in elem.tag:
                        # Transfom the corpus to vectors
                        yield dictionary_hits.doc2bow(preprocess_text(elem.text))
                        index+=1
                        # clear the node
                        elem.clear()
                else:
                    index+=1  
            else:
                break  

In [24]:
corpus_small_hits = MyCorpus_small_hits()

In [85]:
#pseudo funktion zum errechnen der richtigen topic zahl 
topics_hit = int(len(test_document)/(56000/len(test_document)))

In [86]:
topics_hit

447

In [87]:
%%time
hit_model = LdaMulticore(corpus_small_hits, num_topics=topics_hit, id2word=dictionary_hits)

CPU times: user 4.99 s, sys: 981 ms, total: 5.98 s
Wall time: 5.55 s


In [88]:
print(termsim_index)

<gensim.models.keyedvectors.WordEmbeddingSimilarityIndex object at 0x153de9d30>


In [89]:
%%time
corpus_hit_index = similarities.MatrixSimilarity(list(hit_model[corpus_small_hits]), num_features=len(dictionary_hits))

CPU times: user 2.44 s, sys: 3.09 s, total: 5.53 s
Wall time: 2.06 s


In [90]:
print(corpus_hit_index)

MatrixSimilarity<6 docs, 11651 features>


In [91]:
#use nltk tokenize to slice sentences
from nltk import tokenize

#slice test document to sentences
test_doc_raw_slice = []
for split in tokenize.sent_tokenize(test_document):
    test_doc_raw_slice.append(preprocess_text(str(split)))

test_doc_raw_sentence = []
for split in tokenize.sent_tokenize(test_document):
    test_doc_raw_sentence.append(str(split))

In [92]:
sim_hits = []
for sentence in test_doc_raw_slice:
    # test doc Sätze vs hit_corpus 
    test_vec = dictionary_hits.doc2bow(sentence)
    # convert to lda space
    test_vec_lda = hit_model[test_vec]
    sim_hits.append(corpus_hit_index[test_vec_lda])

In [93]:
for elm in list(enumerate(sim_hits)):
    title = hit_title[np.argmax(elm[1])]
    
    if elm[1][np.argmax(elm[1])] > 0.80:
        print(test_doc_raw_sentence[elm[0]])
        print("aus Dokument: ", title)
        print("Übereinstimmung: ", elm[1][np.argmax(elm[1])])
        print("  ")
        print("Mehr Infos:")
        print(str(elm[1]).replace("         ", " ").replace("        ", ""))
        print("max: ", elm[1][np.argmax(elm[1])], "position: ", np.argmax(elm[1]))
        print("----------------------------------------------")

Im Anschluss erfolgte die Umgestaltung der rund 2400 Märkte im europäischen Ausland.
aus Dokument:  Angela Merkel
Übereinstimmung:  1.0
  
Mehr Infos:
[0. 0.0146749  1. 1. 0.02761932 0.]
max:  1.0 position:  2
----------------------------------------------
Diese noch unter dem roten Albrecht-Logo getesteten Märkte wurden bald wieder geschlossen bzw.
aus Dokument:  Ampelkoalition
Übereinstimmung:  0.99920845
  
Mehr Infos:
[0. 0. 0. 0. 0.99920845 0.]
max:  0.99920845 position:  4
----------------------------------------------
konnten kurze Zeit später nach Umgestaltung auf Aldi-Discount genutzt werden.
aus Dokument:  Angela Merkel
Übereinstimmung:  1.0
  
Mehr Infos:
[0. 0.0146749  1. 1. 0.02761932 0.]
max:  1.0 position:  2
----------------------------------------------
Die beiden Unternehmensgruppen sind freundschaftlich verbunden und koordinieren im Aldi-Unternehmensausschuss gemeinsam ihre Geschäftspolitik.
aus Dokument:  Angelina Jolie
Übereinstimmung:  0.99956226
  
Mehr Infos:
[0

In [94]:
# creates result tags for html output
hit_result_html = ""
hit_vis = []
hits = 0
for elm in list(enumerate(sim_hits)):
    title = hit_title[np.argmax(elm[1])]
    if elm[1][np.argmax(elm[1])] < 0.60:
        cr_level="zero"
    if elm[1][np.argmax(elm[1])] >= 0.93:
        cr_level="low"
    if elm[1][np.argmax(elm[1])] >= 0.95:
        cr_level="medium"
    if elm[1][np.argmax(elm[1])] >= 0.98:
        cr_level="higher"
    if elm[1][np.argmax(elm[1])] >= 0.99:
        cr_level="high"
    
    if cr_level=="zero":
        hit_result_html = hit_result_html+" <t class='"+cr_level+"'>"+test_doc_raw_sentence[elm[0]]+"</t> "
    else:
        hit_result_html = hit_result_html+" <t class='"+cr_level+"'>"+test_doc_raw_sentence[elm[0]]+"<b> <a href='https://de.wikipedia.org/wiki/"+title+"'>"+title+"</a></b></t>"

In [95]:
# html output of all results
display(HTML("""
<style>
.high {background-color: #F8E0E0;}
.higher {background-color: #F8ECE0;}
.medium {background-color: #F7F8E0;}
.low {background-color: #E0F8E0;}
.zero {background-color: white;}
</style>

  """+hit_result_html+""))