# Wikipedia Corpus

Corpus from: https://dumps.wikimedia.org/dewiki/20200820/

Sentences for comparison from: https://github.com/t-systems-on-site-services-gmbh/german-wikipedia-text-corpus

In [131]:
# imports
from xml.etree.ElementTree import *
import xml.etree.ElementTree as ET
from collections import Counter
import os
import pprint
import gensim
from gensim import corpora
from gensim import models
from gensim import similarities
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.models import LdaMulticore
import nltk
from nltk.corpus import stopwords
from smart_open import open 
import spacy
import de_core_news_md
import pickle

from ipywidgets import FileUpload
from IPython.display import display
from IPython.core.display import display, HTML

from functions import *

### Global Variables

In [5]:
# the XML-file
xml_file = "/Volumes/SSD/dewiki-20200820-pages-articles-multistream.xml"

# number of documents to parse 
num_documents = 200

## Preprocessing

To be able to return the title of a given article later on, we need to store those in a dictionary:

In [6]:
title_ids = get_titles(xml_file)

## Build the corpus

Create a corpus from the text contents of the XML file.

1. Corpus is defined as a class object, so it can be called when needed.
2. Loops through the XML-file, searching for closing "text" tags.
3. Returns the text contents from these nodes in preprocessed form.
4. Then clears the current node from memory

In [7]:
# Define the corpus as an object
class MyCorpus:
    def __iter__(self):
        # define the XML tree
        for event, elem in ET.iterparse(xml_file, events = ("start", "end")):            
            # Each document is represented as an object between <text> tags in the xml file
            if event == 'end' and "text" in elem.tag:
                # Transfom the corpus to vectors
                yield dictionary.doc2bow(preprocess_text(elem.text))
                # clear the node
                elem.clear()                

Initialize the corpus, without loading it into memory, this is not needed when working with the smaller corpus.

In [8]:
corpus = MyCorpus()

The whole corpus is too big for this experiment and takes too long to parse through. For our proof-of-concept approach we therefore propose a function which only loops through the first i documents (text nodes) in the XML tree:

In [9]:
# Define a smaller corpus, containing only the first i documents:
class MyCorpus_small:
    def __iter__(self):
        index = 0
        # define the XML tree
        for event, elem in ET.iterparse(xml_file, events = ("start", "end")):
            if index < num_documents:
                # Each document is represented as an object between <text> tags in the xml file
                if event == 'end' and "text" in elem.tag:
                    # Transfom the corpus to vectors
                    yield dictionary.doc2bow(preprocess_text(elem.text))
                    index+=1
                    # clear the node
                    elem.clear()
            else:
                break    

Initialize the smaller corpus, again without loading it into memory:

In [10]:
corpus_small = MyCorpus_small()

---

## Build the Dictionary

To further work with the corpus in vector form, we need to build a dictionary. 

This function needs to be called only once, since we are able to save the dictionary created by it and load it in future use.

__DO NOT RUN THE FOLLOWING CODE IF THE DICTIONARY CAN BE LOADED FROM A FILE__

__CONTINUE HERE TO LOAD THE DICTIONARY__

In [11]:
# load the dictionary
dictionary = Dictionary.load('data/wiki_200_new.dict')

In [12]:
# check if the dictionary has been loaded 
print(dictionary)

Dictionary(20308 unique tokens: ['abc', 'abkehr', 'ablehnen', 'abrufen', 'abschluss']...)


---

## Similarity with LDA (Latent Dirichlet Allocation)

### Train the LDA model

Parameters:
* corpus: the corpus
* num_topics: topics to be extracted from the training corpus
* id2word: id to word mapping, the dictionary
* workers: number of cpu cores used

The trained model can be stored and loaded, as same as the dictionary before.

First experiments have shown that a topic number of 10 (default) is too low. 100 resulted in better disctinction between the different articles.
__Further fine tuning needed here__

In [13]:
# load the trained model
lda = LdaModel.load("data/lda_model_200_t300.txt")

Index the corpus with the trained model:

In [142]:
# load the index from disk
corpus_index = pickle.load(open("data/corpus_index.pickle", "rb"))

## Similarity Check

Now that we have a LDA model and an index we can check the similarity of an input document against all documents in our corpus.

In [278]:
# upload text file for testing
upload = FileUpload(accept='.txt', multiple=False)
print("upload the text you want to check for plagiarism")
display(upload)

upload the text you want to check for plagiarism


FileUpload(value={}, accept='.txt', description='Upload')

In [287]:
if len(upload.value) == 0:
    print("no file uploaded! try again")
else :
    print("File uploaded")


File uploaded


In [288]:
# save text file for testing
with open('import/tobetested.txt', 'wb') as output_file: 
    for uploaded_filename in upload.value:
        content = upload.value[uploaded_filename]['content']   
        output_file.write(content) 

In [289]:
# define document to use in similarity check
test_document = open('import/tobetested.txt', encoding='utf-8')
document_name = '"'+os.path.basename(test_document.name)+'"'
test_document = test_document.read()

# delete imported data after correct usage 
os.remove("import/tobetested.txt")

In [290]:
# transform the document to vector space
test_vec = dictionary.doc2bow(preprocess_text(test_document))
# convert to lda space
test_vec_lda = lda[test_vec]

In [291]:
# get the similarities
sims = corpus_index[test_vec_lda]

## Results

In [292]:
# creates result tags for html output
result_html = ""
vis = []
hits = 0
for ids in list(enumerate(sims)):
    if ids[1] >= 0.1:
        hits += 1
        title = title_ids.get(ids[0])
        if ids[1] < 0.4:
            cr_level="zero"
        if ids[1] >= 0.4:
            cr_level="low"
        if ids[1] >= 0.5:
            cr_level="medium"
        if ids[1] >= 0.6:
            cr_level="higher"
        if ids[1] >= 0.75:
            cr_level="high"
        result_html = result_html+" <tr class='"+cr_level+"'><td><a href='https://de.wikipedia.org/wiki/"+title+"'>"+title+"</a></td> "+"<td>"+str(round(ids[1],2))+"</td> "+"<td>"+str(ids[0])+"</td> </tr> "
        vis.append(cr_level)

In [293]:
#werte für visualisierung
prczero = 0
prclow = 0
prcmed = 0
prcher = 0
prchig = 0

for elm in vis:
    if elm=="zero": 
        prczero+=1
    if elm=="low": 
        prclow+=1
    if elm=="medium": 
        prcmed+=1
    if elm=="higher": 
        prcher+=1
    if elm=="high": 
        prchig+=1

In [294]:
# html output of all results
display(HTML("""
<style>
.r_table {
  font-family: Arial;
  border-collapse: collapse;
  width: 100%;}
  
.r_table th {border: 1px solid #ddd;padding: 8px;}

.r_table th {
  font-size: 16px;
  padding-top: 12px;
  padding-bottom: 12px;
  text-align: left;
  background-color: steelblue;
  color: white;
  border: 1px solid #ddd;}
  
.r_table td {border: 1px solid #ddd;font-size: 14px; text-align:left;}

.high td{background-color: #F8E0E0;}
.higher td{background-color: #F8ECE0;}
.medium td{background-color: #F7F8E0;}
.low td{background-color: #E0F8E0;}
.zero td{background-color: white;}
</style>

<h3> The tested input """+document_name+""" has the following similarity results </h3> 
<table class="r_table">
  <tr>
    <th>Document Title</th>
    <th>Similarity Score</th> 
    <th>Document-ID</th>
  </tr>
  """+result_html+"""

</table>
<h4>"""+str(hits)+""" wikipedia documents with higher similarity found</h4> """))

Document Title,Similarity Score,Document-ID
Actinium,0.11,1
Ang Lee,0.2,2
Anschluss (Luhmann),0.65,3
Anschlussfähigkeit,0.65,4
Liste von Autoren/J,0.14,13
US-amerikanischer Film,0.24,34
Alfred Hitchcock,0.25,41
Anime,0.61,45
Al Pacino,0.29,47
Alkohole,0.16,48


In [267]:
display(HTML("""
<style>
#piechart {
  position: relative;
  width: 250px;
  height: 250px;
  margin-left: 100px;
  margin-top: 100px;
}
 
.piece {
  position: absolute;
  width: 250px;
  height: 250px;
  clip: rect(0px, 250px, 250px, 125px);
  border-radius: 125px;
  transition: all 0.8s ease-out;
}

.piece-inner {
    position: absolute;
	width: 250px;
	height: 250px;
	clip: rect(0px, 125px, 250px, 0px);
	border-radius: 125px;
	-webkit-backface-visibility: hidden;
    transition: all 0.8s ease-out;
}
/* Spezifische Einstellungen */
#piece1 > .piece-inner {
    background: green; 
}
#piece2 > .piece-inner {
    background: yellow; 
}
#piece3 > .piece-inner {
    background: orange;
}
#piece4 > .piece-inner {
    background: red;
}
</style>

<script>
var wert1 = """+str(prclow/len(vis)*1200)+""";
var wert2 = """+str(prcmed/len(vis)*1200)+""";
var wert3 = """+str(prcher/len(vis)*1200)+""";
var wert4 = """+str(prchig/len(vis)*1200)+""";

document.querySelector("#piece1").style.webkitTransform = "rotate(0deg)";
document.querySelector("#piece1").style.transform = "rotate(0deg)";
document.querySelector("#piece1 > .piece-inner").style.webkitTransform =
    "rotate(" + wert1 + "deg)";
document.querySelector("#piece1 > .piece-inner").style.transform = "rotate(" + wert1 + "deg)";

document.querySelector("#piece2").style.webkitTransform = "rotate(" + wert1 + "deg)";
document.querySelector("#piece2").style.transform = "rotate(" + wert1 + "deg)";
document.querySelector("#piece2 > .piece-inner").style.webkitTransform = "rotate(" + wert2 + "deg)";
document.querySelector("#piece2 > .piece-inner").style.transform = "rotate(" + wert2 + "deg)";

document.querySelector("#piece3").style.webkitTransform = "rotate(" + (wert1 + wert2) + "deg)";
document.querySelector("#piece3").style.transform = "rotate(" + (wert1 + wert2) + "deg)";
document.querySelector("#piece3 > .piece-inner").style.webkitTransform = "rotate(" + wert3 + "deg)";
document.querySelector("#piece3 > .piece-inner").style.transform = "rotate(" + wert3 + "deg)";

document.querySelector("#piece4").style.webkitTransform = "rotate(" + (wert1 + wert2 + wert3) + "deg)";
document.querySelector("#piece4").style.transform = "rotate(" + (wert1 + wert2 + wert3) + "deg)";
document.querySelector("#piece4 > .piece-inner").style.webkitTransform = "rotate(" + wert4 + "deg)";
document.querySelector("#piece4 > .piece-inner").style.transform = "rotate(" + wert4 + "deg)";
</script>

<div id="piechart">
  <div id="piece1" class="piece">
      <div class="piece-inner"></div>      
  </div>
  <div id="piece2" class="piece">
      <div class="piece-inner"></div>
  </div>
  <div id="piece3" class="piece">
      <div class="piece-inner"></div>
  </div>
  <div id="piece4" class="piece">
      <div class="piece-inner"></div>
  </div>
</div>
"""))

# Sentence similarity

In [186]:
%%time
text_ids = {}
texts = []
index = 0
for event, elem in ET.iterparse(xml_file, events = ("start", "end")):        
    if index < 200:
        if event == 'end' and "text" in elem.tag:
            text_ids[index]=str(elem.text)
            index += 1  
            texts.append(str(elem.text))
            elem.clear()
    else:
        break


CPU times: user 72.2 ms, sys: 12.1 ms, total: 84.3 ms
Wall time: 87 ms


In [187]:
# define hit-corpus
# takes all plagiarism documents with similarity over 0.75
# split into sentences with spacy
class MyCorpus_hits:
    def __iter__(self):
          for ids in list(enumerate(sims)):
            if ids[1] >= 0.7:
                for split in spacy_data(texts[ids[0]]).sents:
                    yield dictionary.doc2bow(preprocess_text(str(split)))
                    elem.clear()

In [188]:
hit_corpus = MyCorpus_hits()

In [189]:
%%time
hit_lda = LdaMulticore(hit_corpus, num_topics=300, id2word=dictionary)

CPU times: user 18.3 s, sys: 425 ms, total: 18.7 s
Wall time: 19.1 s


In [190]:
%%time
corpus_hit_index = similarities.MatrixSimilarity(list(hit_lda[hit_corpus]), num_features=len(dictionary))

CPU times: user 11.3 s, sys: 85.3 ms, total: 11.4 s
Wall time: 11.4 s


In [191]:
#slice test document to sentences
test_doc_raw_slice = []
for split in spacy_data(test_document).sents:
    test_doc_raw_slice.append(preprocess_text(str(split)))

In [197]:
import numpy as np
for sentence in test_doc_raw_slice:
    # test doc Sätze vs hit_corpus 
    test_vec = dictionary.doc2bow(sentence)
    # convert to lda space
    test_vec_lda = lda[test_vec]
    sims_hits = corpus_hit_index[test_vec_lda]
    print(np.amax(sims_hits))
    print(test_vec)

0.0
[(468, 1), (6650, 1), (18420, 1)]
0.0
[(702, 1), (8332, 1), (10272, 1), (10988, 1), (16013, 1), (18433, 1), (18465, 1)]
0.0
[(500, 1), (707, 1), (766, 1), (1387, 1), (6650, 1), (10986, 1)]
0.0
[(1447, 1), (1695, 1)]
0.0
[(702, 1), (1554, 1), (4886, 1), (10047, 1), (10543, 1), (12114, 1)]
0.0
[(217, 1), (468, 1), (1554, 1), (6650, 1), (10988, 1), (11809, 1), (18371, 1)]
0.0
[(696, 1)]
0.0
[(217, 1), (18421, 1)]
0.0
[(1078, 1), (2785, 1), (5543, 1), (10657, 1), (11809, 1), (18464, 1)]
0.0
[(4571, 1), (5224, 1), (8426, 1)]
0.0
[(9357, 1), (13601, 1), (18379, 1), (18380, 1), (18421, 1)]
0.0
[(1268, 1), (12437, 1), (17562, 1), (18396, 1), (18401, 1), (18406, 1)]
0.0
[(529, 1), (18396, 1)]


In [295]:
# html output of all results
display(HTML("""
<style>

.high {background-color: #F8E0E0;}
.higher {background-color: #F8ECE0;}
.medium {background-color: #F7F8E0;}
.low {background-color: #E0F8E0;}
</style style>

<h3>color schema
<span class='high'>1.00-0.70</span>&nbsp;&nbsp;
<span class='higher'>0.69-0.60</span>&nbsp;&nbsp; 
<span class='medium'>0.59-0.50</span>&nbsp;&nbsp; 
<span class='low'>0.49-0.40</span>&nbsp;&nbsp; 
<br><br></h3>
<div style="font-size:12pt;line-height:150%;">
<span class='medium'>Der Kleinspecht (Dryobates minor, Syn.: Dendrocopos minor) ist eine Vogelart aus der Gattung der Buntspechte (Dendrocopos). Diese gehören zur Unterfamilie der Echten Spechte in der Familie der Spechte (Picidae).
Die Art zählt mit einer Körperlänge von rund 15 cm zu den kleinsten Echten Spechten.<b> (Aristoteles)</b></span> Sie ist in 11 Unterarten über die gesamte westliche und nördliche Paläarktis bis an die asiatische Pazifikküste verbreitet.
<span class='higher'>Der Kleinspecht ist ein typischer Vertreter der Buntspechte mit schwarz-weiß kontrastierendem Gefieder, trotzdem ist er in der West- und Zentralpaläarktis auf Grund seiner Kleinheit unverwechselbar
Beide Geschlechter des Kleinspechtes sind fast während des gesamten Jahres sehr ruffreudig<b> (Abraham Lincoln)</b></span>
Der Höhepunkt der gesanglichen Aktivität liegt jedoch im Spätwinter und im zeitigen Frühjahr
Die dichteste Verbreitung liegt in der planaren und collinen Stufe. Bedeutend seltener brüten Kleinspechte in Mitteleuropa in höhergelegenen Gebieten.
Er bevorzugt Waldgebiete und Gehölze mit einem guten Bestand an alten, grobborkigen Laubbäumen. 
Die Nahrung des Kleinspechtes besteht fast während des gesamten Jahres aus kleinen baumbewohnenden Insekten
<span class='medium'>Wie alle Spechte ist auch der Kleinspecht tagaktiv; seine Aktivität beginnt kurz vor Sonnenaufgang und endet kurz nach Sonnenuntergang
Alfredo James „Al“ Pacino (* 25. April 1940 in New York) ist ein US-amerikanischer Schauspieler, Filmregisseur und Filmproduzent.<b> (Abraham Lincoln)</b></span> Er gilt für viele Kritiker und Zuschauer als einer der herausragenden Charakterdarsteller des zeitgenössischen amerikanischen Films und Theaters. So ist er seit den 1970er Jahren in zahlreichen Filmklassikern zu sehen.
m Laufe seiner Karriere wurde er unter anderem mit dem Oscar, dem Golden Globe Award, dem Tony Award und der National Medal of Arts ausgezeichnet. <span class='high'>Seine bekanntesten Rollen sind die des Michael Corleone in der von Francis Ford Coppola inszenierten Der Pate-Trilogie und als Gangster Tony Montana in Scarface.<b> (Aldi)</b></span> 
</div>
"""))