# Wikipedia Corpus

Corpus from: https://dumps.wikimedia.org/dewiki/20200820/

Sentences for comparison from: https://github.com/t-systems-on-site-services-gmbh/german-wikipedia-text-corpus

In [1]:
#imports
from xml.etree.ElementTree import *
import xml.etree.ElementTree as ET
from collections import Counter
import os
import pprint
import gensim
from gensim import corpora
from gensim import models
from gensim import similarities
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.models import LdaMulticore
import nltk
from nltk.corpus import stopwords
from smart_open import open 
import spacy
import de_core_news_md
import pickle
import numpy as np

from ipywidgets import FileUpload
from IPython.display import display
from IPython.core.display import display, HTML


from functions import *

### Global Variables

In [24]:
# the XML-file
xml_file = "data/wiki_corpus/dewiki-20200820-pages-articles-multistream.xml"

# number of documents to parse 
num_documents = 200

# similarity threshold, when does a document count as plagiarism
sim_threshold = 0.7

## Preprocessing

To be able to return the title of a given article later on, we need to store those in a dictionary:

In [3]:
title_ids = get_titles(xml_file, num_documents)

## Build the corpus

Create a corpus from the text contents of the XML file.

1. Corpus is defined as a class object, so it can be called when needed.
2. Loops through the XML-file, searching for closing "text" tags.
3. Returns the text contents from these nodes in preprocessed form.
4. Then clears the current node from memory

In [4]:
# Define the corpus as an object
class MyCorpus:
    def __iter__(self):
        # define the XML tree
        for event, elem in ET.iterparse(xml_file, events = ("start", "end")):            
            # Each document is represented as an object between <text> tags in the xml file
            if event == 'end' and "text" in elem.tag:
                # Transfom the corpus to vectors
                yield dictionary.doc2bow(preprocess_text(elem.text))
                # clear the node
                elem.clear()                

Initialize the corpus, without loading it into memory, this is not needed when working with the smaller corpus.

In [5]:
corpus = MyCorpus()

The whole corpus is too big for this experiment and takes too long to parse through. For our proof-of-concept approach we therefore propose a function which only loops through the first i documents (text nodes) in the XML tree:

In [6]:
# Define a smaller corpus, containing only the first i documents:
class MyCorpus_small:
    def __iter__(self):
        index = 0
        # define the XML tree
        for event, elem in ET.iterparse(xml_file, events = ("start", "end")):
            if index < num_documents:
                # Each document is represented as an object between <text> tags in the xml file
                if event == 'end' and "text" in elem.tag:
                    # Transfom the corpus to vectors
                    yield dictionary.doc2bow(preprocess_text(elem.text))
                    index+=1
                    # clear the node
                    elem.clear()
            else:
                break    

Initialize the smaller corpus, again without loading it into memory:

In [7]:
corpus_small = MyCorpus_small()

---

## Build the Dictionary

To further work with the corpus in vector form, we need to build a dictionary. 

This function needs to be called only once, since we are able to save the dictionary created by it and load it in future use.

__DO NOT RUN THE FOLLOWING CODE IF THE DICTIONARY CAN BE LOADED FROM A FILE__

In [34]:
%%time
# build the dictionary:
dictionary = build_dictionary(xml_file, num_documents)

Wall time: 2min 13s


In [35]:
%%time
# remove words that appear only once
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]
dictionary.filter_tokens(once_ids)
# remove gaps in id sequence after words that were removed
dictionary.compactify()

Wall time: 35 ms


In [36]:
#save the dictionary
dictionary.save('data/wiki_200_new.dict')

__CONTINUE HERE TO LOAD THE DICTIONARY__

In [9]:
#load the dictionary
dictionary = Dictionary.load('data/wiki_200_new.dict')

In [10]:
# check if the dictionary has been loaded 
print(dictionary)

Dictionary(20308 unique tokens: ['abc', 'abkehr', 'ablehnen', 'abrufen', 'abschluss']...)


---

## Similarity with LDA (Latent Dirichlet Allocation)

### Train the LDA model

Parameters:
* corpus: the corpus
* num_topics: topics to be extracted from the training corpus
* id2word: id to word mapping, the dictionary
* workers: number of cpu cores used

The trained model can be stored and loaded, as same as the dictionary before.

In [11]:
%%time
lda = LdaMulticore(corpus_small, num_topics=200, id2word=dictionary)

Wall time: 4min 24s


First experiments have shown that a topic number of 10 (default) is too low. 100 resulted in better disctinction between the different articles.
__Further fine tuning needed here__

In [12]:
#save the trained model
lda.save("data/lda_model_200_t200.txt")

In [13]:
#load the trained model
lda = LdaModel.load("data/lda_model_200_t200.txt")

Index the corpus with the trained model:

In [14]:
%%time
corpus_index = similarities.MatrixSimilarity(list(lda[corpus_small]), num_features=len(dictionary))

Wall time: 2min 11s


In [16]:
#save the index
pickle_out = open("data/lda_index_200_t200.pickle", "wb")
pickle.dump(corpus_index, pickle_out)
pickle_out.close()

In [17]:
# load the index from disk
corpus_index = pickle.load(open("data/lda_index_200_t200.pickle", "rb"))

## Similarity Check

Now that we have a LDA model and an index we can check the similarity of an input document against all documents in our corpus.

In [18]:
# define document to use in similarity check
test_document = open('beispieltexte/wikibeispiele.txt', encoding='utf-8')
test_document = test_document.read()

In [19]:
print(test_document)

Pacino wurde durch den späteren Filmproduzenten Martin Bregman bei einem Off-Broadway-Auftritt entdeckt.[6] 1969 wirkte er in seiner ersten Hollywood-Produktion Ich, Natalie mit. 1971 erhielt er neben Kitty Winn eine Rolle in dem Film Panik im Needle Park, die ihm den Weg für die Rolle des Michael Corleone in Francis Ford Coppolas Der Pate (1972) ebnete und ihm 1973 seine erste Oscar-Nominierung einbrachte.

Nach Hundstage wurde es stiller um Pacino. Erst in den 1980er Jahren brachte er sich durch Filme wie Brian De Palmas Scarface (1983) und Sea of Love – Melodie des Todes (1989) wieder ins Gespräch. Nach einer erneuten Zusammenarbeit mit Coppola in Der Pate III (1990) folgte der Thriller Heat (1995) mit Schauspielkollege Robert De Niro. Die männliche Hauptrolle in dem Film Pretty Woman lehnte er ab.

Seine Darstellung des AIDS-kranken Schwulenhassers Roy Cohn in der Miniserie Engel in Amerika (2003) brachte ihm zahlreiche Preise ein und wurde von der Kritik hoch gelobt.

Pacino ist d

In [20]:
# transform the document to vector space
test_vec = dictionary.doc2bow(preprocess_text(test_document))
# convert to lda space
test_vec_lda = lda[test_vec]

In [21]:
# get the similarities
sims = corpus_index[test_vec_lda]

In [22]:
sims = corpus_index[test_vec_lda]
print(list(enumerate(sims)))

[(0, 0.01008058), (1, 0.03768393), (2, 0.0), (3, 0.8155303), (4, 0.8115323), (5, 0.048745703), (6, 0.0), (7, 0.12518354), (8, 0.12518354), (9, 0.12518354), (10, 0.12518354), (11, 0.07066945), (12, 0.1100373), (13, 0.084560685), (14, 0.10474379), (15, 0.12518354), (16, 0.09474901), (17, 0.12518354), (18, 0.12518354), (19, 0.12518354), (20, 0.10801512), (21, 0.09821261), (22, 0.0), (23, 0.12518354), (24, 0.12518354), (25, 0.12518354), (26, 0.042796742), (27, 0.0), (28, 0.0), (29, 0.08997903), (30, 0.07212468), (31, 0.050159145), (32, 0.11552935), (33, 0.04233185), (34, 0.22099994), (35, 0.0029901767), (36, 0.0029772343), (37, 0.0021655203), (38, 0.13527161), (39, 0.0), (40, 0.0), (41, 0.24710418), (42, 0.013038034), (43, 0.0), (44, 0.0), (45, 0.60376626), (46, 0.0), (47, 0.92531735), (48, 0.0055418247), (49, 0.0047619), (50, 0.083487034), (51, 0.014006715), (52, 0.0), (53, 0.11487424), (54, 0.0028934155), (55, 0.0), (56, 0.04766563), (57, 0.01629321), (58, 0.4882217), (59, 0.011836447), 

## Results

In [25]:
hits = 0
for ids in list(enumerate(sims)):
    if ids[1] >= sim_threshold:
        hits += 1
        title = title_ids.get(ids[0])
        print("Similarity Score: ",ids[1],"\n","Document ID:",ids[0],"\n","Title:", title,"\n", "------------------------------------")
print(hits, "cases of possible plagiarism detected.")

Similarity Score:  0.8155303 
 Document ID: 3 
 Title: Anschluss (Luhmann) 
 ------------------------------------
Similarity Score:  0.8115323 
 Document ID: 4 
 Title: Anschlussfähigkeit 
 ------------------------------------
Similarity Score:  0.92531735 
 Document ID: 47 
 Title: Al Pacino 
 ------------------------------------
Similarity Score:  0.9236614 
 Document ID: 90 
 Title: Alicia Silverstone 
 ------------------------------------
Similarity Score:  0.9236614 
 Document ID: 119 
 Title: Atlantik 
 ------------------------------------
5 cases of possible plagiarism detected.


In [39]:
hit_ids = {}
for ids in list(enumerate(sims)):
    if ids[1] >= sim_threshold:
        hit_ids[ids[0]] = ids[1]

In [40]:
hit_ids

{3: 0.8155303, 4: 0.8115323, 47: 0.92531735, 90: 0.9236614, 119: 0.9236614}

## Sentence Similarity

The next step would be to define all documents that were found to have a specific similarity score as a new corpus. Then we can check the similarty score for each sentence from the input document in relation to the sentences from the "new" corpus.

### Build new dictionary

In [52]:
%%time
index = 0
first_elem = True
# loop through all nodes
for event, elem in ET.iterparse(xml_file, events = ("start", "end")):        
    if index < num_documents:
        # check if current node contains a document
        if event == "end" and "text" in elem.tag:
            if index in hit_ids.keys():
                # preprocess the text
                text = preprocess_text(elem.text)
                # if this is the first document found, create a new dictionary with it
                if first_elem:
                    dictionary_hits = Dictionary([text])
                    first_elem = False
                    index += 1
                # all documents after the first one get appended to the dictionary
                else:
                    dictionary_hits.add_documents([text])
                    index += 1
                # clear the node
                elem.clear()
                
            else:
                index += 1
                elem.clear()
    else:
        break


Wall time: 930 ms


In [42]:
print(dictionary_hits)

Dictionary(1168 unique tokens: ['abrücken', 'akt', 'akzeptieren', 'aneinander', 'anregung']...)


In [90]:
# Define a smaller corpus, containing only the first i documents:
class MyCorpus_small_hits:
    def __iter__(self):
        index = 0
        # define the XML tree
        for event, elem in ET.iterparse(xml_file, events = ("start", "end")):
            if index < num_documents:
                # Each document is represented as an object between <text> tags in the xml file
                if event == 'end' and "text" in elem.tag:
                    if index in hit_ids.keys():
                        yield dictionary_hits.doc2bow(preprocess_text(str(elem.text)))
                    index+=1
                    # clear the node
                    elem.clear()
                else:
                    index+=1
                    elem.clear()
            else:
                break    

In [91]:
corpus_small_hits = MyCorpus_small_hits()

In [92]:
%%time
hit_lda = LdaMulticore(corpus_small_hits, num_topics=100, id2word=dictionary_hits)

Wall time: 9.01 ms


In [93]:
print(hit_lda)

LdaModel(num_terms=1168, num_topics=100, decay=0.5, chunksize=2000)


In [94]:
%%time
corpus_hit_index = similarities.MatrixSimilarity(list(hit_lda[corpus_small_hits]), num_features=len(dictionary_hits))

Wall time: 1 ms


In [96]:
print(corpus_hit_index)

MatrixSimilarity<0 docs, 1168 features>


In [97]:
#slice test document to sentences
test_doc_raw_slice = []
for split in spacy_data(test_document).sents:
    test_doc_raw_slice.append(preprocess_text(str(split)))

In [98]:
sim_hits = []
for sentence in test_doc_raw_slice:
    # test doc Sätze vs hit_corpus 
    test_vec = dictionary_hits.doc2bow(sentence)
    # convert to lda space
    test_vec_lda = hit_lda[test_vec]
    sim_hits.append(corpus_hit_index[test_vec_lda])

In [99]:
sim_hits

[array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 array([], dtype=float32),
 

In [100]:
for ids in list(enumerate(sim_hits)):
    title = title_ids.get(ids[0])
    print("Similarity Score: ",ids[1],"\n","Document ID:",ids[0],"\n","Title:", title,"\n", "------------------------------------")
print(hits, "cases of possible plagiarism detected.")

Similarity Score:  [] 
 Document ID: 0 
 Title: Alan Smithee 
 ------------------------------------
Similarity Score:  [] 
 Document ID: 1 
 Title: Actinium 
 ------------------------------------
Similarity Score:  [] 
 Document ID: 2 
 Title: Ang Lee 
 ------------------------------------
Similarity Score:  [] 
 Document ID: 3 
 Title: Anschluss (Luhmann) 
 ------------------------------------
Similarity Score:  [] 
 Document ID: 4 
 Title: Anschlussfähigkeit 
 ------------------------------------
Similarity Score:  [] 
 Document ID: 5 
 Title: Aussagenlogik 
 ------------------------------------
Similarity Score:  [] 
 Document ID: 6 
 Title: Autopoiese 
 ------------------------------------
Similarity Score:  [] 
 Document ID: 7 
 Title: A.A. 
 ------------------------------------
Similarity Score:  [] 
 Document ID: 8 
 Title: Liste von Autoren/A 
 ------------------------------------
Similarity Score:  [] 
 Document ID: 9 
 Title: Liste von Autoren/H 
 ----------------------------

In [86]:
#hits for all sentences
sims_hits = corpus_hit_index[test_vec_lda]
print(list(enumerate(sims_hits)))

[(0, 0.0), (1, 0.0), (2, 0.18720087), (3, 0.0)]
