**used libraries:**
- pandas
- glob
- os 
- pyunpack
- shutil
- numpy

**make a new folder in data for preprocessing**

In [66]:
import os
processing_path = "../../data/preprocessing/"
os.makedirs(processing_path, exist_ok=True)

**clone and unzip wikiextractor**
- **wikiextractor has to be cited in the paper! for citing information see github page**

In [67]:
!wget https://github.com/attardi/wikiextractor/archive/master.zip -P {processing_path}
unzip_path_extractor = processing_path + "master.zip"
!unzip {unzip_path_extractor} -d {processing_path}

--2023-04-19 09:53:43--  https://github.com/attardi/wikiextractor/archive/master.zip
Auflösen des Hostnamens github.com (github.com)… 140.82.121.4
Verbindungsaufbau zu github.com (github.com)|140.82.121.4|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 302 Found
Platz: https://codeload.github.com/attardi/wikiextractor/zip/refs/heads/master [folgend]
--2023-04-19 09:53:44--  https://codeload.github.com/attardi/wikiextractor/zip/refs/heads/master
Auflösen des Hostnamens codeload.github.com (codeload.github.com)… 140.82.121.9
Verbindungsaufbau zu codeload.github.com (codeload.github.com)|140.82.121.9|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: nicht spezifiziert [application/zip]
Wird in »../../data/preprocessing/master.zip« gespeichert.

master.zip              [ <=>                ]  48,29K  --.-KB/s    in 0,03s   

2023-04-19 09:53:44 (1,60 MB/s) - »../../data/preprocessing/master.zip« gespeichert [49444]

Archive:  ..

**download the data dump**

In [68]:
# elder scrolls: https://s3.amazonaws.com/wikia_xml_dumps/e/el/elderscrolls_pages_current.xml.7z
# wiki/Special:Statistics

download_link = "https://s3.amazonaws.com/wikia_xml_dumps/h/ha/harrypotter_pages_current.xml.7z"
filename = download_link.split("/")[-1][:-3]

!wget  {download_link} -P {processing_path}

--2023-04-19 09:53:44--  https://s3.amazonaws.com/wikia_xml_dumps/h/ha/harrypotter_pages_current.xml.7z
Auflösen des Hostnamens s3.amazonaws.com (s3.amazonaws.com)… 52.216.59.168, 52.217.94.174, 52.217.169.136, ...
Verbindungsaufbau zu s3.amazonaws.com (s3.amazonaws.com)|52.216.59.168|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: 44399418 (42M) [application/x-7z-compressed]
Wird in »../../data/preprocessing/harrypotter_pages_current.xml.7z« gespeichert.


2023-04-19 09:53:47 (16,9 MB/s) - »../../data/preprocessing/harrypotter_pages_current.xml.7z« gespeichert [44399418/44399418]



**unpack the data dump**

In [69]:
from pyunpack import Archive

Archive(processing_path + filename + ".7z").extractall(processing_path)

**use wikiextractor to clean the data**
- cleaned data will be saved in json in `../../data/preprocessing/text`

In [70]:
path = processing_path + filename
cleaned_path = processing_path + "text"
!mkdir {cleaned_path}
!python3 -m wikiextractor.WikiExtractor --json -o {cleaned_path} {path}

INFO: Preprocessing '../../data/preprocessing/harrypotter_pages_current.xml' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Loaded 1840 templates in 3.6s
INFO: Starting page extraction from ../../data/preprocessing/harrypotter_pages_current.xml.
INFO: Using 7 extract processes.
INFO: Finished 7-process extraction of 29040 articles in 11.7s (2481.2 art/s)


**create one dataframe from all data files**

In [71]:
import glob
import pandas as pd
pd.set_option('display.max_colwidth', 200)
df = pd.DataFrame()

# default output directory is ../../data/preprocessing/text 

for x in os.walk(cleaned_path):
    for y in glob.glob(os.path.join(x[0], '**')):
        if not os.path.isdir(y):
            df = pd.concat([df, pd.read_json(y, lines=True)], ignore_index=True, sort=False)       

df


Unnamed: 0,id,revid,url,title,text
0,21657,35050701,http://harrypotter.fandom.com/wiki?curid=21657,Jelly-Brain Jinx,The Jelly-Brain Jinx was a jinx that reduced one's mental abilities.\nHistory.\nDuring the September 1999 riot that took place in the middle of the Puddlemere United versus Holyhead Harpies Quiddi...
1,21660,39675822,http://harrypotter.fandom.com/wiki?curid=21660,Intruder Charm,"The Intruder Charm (""incantation unknown"") was a charm which detected intruders and sounded an alarm, the magical-equivalent to a burglar alarm.\nHistory.\nHorace Slughorn used it on a Muggle-owne..."
2,21664,35050701,http://harrypotter.fandom.com/wiki?curid=21664,Bewitched tea set,"A bewitched tea set was owned by an old witch, until she died and the tea set was sold to a Muggle antique shop. It was subsequently purchased by a Muggle woman who used it for a tea party, but th..."
3,21665,49997385,http://harrypotter.fandom.com/wiki?curid=21665,Model of a Firebolt,A model of a Firebolt was a small working model of the world-class Firebolt. Harry Potter received one of these at Christmas from Nymphadora Tonks on 25 December 1995. Harry missed his real Firebo...
4,21666,39675822,http://harrypotter.fandom.com/wiki?curid=21666,Snuffbox,"A snuffbox is a decorative box originally intended to hold snuff, a form of powdered tobacco inhaled through the nose.\nHistory.\nAt the end of the 1991–1992 school year at Hogwarts School of Witc..."
...,...,...,...,...,...
29034,8213,23825763,http://harrypotter.fandom.com/wiki?curid=8213,Moke,The Moke was a magical lizard with silver-green skin that was native to the Great Britain and Ireland.\nDescription.\nIt could reach up to ten inches in length. The Moke had never been noticed by ...
29035,8214,35418795,http://harrypotter.fandom.com/wiki?curid=8214,Mooncalf,"The Mooncalf was a shy magical beast that only came out of its burrow during a full moon.\nDescription.\nThe Mooncalf had smooth, pale grey skin, and four spindly legs that ended in large flat web..."
29036,8215,35050701,http://harrypotter.fandom.com/wiki?curid=8215,Mooncalf dung,"Mooncalf dung could only be harvested when the Mooncalf emerged from its burrow during the full moon.\nDescription.\nIf collected before the sun rose, it would make any magical plant it was spread..."
29037,8218,35050701,http://harrypotter.fandom.com/wiki?curid=8218,Murtlap,"The Murtlap was a magical marine beast resembling a rat with a growth on its back resembling a sea anemone, found on the coastal areas of Britain.\nNature.\nThe favoured prey of the Murtlap were c..."


**For some wikis there are redirect pages which do not have any text or have weird structure. Drop them and reset index**

In [72]:
df = df[df.text != ""]
df = df[df.text.str.contains("&lt") == False].reset_index()
df

Unnamed: 0,index,id,revid,url,title,text
0,0,21657,35050701,http://harrypotter.fandom.com/wiki?curid=21657,Jelly-Brain Jinx,The Jelly-Brain Jinx was a jinx that reduced one's mental abilities.\nHistory.\nDuring the September 1999 riot that took place in the middle of the Puddlemere United versus Holyhead Harpies Quiddi...
1,1,21660,39675822,http://harrypotter.fandom.com/wiki?curid=21660,Intruder Charm,"The Intruder Charm (""incantation unknown"") was a charm which detected intruders and sounded an alarm, the magical-equivalent to a burglar alarm.\nHistory.\nHorace Slughorn used it on a Muggle-owne..."
2,2,21664,35050701,http://harrypotter.fandom.com/wiki?curid=21664,Bewitched tea set,"A bewitched tea set was owned by an old witch, until she died and the tea set was sold to a Muggle antique shop. It was subsequently purchased by a Muggle woman who used it for a tea party, but th..."
3,3,21665,49997385,http://harrypotter.fandom.com/wiki?curid=21665,Model of a Firebolt,A model of a Firebolt was a small working model of the world-class Firebolt. Harry Potter received one of these at Christmas from Nymphadora Tonks on 25 December 1995. Harry missed his real Firebo...
4,4,21666,39675822,http://harrypotter.fandom.com/wiki?curid=21666,Snuffbox,"A snuffbox is a decorative box originally intended to hold snuff, a form of powdered tobacco inhaled through the nose.\nHistory.\nAt the end of the 1991–1992 school year at Hogwarts School of Witc..."
...,...,...,...,...,...,...
17741,29034,8213,23825763,http://harrypotter.fandom.com/wiki?curid=8213,Moke,The Moke was a magical lizard with silver-green skin that was native to the Great Britain and Ireland.\nDescription.\nIt could reach up to ten inches in length. The Moke had never been noticed by ...
17742,29035,8214,35418795,http://harrypotter.fandom.com/wiki?curid=8214,Mooncalf,"The Mooncalf was a shy magical beast that only came out of its burrow during a full moon.\nDescription.\nThe Mooncalf had smooth, pale grey skin, and four spindly legs that ended in large flat web..."
17743,29036,8215,35050701,http://harrypotter.fandom.com/wiki?curid=8215,Mooncalf dung,"Mooncalf dung could only be harvested when the Mooncalf emerged from its burrow during the full moon.\nDescription.\nIf collected before the sun rose, it would make any magical plant it was spread..."
17744,29037,8218,35050701,http://harrypotter.fandom.com/wiki?curid=8218,Murtlap,"The Murtlap was a magical marine beast resembling a rat with a growth on its back resembling a sea anemone, found on the coastal areas of Britain.\nNature.\nThe favoured prey of the Murtlap were c..."


**Look at some example texts**

In [73]:
import numpy as np

for i in np.random.randint(len(df), size= 10):
    print(str(i) + " - "+ df.iloc[i]["title"] +  ": ")
    print(df.iloc[i]["text"])
    print("------------------------------------------")

12222 - Matagot at the Magical Creatures Reserve: 
This Matagot lived in the Magical Creatures Reserve in the 1980s.
Biography.
During 1988–1989 school year, it was revealed that Mrs Norris befriended another Matagot. Later, this Matagot played with them and Kneil in Hagrid's garden.
------------------------------------------
6328 - Unidentified Merwoman at the Great Lake: 
This merwoman was an individual Merperson that lived in the Great Lake Merpeople colony.
Biography.
The merwoman fought Jacob's sibling when they entered the Black Lake for the first time. When the Weird Sisters were holding a concert near the Lake, she surfaced and listened in joy. When Jacob's sibling entered the Lake for a second time, she led them into the Merpeople village and introduced them to the Merqueen.
------------------------------------------
14973 - Second-floor hall: 
This hall was located on the second floor of Hogwarts Castle.
Description.
Inside, one could find several display cases, as well as a 

**delete unnecessary data and save dataframe as .pickle file**
- dataframe can be read with `pd.read_pickle('../../data/dataframes/-filename-.pickle')` 

In [74]:
import shutil
shutil.rmtree(processing_path)
os.makedirs(processing_path, exist_ok=True)

In [75]:
saving_path = "../../data/dataframes/"
os.makedirs(saving_path, exist_ok=True)
df.to_pickle(saving_path +filename[:-4] +'.pickle')

**Process dataframe**

In [76]:
import pandas as pd
df2 = pd.read_pickle("../../data/dataframes/harrypotter_pages_current.pickle")
pd.set_option('display.max_colwidth', 300)
df2

Unnamed: 0,index,id,revid,url,title,text
0,0,21657,35050701,http://harrypotter.fandom.com/wiki?curid=21657,Jelly-Brain Jinx,"The Jelly-Brain Jinx was a jinx that reduced one's mental abilities.\nHistory.\nDuring the September 1999 riot that took place in the middle of the Puddlemere United versus Holyhead Harpies Quidditch game, many of the Harpy supporters were using this jinx.\nThis spell might also have been the sp..."
1,1,21660,39675822,http://harrypotter.fandom.com/wiki?curid=21660,Intruder Charm,"The Intruder Charm (""incantation unknown"") was a charm which detected intruders and sounded an alarm, the magical-equivalent to a burglar alarm.\nHistory.\nHorace Slughorn used it on a Muggle-owned house he stayed in temporarily in 1996 in Budleigh Babberton, but did not hear it go off when Albu..."
2,2,21664,35050701,http://harrypotter.fandom.com/wiki?curid=21664,Bewitched tea set,"A bewitched tea set was owned by an old witch, until she died and the tea set was sold to a Muggle antique shop. It was subsequently purchased by a Muggle woman who used it for a tea party, but the bewitched tea set caused it to end in disaster and several injuries; the teapot went berserk and s..."
3,3,21665,49997385,http://harrypotter.fandom.com/wiki?curid=21665,Model of a Firebolt,A model of a Firebolt was a small working model of the world-class Firebolt. Harry Potter received one of these at Christmas from Nymphadora Tonks on 25 December 1995. Harry missed his real Firebolt which he got back at the end of the year while watching it zoom around his bedroom at 12 Grimmaul...
4,4,21666,39675822,http://harrypotter.fandom.com/wiki?curid=21666,Snuffbox,"A snuffbox is a decorative box originally intended to hold snuff, a form of powdered tobacco inhaled through the nose.\nHistory.\nAt the end of the 1991–1992 school year at Hogwarts School of Witchcraft and Wizardry, first-years had to transform a mouse into a snuffbox for their Transfiguration ..."
...,...,...,...,...,...,...
17741,29034,8213,23825763,http://harrypotter.fandom.com/wiki?curid=8213,Moke,The Moke was a magical lizard with silver-green skin that was native to the Great Britain and Ireland.\nDescription.\nIt could reach up to ten inches in length. The Moke had never been noticed by Muggles since it had the ability to shrink at will.\nMokeskin was highly prized in the making of pur...
17742,29035,8214,35418795,http://harrypotter.fandom.com/wiki?curid=8214,Mooncalf,"The Mooncalf was a shy magical beast that only came out of its burrow during a full moon.\nDescription.\nThe Mooncalf had smooth, pale grey skin, and four spindly legs that ended in large flat webbed feet. The Mooncalf also had a very long neck and bulging blue eyes that sat on the top of its he..."
17743,29036,8215,35050701,http://harrypotter.fandom.com/wiki?curid=8215,Mooncalf dung,"Mooncalf dung could only be harvested when the Mooncalf emerged from its burrow during the full moon.\nDescription.\nIf collected before the sun rose, it would make any magical plant it was spread on grow fast and strong.\nTaking mooncalf dung from a field owned by another witch or wizard withou..."
17744,29037,8218,35050701,http://harrypotter.fandom.com/wiki?curid=8218,Murtlap,"The Murtlap was a magical marine beast resembling a rat with a growth on its back resembling a sea anemone, found on the coastal areas of Britain.\nNature.\nThe favoured prey of the Murtlap were crustaceans, though they also went for the feet of any human foolish enough to step on them.\nUses.\n..."


**Functions to split dataframe into**

In [82]:
def split_text(df_prev, url, title, text):
    split = text.split("\n")
    arr = [title + ": "] + split
    if len(arr) % 2 != 0:
        arr = arr[:-2]

    arr1 = arr[0::2]
    arr2 = arr[1::2]

    # hard limit for incorrectly formatted texts
    limit = 30

    res = [x.replace(".", ": ") + y for x,y in zip(arr1, arr2) if len(x) < limit]
    url_arr = [url] * len(res)
    dict_list = {'URL':url_arr,'text':res}
    df = pd.DataFrame(dict_list)
    df = pd.concat([df, df_prev], ignore_index=True, sort=False)

    return df 



In [83]:
def create_cleaned_df(df):
    df_res = pd.DataFrame()
    for i in range(len(df)):
        df_res = split_text(df_res, df["url"].iloc[i], df["title"].iloc[i], df["text"].iloc[i])

    return df_res


**clean dataset**

In [84]:
df_cleaned = create_cleaned_df(df2)
df_cleaned

Unnamed: 0,URL,text
0,http://harrypotter.fandom.com/wiki?curid=8220,"Advanced Rune Translation: Advanced Rune Translation was a book about Rune Translation by Yuri Blishen. It was a required textbook for Study of Ancient Runes, an elective course at Hogwarts School of Witchcraft and Wizardry."
1,http://harrypotter.fandom.com/wiki?curid=8220,History: Hermione Granger was reading a copy of this book after her trip to Diagon Alley before the start of her sixth year at Hogwarts.
2,http://harrypotter.fandom.com/wiki?curid=8218,"Murtlap: The Murtlap was a magical marine beast resembling a rat with a growth on its back resembling a sea anemone, found on the coastal areas of Britain."
3,http://harrypotter.fandom.com/wiki?curid=8218,"Nature: The favoured prey of the Murtlap were crustaceans, though they also went for the feet of any human foolish enough to step on them."
4,http://harrypotter.fandom.com/wiki?curid=8218,"Uses: The growth on the Murtlap's back may be pickled and eaten to improve one's resistance to jinxes, although eating an excess of pickled murtlap may cause one to grow unsightly purple ear hair. Murtlap Essence was a home remedy for cuts and abrasions. Murtlap tentacles were included in Murtla..."
...,...,...
24764,http://harrypotter.fandom.com/wiki?curid=21665,Model of a Firebolt: A model of a Firebolt was a small working model of the world-class Firebolt. Harry Potter received one of these at Christmas from Nymphadora Tonks on 25 December 1995. Harry missed his real Firebolt which he got back at the end of the year while watching it zoom around his b...
24765,http://harrypotter.fandom.com/wiki?curid=21664,"Bewitched tea set: A bewitched tea set was owned by an old witch, until she died and the tea set was sold to a Muggle antique shop. It was subsequently purchased by a Muggle woman who used it for a tea party, but the bewitched tea set caused it to end in disaster and several injuries; the teapot..."
24766,http://harrypotter.fandom.com/wiki?curid=21660,"Intruder Charm: The Intruder Charm (""incantation unknown"") was a charm which detected intruders and sounded an alarm, the magical-equivalent to a burglar alarm."
24767,http://harrypotter.fandom.com/wiki?curid=21660,"History: Horace Slughorn used it on a Muggle-owned house he stayed in temporarily in 1996 in Budleigh Babberton, but did not hear it go off when Albus Dumbledore and Harry Potter arrived because he was in the bath."


In [85]:
for i in np.random.randint(len(df_cleaned), size= 10):
    print(str(i) + " - " +  ": ")
    print(df_cleaned.iloc[i]["text"])
    print("------------------------------------------")

16785 - : 
History: Despite its small size, Moldova is renown for consistently producing excellent Quidditch teams: their national squad won the 2010 Quidditch World Cup. During the qualification period of the 2014 Quidditch World Cup, Moldova suffered an outbreak of Dragon Pox that affected the national team's training camp, thus making the country unable to compete in that year's Cup.
------------------------------------------
7238 - : 
Brain Room: The Brain Room was a room in the Department of Mysteries. It was located on the ninth level of the Ministry of Magic in London, England. The concept of thought and all related subjects were studied in this chamber. 
------------------------------------------
15992 - : 
Morgan Walters: Morgan Walters is a British actor who played a Watchman in "Fantastic Beasts and Where to Find Them".
------------------------------------------
14301 - : 
Physical appearance: Alice was a round-faced woman with short hair, and her son Neville resembled her q

**save cleaned dataset**

In [86]:
saving_path = "../../data/dataframes/"
os.makedirs(saving_path, exist_ok=True)
df_cleaned.to_pickle(saving_path +filename[:-4] + "_cleaned"+'.pickle')