**used libraries:**
- pandas
- glob
- os 
- pyunpack
- shutil
- numpy

**make a new folder in data for preprocessing**

In [31]:
processing_path = "../../data/preprocessing/"
os.makedirs(processing_path, exist_ok=True)

**download the data dump**

In [32]:
# elder scrolls: https://s3.amazonaws.com/wikia_xml_dumps/e/el/elderscrolls_pages_current.xml
# wiki/Special:Statistics

download_link = "https://s3.amazonaws.com/wikia_xml_dumps/h/ha/harrypotter_pages_current.xml.7z"
filename = download_link.split("/")[-1][:-3]

!wget  {download_link} -P {processing_path}

--2023-04-15 11:32:45--  https://s3.amazonaws.com/wikia_xml_dumps/h/ha/harrypotter_pages_current.xml.7z
Auflösen des Hostnamens s3.amazonaws.com (s3.amazonaws.com)… 52.217.225.96, 52.216.54.24, 52.217.140.56, ...
Verbindungsaufbau zu s3.amazonaws.com (s3.amazonaws.com)|52.217.225.96|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: 44399418 (42M) [application/x-7z-compressed]
Wird in »../../data/preprocessing/harrypotter_pages_current.xml.7z« gespeichert.


2023-04-15 11:32:49 (13,7 MB/s) - »../../data/preprocessing/harrypotter_pages_current.xml.7z« gespeichert [44399418/44399418]



**clone and unzip wikiextractor**
- **wikiextractor has to be cited in the paper! for citing information see github page**

In [33]:
!wget https://github.com/attardi/wikiextractor/archive/master.zip -P {processing_path}
!unzip ../../data/preprocessing/master.zip -d {processing_path}

--2023-04-15 11:32:49--  https://github.com/attardi/wikiextractor/archive/master.zip
Auflösen des Hostnamens github.com (github.com)… 140.82.121.4
Verbindungsaufbau zu github.com (github.com)|140.82.121.4|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 302 Found
Platz: https://codeload.github.com/attardi/wikiextractor/zip/refs/heads/master [folgend]
--2023-04-15 11:32:50--  https://codeload.github.com/attardi/wikiextractor/zip/refs/heads/master
Auflösen des Hostnamens codeload.github.com (codeload.github.com)… 140.82.121.10
Verbindungsaufbau zu codeload.github.com (codeload.github.com)|140.82.121.10|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: nicht spezifiziert [application/zip]
Wird in »../../data/preprocessing/master.zip« gespeichert.

master.zip              [ <=>                ]  48,29K  --.-KB/s    in 0,04s   

2023-04-15 11:32:50 (1,35 MB/s) - »../../data/preprocessing/master.zip« gespeichert [49444]

Archive:  

**unpack the data dump**

In [34]:
from pyunpack import Archive

Archive(processing_path + filename + ".7z").extractall(processing_path)

**use wikiextractor to clean the data**
- cleaned data will be saved in json in `../../data/preprocessing/text`

In [35]:
path = processing_path + filename
cleaned_path = processing_path + "text"
!mkdir {cleaned_path}
!python3 -m wikiextractor.WikiExtractor --json -o {cleaned_path} {path}

INFO: Preprocessing '../../data/preprocessing/harrypotter_pages_current.xml' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Loaded 1840 templates in 3.7s
INFO: Starting page extraction from ../../data/preprocessing/harrypotter_pages_current.xml.
INFO: Using 7 extract processes.
INFO: Finished 7-process extraction of 29040 articles in 11.1s (2611.1 art/s)


**create one dataframe from all data files**

In [36]:
import glob
import os
import pandas as pd

df = pd.DataFrame()

# default output directory is ../../data/preprocessing/text 

for x in os.walk(cleaned_path):
    for y in glob.glob(os.path.join(x[0], '**')):
        if not os.path.isdir(y):
            df = pd.concat([df, pd.read_json(y, lines=True)], ignore_index=True, sort=False)       

df


Unnamed: 0,id,revid,url,title,text
0,21657,35050701,http://harrypotter.fandom.com/wiki?curid=21657,Jelly-Brain Jinx,The Jelly-Brain Jinx was a jinx that reduced o...
1,21660,39675822,http://harrypotter.fandom.com/wiki?curid=21660,Intruder Charm,"The Intruder Charm (""incantation unknown"") was..."
2,21664,35050701,http://harrypotter.fandom.com/wiki?curid=21664,Bewitched tea set,"A bewitched tea set was owned by an old witch,..."
3,21665,49997385,http://harrypotter.fandom.com/wiki?curid=21665,Model of a Firebolt,A model of a Firebolt was a small working mode...
4,21666,39675822,http://harrypotter.fandom.com/wiki?curid=21666,Snuffbox,A snuffbox is a decorative box originally inte...
...,...,...,...,...,...
29034,8213,23825763,http://harrypotter.fandom.com/wiki?curid=8213,Moke,The Moke was a magical lizard with silver-gree...
29035,8214,35418795,http://harrypotter.fandom.com/wiki?curid=8214,Mooncalf,The Mooncalf was a shy magical beast that only...
29036,8215,35050701,http://harrypotter.fandom.com/wiki?curid=8215,Mooncalf dung,Mooncalf dung could only be harvested when the...
29037,8218,35050701,http://harrypotter.fandom.com/wiki?curid=8218,Murtlap,The Murtlap was a magical marine beast resembl...


**For some wikis there are redirect pages which do not have any text or have weird structure. Drop them and reset index**

In [37]:
df = df[df.text != ""]
df = df[df.text.str.contains("&lt") == False].reset_index()
df

Unnamed: 0,index,id,revid,url,title,text
0,0,21657,35050701,http://harrypotter.fandom.com/wiki?curid=21657,Jelly-Brain Jinx,The Jelly-Brain Jinx was a jinx that reduced o...
1,1,21660,39675822,http://harrypotter.fandom.com/wiki?curid=21660,Intruder Charm,"The Intruder Charm (""incantation unknown"") was..."
2,2,21664,35050701,http://harrypotter.fandom.com/wiki?curid=21664,Bewitched tea set,"A bewitched tea set was owned by an old witch,..."
3,3,21665,49997385,http://harrypotter.fandom.com/wiki?curid=21665,Model of a Firebolt,A model of a Firebolt was a small working mode...
4,4,21666,39675822,http://harrypotter.fandom.com/wiki?curid=21666,Snuffbox,A snuffbox is a decorative box originally inte...
...,...,...,...,...,...,...
17741,29034,8213,23825763,http://harrypotter.fandom.com/wiki?curid=8213,Moke,The Moke was a magical lizard with silver-gree...
17742,29035,8214,35418795,http://harrypotter.fandom.com/wiki?curid=8214,Mooncalf,The Mooncalf was a shy magical beast that only...
17743,29036,8215,35050701,http://harrypotter.fandom.com/wiki?curid=8215,Mooncalf dung,Mooncalf dung could only be harvested when the...
17744,29037,8218,35050701,http://harrypotter.fandom.com/wiki?curid=8218,Murtlap,The Murtlap was a magical marine beast resembl...


**Look at some example texts**

In [38]:
import numpy as np
for i in np.random.randint(len(df), size= 10):
    print(str(i) + ": ")
    print(df.iloc[i]["text"])
    print("------------------------------------------")

3638: 
Abram Welsh is a British actor who played an Engine Driver in .
------------------------------------------
9239: 
Alicia Spinnet (b. 1977/1978) was a witch and Gryffindor student at Hogwarts School of Witchcraft and Wizardry from 1989-1996. She played as a reserve Chaser and later Chaser on the Gryffindor Quidditch team. During her Hogwarts years, she became close friends with Hermione Granger, Cho Chang, Katie Bell, and Angelina Johnson. In her seventh year, she joined Dumbledore's Army, an organisation taught and led by Harry Potter. In 1998, she returned at Hogwarts in order to fight in the Battle of Hogwarts against Lord Voldemort and his Death Eaters.
Biography.
Hogwarts years.
Early years.
Alicia Spinnet attended Hogwarts School of Witchcraft and Wizardry from 1989 to 1996, and was sorted into Gryffindor House. There, she became best friends with fellow Gryffindors Katie Bell, Angelina Johnson, Lee Jordan, Oliver Wood, and the twins Fred and George Weasley who all shared a

**delete unnecessary data and save dataframe as .pickle file**
- dataframe can be read with `pd.read_pickle('../../data/preprocessing/-filename-.pickle')` 

In [39]:
import shutil
shutil.rmtree(processing_path)

In [40]:
os.makedirs(processing_path, exist_ok=True)
df.to_pickle(processing_path +filename[:-4] +'.pickle')