**used libraries:**
- pandas
- glob
- os 
- pyunpack
- shutil
- numpy

**make a new folder in data for preprocessing**

In [81]:
mkdir ../../data/preprocessing/

**download the data dump**

In [82]:
# elder scrolls: https://s3.amazonaws.com/wikia_xml_dumps/e/el/elderscrolls_pages_current.xml
# wiki/Special:Statistics

download_link = "https://s3.amazonaws.com/wikia_xml_dumps/h/ha/harrypotter_pages_current.xml.7z"
filename = download_link.split("/")[-1][:-3]

!wget  {download_link} -P ../../data/preprocessing/

--2023-04-15 11:12:32--  https://s3.amazonaws.com/wikia_xml_dumps/h/ha/harrypotter_pages_current.xml.7z
Auflösen des Hostnamens s3.amazonaws.com (s3.amazonaws.com)… 52.217.205.96, 52.217.114.56, 52.216.94.157, ...
Verbindungsaufbau zu s3.amazonaws.com (s3.amazonaws.com)|52.217.205.96|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: 44399418 (42M) [application/x-7z-compressed]
Wird in »../../data/preprocessing/harrypotter_pages_current.xml.7z« gespeichert.


2023-04-15 11:12:36 (12,4 MB/s) - »../../data/preprocessing/harrypotter_pages_current.xml.7z« gespeichert [44399418/44399418]



**clone and unzip wikiextractor**
- **wikiextractor has to be cited in the paper! for citing information see github page**

In [83]:
!wget https://github.com/attardi/wikiextractor/archive/master.zip -P ../../data/preprocessing/
!unzip ../../data/preprocessing/master.zip -d ../../data/preprocessing/

--2023-04-15 11:12:37--  https://github.com/attardi/wikiextractor/archive/master.zip
Auflösen des Hostnamens github.com (github.com)… 140.82.121.3
Verbindungsaufbau zu github.com (github.com)|140.82.121.3|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 302 Found
Platz: https://codeload.github.com/attardi/wikiextractor/zip/refs/heads/master [folgend]
--2023-04-15 11:12:38--  https://codeload.github.com/attardi/wikiextractor/zip/refs/heads/master
Auflösen des Hostnamens codeload.github.com (codeload.github.com)… 140.82.121.9
Verbindungsaufbau zu codeload.github.com (codeload.github.com)|140.82.121.9|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: nicht spezifiziert [application/zip]
Wird in »../../data/preprocessing/master.zip« gespeichert.

master.zip              [ <=>                ]  48,29K  --.-KB/s    in 0,03s   

2023-04-15 11:12:38 (1,60 MB/s) - »../../data/preprocessing/master.zip« gespeichert [49444]

Archive:  ..

**unpack the data dump**

In [84]:
from pyunpack import Archive

Archive('../../data/preprocessing/' + filename + ".7z").extractall("../../data/preprocessing/")

**use wikiextractor to clean the data**
- cleaned data will be saved in json in `../../data/preprocessing/text`

In [85]:
path = "../../data/preprocessing/" + filename
!mkdir ../../data/preprocessing/text
!python3 -m wikiextractor.WikiExtractor --json -o ../../data/preprocessing/text {path}

INFO: Preprocessing '../../data/preprocessing/harrypotter_pages_current.xml' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Loaded 1840 templates in 3.7s
INFO: Starting page extraction from ../../data/preprocessing/harrypotter_pages_current.xml.
INFO: Using 7 extract processes.
INFO: Finished 7-process extraction of 29040 articles in 11.2s (2588.0 art/s)


**create one dataframe from all data files**

In [86]:
import glob
import os
import pandas as pd

df = pd.DataFrame()

# default output directory is ../../data/preprocessing/text 

for x in os.walk("../../data/preprocessing/text"):
    for y in glob.glob(os.path.join(x[0], '**')):
        if not os.path.isdir(y):
            df = pd.concat([df, pd.read_json(y, lines=True)], ignore_index=True, sort=False)       

df


Unnamed: 0,id,revid,url,title,text
0,21657,35050701,http://harrypotter.fandom.com/wiki?curid=21657,Jelly-Brain Jinx,The Jelly-Brain Jinx was a jinx that reduced o...
1,21660,39675822,http://harrypotter.fandom.com/wiki?curid=21660,Intruder Charm,"The Intruder Charm (""incantation unknown"") was..."
2,21664,35050701,http://harrypotter.fandom.com/wiki?curid=21664,Bewitched tea set,"A bewitched tea set was owned by an old witch,..."
3,21665,49997385,http://harrypotter.fandom.com/wiki?curid=21665,Model of a Firebolt,A model of a Firebolt was a small working mode...
4,21666,39675822,http://harrypotter.fandom.com/wiki?curid=21666,Snuffbox,A snuffbox is a decorative box originally inte...
...,...,...,...,...,...
29034,8213,23825763,http://harrypotter.fandom.com/wiki?curid=8213,Moke,The Moke was a magical lizard with silver-gree...
29035,8214,35418795,http://harrypotter.fandom.com/wiki?curid=8214,Mooncalf,The Mooncalf was a shy magical beast that only...
29036,8215,35050701,http://harrypotter.fandom.com/wiki?curid=8215,Mooncalf dung,Mooncalf dung could only be harvested when the...
29037,8218,35050701,http://harrypotter.fandom.com/wiki?curid=8218,Murtlap,The Murtlap was a magical marine beast resembl...


**For some wikis there are redirect pages which do not have any text or have weird structure. Drop them and reset index**

In [87]:
df = df[df.text != ""]
df = df[df.text.str.contains("&lt") == False].reset_index()
df

Unnamed: 0,index,id,revid,url,title,text
0,0,21657,35050701,http://harrypotter.fandom.com/wiki?curid=21657,Jelly-Brain Jinx,The Jelly-Brain Jinx was a jinx that reduced o...
1,1,21660,39675822,http://harrypotter.fandom.com/wiki?curid=21660,Intruder Charm,"The Intruder Charm (""incantation unknown"") was..."
2,2,21664,35050701,http://harrypotter.fandom.com/wiki?curid=21664,Bewitched tea set,"A bewitched tea set was owned by an old witch,..."
3,3,21665,49997385,http://harrypotter.fandom.com/wiki?curid=21665,Model of a Firebolt,A model of a Firebolt was a small working mode...
4,4,21666,39675822,http://harrypotter.fandom.com/wiki?curid=21666,Snuffbox,A snuffbox is a decorative box originally inte...
...,...,...,...,...,...,...
17741,29034,8213,23825763,http://harrypotter.fandom.com/wiki?curid=8213,Moke,The Moke was a magical lizard with silver-gree...
17742,29035,8214,35418795,http://harrypotter.fandom.com/wiki?curid=8214,Mooncalf,The Mooncalf was a shy magical beast that only...
17743,29036,8215,35050701,http://harrypotter.fandom.com/wiki?curid=8215,Mooncalf dung,Mooncalf dung could only be harvested when the...
17744,29037,8218,35050701,http://harrypotter.fandom.com/wiki?curid=8218,Murtlap,The Murtlap was a magical marine beast resembl...


**Look at some example texts**

In [88]:
import numpy as np
for i in np.random.randint(len(df), size= 10):
    print(str(i) + ": ")
    print(df.iloc[i]["text"])
    print("------------------------------------------")

10449: 
Naomi Kusumi (born 17 June, 1954) is a Japanese actor who voiced Vernon Dursley in the Japanese dubs of the film adaptations of , , , and .
------------------------------------------
4612: 
A Brown Bear is a species of bear and one possible corporeal form of the Patronus Charm.
------------------------------------------
10241: 
StarKid Productions, also known as Team StarKid, StarKidPotter or simply StarKid, is an American based theatre production company founded by Darren Criss, Brian Holden, Nick Lang, and Matt Lang at the University of Michigan in 2009. It is currently based in Chicago, IL. They have produced several musicals, including "A Very Potter Musical" and "A Very Potter Sequel", both of which parody the "Harry Potter" series. A third instalment was performed at LeakyCon 2012 on August 11, 2012 titled "A Very Potter 3D: A Very Potter Senior Year. "Non-Harry Potter productions of theirs include "Little White Lie", "Me and My Dick", "Starship", "Holy Musical B@man, Twi

**delete unnecessary data and save dataframe as .pickle file**
- dataframe can be read with `pd.read_pickle('../../data/preprocessing/preprocessed_data.pickle')` 

In [80]:
import shutil
shutil.rmtree('../../data/preprocessing')

In [36]:
!mkdir ../../data/preprocessing
df.to_pickle('../../data/preprocessing/preprocessed_data.pickle')