**used libraries:**
- pandas
- glob
- os 
- pyunpack
- shutil
- numpy

**make a new folder in data for preprocessing**

In [1]:
import os
processing_path = "../../data/preprocessing/"
os.makedirs(processing_path, exist_ok=True)

**clone and unzip wikiextractor**
- **wikiextractor has to be cited in the paper! for citing information see github page**

In [2]:
!wget https://github.com/attardi/wikiextractor/archive/master.zip -P {processing_path}
unzip_path_extractor = processing_path + "master.zip"
!unzip {unzip_path_extractor} -d {processing_path}

--2023-04-18 16:54:08--  https://github.com/attardi/wikiextractor/archive/master.zip
Auflösen des Hostnamens github.com (github.com)… 140.82.121.3
Verbindungsaufbau zu github.com (github.com)|140.82.121.3|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 302 Found
Platz: https://codeload.github.com/attardi/wikiextractor/zip/refs/heads/master [folgend]
--2023-04-18 16:54:08--  https://codeload.github.com/attardi/wikiextractor/zip/refs/heads/master
Auflösen des Hostnamens codeload.github.com (codeload.github.com)… 140.82.121.9
Verbindungsaufbau zu codeload.github.com (codeload.github.com)|140.82.121.9|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: nicht spezifiziert [application/zip]
Wird in »../../data/preprocessing/master.zip« gespeichert.

master.zip              [ <=>                ]  48,29K  --.-KB/s    in 0,03s   

2023-04-18 16:54:08 (1,38 MB/s) - »../../data/preprocessing/master.zip« gespeichert [49444]

Archive:  ..

**download the data dump**

In [3]:
# elder scrolls: https://s3.amazonaws.com/wikia_xml_dumps/e/el/elderscrolls_pages_current.xml.7z
# wiki/Special:Statistics

download_link = "https://s3.amazonaws.com/wikia_xml_dumps/h/ha/harrypotter_pages_current.xml.7z"
filename = download_link.split("/")[-1][:-3]

!wget  {download_link} -P {processing_path}

--2023-04-18 16:54:09--  https://s3.amazonaws.com/wikia_xml_dumps/h/ha/harrypotter_pages_current.xml.7z
Auflösen des Hostnamens s3.amazonaws.com (s3.amazonaws.com)… 54.231.138.112, 52.217.88.238, 52.216.20.189, ...
Verbindungsaufbau zu s3.amazonaws.com (s3.amazonaws.com)|54.231.138.112|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: 44399418 (42M) [application/x-7z-compressed]
Wird in »../../data/preprocessing/harrypotter_pages_current.xml.7z« gespeichert.


2023-04-18 16:54:12 (17,7 MB/s) - »../../data/preprocessing/harrypotter_pages_current.xml.7z« gespeichert [44399418/44399418]



**unpack the data dump**

In [4]:
from pyunpack import Archive

Archive(processing_path + filename + ".7z").extractall(processing_path)

**use wikiextractor to clean the data**
- cleaned data will be saved in json in `../../data/preprocessing/text`

In [5]:
path = processing_path + filename
cleaned_path = processing_path + "text"
!mkdir {cleaned_path}
!python3 -m wikiextractor.WikiExtractor --json -o {cleaned_path} {path}

INFO: Preprocessing '../../data/preprocessing/harrypotter_pages_current.xml' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Loaded 1840 templates in 3.6s
INFO: Starting page extraction from ../../data/preprocessing/harrypotter_pages_current.xml.
INFO: Using 7 extract processes.
INFO: Finished 7-process extraction of 29040 articles in 12.3s (2368.4 art/s)


**create one dataframe from all data files**

In [6]:
import glob
import pandas as pd

df = pd.DataFrame()

# default output directory is ../../data/preprocessing/text 

for x in os.walk(cleaned_path):
    for y in glob.glob(os.path.join(x[0], '**')):
        if not os.path.isdir(y):
            df = pd.concat([df, pd.read_json(y, lines=True)], ignore_index=True, sort=False)       

df


Unnamed: 0,id,revid,url,title,text
0,21657,35050701,http://harrypotter.fandom.com/wiki?curid=21657,Jelly-Brain Jinx,The Jelly-Brain Jinx was a jinx that reduced o...
1,21660,39675822,http://harrypotter.fandom.com/wiki?curid=21660,Intruder Charm,"The Intruder Charm (""incantation unknown"") was..."
2,21664,35050701,http://harrypotter.fandom.com/wiki?curid=21664,Bewitched tea set,"A bewitched tea set was owned by an old witch,..."
3,21665,49997385,http://harrypotter.fandom.com/wiki?curid=21665,Model of a Firebolt,A model of a Firebolt was a small working mode...
4,21666,39675822,http://harrypotter.fandom.com/wiki?curid=21666,Snuffbox,A snuffbox is a decorative box originally inte...
...,...,...,...,...,...
29034,8213,23825763,http://harrypotter.fandom.com/wiki?curid=8213,Moke,The Moke was a magical lizard with silver-gree...
29035,8214,35418795,http://harrypotter.fandom.com/wiki?curid=8214,Mooncalf,The Mooncalf was a shy magical beast that only...
29036,8215,35050701,http://harrypotter.fandom.com/wiki?curid=8215,Mooncalf dung,Mooncalf dung could only be harvested when the...
29037,8218,35050701,http://harrypotter.fandom.com/wiki?curid=8218,Murtlap,The Murtlap was a magical marine beast resembl...


**For some wikis there are redirect pages which do not have any text or have weird structure. Drop them and reset index**

In [7]:
df = df[df.text != ""]
df = df[df.text.str.contains("&lt") == False].reset_index()
df

Unnamed: 0,index,id,revid,url,title,text
0,0,21657,35050701,http://harrypotter.fandom.com/wiki?curid=21657,Jelly-Brain Jinx,The Jelly-Brain Jinx was a jinx that reduced o...
1,1,21660,39675822,http://harrypotter.fandom.com/wiki?curid=21660,Intruder Charm,"The Intruder Charm (""incantation unknown"") was..."
2,2,21664,35050701,http://harrypotter.fandom.com/wiki?curid=21664,Bewitched tea set,"A bewitched tea set was owned by an old witch,..."
3,3,21665,49997385,http://harrypotter.fandom.com/wiki?curid=21665,Model of a Firebolt,A model of a Firebolt was a small working mode...
4,4,21666,39675822,http://harrypotter.fandom.com/wiki?curid=21666,Snuffbox,A snuffbox is a decorative box originally inte...
...,...,...,...,...,...,...
17741,29034,8213,23825763,http://harrypotter.fandom.com/wiki?curid=8213,Moke,The Moke was a magical lizard with silver-gree...
17742,29035,8214,35418795,http://harrypotter.fandom.com/wiki?curid=8214,Mooncalf,The Mooncalf was a shy magical beast that only...
17743,29036,8215,35050701,http://harrypotter.fandom.com/wiki?curid=8215,Mooncalf dung,Mooncalf dung could only be harvested when the...
17744,29037,8218,35050701,http://harrypotter.fandom.com/wiki?curid=8218,Murtlap,The Murtlap was a magical marine beast resembl...


**Look at some example texts**

In [8]:
import numpy as np
for i in np.random.randint(len(df), size= 10):
    print(str(i) + ": ")
    print(df.iloc[i]["text"])
    print("------------------------------------------")

9528: 
This female (c. 1994) was a woman employed by the Smeltings Academy of Great Britain sometime in or, possibly, before 1994.
Biography.
Sometime in or before 1976 this woman was born to a Muggle family. She eventually was hired by the Smeltings Academy of Great Britain, and worked for the school as a nurse.
In 1994, she sent a letter to the Dursley family about Dudley's considerable weight. The letter informed them that the school no longer had knickerbockers large enough to fit him, and included a diet sheet comprised mostly of fruits and vegetables.
Personality and traits.
Little is known of this woman's personality; however, it seemed that she was a fairly caring person as she not only was hired by a school, but was troubled enough by the weight of Dudley Dursley that she sent a letter and a diet sheet to the Dursley home to try and get the family to help Dudley lose weight.
------------------------------------------
2785: 
Brogans are boys' ankle-high shoes originating from S

**delete unnecessary data and save dataframe as .pickle file**
- dataframe can be read with `pd.read_pickle('../../data/preprocessing/-filename-.pickle')` 

In [9]:
import shutil
shutil.rmtree(processing_path)
os.makedirs(processing_path, exist_ok=True)

In [10]:
saving_path = "../../data/dataframes/"
os.makedirs(saving_path, exist_ok=True)
df.to_pickle(saving_path +filename[:-4] +'.pickle')