# **Prepare input text (HUWIKI) for Huggingface Transformer model (GPT2/Reformer/TransformerXL/...) training**

## **To be run on google colab**

## **Load a wiki dump**

#### We have already downloaded a huwiki dump from 20200520, here we just copy it from a google cloud bucket. It consists of six compressed xml files, with the whole dump ~ 0.9 GB

In [None]:
# authorize access to bucket from colab
from google.colab import auth
auth.authenticate_user()

# create folder for storing xml dump files
!mkdir hunwiki
# copy dump files from a bucket 
!gsutil -m cp  gs://hungpt2-wikipedia/huwiki-20200520-dump/*bz2 ./hunwiki/

Copying gs://hungpt2-wikipedia/huwiki-20200520-dump/huwiki-20200520-pages-articles-multistream2.xml-p58602p198203.bz2...
/ [0/6 files][    0.0 B/880.0 MiB]   0% Done                                    Copying gs://hungpt2-wikipedia/huwiki-20200520-dump/huwiki-20200520-pages-articles-multistream1.xml-p1p58601.bz2...
/ [0/6 files][    0.0 B/880.0 MiB]   0% Done                                    Copying gs://hungpt2-wikipedia/huwiki-20200520-dump/huwiki-20200520-pages-articles-multistream4.xml-p406075p692318.bz2...
/ [0/6 files][    0.0 B/880.0 MiB]   0% Done                                    Copying gs://hungpt2-wikipedia/huwiki-20200520-dump/huwiki-20200520-pages-articles-multistream3.xml-p198204p406074.bz2...
/ [0/6 files][    0.0 B/880.0 MiB]   0% Done                                    Copying gs://hungpt2-wikipedia/huwiki-20200520-dump/huwiki-20200520-pages-articles-multistream5.xml-p692319p1116438.bz2...
Copying gs://hungpt2-wikipedia/huwiki-20200520-dump/huwiki-20200520-page

## **Preprocess raw XMLs**

#### WikiExtractor.py (https://github.com/attardi/wikiextractor) is a Python script that extracts and cleans text from a Wikipedia database dump. It stores output in text files of similar size in a given directory. <br> Each file will contain several documents in the format:
>\<doc id=" " revid=" " url="" title=" "\>
><br>...</br>
>\</doc\>

#### We feed each xml to the extractor script in a loop. To avoid the output files to be overwritten, text files from each xml are saved out into a different subdir under /content/full_wiki_extract/ (e.g. ".../full_wiki_extract/xml0/")

In [None]:
# install from git
!git clone https://github.com/attardi/wikiextractor.git

Cloning into 'wikiextractor'...
remote: Enumerating objects: 613, done.[K
remote: Total 613 (delta 0), reused 0 (delta 0), pack-reused 613[K
Receiving objects: 100% (613/613), 1.24 MiB | 16.09 MiB/s, done.
Resolving deltas: 100% (352/352), done.


In [None]:
# create target dir
!mkdir full_wiki_extract

import time 
import os
import glob

# list xml dump files
dumpFiles = glob.glob('/content/hunwiki/*xml*bz2')
print('XML dump files to process:')
print(dumpFiles)
# create dirs for the output from each dump
outputDirs = []
for i in range(len(dumpFiles)):
  outputDirs.append('/content/full_wiki_extract/xml'+str(i))
  os.environ['SUBDIR'] = outputDirs[i]
  !mkdir $SUBDIR
print('Output dirs for preprocessing:')
print(outputDirs)

# process each dump file, save outputs to separate dir ,measure elapsed time:
for idx, dumpFile in enumerate(dumpFiles):
  print('Processing ' + dumpFile)
  print('Output dir is ' + outputDirs[idx])
  start = time.time()
  # we pass input name and output dir as env vars to the wikiextractor script
  os.environ['DUMPFILE'] = dumpFile
  os.environ['OUTPUTDIR'] = outputDirs[idx]
  # invoke wikiextractor script
  !python wikiextractor/WikiExtractor.py $DUMPFILE --processes 4 --bytes=25M  --filter_disambig_pages --output=$OUTPUTDIR --min_text_length 100 -q
  end = time.time()
  print(f'Elapsed time {end - start}')

XML dump files to process:
['/content/hunwiki/huwiki-20200520-pages-articles-multistream2.xml-p58602p198203.bz2', '/content/hunwiki/huwiki-20200520-pages-articles-multistream4.xml-p406075p692318.bz2', '/content/hunwiki/huwiki-20200520-pages-articles-multistream5.xml-p692319p1116438.bz2', '/content/hunwiki/huwiki-20200520-pages-articles-multistream6.xml-p1116439p1705558.bz2', '/content/hunwiki/huwiki-20200520-pages-articles-multistream3.xml-p198204p406074.bz2', '/content/hunwiki/huwiki-20200520-pages-articles-multistream1.xml-p1p58601.bz2']
Output dirs for preprocessing:
['/content/full_wiki_extract/xml0', '/content/full_wiki_extract/xml1', '/content/full_wiki_extract/xml2', '/content/full_wiki_extract/xml3', '/content/full_wiki_extract/xml4', '/content/full_wiki_extract/xml5']
Processing /content/hunwiki/huwiki-20200520-pages-articles-multistream2.xml-p58602p198203.bz2
Output dir is /content/full_wiki_extract/xml0
Elapsed time 292.69970703125
Processing /content/hunwiki/huwiki-20200520


#### Collect all txt files under /content/full_wiki_extract, append their names with the no. of the source xml:

In [None]:
# rename files according to their origin xml + move them under /content/full_wiki_extract/
for idx, outputDir in enumerate(outputDirs):
  # file names
  fileNames = os.listdir(outputDir+'/AA')
  # xml number from dumFiles list
  xmlNo = os.path.split(dumpFiles[idx])[1].split('.')[0][-1]  # last digit before the "".xml" part in the filename
  # new file names
  newFileNames = ['xml'+str(xmlNo)+'_'+f for f in fileNames]
  # new paths
  newPaths = ['/content/full_wiki_extract/'+newName for newName in newFileNames]
  # move files
  for fileIdx in range(len(fileNames)):
    os.rename(outputDir+'/AA/'+fileNames[fileIdx], newPaths[fileIdx])


#### Check the results and save them out to a google cloud bucket 

In [None]:
# let's see what we have
print(os.listdir('/content/full_wiki_extract'))

# copy all txt files to google cloud bucket
!gsutil cp /content/full_wiki_extract/*wiki* gs://hungpt2-wikipedia/full_wiki_extract/

['xml2_wiki_03', 'xml2_wiki_00', 'xml3_wiki_01', 'xml4_wiki_00', 'xml6_wiki_02', 'xml5_wiki_01', 'xml3_wiki_00', 'xml4_wiki_04', 'xml2', 'xml5', 'xml6_wiki_06', 'xml4_wiki_01', 'xml1_wiki_03', 'xml6_wiki_00', 'xml4_wiki_02', 'xml3_wiki_04', 'xml6_wiki_01', 'xml5_wiki_02', 'xml6_wiki_07', 'xml4', 'xml5_wiki_03', 'xml3', 'xml1_wiki_04', 'xml2_wiki_04', 'xml1_wiki_01', 'xml5_wiki_05', 'xml6_wiki_08', 'xml4_wiki_03', 'xml6_wiki_05', 'xml0', 'xml2_wiki_02', 'xml6_wiki_03', 'xml6_wiki_04', 'xml1_wiki_00', 'xml3_wiki_03', 'xml5_wiki_04', 'xml6_wiki_09', 'xml5_wiki_06', 'xml2_wiki_01', 'xml1_wiki_02', 'xml5_wiki_07', 'xml1', 'xml5_wiki_00', 'xml3_wiki_02']
Copying file:///content/full_wiki_extract/xml1_wiki_00 [Content-Type=application/octet-stream]...
Copying file:///content/full_wiki_extract/xml1_wiki_01 [Content-Type=application/octet-stream]...
Copying file:///content/full_wiki_extract/xml1_wiki_02 [Content-Type=application/octet-stream]...
Copying file:///content/full_wiki_extract/xml1_wi

## **Clear text**
#### Text files are cleared from unneccessary tags and symbols, then transformed into a "one wiki article - one line" format. We save these final files out to the google cloud bucket as well

In [None]:
# dir to store cleaned text
!mkdir full_wiki_cleaned

# get list of extracted wiki text files
extractedFiles = glob.glob('/content/full_wiki_extract/*wiki*')

for idx, file in enumerate(extractedFiles):
  with open(file) as f:
      wikitext = f.read()
  # split text at article ends (</doc> tag)    
  wikitext = wikitext.split('</doc>')
  # for each article delete endline symbols, join text without those
  wikitext = [' '.join(text.split('\n')[3:]) for text in wikitext]
  # join list of article texts into one string, with endline between articles
  wikitext = '\n'.join(wikitext)
  # save cleaned text to the "full_wiki_cleared" dir
  filename = '/content/full_wiki_cleaned/'+os.path.split(file)[1]
  with open(filename,'w') as f:
      f.write(wikitext)

#### Save cleaned files to bucket

In [None]:
# let's see what we have
print(os.listdir('/content/full_wiki_cleaned'))

# copy all txt files to google cloud bucket
!gsutil cp /content/full_wiki_cleaned/*wiki* gs://hungpt2-wikipedia/full_wiki_cleaned/

['xml2_wiki_03', 'xml2_wiki_00', 'xml3_wiki_01', 'xml4_wiki_00', 'xml6_wiki_02', 'xml5_wiki_01', 'xml3_wiki_00', 'xml4_wiki_04', 'xml6_wiki_06', 'xml4_wiki_01', 'xml1_wiki_03', 'xml6_wiki_00', 'xml4_wiki_02', 'xml3_wiki_04', 'xml6_wiki_01', 'xml5_wiki_02', 'xml6_wiki_07', 'xml5_wiki_03', 'xml1_wiki_04', 'xml2_wiki_04', 'xml1_wiki_01', 'xml5_wiki_05', 'xml6_wiki_08', 'xml4_wiki_03', 'xml6_wiki_05', 'xml2_wiki_02', 'xml6_wiki_03', 'xml6_wiki_04', 'xml1_wiki_00', 'xml3_wiki_03', 'xml5_wiki_04', 'xml6_wiki_09', 'xml5_wiki_06', 'xml2_wiki_01', 'xml1_wiki_02', 'xml5_wiki_07', 'xml5_wiki_00', 'xml3_wiki_02']
Copying file:///content/full_wiki_cleaned/xml1_wiki_00 [Content-Type=application/octet-stream]...
Copying file:///content/full_wiki_cleaned/xml1_wiki_01 [Content-Type=application/octet-stream]...
Copying file:///content/full_wiki_cleaned/xml1_wiki_02 [Content-Type=application/octet-stream]...
Copying file:///content/full_wiki_cleaned/xml1_wiki_03 [Content-Type=application/octet-stream]...

## **The End** - the next part (tokenization) goes into another notebook