# WikiFileReader purpose

After expanding a wikipedia archive with the Wiki-Parser, the resulting plain-text file is too large for usual file processing. Here we create a WikiFile class that handles reading articles one at a time, and also a method to create KeyWord flag-codes, for downstream filtering purposes. 

- Use Article indicator for article delimiter
- Check for appearance of keywords: Jesus, Christ, Christianity, God, Salvation, Theology
- create filtered files 

The filtered file (300MB gzipped) can be expanded and read by our WikiFile reader since it is still in WikiFile format.


## WikiFile reader class

In [62]:
class WikiFile(object):
    def __init__(self, path):
        self._articlepath = path
        self._stream = open(path,"r", errors='ignore')
        self._buffer = self._stream.readline()
        while (not self._buffer.startswith('#Article:')):
            print('initial extra lines:')
            print(self._buffer)
            self._buffer = self._stream.readline()
        self._EOF_reached = False

        
    def getNext(self):
        # a state machine that reads ahead a line to the begin marker of next article
        if self._EOF_reached:
            return None
        nextline = self._stream.readline()
        while (nextline != ''): # not good enough: (nextline is not None):
            if nextline.startswith('#Article:'):
                output = self._buffer
                self._buffer = nextline
                return output
            else:
                self._buffer += nextline
                nextline = self._stream.readline()
        #alternative to While: if nextline is None: # reached end of file
        self._EOF_reached = True
        output = self._buffer
        self._buffer = None
        return output

    def getStream(self):
        return self._stream
    
    def close(self):
        self._stream.close()

### Example usage

In [25]:
articleStream = WikiFile("H:/articles_in_plain_text.txt")
# look at first article
article = articleStream.getNext()
articleStream.close()
article

initial extra lines:


initial extra lines:




### Article Flag-Code

In [29]:
patterns = ['Jesus', 'Christ ', 'Christianity', 'God', 'salvation', 'theology']
def patternCode(str):
    res = 1
    for patt in patterns:
        res = res << 1
        if patt in str:
            res += 1
    return "{0:b}".format(res)

patternCode(article)

'1000000'

In [11]:
# create the 2^7 valid keys for error checking
valid_keys = {}
for i in range(2**7):
    str_i="{0:b}".format(i)
    valid_keys[str_i]=1
"1001001" in valid_keys    

True

## Construct Christian article archive (300MB) and title file

In [37]:
DEBUG = False #True
titlepath = "H:/article_titles.txt"
articlepath = "H:/articles_in_plain_text.txt"

# We have 2 codes to skip. Add this '2' into output files for mnemonic
code_skip_list = ['1000000','1000100']

# output paths
titleAllOutPath = 'F:/wiki/wiki_titles_all_code.txt'
titleSelectOutPath = 'F:/wiki/wiki_titles_christian2_code.txt'
articleSelectOutPath = 'F:/wiki/wiki_articles_christian2.txt'

titleAllStream = open(titleAllOutPath, 'w', newline='') #to avoid CR
titleSelectStream = open(titleSelectOutPath, 'w', newline='') #to avoid CR
articleSelectStream = open(articleSelectOutPath, 'w', newline='') #to avoid CR

articleStream = WikiFile(articlepath)
article_tag = '#Article: '

with open(titlepath, 'r', errors='ignore') as titleStream:
    for cntT, title in enumerate(titleStream):
        if DEBUG and cntT > 100:
            break        
        
        title = title.strip() 
        
        #progress
        if cntT%10000 == 0:
            print("processing Line {}: {}|||".format(cntT, title))
        
        # read article
        article = articleStream.getNext()
        if not article.startswith(article_tag):
            print('ERROR header: line={} title={} article={}'.format(cntT,title,article))
            break
        if not article.startswith(article_tag+title):
            #print('ERROR title mismatch {}: {}=/={}'.format(cntT,title,article[:(len(title)+10)]))
            # error always due to a different DASH character, so we can just fix it
            title = article[10:(len(title)+10)]
            if DEBUG:
                print('fixed title {}: {}'.format(cntT,title))

        pattCode = patternCode(article)
        titleAllStream.write("{}:{}\n".format(pattCode,title))
        if pattCode not in code_skip_list:
            titleSelectStream.write("{}:{}\n".format(pattCode,title))
            articleSelectStream.write(article)
            
articleStream.close()
titleAllStream.close()
titleSelectStream.close()
articleSelectStream.close()

initial extra lines:


initial extra lines:


processing Line 0: A|||
processing Line 10000: Political philosophy|||
processing Line 20000: Terra Australis|||
processing Line 30000: Distributary|||
processing Line 40000: Troup County, Georgia|||
processing Line 50000: Winchester, Indiana|||
processing Line 60000: Harlowton, Montana|||
processing Line 70000: Marion, South Carolina|||
processing Line 80000: Thomas Chippendale|||
processing Line 90000: Ellison Onizuka|||
processing Line 100000: Knettishall Heath|||
processing Line 110000: Esophageal cancer|||
processing Line 120000: Welwyn Wilton Katz|||
processing Line 130000: Third Battle of Kharkov|||
processing Line 140000: Taiga flycatcher|||
processing Line 150000: President of the Gambia|||
processing Line 160000: Patriotic Union (Liechtenstein)|||
processing Line 170000: DFS 346|||
processing Line 180000: Frogtie|||
processing Line 190000: Knife River|||
processing Line 200000: Boilermaker Road Race|||
processing Line 210000: The 

processing Line 1770000: Samuel Young (New York)|||
processing Line 1780000: Matrjoschka|||
processing Line 1790000: Slieve na Calliagh|||
processing Line 1800000: Vanguardia de la Ciencia|||
processing Line 1810000: DCPS (gene)|||
processing Line 1820000: Railleu|||
processing Line 1830000: Gmina Czarnocin, 艢wi臋tokrzyskie Voivodeship|||
processing Line 1840000: SLC47A1|||
processing Line 1850000: Chekhovskaya|||
processing Line 1860000: Battle of Morotai|||
processing Line 1870000: Dessie Dolan|||
processing Line 1880000: Natasha Spender|||
processing Line 1890000: Fargovo|||
processing Line 1900000: Nathan Cross|||
processing Line 1910000: Oinville-sur-Montcient|||
processing Line 1920000: Food Programme|||
processing Line 1930000: Raja Badhe|||
processing Line 1940000: Trapezoaia River|||
processing Line 1950000: Guadiana Valley Natural Park|||
processing Line 1960000: Lamborghini Alar|||
processing Line 1970000: Migrant hostels of South Australia|||
processing Line 1980000: Konstan

processing Line 3490000: Diplomacy (1916 film)|||
processing Line 3500000: Fran莽ois 脡mile Michel|||
processing Line 3510000: Sententia fidei proxima|||
processing Line 3520000: Cerro Overo|||
processing Line 3530000: Coup of 25 November 1975|||
processing Line 3540000: 2012鈥13 Valencia CF season|||
processing Line 3550000: HAMP domain|||
processing Line 3560000: Auction Room|||
processing Line 3570000: Vale Recreation F.C.|||
processing Line 3580000: Tom Pietermaat|||
processing Line 3590000: Volleyball at the 1998 Central American and Caribbean Games|||
processing Line 3600000: Sachithra Serasinghe|||
processing Line 3610000: Baqerabad, Davarzan|||
processing Line 3620000: Baillieu Library|||
processing Line 3630000: Hamidiyeh, Semnan|||
processing Line 3640000: P眉revjavyn 脰n枚rbat|||
processing Line 3650000: Tomasz Foszma艅czyk|||
processing Line 3660000: Athletics at the 2007 Summer Universiade 鈥 Men's 400 metres|||
processing Line 3670000: Beyeria subtecta|||
processing Line 3680000:

### gzip the large archive
On Windows, right-click the archive wiki_articles_christian2.txt and choose [7zip](https://www.7-zip.org/)  => Add to archive => select gzip

Output file: wiki_articles_christian2.txt.gz

### Construct tail file of the final 90 articles

In [56]:
# create file to investigate the final part of the large article file
# 101890 titles in christian2
# so we will skip the first 101.8K articles and write the rest to file

articlepath = 'F:/wiki/wiki_articles_christian2.txt'
articleStream = WikiFile(articlepath)

articleTailpath = 'F:/wiki/wiki_articles_christian2_tail.txt'

for i in range(101800):
    articleStream.getNext()
    if i % 10000 == 0:
        print(i)

midStream = articleStream.getStream()
        
with open(articleTailpath, 'w', newline='') as tailStream:
    tailStream.writelines(midStream.readlines())
    
articleStream.close()

0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000


### Example usage of WikiFile reading

In [69]:
ct = 0
articleTailPath = 'F:/wiki/wiki_articles_christian2_tail.txt'
articleStream = WikiFile(articleTailPath)
article = articleStream.getNext()

while (article is not None):
    ct+=1
    article = articleStream.getNext()
print('Article Count = {} for {}'.format(ct, articleSelectOutPath))
articleStream.close()

Article Count = 90 for F:/wiki/wiki_articles_christian2.txt


In [67]:
article is None

True

## Article text cleaning

In [None]:
articlepath = 'F:/wiki/wiki_articles_christian2_tail.txt'
articleStream = WikiFile(articlepath)
a1 = articleStream.getNext()
articleStream.close()
# to see the article:
# print(a1)

## Rules for cleaning

Remove the following patterns at the beginning of lines:
- '#Subtitle level ... :'
- '#Article:'
- '#Type: ...'

Afterwards,
- \n becomes space

TODO: tokenizer, remove possible numerical patterns (or consolidate into classes)


In [76]:
import re
patt1 = re.compile(r"^\#Article: ")
patt2 = re.compile(r"^\#Subtitle.*?: ")
patt3 = '#Type:'
patt4 = '#'
def clean_article(art):
    if art is None:
        return None
    lines = art.split("\n")
    modlines = []
    for line in lines:
        if line.startswith(patt3):
            continue
        if line.startswith(patt4):
            line = patt1.sub("", line)
            line = patt2.sub("", line)            
        line.strip()
        if len(line) == 0:
            continue
        modlines.append(line)
    return ' '.join(modlines)


In [None]:
# to see the effect of clean_article:
# print(clean_article(a1))