# Kurdish Wikipedia dumps to clean concated words processing

## Dependencies

* `os` for interacting with os files
* `sys` to print progress
* `re` regex library
* `bs4` or Beautiful Soup for parsing xml

In [1]:
import os
import sys
import re

from bs4 import BeautifulSoup

Define few important paths

In [129]:
wiki_out_path = os.path.join('..', 'wikipedia', 'kurmanji', 'output', 'AA')
output_file = os.path.join('..', 'wikipedia', 'kurmanji', 'concated', 'text8ku.txt')

Some useful variables
- vocab: to keep track of the characters in the documents
- cnt: to count the processed documents

In [130]:
vocab = set()
cnt = 0

Might need the chars for a future project

In [136]:
with open(os.path.join('..', 'char_vocab.txt'), 'r') as f:
    lines = f.readlines()
chars = [x.split()[1] for x in lines]

Precompiled regex patterns to clean up the data and produce one big [text8](http://mattmahoney.net/dc/textdata.html) style file

In [131]:
brackets = re.compile('[\(\)]')
spaces = re.compile('\s+')
non_vocab = re.compile('[^{}]'.format(''.join(chars)))
# non_char = re.compile('[0-9:/";`´°.,\-%\'²ı!#|\[\]_’*]') # List got too long

One pass through all documents to parse xml, clean unwanted characters, and write them to the corpus

In [133]:
for wiki_extracted_partition_file in os.listdir(wiki_out_path):
    with open(os.path.join(wiki_out_path, wiki_extracted_partition_file), 'r') as f:
        soup = BeautifulSoup(f.read(), "lxml")
    all_docs = soup.findAll('doc')
    for doc in all_docs:
        page = doc.get_text()
        page = re.sub(non_vocab, ' ', page)
        page = re.sub(spaces, ' ', page)
        vocab.update(page)
        with open(output_file, 'a+') as f:
            f.write(page)
            f.write(' ')
        cnt += 1
        sys.stdout.write('\rdocs %d' % cnt)

docs 23915

This was a tricky part and I had to hand pick the chars and write them to a file. This is a verified list from the [Wikipedia Kurmanji Alphabets](https://en.wikipedia.org/wiki/Kurdish_alphabets) page. The process was a bit messy so I only left the writing part here. If other chars are needed the section bellow will be updated. I have commented it out so that it won't be run by mistake.

In [83]:
# with open(os.path.join('..', 'char_vocab.txt'), 'a+') as f:
#     for index, character in enumerate(sorted(chars)):
#         f.write('{} {}\n'.format(index, character))

> Note: This method of removing non-vocab characters will leave many incomplete words. For example if a word contains a Turkish letter only that letter will be removed. For now my solution is discard words that occur only once in my dataset when I use it but I can develop a different regex to remove the surrounding characters as well.