# Kurdish Wikipedia dumps to clean concated words processing

## Dependencies

* `os` for interacting with os files
* `sys` to print progress
* `re` regex library
* `bs4` or Beautiful Soup for parsing xml

In [1]:
import os
import sys
import re

from bs4 import BeautifulSoup

Define few important paths

In [4]:
wiki_out_path = os.path.join('..', 'data', 'wikipedia', 'sorani', 'output', 'AA')
output_file = os.path.join('..', 'data', 'wikipedia', 'sorani', 'concated', 'text8.txt')

Some useful variables
- vocab: to keep track of the characters in the documents
- cnt: to count the processed documents

Might need the chars for a future project

In [136]:
with open(os.path.join('..', 'char_vocab.txt'), 'r') as f:
    lines = f.readlines()
chars = [x.split()[1] for x in lines]

Precompiled regex patterns to clean up the data and produce one big [text8](http://mattmahoney.net/dc/textdata.html) style file

In [206]:
brackets = re.compile('[\(\)]')
spaces = re.compile('\s+')
non_char = re.compile('[0-9:/";`´°.,\-&%\'²ı!#|\[\]_’*]') # List got too long
# non_vocab = re.compile('\s\S*[^{}]\S*\s'.format(''.join(chars)))
non_vocab = re.compile('[^{}]+'.format(''.join(chars)))

In [201]:
vocab = set()
cnt = 0

One pass through all documents to parse xml, clean unwanted characters, and write them to the corpus

In [202]:
for wiki_extracted_partition_file in os.listdir(wiki_out_path):
    with open(os.path.join(wiki_out_path, wiki_extracted_partition_file), 'r') as f:
        soup = BeautifulSoup(f.read(), "lxml")
    all_docs = soup.findAll('doc')
    for doc in all_docs:
        page = doc.get_text()
        page = re.sub(non_char, ' ', page)
        page = re.sub(non_vocab, ' ', page)
        page = re.sub(spaces, ' ', page)
        vocab.update(page)
        with open(output_file, 'a+') as f:
            f.write(page)
            f.write(' ')
        cnt += 1
        sys.stdout.write('\rdocs {} vocab {}'.format(cnt, len(vocab)))

docs 21906 vocab 34

In [208]:
' '.join(sorted(vocab))

'  ئ ا ب ت ج ح خ د ر ز س ش ع غ ف ق ل م ن و پ چ ڕ ژ ڤ ک گ ڵ ھ ۆ ی ێ ە'

This was a tricky part and I had to hand pick the chars and write them to a file. This is a verified list from the [Wikipedia Kurmanji Alphabets](https://en.wikipedia.org/wiki/Kurdish_alphabets) page. The process was a bit messy so I only left the writing part here. If other chars are needed the section bellow will be updated. I have commented it out so that it won't be run by mistake.

In [77]:
with open(os.path.join('..', 'char_vocab_sorani.txt'), 'a+') as f:
    for index, character in enumerate(sorted(chars)):
        f.write('{} {}\n'.format(index, character))

> Note: This method of removing non-vocab characters will leave many incomplete words. For example if a word contains a Turkish letter only that letter will be removed. For now my solution is discard words that occur only once in my dataset when I use it but I can develop a different regex to remove the surrounding characters as well.

# Word dictionary

To keep the sections efficient and independent, I chose to separate the vocabulary construction.

In [215]:
from collections import Counter

In [227]:
with open(output_file, 'r') as f:
    words = [word for word in f.read().split()]
    word_corpus = set(words)
    word_histo = Counter(words)

In [228]:
len(word_corpus)

204821

In [229]:
len(words)

2488511

In [226]:
word_histo.most_common()

[('لە', 133780),
 ('و', 129100),
 ('بە', 54650),
 ('کە', 43960),
 ('ی', 29052),
 ('بۆ', 28805),
 ('ساڵی', 15979),
 ('لەگەڵ', 15605),
 ('ئەم', 15175),
 ('ئەو', 14487),
 ('دا', 13668),
 ('بوو', 13425),
 ('بووە', 10031),
 ('ل', 8108),
 ('یان', 7628),
 ('دوای', 7424),
 ('وە', 7192),
 ('ب', 7148),
 ('گوندی', 6791),
 ('بەڵام', 6666),
 ('ژمارەی', 6430),
 ('وەک', 6143),
 ('زۆر', 5735),
 ('کرد', 5591),
 ('خۆی', 5583),
 ('کوردستان', 5430),
 ('لەسەر', 5373),
 ('سەر', 5245),
 ('ناوی', 4977),
 ('شاری', 4826),
 ('تا', 4802),
 ('د', 4764),
 ('کەس', 4699),
 ('ئەنجومەنی', 4512),
 ('ئەوەی', 4457),
 ('ھەیە', 4389),
 ('دوو', 4377),
 ('بێ', 4356),
 ('ھەر', 4318),
 ('ر', 4274),
 ('لەو', 4272),
 ('پارێزگای', 4201),
 ('بڕیارنامەی', 4067),
 ('پەسندکرا', 4053),
 ('م', 3805),
 ('کردووە', 3804),
 ('یەکەم', 3681),
 ('دەکات', 3621),
 ('بوون', 3524),
 ('ھەروەھا', 3515),
 ('لایەن', 3439),
 ('ا', 3427),
 ('ئینگلیزی', 3368),
 ('ک', 3301),
 ('پێی', 3232),
 ('ئ', 3187),
 ('یەکێک', 3107),
 ('چەند', 3102),
 ('زمانی', 3065)