# Bible Authorship
Authors: <a href="mailto:razmalkau@gmail.com">Raz Malka</a> and <a href="mailto:shoham39@gmail.com">Shoham Yamin</a>
under the supervision of <a href="mailto:vlvolkov@braude.ac.il">Prof. Zeev Volkovich</a> and <a href="mailto:r_avros@braude.ac.il@braude.ac.il">Dr. Renata Avros</a>.\
Source:</br> https://github.com/ShohamYamin/BibleAuthorship/

# 2. Preprocessing and Dividing

### 2.1 - General
In the previous task <mark>Data Preparation</mark> we prepared 39 books in plain text format, which we now wish to preprocess and divide.\
The preprocessing task demands the following steps:

1. Remove punctuation marks and diacritics
2. Tokenization

The dividing task demands the division of tokenized books into chunks (or rather, tweets) of predefined size.\
\
Let us import the required modules for this notebook:

In [1]:
%load_ext autoreload
%autoreload 2

import aaib_util as util
import json
import numpy as np
import unicodedata

### 2.2 - Load Texts from Files
Then, we move on to the loading task:

In [2]:
book_data = []
for i in range(len(util.books)):
    book_data.append(open(util.file_path + 'txt\\' + util.books[i] + '.txt', 'r', encoding = 'utf_16_le').read())

### 2.3 - Remove Punctuation Marks and Diacritics
Before we tokenize our text, we have to clean it from excess characters and symbols.\
The marks <mark><b>, . : ;</b></mark> should be removed, as well as diacritics (niqqud and cantillation, which are Hebrew accents) and rogue single letters.\
To this task we chose the <mark><i>unicodedata</i></mark> module, which helps us normalize accented characters into their unaccented form.

In [3]:
def remove_cantillation(text):
    return ''.join(c for c in unicodedata.normalize('NFD', text)
            if unicodedata.category(c) != 'Mn')

def remove_punctuation_and_niqqud(text):
    text = text.replace('\u05be', ' ') # Special Case for Hebrew Hyphen
    return ''.join(['' if  (1456 <= ord(c) <= 1479 or ord(c) in [44, 46, 58, 59]) else c for c in text])

def remove_single_letters(text):
    for i in range(27):                # Hebrew has 27 letters
        text = text.replace(' ' + chr(ord('א') + i) + ' ', ' ')
    return text

for i in range(len(util.books)):
    book_data[i] = remove_punctuation_and_niqqud(book_data[i])
    book_data[i] = remove_cantillation(book_data[i])
    book_data[i] = remove_single_letters(book_data[i])

### 2.4 - Tokenization
Tokenization is the task of breaking a text into tokens, in our case - words.

In [4]:
tokenized_books = []
for i in range(len(util.books)):
    tokenized_books.append(book_data[i].split())

### 2.5 - Dividing
The dividing task demands the division of tokenized books into chunks of predefined size.

In [5]:
chunk_size = 128
divided_books = []

def pad_last_element(l, n):
    if (len(l[-1]) < n):
        l[-1].extend(['' for i in range((n - len(l[-1])))])
    return l

def chunks(l, n):
    return [l[i:i+n] for i in range(0, len(l), n)]

for i in range(len(util.books)):
    divided_books.append(chunks(tokenized_books[i], chunk_size))
    divided_books[i] = pad_last_element(divided_books[i], chunk_size)

### 2.6 - Saving
The <mark><i>JSON</i></mark> module let us save the preprocessed and divided books into files as follows:

In [6]:
for i in range(len(util.books)):
    with open(util.file_path + "json\\" + util.books[i] + ".json", "w") as fp:
        json.dump(divided_books[i], fp)

### Extra - Future Loading and Book Shapes
Naturally, we would use those files in the future. The way to load them back here is as follows:

In [14]:
for i in range(len(util.books)):
    with open(util.file_path + "json\\" + util.books[i] + ".json", "r") as fp:
            book = json.load(fp)