# Preprocessing the Data

The data for this project comes from Lotsawa House. I'm beginning by using their "Words of the Buddha" collection. The translations come in bilingual pdfs which need to be converted to a usable .txt file format.

## Setup

We want to be able to use and reuse this code so first I'll set up some variables that can be easily changed.

In [24]:
# text_name is the name of the text to be processed without the file type suffix

text_name = 'jamyang'

## Converting PDFs to txt

In [25]:
from PyPDF2 import PdfReader

path = 'data/' + text_name + '/'

reader = PdfReader(path + text_name + '.pdf')

num_pages = len(reader.pages)

text = []

for page in reader.pages:
    text.append(page.extract_text())

with open(path + 'phase1.txt', 'w') as f:
    f.writelines('\n'.join(text))

Now that a txt file has been created. We need to remove lines from the file that are not useful to us. This includes pages numbers, Tibetan script lines, etc.

In [26]:
import re

text = []

with open(path + 'phase1.txt', 'r') as f:
    for line in f:
        new_line = re.sub(r'[^a-zA-Z ]', '', line)
        if new_line.replace(' ', '') != '':
            text.append(new_line)

with open(path + 'phase2.txt', 'w') as f:
    f.writelines('\n'.join(text))

Now that we've wittled the text down we can set the text into Tibetan and English sentence pairs. Lotsawa House translations are conveniently provided in multiple lines. First Tibetan and then the English translation.

In [27]:
from spacy_langdetect import LanguageDetector
import spacy

def get_lang_detector(nlp, name):
    return LanguageDetector()

nlp_model = spacy.load("en_core_web_md")
spacy.language.Language.factory("language_detector", func=get_lang_detector)
nlp_model.add_pipe('language_detector', last=True)

pairs = []

def detect(text):
    doc = nlp_model(text)
    detect_language = doc._.language
    lang = detect_language['language']
    return lang

with open(path + 'phase2.txt', 'r') as f:
    text = f.readlines()
    
    for i in range(len(text) - 1):
        lang = detect(text[i])
        if lang != "en":
            next_lang = detect(text[i+1])
            if next_lang == "en":
                pair = (text[i].replace('\n', '') + ',' + text[i+1])
                pairs.append(pair)

with open(path + 'pairs.txt', 'w') as f:
    f.writelines(pairs)

The code below concatenates all of the pairs.txt results.

In [31]:
import os

for folder in os.listdir('data/pre-processed/'):
    path = 'data/pre-processed/' + folder

    with open(path + '/pairs.txt', 'r') as f:
        text = f.readlines()

        with open('data/all-pairs.txt', 'a') as g:
            g.writelines(text)