# Preprocessing the Data

The purpose of this notebook is to document the preprocessing that was done to the PDF's scraped from [Lotsawa House](https://www.lotsawahouse.org/)

## Converting PDFs to txt

The translations come in bilingual pdfs which need to be converted to a usable .txt file format.

In [1]:
from PyPDF2 import PdfReader
import os

path = '/home/j/Documents/Projects/Iron-Bridge/lotsawa/data/lotsawahouse/topic-pdfs'

def pdf_to_txt(file):
    reader = PdfReader(file)

    num_pages = len(reader.pages)

    text = []

    for page in reader.pages:
        text.append(page.extract_text())


    return text

for file in os.listdir(path):
    text = pdf_to_txt(path + '/' + file)
    with open('/home/j/Documents/Projects/Iron-Bridge/lotsawa/data/lotsawahouse/topic-txts/' + file[:-4] + '.txt', 'w') as f:
        f.writelines('\n'.join(text))

## Clean txt

Now that a txt file has been created. We need to remove lines from the file that are not useful to us. This includes pages numbers, Tibetan script lines, etc.

In [4]:
import re

def clean_txt(file):

    text = []

    with open(file, 'r') as f:
        for line in f:
            new_line = re.sub(r'[^a-zA-Z ]', '', line)
            if new_line.replace(' ', '') != '':
                text.append(new_line)

    with open(file, 'w') as f:
        f.writelines('\n'.join(text))


path = '/home/j/Documents/Projects/Iron-Bridge/lotsawa/data/lotsawahouse/topic-txts'

for file in os.listdir(path):
    clean_txt(path + '/' + file)

## Split text into sentence pairs

Now that we've wittled the text down we can set the text into Tibetan and English sentence pairs. Lotsawa House translations are conveniently provided in multiple lines. First Tibetan and then the English translation.

In [1]:
from spacy_language_detection import LanguageDetector
import spacy
import os

def get_lang_detector(nlp, name):
    return LanguageDetector()

nlp_model = spacy.load("en_core_web_md")
spacy.language.Language.factory("language_detector", func=get_lang_detector)
nlp_model.add_pipe('language_detector', last=True)

pairs = []

def detect(text):
    doc = nlp_model(text)
    detect_language = doc._.language
    lang = detect_language['language']
    return lang

def separate_pairs(file):
    with open(file, 'r') as f:
        text = f.readlines()
        
        for i in range(len(text) - 1):
            lang = detect(text[i])
            if lang != "en":
                next_lang = detect(text[i+1])
                if next_lang == "en":
                    pair = (text[i].replace('\n', '') + ',' + text[i+1])
                    pairs.append(pair)

    with open('/home/j/Documents/Projects/Iron-Bridge/lotsawa/data/lotsawahouse/all-lotsawahouse-pairs.txt', 'a') as f:
        f.write('\n')
        f.writelines(pairs)

path = '/home/j/Documents/Projects/Iron-Bridge/lotsawa/data/lotsawahouse/topic-txts/'

for file in os.listdir('/home/j/Documents/Projects/Iron-Bridge/lotsawa/data/lotsawahouse/topic-txts'):
    separate_pairs(path+file)