# 'billingsmoore/tibetan-to-english-translation-dataset' Collection

The purpose of this notebook is to document the process used to scrape the dataset ['billingsmoore/tagged-tibetan-to-english-translation-dataset'](https://huggingface.co/datasets/billingsmoore/tagged-tibetan-to-english-translation-dataset)

## Scraping

The code below was used to scrape the data from [Lotsawa House](www.lotsawahouse.org)

In [6]:
import requests
from bs4 import BeautifulSoup

# get html for title
URL = 'https://www.lotsawahouse.org/tibetan-masters/dodrupchen-III/sukhavati-aspiration'
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

# extract tags
cats = soup.find('div', {'class': 'categories'})
tags = cats.findAll('a', {'class': 'tag-circle'})
tags = [tag.contents[0] for tag in tags]
tag_str = ''
for tag in tags:
    tag_str += ('|' + tag)

print(tags)

 # extract text from html
maintext = soup.find('div', {'id': "maintext"})
tib = maintext.findAll('p', {'class': 'TibetanVerse'})
phon = maintext.findAll('p', {'class': 'EnglishPhonetics'})
en = maintext.findAll('p', {'class': 'EnglishText'})

if len(tib) == len(phon): # if pairs are valid, save them

    # prep pairs from text
    tib = [elt.contents[0].replace('\xa0', '') for elt in tib]
    phon = [elt.contents[0].replace(',', '') for elt in phon]
    en = [elt.contents[0].replace(',', '') for elt in en]
    pairs = [(tib_elt + ',' + phon_elt + ',' + en_elt + ',' + tag_str + '\n') for tib_elt, phon_elt, en_elt in zip(tib, phon, en)]

print(pairs)

['Amitābha', 'Aspiration Prayers', 'Sukhāvatī Aspirations', 'Tibetan Masters', 'Dodrupchen Jigme Tenpe Nyima']
['བཅོམ་ལྡན་འདས་འོད་དཔག་མེད་ལ་ཕྱག་འཚལ་ལོ།།,chomdendé öpakmé la chaktsal lo,Homage to the bhagavan Amitābha!,|Amitābha|Aspiration Prayers|Sukhāvatī Aspirations|Tibetan Masters|Dodrupchen Jigme Tenpe Nyima\n', 'ཡང་ཡང་དྲན་ནོ་ཞིང་ཁམས་བདེ་བ་ཅན།།,yang yang dren no zhingkham dewachen,Again and again I reflect on the Sukhāvatī realm.,|Amitābha|Aspiration Prayers|Sukhāvatī Aspirations|Tibetan Masters|Dodrupchen Jigme Tenpe Nyima\n', 'སྙིང་ནས་དྲན་ནོ་འདྲེན་པ་འོད་དཔག་མེད།།,nying né dren no drenpa öpakmé,With all my heart I recollect the guide Amitābha.,|Amitābha|Aspiration Prayers|Sukhāvatī Aspirations|Tibetan Masters|Dodrupchen Jigme Tenpe Nyima\n', 'རྩེ་གཅིག་དྲན་ནོ་རྒྱལ་སྲས་རྒྱ་མཚོའི་འཁོར།།,tsechik dren no gyalsé gyatsö khor,Single-pointedly I recall the oceanic retinue of bodhisattvas.,|Amitābha|Aspiration Prayers|Sukhāvatī Aspirations|Tibetan Masters|Dodrupchen Jigme Tenpe Nyima\n', 'བ

In [None]:
import requests
from bs4 import BeautifulSoup

URL = "https://www.lotsawahouse.org/topics/"

page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
series = soup.findAll('a', {'class':'index-entry'})

for serie in series:
    URL = 'https://www.lotsawahouse.org/'+serie['href']
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, "html.parser")

    # get titles from html
    titles = soup.findAll('a', {'class':'title'})

    for title in titles:
        try:
            # get html for title
            URL = 'https://www.lotsawahouse.org/' + title['href']
            page = requests.get(URL)

            soup = BeautifulSoup(page.content, "html.parser")

            # extract tags
            cats = soup.find('div', {'class': 'categories'})
            tags = cats.findAll('a', {'class': 'tag-circle'})
            tags = [tag.contents[0] for tag in tags]
            tag_str = ''
            for tag in tags:
                tag_str += ('|' + tag)

            print(tags)

            # extract text from html
            maintext = soup.find('div', {'id': "maintext"})
            tib = maintext.findAll('p', {'class': 'TibetanVerse'})
            phon = maintext.findAll('p', {'class': 'EnglishPhonetics'})
            en = maintext.findAll('p', {'class': 'EnglishText'})

            if len(tib) == len(phon): # if pairs are valid, save them

                # prep "pairs" from text
                tib = [elt.contents[0].replace('\xa0', '') for elt in tib]
                phon = [elt.contents[0].replace(',', '') for elt in phon]
                en = [elt.contents[0].replace(',', '') for elt in en]
                pairs = [(tib_elt + ',' + phon_elt + ',' + en_elt + ',' + tag_str + '\n') for tib_elt, phon_elt, en_elt in zip(tib, phon, en)]

            print(pairs)

        except:
            pass