This notebook contains the necessary steps to parse the text of AODA and it's Regulation from the links below, using `requests`, `BeautifulSoup` and some custom functions:

* Act: https://www.ontario.ca/laws/statute/05a11#BK11
* Regulation: https://www.ontario.ca/laws/regulation/110191/v5

In [1]:
import re
import requests
from bs4.element import Tag
from bs4 import BeautifulSoup
from utils import get_headnotes, LAW_REGEX

In [2]:
aoda_url = 'https://www.ontario.ca/laws/statute/05a11#BK11'
reg_url = 'https://www.ontario.ca/laws/regulation/110191/v5'

In [3]:
r = requests.get(aoda_url)
soup = BeautifulSoup(r.text, 'html.parser')
aoda_html = list(soup.children)[-2]

In [4]:
r = requests.get(reg_url)
soup = BeautifulSoup(r.text, 'html.parser')
reg_html = list(soup.children)[-2]

We take out of the text legal references to other laws and regulations

Examples:
* urther defining the persons or organizations that are part of the industry, sector of the economy or class specified by the Minister under clause (a).  `2005, c. 11, s. 8 (2).`
* Except as otherwise provided in this Regulation, this Regulation applies to the Government of Ontario, the Legislative Assembly, every designated public sector organization and to every other person or organization that provides goods, services or facilities to the public or other third parties and that has at least one employee in Ontario.  `O. Reg. 191/11, s. 1 (3).`

This use of punctuation can make tokenization harder, so we swap the reference for an ID in the format `ref{n}`

**LAW_REGEX**: `(O. ?Reg. ?[0-9]+/[0-9]+)(,?\\.? ?[sS](chedule)?\\.? [0-9]+)+( \\([0-9]+\\))?\\.|([0-9]{4},) (c. [0-9]+)(, Sched. [A-Z]+)?(, s. [0-9]+)?( \\([0-9]+\\))?\\.`

In [5]:
law_to_ref = {}
for ref in set([ref.group() for ref in re.finditer(LAW_REGEX, aoda_html.text.replace('\xa0', ' '))
               ] + [ref.group() for ref in re.finditer(LAW_REGEX, reg_html.text.replace('\xa0', ' '))]):
    law_to_ref[ref] = 'REF{}'.format(len(law_to_ref) + 1)

In [6]:
len(law_to_ref)

491

In [7]:
headnotes_aoda = get_headnotes(aoda_html, 'AODA')

In [8]:
headnotes_reg = get_headnotes(reg_html, 'REG')

In [9]:
aoda = {**headnotes_aoda, **headnotes_reg}

In [10]:
import spacy
nlp = spacy.load('en')

In [11]:
doc = nlp(aoda['AODA Purpose'])

In [12]:
sentences = []

for section, text in aoda.items():
    for i, sent in enumerate(nlp(text).sents):
        sentences.append({
         'section': section,
         'text': sent,
        })

In [13]:
len(sentences)

1186

In [14]:
import pandas as pd

In [15]:
# pd.DataFrame(sentences).reset_index().to_csv('../data/sentences.csv', index=False)