## Summary

This code is the runnable python notebook version of `clean_tags.py`

This file is used to clean the tags in the webpages stored in the target directory,
and save the cleaned webpages into a json file.

The extracted emails will also be saved into a json file and will be used for creating the synthetic email dataset.

## Import

In [None]:
import collections
import glob
import json
import os
import re
from pathlib import Path

from bs4 import BeautifulSoup
from tqdm import tqdm

## Clean tags

In [None]:
input = "<scraped_webpages_path>"

In [None]:
output = Path("cleaned_texts")

In [None]:
# detect if the output directory exists
if not os.path.exists(output):
    os.makedirs(output)

In [None]:
def remove_tags_and_extract_emails(html):
    # parse html content
    soup = BeautifulSoup(html, "html.parser")

    # extract emails
    pattern = re.compile("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+")
    emails = re.findall(pattern, str(soup))

    for data in soup(['style', 'script']):
        # Remove style and script part
        data.decompose()

    # return data by retrieving the tag content
    return ' '.join(soup.stripped_strings), emails

In [None]:
# create the dictionary for pure_texts
pure_texts = collections.defaultdict()
# create the dictionary for extracted_emails
extracted_emails = collections.defaultdict()

# iterating
for file in tqdm(glob.glob(f"{input}/*.html")):
    politician_id = file.split("/")[-1].split(".")[0]
    with open(file, "r") as f:
        html = f.read()
        texts, emails = remove_tags_and_extract_emails(html)
        pure_texts[politician_id] = texts
        extracted_emails[politician_id] = emails

100%|██████████| 1469/1469 [03:38<00:00,  6.71it/s]


In [None]:
# save the cleaned webpages
with open(output / "pure_texts.json", "w") as f:
    json.dump(pure_texts, f)

In [None]:
# save the cleaned webpages
with open(output / 'extracted_emails.json', "w") as f:
    json.dump(extracted_emails, f)

In [None]:
list(pure_texts.items())[0]

('339820',
 "Senator Percy Mockler Opens in a new window Parliament of Canada Visit Parliament Visit Français Fr Search Contact Us Facebook Twitter Instragram YouTube Linked in Search About the Senate About the Senate - Home Senate of Canada Building Publications Photo Gallery Art & Architecture Transparency & Accountability Careers Procedural References Administration & Support Accessibility at the Senate Parliamentary Diplomacy Visit the Senate Everything you need to know to plan your trip. eNewsletter Learn how the Senate represents you by subscribing to our eNewsletter. Page Program Learn about the important role these young people play in the Senate. SENgage Senators engaging youth. Senators In the Chamber In the Chamber - Home Order Paper and Notice Paper Journals of the Senate Debates of the Senate (Hansard) Votes Procedural References LEGISinfo Watch & Listen Bills Before Parliament See what bills are being debated on Parliament Hill. Speaker of the Senate Learn about the Speak

In [None]:
list(extracted_emails.items())[0]

('339820',
 ['percy.mockler@sen.parl.gc.ca',
  'percy.mockler@sen.parl.gc.ca',
  'popper.js@1.16.1',
  'bootstrap@4.6.0'])