## Summary

This code is the runnable python notebook version of `clean_tags.py`

This file is used to clean the tags in the webpages stored in the target directory,
and save the cleaned webpages into a json file.

The extracted emails will be saved into a json file and will be used for creating the reformatted email dataset.

## Import

In [8]:
import os
import re
import json
import collections
import glob
from tqdm import tqdm
from bs4 import BeautifulSoup
from pathlib import Path

## Clean tags

In [10]:
input = "/content/drive/MyDrive/test_results_12062022/scraped_pages/"

In [9]:
output = Path("./")

In [5]:
def remove_tags_and_extract_emails(html):
    # parse html content
    soup = BeautifulSoup(html, "html.parser")

    # extract emails
    pattern = re.compile("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+")
    emails = re.findall(pattern, str(soup))
  
    for data in soup(['style', 'script']):
        # Remove style and script part
        data.decompose()
  
    # return data by retrieving the tag content
    return ' '.join(soup.stripped_strings), emails

In [7]:
# create the cleaned webpages
cleaned_webpages = collections.defaultdict()
# create the extracted email
extracted_emails = collections.defaultdict()

# clean the webpages
for file in tqdm(glob.glob(f"{input}/*.html")):
    politician_id = file.split("/")[-1].split(".")[0]
    with open(file, "r") as f:
        html = f.read()
        content, emails = remove_tags_and_extract_emails(html)
        cleaned_webpages[politician_id] = content
        extracted_emails[politician_id] = emails

100%|██████████| 1181/1181 [03:24<00:00,  5.78it/s]


In [11]:
# save the cleaned webpages
with open(output/"pure_text.json", "w") as f:
    json.dump(cleaned_webpages, f)

In [13]:
# save the cleaned webpages
with open(output/'extracted_emails.json', "w") as f:
    json.dump(extracted_emails, f)

In [14]:
list(cleaned_webpages.items())[0]

('359968',
 "Ward 1 Members You may be trying to access this site from a secured browser on the server. Please enable scripts and reload this page. Menu Business Business in Thornton Economic development, retail operations, and commercial and residential building all work as a single entity here in Thornton. Businesses in Thornton There is a tremendous amount of opportunity within the City of Thornton for existing businesses to thrive and grow. Our elected officials and City staff dedicate a substantial amount of our efforts to creating an atmosphere that provides businesses located in Thornton with every possible advantage. Business Licenses Starting a business? Licensing is an important step toward doing business in the city. This page presents licensing information and all of the forms needed to get started in one place. All persons engaged in business in the City are required to have a City Sales and Use Tax Business License. Contracts & Purchasing Information on the city's Contrac

In [14]:
list(extracted_emails.items())[0]

('359968',
 ['eric.garcia@thorntonco.gov',
  'kathy.henson@thorntonco.gov',
  'eric.garcia@thorntonco.gov',
  'eric.garcia@thorntonco.gov',
  'kathy.henson@thorntonco.gov',
  'kathy.henson@thorntonco.gov',
  'lozad@1.9.0'])