This notebook visits the query URLs to get the two elements on BoardDocs that should correspond to school district name and address. A subset of 1k successful rows are included.

Input
- `boarddocs_url_cleaned.csv`

Output
- `sample_deliverable_1.csv`

In [148]:
import pandas as pd
import numpy as np

input_filename = "boarddocs_url_cleaned.csv"
boarddocs_df = pd.read_csv(input_filename)

In [149]:
unique_urls = list(set(boarddocs_df["query_url"]))
f"Number of unique URLs: {len(unique_urls)}"

'Number of unique URLs: 3906'

In [150]:
# optional filter if we want to test out the scraping
# unique_urls = np.random.choice(unique_urls,size=(1),replace=False)

In [151]:
import requests
from bs4 import BeautifulSoup

def get_site_titles(url, timeout=10):
    try:
        # Add a timeout to the request
        response = requests.get(url, timeout=timeout)

        # Check if the response status code is 200 (OK)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            # Extract the site titles
            site_title1 = soup.find(id='SiteTitle1')
            site_title2 = soup.find(id='SiteTitle2')
            navigation_link = soup.find(id='btn-home-nav')
            return (
                site_title1.text if site_title1 else None,
                site_title2.text if site_title2 else None,
                navigation_link.get("href") if navigation_link else None,
            )
        else:
            # Handle non-200 responses
            return None, None, None
    except requests.exceptions.Timeout:
        # Handle timeout specifically
        print(f"Timeout occurred while fetching {url}")
        return None, None, None
    except requests.exceptions.RequestException as e:
        # Handle other request exceptions
        print(f"Error occurred while fetching {url}: {e}")
        return None, None, None

In [152]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

# Function to fetch site titles for a given URL
def fetch_site_titles(url):
    return url, get_site_titles(url)

# Store the results in a dict
results = dict()
with ThreadPoolExecutor(max_workers=10) as executor:
    future_to_url = {executor.submit(fetch_site_titles, url): url for url in unique_urls}
    for future in tqdm(as_completed(future_to_url), total=len(unique_urls)):
        url, titles = future.result()
        results[url] = titles


 33%|███▎      | 1293/3906 [01:38<03:04, 14.17it/s]

Timeout occurred while fetching https://go.boarddocs.com/ca/yccd/Board.nsf/Public


 34%|███▍      | 1319/3906 [01:41<03:10, 13.61it/s]

Timeout occurred while fetching https://go.boarddocs.com/ca/pasadena/Board.nsf/Public


 86%|████████▌ | 3364/3906 [04:20<00:46, 11.58it/s]

Timeout occurred while fetching https://go.boarddocs.com/ca/scccd/Board.nsf/Public


 91%|█████████ | 3556/3906 [04:37<00:27, 12.82it/s]

Error occurred while fetching nan: Invalid URL 'nan': No scheme supplied. Perhaps you meant https://nan?


 99%|█████████▉| 3862/3906 [05:03<00:04,  9.56it/s]

Timeout occurred while fetching https://go.boarddocs.com/ca/vcccd/Board.nsf/Public


100%|██████████| 3906/3906 [05:13<00:00, 12.47it/s]


In [153]:
# inspect the results
list(results.items())[:5]

[('https://go.boarddocs.com/mi/sjs/Board.nsf/Public',
  ('Board Policies and Guidelines',
   'St. Joseph Public Schools',
   'https://www.sjschools.org/')),
 ('https://go.boarddocs.com/pa/cali/Board.nsf/Public',
  ('School Board Policy Manual', '', 'www.calsd.org')),
 ('https://go.boarddocs.com/oh/mapleheights/Board.nsf/Public',
  ('Maple Heights City Schools',
   '5740 Lawn Avenue | Maple Heights, OH 44137 | 216-587-6100',
   'http://www.mapleschools.com')),
 ('https://go.boarddocs.com/oh/rlsd/Board.nsf/Public',
  ('585 Riverside Drive | Painesville, Ohio 44077 | 440.352.0668 | f 440.639.1959',
   'Riverside Local School District ',
   'https://www.riversidelocalschools.com/')),
 ('https://go.boarddocs.com/pa/shun/Board.nsf/Public',
  ('School Board Policy Manual',
   'Southern Huntingdon County School District',
   'http://www.shcsd.org'))]

In [154]:
len(results)

3906

In [169]:
# rerun for those with errors
# keep rerunning this cell until there are no more timeouts
# feel free to change the timeout

error_urls = [url for url, result in results.items() if result == (None, None, None)]
for url in tqdm(error_urls):
    results[url] = get_site_titles(url, timeout=40)

 62%|██████▎   | 5/8 [00:38<00:28,  9.40s/it]

Error occurred while fetching nan: Invalid URL 'nan': No scheme supplied. Perhaps you meant https://nan?


100%|██████████| 8/8 [00:52<00:00,  6.53s/it]


In [170]:
error_urls = [url for url, result in results.items() if result == (None, None, None)]
error_urls

['https://go.boarddocs.com/il/tfd215/Board.nsf/Public',
 'https://go.boarddocs.com/mi/oxf/Board.nsf/Public',
 nan,
 'https://go.boarddocs.com/mi/wpas/Board.nsf/Public']

In [171]:
for url in error_urls:
    try:
        response = requests.get(url)
        print(f"{response.status_code} {url}")
    except Exception as e:
        print(f"{e} {url}")

404 https://go.boarddocs.com/il/tfd215/Board.nsf/Public
404 https://go.boarddocs.com/mi/oxf/Board.nsf/Public
Invalid URL 'nan': No scheme supplied. Perhaps you meant https://nan? nan
404 https://go.boarddocs.com/mi/wpas/Board.nsf/Public


In [172]:
# I have looked into searching the original query url using "boarddocs" on google
# it does appear that these 404 links are being returned in the Google search
# except for il tfd 215
# the query string returned this instead "https://go.boarddocs.com/co/adams12/Board.nsf/Public"
# Original data
# Thornton School District
# NEW HAMPSHIRE
# but regardless, both are wrong data
# and the url is already pointed to by some other school

"https://go.boarddocs.com/co/adams12/Board.nsf/Public" in unique_urls

True

In [174]:
# create a sample for Tom first
# get those with postcodes

# Create a DataFrame from the results dictionary
results_df = pd.DataFrame.from_dict(results, orient='index', columns=['title_1', 'title_2', 'home_website']).reset_index()
results_df.rename(columns={'index': 'URL'}, inplace=True)
results_df.head()

Unnamed: 0,URL,title_1,title_2,home_website
0,https://go.boarddocs.com/mi/sjs/Board.nsf/Public,Board Policies and Guidelines,St. Joseph Public Schools,https://www.sjschools.org/
1,https://go.boarddocs.com/pa/cali/Board.nsf/Public,School Board Policy Manual,,www.calsd.org
2,https://go.boarddocs.com/oh/mapleheights/Board...,Maple Heights City Schools,"5740 Lawn Avenue | Maple Heights, OH 44137 | 2...",http://www.mapleschools.com
3,https://go.boarddocs.com/oh/rlsd/Board.nsf/Public,"585 Riverside Drive | Painesville, Ohio 44077 ...",Riverside Local School District,https://www.riversidelocalschools.com/
4,https://go.boarddocs.com/pa/shun/Board.nsf/Public,School Board Policy Manual,Southern Huntingdon County School District,http://www.shcsd.org


In [175]:
# save all the results from scrapping
results_df.to_csv('prelim_results.csv', index=False)

In [176]:
# Define a regex pattern to match a 5-digit zip code
zip_code_pattern = r'\b\d{5}\b'

# Find rows where title_1 has a 5-digit zip code
title1_zip_count = results_df['title_1'].str.contains(zip_code_pattern, na=False).sum()

# Find rows where title_2 has a 5-digit zip code
title2_zip_count = results_df['title_2'].str.contains(zip_code_pattern, na=False).sum()

title1_zip_count, title2_zip_count

(np.int64(1440), np.int64(1442))

In [177]:
# Filter rows where title_1 has a 5-digit zip code
rows_with_zip_in_title1 = results_df[results_df['title_1'].str.contains(zip_code_pattern, na=False)]
rows_with_zip_in_title1

Unnamed: 0,URL,title_1,title_2,home_website
3,https://go.boarddocs.com/oh/rlsd/Board.nsf/Public,"585 Riverside Drive | Painesville, Ohio 44077 ...",Riverside Local School District,https://www.riversidelocalschools.com/
5,https://go.boarddocs.com/de/sussexvt/Board.nsf...,17099 County Seat Hwy | Georgetown DE 19947 | ...,Sussex Technical School District,http://www.sussexvt.org
7,https://go.boarddocs.com/ca/mbusd/Board.nsf/Pu...,"325 S. Peck Ave | Manhattan Beach, CA 90266 | ...",Manhattan Beach Unified School District,http://www.mbusd.org
12,https://go.boarddocs.com/oh/pcc/Board.nsf/Public,"9301 Buck Rd. | Perrysburg, Ohio 43551 | High ...",Penta Career Center,http://www.pentacareercenter.org/
14,https://go.boarddocs.com/wa/msdwa/Board.nsf/Pu...,"214 West Laurel Road | Bellingham, WA 98226 | ...",Meridian School District,http://www.meridian.wednet.edu/
...,...,...,...,...
3895,https://go.boarddocs.com/pa/iu05/Board.nsf/Public,"252 Waterford St. | Edinboro, PA 16412 | (814...",Northwest Tri-County Intermediate Unit 5,http://www.iu5.org/
3896,https://go.boarddocs.com/oh/mevs/Board.nsf/Public,"212 Chestnut Street, Marysville, Ohio 43040 | ...",Marysville Exempted Village Schools,http://www.marysville.k12.oh.us/site/
3897,https://go.boarddocs.com/co/jeffco/Board.nsf/P...,1829 Denver West Drive | Golden. CO 80401 | (3...,Jeffco Public Schools Board of Education,http://www.jeffcopublicschools.org/
3902,https://go.boarddocs.com/il/cowil/Board.nsf/Pu...,"100 N. Martin Luther King Jr. Ave. | Waukegan,...",City of Waukegan,http://www.waukeganil.gov/


In [178]:
rows_with_zip_in_title1.rename(columns={'title_1': 'address', 'title_2': 'school_district'}, inplace=True)
rows_with_zip_in_title1.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rows_with_zip_in_title1.rename(columns={'title_1': 'address', 'title_2': 'school_district'}, inplace=True)


Unnamed: 0,URL,address,school_district,home_website
3,https://go.boarddocs.com/oh/rlsd/Board.nsf/Public,"585 Riverside Drive | Painesville, Ohio 44077 ...",Riverside Local School District,https://www.riversidelocalschools.com/
5,https://go.boarddocs.com/de/sussexvt/Board.nsf...,17099 County Seat Hwy | Georgetown DE 19947 | ...,Sussex Technical School District,http://www.sussexvt.org
7,https://go.boarddocs.com/ca/mbusd/Board.nsf/Pu...,"325 S. Peck Ave | Manhattan Beach, CA 90266 | ...",Manhattan Beach Unified School District,http://www.mbusd.org
12,https://go.boarddocs.com/oh/pcc/Board.nsf/Public,"9301 Buck Rd. | Perrysburg, Ohio 43551 | High ...",Penta Career Center,http://www.pentacareercenter.org/
14,https://go.boarddocs.com/wa/msdwa/Board.nsf/Pu...,"214 West Laurel Road | Bellingham, WA 98226 | ...",Meridian School District,http://www.meridian.wednet.edu/


In [179]:
rows_with_zip_in_title1.shape

(1440, 4)

In [180]:
rows_with_zip_in_title1 = rows_with_zip_in_title1.head(1000)

In [181]:
rows_with_zip_in_title1.shape

(1000, 4)

In [182]:
rows_with_zip_in_title1.to_csv("sample_deliverable_1.csv", index=False)