This notebook visits the query URLs to get the two elements on BoardDocs that should correspond to school district name and address. A subset of 1k successful rows are included.

Input
- `boarddocs_url_cleaned.csv`

Output
- `sample_deliverable_1.csv`

In [18]:
import pandas as pd

input_filename = "boarddocs_url_cleaned.csv"
boarddocs_df = pd.read_csv(input_filename)

In [19]:
unique_urls = list(set(boarddocs_df["query_url"]))
f"Number of unique URLs: {len(unique_urls)}"

'Number of unique URLs: 3523'

In [20]:
import requests
from bs4 import BeautifulSoup

def get_site_titles(url, timeout=10):
    try:
        # Add a timeout to the request
        response = requests.get(url, timeout=timeout)

        # Check if the response status code is 200 (OK)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            # Extract the site titles
            site_title1 = soup.find(id='SiteTitle1')
            site_title2 = soup.find(id='SiteTitle2')
            return (
                site_title1.text if site_title1 else None,
                site_title2.text if site_title2 else None
            )
        else:
            # Handle non-200 responses
            return None, None
    except requests.exceptions.Timeout:
        # Handle timeout specifically
        print(f"Timeout occurred while fetching {url}")
        return None, None
    except requests.exceptions.RequestException as e:
        # Handle other request exceptions
        print(f"Error occurred while fetching {url}: {e}")
        return None, None

In [21]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

# Function to fetch site titles for a given URL
def fetch_site_titles(url):
    return url, get_site_titles(url)

# Store the results in a dict
results = dict()
with ThreadPoolExecutor(max_workers=10) as executor:
    future_to_url = {executor.submit(fetch_site_titles, url): url for url in unique_urls}
    for future in tqdm(as_completed(future_to_url), total=len(unique_urls)):
        url, titles = future.result()
        results[url] = titles


  4%|▍         | 134/3523 [00:09<04:54, 11.50it/s]

Error occurred while fetching nan: Invalid URL 'nan': No scheme supplied. Perhaps you meant https://nan?


 46%|████▌     | 1619/3523 [01:58<02:17, 13.84it/s]

Timeout occurred while fetching https://go.boarddocs.com/ca/cabrillo/Board.nsf/Public


 52%|█████▏    | 1834/3523 [02:14<02:25, 11.64it/s]

Timeout occurred while fetching https://go.boarddocs.com/ca/yccd/Board.nsf/Public


 55%|█████▍    | 1937/3523 [02:24<01:57, 13.53it/s]

Timeout occurred while fetching https://go.boarddocs.com/ca/gjccd/Board.nsf/Public


 55%|█████▌    | 1955/3523 [02:28<05:48,  4.50it/s]

Timeout occurred while fetching https://go.boarddocs.com/ca/vcccd/Board.nsf/Public


 57%|█████▋    | 2023/3523 [02:34<01:55, 12.94it/s]

Timeout occurred while fetching https://go.boarddocs.com/ca/pasadena/Board.nsf/Public


 59%|█████▉    | 2071/3523 [02:41<02:15, 10.70it/s]

Timeout occurred while fetching https://go.boarddocs.com/ca/scccd/Board.nsf/Public


 77%|███████▋  | 2696/3523 [03:34<01:14, 11.07it/s]

Timeout occurred while fetching https://go.boarddocs.com/md/stmarysco/Board.nsf/Public


100%|██████████| 3523/3523 [04:47<00:00, 12.27it/s]


In [22]:
len(results)

3523

In [23]:
# rerun for those with errors
# keep rerunning this cell until there are no more timeouts
# feel free to change the timeout

error_urls = [url for url, result in results.items() if result == (None, None)]
for url in tqdm(error_urls):
    results[url] = get_site_titles(url, timeout=40)

 27%|██▋       | 3/11 [00:00<00:00, 25.81it/s]

Error occurred while fetching nan: Invalid URL 'nan': No scheme supplied. Perhaps you meant https://nan?


100%|██████████| 11/11 [01:11<00:00,  6.46s/it]


In [24]:
error_urls = [url for url, result in results.items() if result == (None, None)]
error_urls

[nan,
 'https://go.boarddocs.com/mi/wpas/Board.nsf/Public',
 'https://go.boarddocs.com/il/tfd215/Board.nsf/Public',
 'https://go.boarddocs.com/mi/oxf/Board.nsf/Public']

In [27]:
for url in error_urls:
    try:
        response = requests.get(url)
        print(f"{response.status_code} {url}")
    except Exception as e:
        print(f"{e} {url}")

Invalid URL 'nan': No scheme supplied. Perhaps you meant https://nan? nan
404 https://go.boarddocs.com/mi/wpas/Board.nsf/Public
404 https://go.boarddocs.com/il/tfd215/Board.nsf/Public
404 https://go.boarddocs.com/mi/oxf/Board.nsf/Public


In [28]:
# I have looked into searching the original query url using "boarddocs" on google
# it does appear that these 404 links are being returned in the Google search
# except for il tfd 215
# the query string returned this instead "https://go.boarddocs.com/co/adams12/Board.nsf/Public"
# Original data
# Thornton School District
# NEW HAMPSHIRE
# but regardless, both are wrong data
# and the url is already pointed to by some other school

"https://go.boarddocs.com/co/adams12/Board.nsf/Public" in unique_urls

True

In [36]:
# create a sample for Tom first
# get those with postcodes

import re

# Create a DataFrame from the results dictionary
results_df = pd.DataFrame.from_dict(results, orient='index', columns=['Title1', 'Title2']).reset_index()
results_df.rename(columns={'index': 'URL'}, inplace=True)
results_df.head()

Unnamed: 0,URL,Title1,Title2
0,https://go.boarddocs.com/in/elkh/Board.nsf/Public,Elkhart Community Schools,NEOLA Board Policies
1,https://go.boarddocs.com/oh/copleyfairlawn/Boa...,Copley-Fairlawn City Schools,"3797 Ridgewood Road | Copley, OH 44321 | 330-6..."
2,https://go.boarddocs.com/il/msd60/Board.nsf/Pu...,eGovernance Site,"Meetings, Agendas and Information"
3,https://go.boarddocs.com/wi/cbus/Board.nsf/Public,Columbus School District,"200 W. School Street, Columbus WI 53925 | (92..."
4,https://go.boarddocs.com/fl/taylor/Board.nsf/P...,"318 North Clark Street | Perry, FL 32347 | 850...",Taylor County School District


In [37]:
results_df.to_csv('prelim_results.csv', index=False)

In [45]:
# Define a regex pattern to match a 5-digit zip code
zip_code_pattern = r'\b\d{5}\b'

# Find rows where Title1 has a 5-digit zip code
title1_zip_count = results_df['Title1'].str.contains(zip_code_pattern, na=False).sum()

# Find rows where Title2 has a 5-digit zip code
title2_zip_count = results_df['Title2'].str.contains(zip_code_pattern, na=False).sum()

title1_zip_count, title2_zip_count

(np.int64(1331), np.int64(1301))

In [60]:
# Filter rows where Title1 has a 5-digit zip code
rows_with_zip_in_title1 = results_df[results_df['Title1'].str.contains(zip_code_pattern, na=False)]
rows_with_zip_in_title1

Unnamed: 0,URL,Title1,Title2
4,https://go.boarddocs.com/fl/taylor/Board.nsf/P...,"318 North Clark Street | Perry, FL 32347 | 850...",Taylor County School District
6,https://go.boarddocs.com/nj/colps/Board.nsf/Pu...,"100 Lees Ave | Collingswood, NJ 08108 | 856-96...",Collingswood Public Schools
7,https://go.boarddocs.com/nc/wcpsnc/Board.nsf/P...,"2001 E. Royall Ave | Goldsboro, NC 27534 | 919...",Wayne County Public Schools
9,https://go.boarddocs.com/la/ipsb/Board.nsf/Public,"58060 Plaquemine Street | Plaquemine, LA 70764...",Iberville Parish Schools
12,https://go.boarddocs.com/ca/kccd/Board.nsf/Public,"2100 Chester Ave | Bakersfield, CA 93301 | 661...",Kern Community College District
...,...,...,...
3505,https://go.boarddocs.com/oh/nocs/Board.nsf/Public,26669 Butternut Ridge Rd | North Olmsted OH 44...,North Olmsted City Schools
3506,https://go.boarddocs.com/va/corva/Board.nsf/Pu...,"Rappahannock County, Virginia - 3 Library Road...",
3515,https://go.boarddocs.com/oh/mevs/Board.nsf/Public,"212 Chestnut Street, Marysville, Ohio 43040 | ...",Marysville Exempted Village Schools
3516,https://go.boarddocs.com/ak/akcrsd/Board.nsf/P...,"P.O. Box 108 | Glennallen, AK. 99588 | Ph: (90...",Copper River School District


In [61]:
rows_with_zip_in_title1.rename(columns={'Title1': 'address', 'Title2': 'school_district'}, inplace=True)
rows_with_zip_in_title1.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rows_with_zip_in_title1.rename(columns={'Title1': 'address', 'Title2': 'school_district'}, inplace=True)


Unnamed: 0,URL,address,school_district
4,https://go.boarddocs.com/fl/taylor/Board.nsf/P...,"318 North Clark Street | Perry, FL 32347 | 850...",Taylor County School District
6,https://go.boarddocs.com/nj/colps/Board.nsf/Pu...,"100 Lees Ave | Collingswood, NJ 08108 | 856-96...",Collingswood Public Schools
7,https://go.boarddocs.com/nc/wcpsnc/Board.nsf/P...,"2001 E. Royall Ave | Goldsboro, NC 27534 | 919...",Wayne County Public Schools
9,https://go.boarddocs.com/la/ipsb/Board.nsf/Public,"58060 Plaquemine Street | Plaquemine, LA 70764...",Iberville Parish Schools
12,https://go.boarddocs.com/ca/kccd/Board.nsf/Public,"2100 Chester Ave | Bakersfield, CA 93301 | 661...",Kern Community College District


In [64]:
rows_with_zip_in_title1.shape

(1000, 3)

In [63]:
rows_with_zip_in_title1 = rows_with_zip_in_title1.head(1000)

In [65]:
rows_with_zip_in_title1.shape

(1000, 3)

In [66]:
rows_with_zip_in_title1.to_csv("sample_deliverable_1.csv", index=False)