This notebook searches Google to get the boarddocs website of each school district. We iteratively work on the csv marked as "working" to get the schools that we didn't get to before because of errors.

Input:
- `kaggle_school_districts.csv`
- `working_school_districts_with_boarddocs_scraped.csv`

Output:
- `working_school_districts_with_boarddocs_scraped.csv`
- `school_districts_with_boarddocs_scraped.csv`

In [523]:
import pandas as pd
import requests
from tqdm import tqdm
from dotenv import load_dotenv
import os
import random

# Load environment variables from .env file
load_dotenv()

# Get API credentials
google_api_key = os.getenv("GOOGLE_API_KEY")
google_cse_id = os.getenv("GOOGLE_CSE_ID")

if not google_api_key or not google_cse_id:
    raise ValueError("API Key or CSE ID not found. Ensure they are set in the environment.")


# Function to perform Google API search
def google_search(query, user_id=0):
    url = "https://customsearch.googleapis.com/customsearch/v1"
    
    params = {
        "key": google_api_key,
        "cx": google_cse_id,
        "q": query,
        "num": 1,  # Fetch only the top result
    }
    
    try:
        response = requests.get(url, params=params)
        response.raise_for_status()
        results = response.json()
        if "items" in results:
            return results["items"][0]["link"]
    except Exception as e:
        # before printing the error, remove any secrets
        error_string = str(e)
        # error_string = error_string.replace(google_api_key, "REDACTED_GOOGLE_API_KEY")
        # error_string = error_string.replace(google_cse_id, "REDACTED_GOOGLE_CSE_KEY")
        print(f"Error for query '{query}': {error_string}")
    return None

In [524]:
# Check if the file exists
working_filename = "working_school_districts_with_boarddocs_scraped.csv"

if not os.path.exists(working_filename):
    # Load the school districts CSV
    df = pd.read_csv("lea_boarddocs_queries.csv")
    # Add an empty column called 'url' to the dataframe
    df['url'] = ''
    df['url'] = df['url'].astype('str')
    # Write the dataframe to a CSV file
    df.to_csv(working_filename, index=False)

In [525]:
# Load the working copy
df = pd.read_csv(working_filename)

In [526]:
# get them to be objects as dtype
df = df.astype('object')
df.dtypes

LEA_NAME     object
LEAID        object
STATENAME    object
url          object
dtype: object

In [527]:
print(f"Total number of all school boards: {df.shape[0]}")

Total number of all school boards: 19637


In [528]:
# get the remaining ones
remaining_df = df[df['url'].isna()]
print(f"Number of remaining school boards to scrape: {remaining_df.shape[0]}")

Number of remaining school boards to scrape: 6195


In [529]:
# test with 10
remaining_df = remaining_df.sample(1900)

In [None]:
# Prepare the queries
import time

queries = remaining_df["LEA_NAME"]

# Perform Google search for each school district with tqdm progress bar
results = []
sleep_flag = True
for query in tqdm(queries, desc="Searching Google", unit="query"):
    # sleep will introduce the lag so that we hit right at the rate limit by Google
    if sleep_flag:
        # 60 seconds per 100 operations
        # time.sleep(60/100)
        # time.sleep(60/200)
        # optimized to sleep for less by including computation time
        # should be 25% faster
        time.sleep(0.4)
    results.append(google_search(query))

# Add the results to the DataFrame
remaining_df["url"] = results

In [531]:
# show a sample of results
remaining_df.sample(5)

Unnamed: 0,LEA_NAME,LEAID,STATENAME,url
3251,MIRABILE C-1,2921030,MISSOURI,https://go.boarddocs.com/ny/crboces/Board.nsf/...
2781,Crawford AuSable Schools,2611030,MICHIGAN,https://www.boarddocs.com/mi/casd/Board.nsf/Pu...
1598,Illini Central CUSD 189,1700113,ILLINOIS,
3374,Dawson H S,3008340,MONTANA,https://go.boarddocs.com/ga/musc/Board.nsf/fil...
3227,HAYTI R-II,2913800,MISSOURI,https://go.boarddocs.com/mo/foxc6/Board.nsf/fi...


In [532]:
# only keep the ones with non NA
remaining_df = remaining_df[~remaining_df['url'].isna()]
print(f"Number of new non-NA results: {remaining_df.shape[0]}")

Number of new non-NA results: 1421


In [533]:
# Remove rows that are in remaining_df["school_district"]
df = df[~df["LEAID"].isin(remaining_df["LEAID"])]

# Concatenate remaining_df to df
df = pd.concat([df, remaining_df], ignore_index=True)

In [534]:
# Merge remaining_df with df on 'school_district' and 'state' columns
df.update(remaining_df.set_index('LEAID'), overwrite=False)

In [535]:
# Save the results to the working CSV
df.to_csv(working_filename, index=False)

In [536]:
# percentage done
percentage_done = str(round((df[~df["url"].isna()].shape[0] / df.shape[0])*100,2)) + '%'
print(f"Percentage of total school boards scrapped: {percentage_done}")

Percentage of total school boards scrapped: 75.69%
