# <center>How to search keywords on municipal websites</center>

**Cite this tutorial as**:

Cai, Meng, Huang, Huiqing, & Decaminada, Travis. (2023). Local data at a national scale: Introducing a dataset of official municipal websites in the United States for text-based analytics. Environment and Planning B: Urban Analytics and City Science, 0(0). https://doi.org/10.1177/23998083231190961

This tutorial demonstrates how to systematically search self-defined keywords on municipal websites using the dataset UScityURL.csv and [Google Custom Search API](https://developers.google.com/custom-search/v1/introduction), and save the search results as a csv file. Google offers 100 queries per day for free. Daily quotas reset at midnight Pacific Time. Additional queries cost $5 per 1000 queries. Details about Google API pricing could be find [here](https://developers.google.com/custom-search/v1/overview).

First, please set up a Google API key and case ID following the [instructions](https://support.google.com/googleapi/answer/6158862?hl=en). And then, in the same folder where you run code, save the API key in a txt file and name it `api_key.txt`. The txt file should only have the key as its content. Similarly, save the case ID in a txt file and name it `case_id.txt`. These are unique identifiers to authenticate and authorize access to the Google Search API. It is important to remember that they should never be shared with others.

Next, the Google API client library needs to be installed. The documentation is available [here](https://pypi.org/project/google-api-python-client/). After successful install, import all the libraries.

In [1]:
## remove the # in the next line to install the Google API client library
#!pip install google-api-python-client

In [2]:
# import libraries
import pandas as pd
from googleapiclient.discovery import build
import csv
import math
import time

In [3]:
# read in your API key and case ID
api_key = open('api_key.txt', 'r').read().strip()
case_id = open('case_id.txt', 'r').read().strip()

After the above preparations, we are ready for the search.

In [4]:
# load in the dataset and check it out
UScityURL = pd.read_csv("UScityURL.csv") # make sure data and notebook in the same folder otherwise change path
UScityURL.head()

Unnamed: 0,GEOID,MUNICIPALITY,STATE,WEBSITE_AVAILABLE,WEBSITE_URL
0,1600000US3651000,New York,New York,1,https://www.nyc.gov/
1,1600000US0644000,Los Angeles,California,1,https://www.lacity.org/
2,1600000US1714000,Chicago,Illinois,1,https://www.chicago.gov/
3,1600000US4835000,Houston,Texas,1,http://www.houstontx.gov/
4,1600000US0455000,Phoenix,Arizona,1,https://www.phoenix.gov/


A code book `cookbook.txt` for the dataset is available in the repo. 

As of the time when the dataset is compiled (September 2022), 13,724 out of 19,518 municipalities (70%) have an official website. All the municipalities without official websites have populations below 6,000.

In [5]:
UScityURL.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19518 entries, 0 to 19517
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   GEOID              19518 non-null  object
 1   MUNICIPALITY       19518 non-null  object
 2   STATE              19518 non-null  object
 3   WEBSITE_AVAILABLE  19518 non-null  int64 
 4   WEBSITE_URL        13724 non-null  object
dtypes: int64(1), object(4)
memory usage: 762.5+ KB


In [6]:
# a ready-to-use function for conducting systematic searches
def search(search_term, api_key, case_id):
    """
    Use Google Custom Search API to systematically search self-defined keywords.
    
    Arguments:
        search_term: search string. The maximum length is 2048 characters.
        api_key: your api key.
        case_id: your case id.
    Returns:
        len(link_list): the number of search results returned.
        title_list: the titles of returned search results in a list.
        link_list: the links of returned search results in a list.
        snippet_list: the snippets of returned search results in a list.
    """
    
    service = build("customsearch", "v1", developerKey=api_key)
    result = service.cse().list(q=search_term, cx=case_id).execute()
    est_total_num = int(result["searchInformation"]["totalResults"])
    title_list = []
    link_list = []
    snippet_list = []
    if est_total_num == 0:
        return len(link_list), title_list, link_list, snippet_list
    elif est_total_num <= 10:
        for item in result["items"]:
            title_list.append(item["title"])
            link_list.append(item["link"])
            snippet_list.append(item['snippet'])
        return len(link_list), title_list, link_list, snippet_list
    else:
        for item in result["items"]:
            title_list.append(item["title"])
            link_list.append(item["link"])
            snippet_list.append(item['snippet'])
        total_page = math.ceil(est_total_num/10)
        if total_page > 10:
            total_page = 10
        for page in range(1, total_page):
            start = page * 10 + 1
            more_result = service.cse().list(q=search_term, cx=case_id, start=start).execute()
            new_total_num = int(more_result["searchInformation"]["totalResults"])
            if new_total_num == 0:
                return len(link_list), title_list, link_list, snippet_list
            else:
                for item in more_result["items"]:
                    title_list.append(item["title"])
                    link_list.append(item["link"])
                    snippet_list.append(item['snippet'])
        return len(link_list), title_list, link_list, snippet_list

In [7]:
# define your search keyword
keyword = "inequity"

In [8]:
# give your output file a name
output_file = "output_example.csv"

In [9]:
# filter out municipalities without websites
source = UScityURL[UScityURL.WEBSITE_AVAILABLE==1].reset_index(drop=True)

In [10]:
# conduct the search and write into a csv file
for i in range(0, 5): # run 5 cities as an example
    search_term = keyword + ' site:' + source.WEBSITE_URL.iloc[i]
    total, title, link, snippet = search(search_term, api_key, case_id)
    csv.writer(open(output_file, "a")).writerow([source.GEOID.iloc[i], total, title, link, snippet])
    time.sleep(1) # to avoid too many requests error

In [14]:
# check the output file
output = pd.read_csv(output_file, names=["GEOID","total","title","link","snippet"])

In [15]:
output

Unnamed: 0,GEOID,total,title,link,snippet
0,1600000US3651000,100,"[""Addressing New York City's Smoking Inequitie...",['https://www.nyc.gov/assets/doh/downloads/pdf...,"['Oct 1, 2022 ... These exposures contribute t..."
1,1600000US0644000,5,"['L.A. Controller Releases Report, ""Diversity ...",['https://www.lacity.org/highlights/la-control...,"['Feb 5, 2021 ... L.A. Controller Ron Galperin..."
2,1600000US1714000,100,['Exploring Root Causes of Health Inequities i...,['https://www.chicago.gov/content/dam/city/dep...,"['Jun 19, 2019 ... What underlying social and ..."
3,1600000US4835000,76,"['Health Disparities Summary 2019', 'Community...",['https://www.houstontx.gov/health/chs/documen...,"['Nov 19, 2019 ... The Houston Health Departme..."
4,1600000US0455000,34,['Heat Equity Policy: Inequities in Extreme He...,['https://www.phoenix.gov/oepsite/Documents/De...,"['Oct 19, 2021 ... Heat Equity Policy: Inequit..."


The returned search results pinpoint where a certain topic is discussed. They are suitable for further text-based analysis.

In [17]:
# to check out the titles of the returned search results
print(output.title.iloc[0])

["Addressing New York City's Smoking Inequities", 'Health Department Releases New Data on Smoking Inequities in NYC', 'Mayor Adams Takes Action to Reduce Maternal and Infant Health ...', 'The COVID-19 Pandemic Magnified Inequities and Structural ...', 'Aging in New York Fund - NYC Aging', 'Race to Justice Communications Tips | NYC.gov', 'About - CGE', "Mayor Adams Commits to Making New York City Future of Women's ...", 'Language Use Guide', 'Inequities in Experiences of the COVID-19 Pandemic, New York City', 'LinkNYC Link5G Design Proposal', 'Health Disparities', 'Office of Equity Strategies - ACS', 'Nutrition Services - NYC Health', 'Future of Workers Task Force', 'About COVID-19 Data - NYC Health', "Addressing New York City's Smoking Inequities", 'TCNY 2020 Neighborhood Health Initiative - NYC Health', 'What We Do - CGE', 'Development of a PrEP Equity Index to Set Local Targets for PrEP ...', 'Race to Justice - NYC Health', '2021 Testimony', 'Health Department Releases Monkeypox Vacc