# Introduction to web-scraping

## Outline

* [Making `Requests`](#request)
* [Parsing HTML](#parsing)
    * [Pretty parsing with `BeautifulSoup`](#BS)
    * [Getting human-readable text](#readable)
* [URL collection with automated Google search](#URLs)
    * [Scraping school URLs](#school_URLs)
    * [Scraping URLs using an exclusion list](#exclusionlist)


**__________________________________**


## Import packages

In [693]:
# for parsing
from bs4 import BeautifulSoup # essential package for parsing in Python
import requests # for web requests

# for automated URL collection
# !pip install google # needs running only once
from googlesearch import search # automated Google search package

## Making `Requests` <a id='request'></a>

The first step in web-scraping is getting the HTML of the website we want to scrape. The [requests](http://docs.python-requests.org/en/master/) library is the easiest way to do this in Python.

In [694]:
url = 'https://en.wikipedia.org/wiki/Abenaki'

response = requests.get(url)

Great, it looks like everything worked! Let's see our beautiful HTML:

In [695]:
response

<Response [200]>

What the `requests.get` function returned (and the thing in our `response` variable) was a Response object. It itself isn't the HTML that we wanted, but rather a collection of metadata about the request/response interaction between your computer and the Wikipedia server.

For example, it knows whether the response was successful or not (`response.ok`), how long the whole interaction took (`response.elapsed`), what time the request took place (`response.headers['Date']`) and a whole bunch of other metadata.

In [696]:
response.ok

True

In [697]:
response.headers['Date']

'Wed, 15 Feb 2023 20:39:01 GMT'

Of course, what we really care about is the HTML content. We can get that from the `Response` object with `response.text`. What we get back is a string of HTML, exactly the contents of the HTML file at the URL that we requested.

In [698]:
html = response.text
print(html[:1000])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Abenaki - Wikipedia</title>
<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled";(function(){var cookie=document.cookie.match(/(?:^|;

### Challenge

Get the HTML for [this claim review by fact checking site PolitiFact](https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/). 
Print out the first 1000 characters and compare it to the HTML you see when you view the source HTML in your browser.

In [699]:
# your solution here
biden_url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'

biden_response = requests.get(biden_url)

biden_html = biden_response.text
print(biden_html[:1000])


<!DOCTYPE html>
<html lang="en-US" dir="ltr">
<head>
<meta charset="utf-8">
<meta http-equiv="x-ua-compatible" content="ie=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>PolitiFact | Citizens United calls Biden’s infrastructure plan the Green New Deal. It isn’t.</title>
<meta name="description" content="Republican opposition to President Joe Biden’s infrastructure proposal has been swift and vocal. Senate Minority Leader " />
<meta property="og:url" content="https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/" />
<meta property="og:image" content="https://static.politifact.com/politifact/rulings/meter-mostly-false.jpg" />
<meta property="og:image:secure_url" content="https://static.politifact.com/politifact/rulings/meter-mostly-false.jpg" />
<meta property="og:title" content="PolitiFact - Citizens United calls Biden’s infrastructure plan the Green New Deal. It isn’t." />
<meta propert

# Parsing HTML <a id='parsing'></a>

After getting HTML (e.g., with `Requests`), the second step in web-scraping is parsing HTML. This is where things can get a little tricky.

Let's start by looking more closely at HTML. Use your browser developer tools (e.g., in Chrome, right click > `Inspect`) to inspect the HTML of [the page listing all courses in Quantitative Social Science in 2022-23 at Dartmouth College](https://dartmouth.smartcatalogiq.com/en/current/orc/Departments-Programs-Undergraduate/Quantitative-Social-Science/QSS-Quantitative-Social-Sciences) in your browser, and find the HTML of the hyperlink to each specific course page. There's a lot of other stuff in the file that we don't care too much about. You could try `Crtl-F`ing for the name of a course you see on the webpage (you might have to then scroll around to find the highlighted code).

You should see listings like these ones:

```html
<ul class="sc-child-item-links">
    <li>
        <a href="/en/current/orc/Departments-Programs-Undergraduate/Quantitative-Social-Science/QSS-Quantitative-Social-Sciences/QSS-15">
            QSS&nbsp;15&nbsp;Introduction to Data Analysis
        </a>
    </li>
    <li>
        <a href="/en/current/orc/Departments-Programs-Undergraduate/Quantitative-Social-Science/QSS-Quantitative-Social-Sciences/QSS-17">
            QSS&nbsp;17&nbsp;Data Visualization
        </a>
    </li>
    ...
```

This is HTML. HTML uses "tags", code that surrounds the raw text which indicates the structure of the content. The tags are enclosed in `<` and `>` symbols. The `<li>` says "this is a new thing in a list", and `</li>` says "that's the end of that new thing in the list". Similarly, the `<a ...>` and the `</a>` say, "everything between us is a hyperlink". 

In this HTML file, each course title is listed with `<li>...</li>` and is also linked to its own page using `<a>...</a>`. In our browser, if we click on the name of the course, it takes us to detailed information for that class, including the Instructor and Pre-Requisites. You'll see that inside the `<a>` bit, there's a `href=...`. That tells us the (relative) location of the page it's linked to.

## Pretty parsing with `BeautifulSoup` <a id='BS'></a>

Armed with this knowledge of HTML, let's try getting the HTML and parsing a webpage. We will use `requests` to get the HTML and its text, then `BeautifulSoup` to parse the result. (Check out [the `BeautifulSoup` docs](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) for lots of tips and tricks!)

In [700]:
# Define URL to scrape: a fact-checking page
url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'

# Scrape HTML
html = requests.get(url)

# Convert HTML into soup object
soup = BeautifulSoup(html.text) # use default 'html.parser' ('lxml' is faster though)

# See pretty formatting in soup object
print(soup.prettify()[:1200])

<!DOCTYPE html>
<html dir="ltr" lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="ie=edge" http-equiv="x-ua-compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   PolitiFact | Citizens United calls Biden’s infrastructure plan the Green New Deal. It isn’t.
  </title>
  <meta content="Republican opposition to President Joe Biden’s infrastructure proposal has been swift and vocal. Senate Minority Leader " name="description"/>
  <meta content="https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/" property="og:url"/>
  <meta content="https://static.politifact.com/politifact/rulings/meter-mostly-false.jpg" property="og:image"/>
  <meta content="https://static.politifact.com/politifact/rulings/meter-mostly-false.jpg" property="og:image:secure_url"/>
  <meta content="PolitiFact - Citizens United calls Biden’s infrastructure plan the Green New Deal. It isn’t." property="og

Pretty! Especially compared to the kind of raw output of `requests.get().text` we saw above. But this is just the beginning of what `BeautifulSoup` can do. It can also find specific tags, like paragraphs (via `<p>`), headers (via `h1`, `h2`, etc.), and hyperlinks (via `<a>` and their `href` elements).

In most cases, the `<p>` tag is the most useful for extracting readable text from a webpage. Let's get the first 10 paragraph tags from this claim review page.

In [701]:
for paragraph in soup.find_all('p')[:10]: # first 10 paragraphs via <p> tag
    print(paragraph)

<p>
Our only agenda is to publish the truth so you can be an informed participant in democracy.
<br/>We need your help.
</p>
<p>
<a class="m-disruptor-content__link" href="/membership/">More Info</a>
</p>
<p class="c-image__caption-inner copy-xs">
The White House infrastructure plan has $111 billion to improve water and sewer systems. (Shutterstock)
</p>
<p>The White House infrastructure plan would cost about $2.3 trillion. A Green New Deal-type plan would cost $9.5 trillion.</p>
<p>The Green New Deal included broader social economic goals, such as a guaranteed livable wage, affordable higher education and universal health care.</p>
<p>Republican opposition to President Joe Biden’s infrastructure proposal has been swift and vocal.</p>
<p>Senate Minority Leader Mitch McConnell said that as written, the $2.3 trillion American Jobs Plan released March 31 was a nonstarter. The conservative PAC Citizens United put Biden’s plan in the same boat as the <a href="https://www.congress.gov/bill/1

### Challenge

Find all the links in the above claim review page using the `<a>` tags and their `href` elements. Print every 10th link. What do you notice about where these links point?

In [702]:
url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'
html = requests.get(url)
soup = BeautifulSoup(html.text)

# Your solution here
for link in soup.find_all('a')[::10]:
    print(link.get('href'))

/
/pennsylvania/
/health-check/
/personalities/
/personalities/sean-hannity/
/factchecks/list/?ruling=pants-fire
/corrections-and-updates/
/infrastructure/
https://twitter.com/Citizens_United/status/1377308915227107336?ref_src=twsrc%5Etfw
None
https://www.usda.gov/media/press-releases/2021/03/10/fact-sheet-united-states-department-agriculture-provisions-hr-1319
#
#
#
/personalities/facebook-posts/
/personalities/joe-biden/
/personalities/mitch-mcconnell/
/north-carolina/
/who-pays-for-politifact/
/copyright/


Sometimes these tags aren't very useful--in fact, they can get in the way of extracting only visible or human-readable text from the HTML. This too can be accomplished with `BeautifulSoup`!

## Getting human-readable text <a id='readable'></a>

Occasionally we want to learn about websites via their tags: What the headers say, which paragraph comes first, where the links or images are, etc. Other times tags (such as scripts or styles) only introduce extraneous characters and nonsense words, and we want to ignore the tags themselves or even the text they enclose. 

The simplest way to do this is with the `get_text()` method in `BeautifulSoup`, which returns all the text in a document or beneath a tag, as a single Unicode string. You might have noticed that the `<p>` tags got in the way in our above example. Let's try that again and this time, we will remove the tags.

In [703]:
for paragraph in soup.find_all('p')[:10]: # first 10 paragraphs via <p> tag
    print(paragraph.get_text().strip()) # extract text and strip trailing spaces

Our only agenda is to publish the truth so you can be an informed participant in democracy.
We need your help.
More Info
The White House infrastructure plan has $111 billion to improve water and sewer systems. (Shutterstock)
The White House infrastructure plan would cost about $2.3 trillion. A Green New Deal-type plan would cost $9.5 trillion.
The Green New Deal included broader social economic goals, such as a guaranteed livable wage, affordable higher education and universal health care.
Republican opposition to President Joe Biden’s infrastructure proposal has been swift and vocal.
Senate Minority Leader Mitch McConnell said that as written, the $2.3 trillion American Jobs Plan released March 31 was a nonstarter. The conservative PAC Citizens United put Biden’s plan in the same boat as the Green New Deal, a sweeping environmental and social justice agenda that Republicans have condemned.
"Does this sound like an infrastructure bill to you?" the group tweeted March 31, with a link to

It's also easy to call the first element of the soup object matching a given tag, like so:

In [704]:
soup.p.get_text() # Get text of first paragraph

'\nOur only agenda is to publish the truth so you can be an informed participant in democracy.\nWe need your help.\n'

Another useful method is `extract()`, which can be used to surgically remove a tag or string from the soup tree, storing it for safe keeping. Let's extract the first 5 links:

In [705]:
extracted = [] # initialize list of extracted links

for link in soup.find_all('a')[:5]: # get first five <a> tags
    extracted.append(link.extract()) # extract the link
    
print('Extracted links:')
print(extracted)
print()

# What are the first five links now that the previous five were removed? 
print('Remaining links:')
for link in soup.find_all('a')[:5]: 
    print(link)

Extracted links:
[<a href="/">
<span class="m-branding">
<span class="m-branding__logo">
<svg class="c-icon">
<use xlink:href="#svg_logo-plain"></use>
</svg>
</span>
<span class="m-branding__subline">
<span class="m-branding__claim">
The Poynter Institute
</span>
</span>
</span>
</a>, <a class="c-burger" data-menu-toggle="" href="#">
<span class="c-burger__lines"></span>
<span class="c-burger__value">Menu</span>
</a>, <a class="c-button c-button--small show-for-large" href="/membership/">
Donate
</a>, <a href="/california/">
California
</a>, <a href="/florida/">
Florida
</a>]

Remaining links:
<a href="/illinois/">
Illinois
</a>
<a href="/iowa/">
 Iowa
</a>
<a href="/missouri/">
Missouri
</a>
<a href="/new-york/">
New York
</a>
<a href="/north-carolina/">
North Carolina
</a>


What if we don't want to keep the tag at all? In this case, we would use `decompose()`, which obliterates a useless tag (and frees up memory). Unlike with `extract()`, with `decompose()` you don't need to assign the junk tag to anything to clear it--the method does this automatically. 

Let's try the above code again, this time with `decompose()` and `get_text()` to clean up the display.

In [706]:
url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'
html = requests.get(url)
soup = BeautifulSoup(html.text)

for link in soup.find_all('a')[:5]: # get first five <a> tags
    link.decompose() # obliterate this link
    
# What are the first five links now the the previous five were removed? 
print('Remaining links:')
for link in soup.find_all('a')[:5]: 
    print(link.get_text().strip()) # get text and clean spacing

Remaining links:
Illinois
Iowa
Missouri
New York
North Carolina


Not all websites use the `<p>` tag to indicate the important, human-readable text. Sometimes we need to approach HTML parsing from the other end: By finding and removing all non-informative tags. Let's use `BeautifulSoup` to build such a method. 

### Challenge

Use `decompose()` to remove from the soup all tags showing anything other than human-readable text. Below is a list of such junk tags to use as an exclusion list. 

```
"b", "big", "i", "small", "tt", "abbr", "acronym", "cite", "dfn", "kbd", 
"samp", "var", "bdo", "map", "object", "q", "span", "sub", "sup", "head", 
"title", "[document]", "script", "style", "meta", "noscript"
```

_Hint:_ Iterate over these tags to identify each one in the soup and remove it.

In [707]:
# Your solution here
remove_tags = ["b", "big", "i", "small", "tt", "abbr", "acronym", "cite", "dfn", "kbd", 
"samp", "var", "bdo", "map", "object", "q", "span", "sub", "sup", "head", 
"title", "[document]", "script", "style", "meta", "noscript"]

print("meta tags pre removal:")
print(soup.find_all('meta')[:5])

for tag_type in remove_tags:
    for tag in soup.find_all(tag_type):
        tag.decompose()
        
print("\nmeta tags pre removal:")
print(soup.find_all('meta'))

print("\nSoup post removal:")
print(soup.get_text(strip=True)[:1000])
    

meta tags pre removal:
[<meta charset="utf-8"/>, <meta content="ie=edge" http-equiv="x-ua-compatible"/>, <meta content="width=device-width, initial-scale=1" name="viewport"/>, <meta content="Republican opposition to President Joe Biden’s infrastructure proposal has been swift and vocal. Senate Minority Leader " name="description"/>, <meta content="https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/" property="og:url"/>]

meta tags pre removal:
[]

Soup post removal:
State EditionsIllinoisIowaMissouriNew YorkNorth CarolinaPennsylvaniaTexasVirginiaWest VirginiaVermontWisconsinMichiganIssuesAll IssuesOnline hoaxesCoronavirusHealth CareImmigrationExtremismTaxesMarijuanaEnvironmentCrimeGunsForeign PolicyLGBTQPeopleAll PeopleJoe BidenKamala HarrisCharles SchumerMitch McConnellBernie SandersNancy PelosiDonald TrumpMediaPunditFactTucker CarlsonSean HannityRachel MaddowBloggersPolitiFact VideosCampaigns2020 ElectionsTruth-o-Meter

You might have noticed that word boundaries get clobbered when you call `get_text()`. This is because the default setting for this method is `strip=True`, which tells `BeautifulSoup` to strip whitespaces (of any kind) from the beginning and end of each bit of text. Using `strip=False` leads to lots of extra whitespaces--usually, newlines--which requires some regular expressions to clean up.

### Challenge

Using the above tags exclusion list and `decompose()` as before, this time use the `strip=False` parameter when calling `get_text()` to avoid combining words across whitespace boundaries. Instead, use regular expressions to clean up extra whitespaces.

In [708]:
# Your solution here    
import re

text_nostrip = soup.get_text(strip=False)

regex = r'\s+'
text_processed = re.sub(regex, ' ', text_nostrip[:1000])

print(text_processed)

 State Editions Illinois Iowa Missouri New York North Carolina Pennsylvania Texas Virginia West Virginia Vermont Wisconsin Michigan Issues All Issues Online hoaxes Coronavirus Health Care Immigration Extremism Taxes Marijuana Environment Crime Guns Foreign Policy LGBTQ People All People Joe Biden Kamala Harris Charles Schumer Mitch McConnell Bernie Sanders Nancy Pelosi Donald Trump Media PunditFact Tucker Carlson Sean Hannity Rachel Maddow Bloggers PolitiFact Videos Campaigns 2020 Elections Truth-o-Meter True Mostly True Half True Mostly False False Pants on Fire Promises Biden Promise Tracker Trump-O-Meter Obameter Latest Promises About Us Our Process Our Staff Who pays for


### Extra Challenge

You might have noticed that when we scraped HTML above from [this claim review by PolitiFact](https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/), we got headers and tags like this:
```html
<p>Misinformation isn't going away just because it's a new year. Support trusted, factual information with a tax deductible contribution to PolitiFact.</p>
<p>
<a class="m-disruptor-content__link" href="/membership/">More Info</a>
</p>
<p class="c-image__caption-inner copy-xs">
The White House infrastructure plan has $111 billion to improve water and sewer systems. (Shutterstock)
</p>
```
Use what you now know about identifying HTML, removing tags, and cleaning spacing to scrape a clean explanation from the body of this article. 

_Hint:_ Use your browser to inspect this website's HTML and identify any unique types and/or classes that enclose the explanation (and nothing else).

In [709]:
# Your solution here

Compare the output from this focused, site-specific scraping approach with that from the exclusion list method above. <br/>
**Which method gives the cleaner output? Which method is more extensible?**

# URL collection with automated Google search<a id='URLs'></a>

If you want to crawl and/or scrape an online community of websites, there's a good chance may find yourself needing to collect their URLs. If you're lucky, you have comprehensive metadata describing these entities, something like their name and physical address. Your next step in this scenario would be to automate a Google search to collect the best URL matching each entity. 

How can you scrape URLs from Google? There are two fairly easy ways.

First, the **Google Places API**, which is the best option to do this at scale. You would need to apply for an API key from Google: go to the [Google cloud console](https://console.cloud.google.com/), create a project, and request an API key for each service you want to use. Approval may take a few days, but once done there is a [handy Python wrapper](https://github.com/slimkrazy/python-google-places) to make this easy to use in Python. See [Google Web Services](https://developers.google.com/places/web-service/) for general documentation and [Google Developers](https://developers.google.com/places/web-service/details) for details on Place Details requests.

The second option is **automated Google search**, which is not nearly as reliable and may get you blocked if used repeatedly. This method tends to get lots of false positives and third-party website aggregators (e.g., yellowpages.com, trulia.com), so using an exclusion list to manually filter results is a good idea. Check out [the source code](https://github.com/MarioVilas/googlesearch) and [documentation](https://python-googlesearch.readthedocs.io/en/latest/). _Thanks Mario Vilas for this package!_

Because this second option is free and has no waiting period to use, we will practice using this in a nice way. In case you want to pursue further the first option, at the bottom of this notebook there is template code for running the Google Places API.

_Note_: Remember what I said about following the Terms of Service for APIs? You might find real gems in there--like this extract from the [Google Maps Platform Terms of Service](https://developers.google.com/terms/) that prohibits scraping data you intend to store:

```
3.2.3 Restrictions Against Misusing the Services.

(a)  No Scraping. Customer will not export, extract, or otherwise scrape Google Maps Content for use outside the Services. For example, Customer will not: (i) pre-fetch, index, store, reshare, or rehost Google Maps Content outside the services; (ii) bulk download Google Maps tiles, Street View images, geocodes, directions, distance matrix results, roads information, places information, elevation values, and time zone details; (iii) copy and save business names, addresses, or user reviews; or (iv) use Google Maps Content with text-to-speech services.
```

Keep this in mind should you consider using the Google Places API for URL scraping (as my template code below does): The same terms apply, so be nice!

## Scraping school URLs<a id='school_URLs'></a>

To see how this works, let's start by searching for the best URL for a charter school in Washington, D.C. Assume we have the name and address of the school.

To prevent overwhelming Google search with rapid requests--and likely getting our IP address blocked by Google as a result--let's search only for the first 10 results and include a five-second pause in between each request.

In [710]:
# Define metadata for a single entity: a DC charter school
school_name = 'Capital City Public Charter School'
school_address = '100 Peabody Street NW, Washington, DC 20011'

# Search for first 10 Google results using joined metadata, show each one
for url in search(school_name + ' ' + school_address, \
                  stop=10, pause=2.5):
    print(url)

https://www.ccpcs.org/
https://www.ccpcs.org/about/our-staff
https://www.ccpcs.org/about/our-staff/join-our-team
https://www.ccpcs.org/current-families/calendar
https://www.ccpcs.org/about/our-staff/high-school
https://www.niche.com/k12/capital-city-public-charter-school-washington-dc/
https://www.myschooldc.org/schools/profile/143
https://www.myschooldc.org/schools/profile/142
https://www.usnews.com/education/k12/district-of-columbia/capital-city-pcs-lower-school-226373
https://nces.ed.gov/ccd/schoolsearch/school_detail.asp?Search=1&ID=110003500475


This is a pretty strong result: the first six matches share the domain of https://www.ccpcs.org/, so this is probably the best match. We identified a URL without even visiting any websites!

Notice that results 7-10 are about the right school, but they don't point to it's genuine website--with all its descriptive language, images, and subpages. Even in this case with a strong topline result, we can already get a feel for what websites will pollute our automated searches: Facebook and greatschools.org are a good start to making an exclusion list to filter the results. 

Now let's try something harder to find.

### Challenge

Collect the first 10 results from Google for Dr. David C. Walker Intermediate School located at 6500 Ih 35 N Ste C, San Antonio, TX 78218. What do you notice about the results? How do they compare to the previous set of results?

In [711]:
# Your solution here
school_name = 'Dr. David C. Walker Intermediate School'
school_address = '6500 Ih 35 N Ste C, San Antonio, TX 78218'

# Search for first 10 Google results using joined metadata, show each one
for url in search(school_name + ' ' + school_address, \
                  stop=10, pause=2.5):
    print(url)

https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
https://www.usnews.com/education/k12/texas/dr-david-c-walker-elementary-206298
https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
https://www.mapquest.com/us/texas/dr-david-c-walker-intermediate-school-438581037
https://www.mapquest.com/us/texas/dr-david-c-walker-int-475545773
https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
http://www.trueschools.com/schools/texas/san-antonio/dr-david-c-walker-intermediate/
https://www.hisawyer.com/listings/providers/126444-dr-david-c-walker-elementary
https://www.dnb.com/business-directory/company-profiles.school_of_excellence_in_education.8fde8b90005cb3de714dd31c0d8e98f4.html


These results are much less clear and organized: Each one points to a different site, and all of them are third parties. Why would this be the case? 

This school site probably has poor SEO (search engine optimization).

## Scraping URLs using an exclusion list<a id='exclusionlist'></a>

To provide cleaner search results, let's filter out the third-party websites from the previous two examples. 

Many of these websites can show up with either 'http' or 'https', often with or without a 'www', but usually have a consistent top-level domain (e.g., 'com'). Exact string matchin would fail to capture matches across these variations. Regular expressions could do this, but for now let's just filter out those search results that contain the core of any domain name in the exclusion list (e.g., niche.com). 

Let's get the first result for the previous school (Dr. David C. Walker Intermediate School) that doesn't match any domains in the exclusion list. 

In [712]:
# Define excluded domains to filter out: third-party domains/false positives that we DON'T want to scrape 
exclusions = ['facebook.com', 'greatschools.org', 'niche.com', 'har.com', 'usnews.com', 'publicschoolreview.com', 
             'nces.ed.gov', 'dnb.com', 'schooldigger.com', 'elementaryschools.org', 'closelocation.com']

# Define search metadata
school_name = 'Dr. David C. Walker Intermediate School'
school_address = '6500 Ih 35 N Ste C, San Antonio, TX 78218'
#school_name = "River City Scholars Charter Academy"
#school_address = "944 Evergreen Street, Grand Rapids, MI 49507"

# Collect search results
urls = search(school_name + ' ' + school_address, \
              stop=20, pause=5.0) # Expand search range to help avoid excluded domains
print("Successfully collected Google search results.")

# Initialize exclusion list match counter: How many excluded domains has this search encountered?
excluded_num = 0 

# Loop through google search output to find first good result:
for url in urls:
    if any(domain in url for domain in exclusions):
        print(f'Bad site detected: {url}') 
        excluded_num += 1 # Add one to exclusions list match counter
    else:
        good_url = url
        print("Success! URL obtained by Google search with " + str(excluded_num) + " bad URLs avoided.")
        break # Exit for loop after first good url is found
        
print(f'Quality URL: {good_url}')

Successfully collected Google search results.
Bad site detected: https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
Bad site detected: https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
Bad site detected: https://www.usnews.com/education/k12/texas/dr-david-c-walker-elementary-206298
Bad site detected: https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
Success! URL obtained by Google search with 4 bad URLs avoided.
Quality URL: https://www.mapquest.com/us/texas/dr-david-c-walker-intermediate-school-438581037


What do you think of [the "quality" URL we landed on](https://yellow.place/en/dr-david-c-walker-int-san-antonio-tx-usa)? What does this mean about our exclusion list?

### Challenge

Improve our automated searching to try to get the genuine URL of Dr. David C. Walker Intermediate School. <br/>
_Hint_: You could try (A) adding more URLs to the exclusion list OR (B) try a simple search but for more URLs.

In [713]:
# Your solution here
exclusions = ['facebook.com', 'greatschools.org', 'niche.com', 'har.com', 'usnews.com', 'publicschoolreview.com', 
             'nces.ed.gov', 'dnb.com', 'schooldigger.com', 'elementaryschools.org', 'closelocation.com']

school_name = 'Dr. David C. Walker Intermediate School'
school_address = '6500 Ih 35 N Ste C, San Antonio, TX 78218'
#school_name = "River City Scholars Charter Academy"
#school_address = "944 Evergreen Street, Grand Rapids, MI 49507"

urls = search(school_name + ' ' + school_address, \
              stop=20, pause=2.5) 
print("Successfully collected Google search results.")

excluded_num = 0 
good_urls = []

# Loop through google search output to find all good results:
for url in urls:
    if any(domain in url for domain in exclusions):
        excluded_num += 1 # Add one to exclusions list match counter
    else:
        good_url = url
        good_urls.append(good_url)
        
good_urls_string = "\n".join(good_urls)
print("Quality URLS:")
print(good_urls_string)

Successfully collected Google search results.
Quality URLS:
https://www.mapquest.com/us/texas/dr-david-c-walker-intermediate-school-438581037
https://www.mapquest.com/us/texas/dr-david-c-walker-int-475545773
http://www.trueschools.com/schools/texas/san-antonio/dr-david-c-walker-intermediate/
https://www.hisawyer.com/listings/providers/126444-dr-david-c-walker-elementary
https://www.citydirectory.us/school-dr-david-c-walker-elementary-san-antonio-tx.html
https://www.homefacts.com/schools/Texas/Bexar-County/San-Antonio/Dr-David-C-Walker-El.html
https://wheretoteach.com/dr-david-c-walker-elementary-22789
https://www.century21.com/schools/78218-san-antonio-tx-schools/dr-david-c-walker-intermediate-school/O10775139-LZ78218
https://www.donorschoose.org/schools/texas/school-of-excellence-in-education/dr-david-walker-elementary-school/95612
https://www.k12jobspot.com/District/1117/Schools
http://www.localschooldirectory.com/public-school/345352660/TX
http://www.loresult.com/us-zip-codes/united