# QF 625 Introduction to Programming
## Lesson 07 | An Introduction to Web Scraping | `RE`view 

### A task :)

> We will be parsing information collected from the Internet, specifically from `Wikipedia`.  

> Let's take a look at the HTML source code that powers the page about Pennsylvania:

- Open up https://en.wikipedia.org/wiki/Pennsylvania in your browser

- Right click and select "Inspect" or "Inspect Element"

- Alternatively:

  - _Chrome_ -- View > Developer > View Source
  - _Safari_ -- Develop > Show Web Inspector 
  - _Firefox_ -- Tools > Web Developer > Inspector

> The same HTML code we have been exploring is used to produce the structure of just about every webpage you visit.

> Note that we will be scraping Wikipedia for learning purposes, but you can simply download its content instead.  Check [this](https://en.wikipedia.org/wiki/Wikipedia:Database_download) out to learn more.

### Introduction to `requests`

> The module `requests` allows us to retrieve information from the web.  

> We need to import this package.

In [1]:
import requests

> Let's use the `.get()` method to retrieve a page's HTML.

In [2]:
url = "https://en.wikipedia.org/wiki/Pennsylvania"

response = requests.get(url)

response

<Response [200]>

> So, what's in a response?

> This object gives us a few important things:

- `response.text` -- the returned HTML (if any)

- `response.status_code` -- a [code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) to tell you if your request was successful :)

##### 200 = success

In [3]:
response.status_code

200

> Let's print first 200 characters of the HTML

In [4]:
print(response.text[:200])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Pennsylvania - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wg


In [5]:
page = response.text

In [6]:
type(page)

str

### Using `requests` with `BeautifulSoup`

> Now that we have the HTML, we use `BeautifulSoup` to understand its structure.

In [9]:
from bs4 import BeautifulSoup as bs

In [10]:
soup = bs(page)
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Pennsylvania - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"5424c8bc-d146-46f6-a062-ec835effa5d9","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Pennsylvania","wgTitle":"Pennsylvania","wgCurRevisionId":976204200,"wgRevisionId":976204200,"wgArticleId":23332,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with dead external links","Articles with dead external links from July 2010","Webarchive template wayback links","Articles with dead external links from March 20

`BeautifulSoup` has now parsed through the HTML about Pennsylvania, so we can look for things like the header tag:

In [11]:
soup.find("h1")

<h1 class="firstHeading" id="firstHeading" lang="en">Pennsylvania</h1>

In [12]:
soup.find("h1").text

'Pennsylvania'

To extract information from the web, you will alternate between: 
- Inspecting the HTML in your browser
- Using `BeautifulSoup` to find information

### Quick Exercises

**Disambiguation link**

> Wikipedia often provides a disambiguation link to a list of additional topics that could reference the same term.  

> For example, searching for "Pennsylvania" directs to this article about the state, but "Pennsylvania" may instead refer to the railroad, a ship, or a music album.

>Let's try to extract this disambiguation link by inspecting the source code and then using `BeautifulSoup`.  

In [13]:
soup.find(class_ = "mw-disambig")

<a class="mw-disambig" href="/wiki/Pennsylvania_(disambiguation)" title="Pennsylvania (disambiguation)">Pennsylvania (disambiguation)</a>

In [14]:
soup.find(class_ = "mw-disambig")["href"]

'/wiki/Pennsylvania_(disambiguation)'

**Longitute, Latitude**
>How would you retrieve Pennsylvania's longitude and latitude coordinates from this page?

*Hint: Right click the coordinates directly and then select inspect.*

In [15]:
soup.find(class_ = "geo-dec")

<span class="geo-dec" title="Maps, aerial photos, and other data for this location">41°N 77.5°W</span>

In [16]:
soup.find(class_ = "geo-dec").text

'41°N 77.5°W'

In [17]:
soup.find("span", class_ = "geo-dec").text

'41°N 77.5°W'

### Advancing Further

#### Chaining commands

> You might want to first isolate a division of the HTML and then look for tags within the division.  

> The returned element(s) of any `find()` or `find_all()` command are themselves `BeautifulSoup` elements.  

> This means we can continue searching for information within them.

> Let's take a look at the first table on the page.

*Note: Adding `.prettify()` below just prints each HTML element on its own line.*

In [18]:
print(soup.find("table").prettify())

<table class="infobox geography vcard" style="width:22em;width:23em">
 <tbody>
  <tr>
   <th colspan="2" style="text-align:center;font-size:125%;font-weight:bold;font-size:1.25em; white-space:nowrap">
    <div class="fn org" style="display:inline">
     Pennsylvania
    </div>
   </th>
  </tr>
  <tr>
   <td colspan="2" style="text-align:center;background-color:#cddeff; font-weight:bold;">
    <div class="category">
     <a href="/wiki/U.S._state" title="U.S. state">
      State
     </a>
    </div>
   </td>
  </tr>
  <tr class="mergedtoprow">
   <td colspan="2" style="text-align:center;font-weight:bold;">
    Commonwealth of Pennsylvania
   </td>
  </tr>
  <tr class="mergedtoprow">
   <td class="maptable" colspan="2" style="text-align:center">
    <div style="display:table; width:100%; background:none;">
     <div style="display:table-row">
      <div style="display:table-cell;vertical-align:middle; text-align:center;">
       <a class="image" href="/wiki/File:Flag_of_Pennsylvania.svg"

In [19]:
first_table = soup.find("table")

In [20]:
type(first_table)

bs4.element.Tag

> Now find the header text of this table and the text of the first data row.

In [21]:
first_table.find("th").text

'Pennsylvania'

In [22]:
first_table.find("td").text

'State'

> Also note that instead of saving the table as its own Python variable, you could just chain these searches together.

In [23]:
state = soup.find("table").find("th").text
state == "Pennsylvania"

True

In [24]:
for r in soup.find("table").find_all("tr")[:20]:
    print(r.text)

Pennsylvania
State
Commonwealth of Pennsylvania

FlagSeal
Nickname(s): Keystone State;[1] Quaker State
Motto(s): Virtue, Liberty and Independence
Anthem: "Pennsylvania"
Map of the United States with Pennsylvania highlighted
CountryUnited States
Before statehoodProvince of Pennsylvania
Admitted to the UnionDecember 12, 1787 (2nd)
CapitalHarrisburg
Largest cityPhiladelphia
Largest metroDelaware Valley
Government
 • GovernorTom Wolf (D)
 • Lieutenant GovernorJohn Fetterman (D)
LegislatureGeneral Assembly
 • Upper houseState Senate
 • Lower houseHouse of Representatives


> You can continue chaining down through as much of the HTML DOM as you'd like!

In [25]:
(soup
 .find("div", id = "content")
 .find("div", id = "bodyContent")
 .find("div", id = "mw-content-text")
 .find("div", role = "note")
).text

'This article is about the U.S. state. For other uses, see Pennsylvania (disambiguation).'

### Locating information by position

> We just saw that basic facts about Pennsylvania can be found within the first table of this page.  Now let's get more specific.

> How can we extract the date Pennsylvania was admitted to the union?

In [26]:
soup.find("table").find_all("td")[9]

<td>December 12, 1787 (2nd)</td>

> It's the tenth element in this list, but `what happens to this code if someone edits` the Wikipedia table to include additional information?

> Sometimes it is better or necessary to find information by text matching, but be careful -- this needs to be an exact match!

In [27]:
soup.find(text = "Admitted")

In [28]:
soup.find(text = "Admitted to the Union")

'Admitted to the Union'

> Alternatively, we could use [regular expressions](https://docs.python.org/3/library/re.html).

In [29]:
import re

In [30]:
admitted_regex = re.compile("Admitted")
soup.find(text = admitted_regex)

'Admitted to the Union'

> This looks like a string, but it's actually a `BeautifulSoup` element.

In [31]:
admitted = soup.find(text = "Admitted to the Union")

In [32]:
type(admitted)

bs4.element.NavigableString

> So we can use it to traverse the DOM.  

> Here, we will find the next element in the tree.

In [33]:
admitted.next

<td>December 12, 1787 (2nd)</td>

In [34]:
admitted.next.text

'December 12, 1787 (2nd)'

> For some cases it's much easier to find one element and then move up, down, or sideways within the DOM.  

> `BeautifulSoup` also allows you to look for `.parent`, `.children`, `.next_sibling`, `.previous_sibling`, [etc.](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree)

In [35]:
admitted.parent

<a href="/wiki/List_of_U.S._states_by_date_of_admission_to_the_Union#List_of_U.S._states" title="List of U.S. states by date of admission to the Union">Admitted to the Union</a>

In [36]:
admitted.parent.parent

<th scope="row"><a href="/wiki/List_of_U.S._states_by_date_of_admission_to_the_Union#List_of_U.S._states" title="List of U.S. states by date of admission to the Union">Admitted to the Union</a></th>

**Tip**: FYI, any "plural" attribute such as `children` or `siblings` will return a generator.  Just loop over the result or convert it to a list.

### Quick Exercises

**Capital City**

> Write code to extract the capital of Pennsylvania from the main table without using list positions.

In [37]:
soup.find(text = "Capital")

'Capital'

In [38]:
soup.find(text = "Capital").next

<td><a href="/wiki/Harrisburg,_Pennsylvania" title="Harrisburg, Pennsylvania">Harrisburg</a></td>

In [39]:
soup.find(text = "Capital").next.text

'Harrisburg'

**Reference `Links`**

> Print out the text of the first three references (at the bottom of the page).  For an added bonus: can you also print all the external links from these three references?

In [40]:
soup.find(class_ = "references")

<ol class="references">
<li id="cite_note-1"><span class="mw-cite-backlink"><b><a href="#cite_ref-1">^</a></b></span> <span class="reference-text"><cite class="citation web cs1"><a class="external text" href="http://www.portal.state.pa.us/portal/server.pt/community/things/4280/symbols_of_pennsylvania/478690" rel="nofollow">"Symbols of Pennsylvania"</a>. Portal.state.pa.us. <a class="external text" href="https://web.archive.org/web/20071014215922/http://www.phmc.state.pa.us/bah/pahist/symbols.asp?secid=31" rel="nofollow">Archived</a> from the original on October 14, 2007<span class="reference-accessdate">. Retrieved <span class="nowrap">May 4,</span> 2014</span>.</cite><span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&amp;rft.genre=unknown&amp;rft.btitle=Symbols+of+Pennsylvania&amp;rft.pub=Portal.state.pa.us&amp;rft_id=http%3A%2F%2Fwww.portal.state.pa.us%2Fportal%2Fserver.pt%2Fcommunity%2Fthings%2F4280%2Fsymbols_of_pennsylvania%2F478690&a

In [45]:
soup.find(class_ = "references").find_all("cite")[:10]

[<cite class="citation web cs1"><a class="external text" href="http://www.portal.state.pa.us/portal/server.pt/community/things/4280/symbols_of_pennsylvania/478690" rel="nofollow">"Symbols of Pennsylvania"</a>. Portal.state.pa.us. <a class="external text" href="https://web.archive.org/web/20071014215922/http://www.phmc.state.pa.us/bah/pahist/symbols.asp?secid=31" rel="nofollow">Archived</a> from the original on October 14, 2007<span class="reference-accessdate">. Retrieved <span class="nowrap">May 4,</span> 2014</span>.</cite>,
 <cite class="citation web cs1"><a class="external text" href="https://web.archive.org/web/20111015012701/http://egsc.usgs.gov/isb/pubs/booklets/elvadist/elvadist.html" rel="nofollow">"Elevations and Distances in the United States"</a>. <a href="/wiki/United_States_Geological_Survey" title="United States Geological Survey">United States Geological Survey</a>. 2001. Archived from <a class="external text" href="http://egsc.usgs.gov/isb/pubs/booklets/elvadist/elvadi

In [41]:
for reference in soup.find(class_ = "references").find_all("cite")[:3]:
    print(reference.text)

"Symbols of Pennsylvania". Portal.state.pa.us. Archived from the original on October 14, 2007. Retrieved May 4, 2014.
"Elevations and Distances in the United States". United States Geological Survey. 2001. Archived from the original on October 15, 2011. Retrieved October 24, 2011.
"Median Annual Household Income". The Henry J. Kaiser Family Foundation. Archived from the original on December 20, 2016. Retrieved December 9, 2016.


In [42]:
reference3 = soup.find(class_ = "references").find_all("cite")[:3]

for reference in reference3:
    for link in reference.find_all("a"):
        print(link["href"])

http://www.portal.state.pa.us/portal/server.pt/community/things/4280/symbols_of_pennsylvania/478690
https://web.archive.org/web/20071014215922/http://www.phmc.state.pa.us/bah/pahist/symbols.asp?secid=31
https://web.archive.org/web/20111015012701/http://egsc.usgs.gov/isb/pubs/booklets/elvadist/elvadist.html
/wiki/United_States_Geological_Survey
http://egsc.usgs.gov/isb/pubs/booklets/elvadist/elvadist.html
https://web.archive.org/web/20161220091007/http://kff.org/other/state-indicator/median-annual-income/?currentTimeframe=0
http://kff.org/other/state-indicator/median-annual-income/?currentTimeframe=0


In [43]:
reference3 = soup.find(class_ = "references").find_all("cite")[:3]

for reference in reference3:
    for link in reference.find_all("a", class_ = "external"):
        print(link["href"])

http://www.portal.state.pa.us/portal/server.pt/community/things/4280/symbols_of_pennsylvania/478690
https://web.archive.org/web/20071014215922/http://www.phmc.state.pa.us/bah/pahist/symbols.asp?secid=31
https://web.archive.org/web/20111015012701/http://egsc.usgs.gov/isb/pubs/booklets/elvadist/elvadist.html
http://egsc.usgs.gov/isb/pubs/booklets/elvadist/elvadist.html
https://web.archive.org/web/20161220091007/http://kff.org/other/state-indicator/median-annual-income/?currentTimeframe=0
http://kff.org/other/state-indicator/median-annual-income/?currentTimeframe=0


### Data Preparation

> Now that we know how to gather information from the web, what do we do with it?

This data can be
- aggregated to look for trends
- visualized to understand patterns
- leveraged with machine learning algorithms


But first we need to 
- convert several strings into numerical or datetime values
- collect and store data from multiple pages (next section)

**Tip**: Most web scraping project rely on multiple pages of information, each of which serving as a data observation.  For this case, we might collect data about Pennsylvania and then collect the same kinds of information for all 50 United States before analyzing or visualizing the data.

### Data processing

#### Date Admitted

> Previously, we collected the date that Pennsylvania was admitted to the union.

In [44]:
date_admitted_text = admitted.next.text
date_admitted_text

'December 12, 1787 (2nd)'

> We need Python to recognize this as a date for futher analyses.  

> Let's narrow down to just the date part of the string.

In [45]:
date_admitted_list = date_admitted_text.split(" ")[:-1]
date_admitted_list

['December', '12,', '1787']

In [46]:
date_admitted_str = " ".join(date_admitted_list)
date_admitted_str

'December 12, 1787'

> Then, let's convert this string into a datetime data type.

In [47]:
import dateutil.parser

In [48]:
date_admitted = dateutil.parser.parse(date_admitted_str)
date_admitted

datetime.datetime(1787, 12, 12, 0, 0)

In [49]:
type(date_admitted)

datetime.datetime

In [50]:
date_admitted.year

1787

#### Population and Area

> Another quantity that might be useful if we want to compare Pennsylvania to other US states is population.  

> Let's look for the word "Total" and use the same trick we tried before.





In [51]:
soup.find(text = re.compile("Total"))

'\xa0•\xa0Total'

In [52]:
soup.find(text = re.compile("Total")).next

<td>46,055 sq mi (119,283 km<sup>2</sup>)</td>

> That's not the population!  

> Looks like total area is also next to a "Total" label.  

> Let's save that and come back to it later.

In [53]:
area_text = soup.find(text = re.compile("Total")).next.text

> How might we explicitly look for the population total?

In [54]:
soup.find(text = "Population")

'Population'

In [55]:
soup.find(text = "Population").parent

<th colspan="2" style="text-align:center;text-align:left">Population<div style="font-weight:normal;display:inline;"><span class="nowrap"> </span>(2019)</div></th>

In [56]:
soup.find(text = "Population").parent.parent

<tr class="mergedtoprow"><th colspan="2" style="text-align:center;text-align:left">Population<div style="font-weight:normal;display:inline;"><span class="nowrap"> </span>(2019)</div></th></tr>

In [57]:
soup.find(text = "Population").parent.parent.next_sibling

<tr class="mergedrow"><th scope="row"> • Total</th><td>12,801,989</td></tr>

In [58]:
soup.find(text = "Population").parent.parent.next_sibling.find("td")

<td>12,801,989</td>

In [59]:
population_text = soup.find(text = "Population").parent.parent.next_sibling.find("td").text
population_text

'12,801,989'

> Sometimes you need to continuing traversing the DOM until you find the information you need!

> Now let's convert that string into an integer.

In [60]:
population = int(population_text.replace(",", ""))
population

12801989

> Often it's useful to write functions to help you clean up your data.  

> Let's do that now so we can reuse these steps.

In [61]:
# This is a gift.

def to_int(number_str):
    number_str = re.match("[\d,$]+", number_str)[0]
    number_str = number_str.replace("$", "").replace(",", "")
    return int(number_str)

> Now we can use our `to_int` function to clean up the area text we found previously.  

> This text actually contains special spaces so we will use regular expressions (regex) to capture just the first digits in the `to_int` function.

In [62]:
area_text

'46,055\xa0sq\xa0mi (119,283\xa0km2)'

In [63]:
area = to_int(area_text)
area

46055

In [64]:
# This the second gift.

def to_date(date_str):
    date_str = re.match("[\w\s,]+", date_str)[0]
    return dateutil.parser.parse(date_str)

### Data storage

> Now let's put all the information we have about Pennsylvania together.

In [65]:
penn_dict = {
    "state" : state,
    "date_admitted" : date_admitted,
    "population" : population,
    "area_sq_mi" : area
}

In [66]:
penn_dict

{'state': 'Pennsylvania',
 'date_admitted': datetime.datetime(1787, 12, 12, 0, 0),
 'population': 12801989,
 'area_sq_mi': 46055}

> Once we have this information in dictionary form, we can build a `pandas` dataframe with it and eventually perform further analyses or save it to our computer.

In [67]:
import pandas as pd

In [68]:
penn_info = [penn_dict]

In [69]:
penn_df = pd.DataFrame(penn_info)

In [70]:
penn_df

Unnamed: 0,state,date_admitted,population,area_sq_mi
0,Pennsylvania,1787-12-12,12801989,46055


### Quick Exercises

**Median Household Income**

> Get the median household income for the state of Pennsylvania as a text string and then as an integer.

In [71]:
soup.find(text = "Median household income")

'Median household income'

In [72]:
soup.find(text = "Median household income").next

<div style="font-weight:normal;display:inline;"></div>

In [73]:
soup.find(text = "Median household income").next.next

<td>$59,195<sup class="reference" id="cite_ref-4"><a href="#cite_note-4">[4]</a></sup></td>

In [74]:
mhi_text = soup.find(text = "Median household income").next.next.text

In [75]:
mhi = to_int(mhi_text)
mhi

59195

**Median Household Income - Part 2** 

> Update `state_df` to include median household income. 

(Hint: One way you can do this: add median household income to `penn_dict` and recreate `state_df`.)

In [76]:
penn_dict["median_household_income"] = mhi

In [77]:
state_df = pd.DataFrame([penn_dict])
state_df

Unnamed: 0,state,date_admitted,population,area_sq_mi,median_household_income
0,Pennsylvania,1787-12-12,12801989,46055,59195


### ***Pipeline*** Considerations

> Now that we can extract numerical data from this page about Pennsylvania, how would we build out a full analytic or data science project? 

> The next step is to systematically retrieve this information from the Wikipedia page of each US state.  First, let's build reusable functions to find the state's
- name
- date admitted
- population
- area
- median household income

***Note that all of this info can be found in the table on the right side of the page.***

In [78]:
# State Name

def get_name(table):
    raw_name = table.find("th").text
    return re.match("[A-z\s]+", raw_name)[0]

In [79]:
# Date admitted to the Union

def get_date_admitted(table):
    raw_date = table.find(text = "Admitted to the Union").next.text
    return to_date(raw_date) # YAY!

In [80]:
# Population

def get_population(table):
    raw_population = table.find(text = "Population").parent.parent.next_sibling.find("td").text
    return to_int(raw_population)

In [81]:
# Area

def get_area(table):
    raw_area = table.find(text = re.compile("Total")).next.text
    return to_int(raw_area)

In [82]:
# Income

def get_income(table):
    raw_income = table.find(text = "Median household income").next.next.text
    return to_int(raw_income)

> These functions will extract information from any Wikipedia state table we pass into them.  

> For example, let's try parsing the page for New York.

In [83]:
ny_url = "https://en.wikipedia.org/wiki/New_York_(state)"

ny_page = requests.get(ny_url).text

ny_soup = bs(ny_page)

In [84]:
ny_table = ny_soup.find("table")

In [85]:
get_name(ny_table)

'New York'

In [86]:
get_date_admitted(ny_table)

datetime.datetime(1788, 7, 26, 0, 0)

In [87]:
get_population(ny_table)

19453561

In [88]:
get_area(ny_table)

54555

In [89]:
get_income(ny_table)

64894

### Yes, we want to scale this up :)

Let's also make a function to gather all five values from a given state Wiki page and return the information as a dictionary.

In [90]:
def parse_url(url):
    page = requests.get(url).text
    return bs(page)

In [91]:
def get_state_info(state_url):
    
    # Let's use parse page and extract main table information
    state_soup = parse_url(state_url)
    state_table = state_soup.find("table")
    
    state_info = {}
    
    # Let's extract info with our defined functions above
    state_info["state"] = get_name(state_table)
    state_info["date_admitted"] = get_date_admitted(state_table)
    state_info["population"] = get_population(state_table)
    state_info["area"] = get_area(state_table)
    state_info["median_household_income"] = get_income(state_table)
    
    return state_info

In [92]:
test_url = "https://en.wikipedia.org/wiki/Alabama"

alabama_info = get_state_info(test_url)

alabama_info

{'state': 'Alabama',
 'date_admitted': datetime.datetime(1819, 12, 14, 0, 0),
 'population': 4903185,
 'area': 52419,
 'median_household_income': 48123}

### Lists of links

> The next step in our process will require us to use our `get_state_info()` function on the URLs of each of the 50 US states.  

> But how do we know which URLs to visit?  

> We might be able to guess that the page for Rhode Island is https://en.wikipedia.org/wiki/Rhode_Island but not all pages follow this convention.

In [93]:
ny_url

'https://en.wikipedia.org/wiki/New_York_(state)'

> Instead of guessing, let's first gather these links from this "[List of States and Territories of the United States](https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States)" article. 

> Click on this link and inspect the page to develop a plan for doing this.

> It looks like each of the states are listed in the second table of the page.  

> Each state name and link is contained within table header tags (`th`) and have the additional property of `scope`="row".

In [94]:
state_list_url = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States"

list_page = requests.get(state_list_url).text

list_soup = bs(list_page)

In [95]:
list_soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of states and territories of the United States - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"0fbcafc8-998b-4aa0-aec0-d968fd1ea86a","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_states_and_territories_of_the_United_States","wgTitle":"List of states and territories of the United States","wgCurRevisionId":975758572,"wgRevisionId":975758572,"wgArticleId":12610470,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: unfit url","CS1 Spanish-language sources (e

In [96]:
state_rows = list_soup.find_all("table")[0].find_all("th", scope = "row")
state_rows[:10]

[<th scope="row"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="400" data-file-width="600" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Flag_of_Alabama.svg/23px-Flag_of_Alabama.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Flag_of_Alabama.svg/35px-Flag_of_Alabama.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Flag_of_Alabama.svg/45px-Flag_of_Alabama.svg.png 2x" width="23"/> </span><a href="/wiki/Alabama" title="Alabama">Alabama</a>
 </th>,
 <th scope="row"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="1000" data-file-width="1416" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/e/e6/Flag_of_Alaska.svg/21px-Flag_of_Alaska.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/e/e6/Flag_of_Alaska.svg/33px-Flag_of_Alaska.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/e/e6/Flag_of_Alaska.svg/43px-Flag_

> Now we just need to extract the links from the `a` tags.

In [97]:
state_rows[0].find("a") # anchor tag

<a href="/wiki/Alabama" title="Alabama">Alabama</a>

In [98]:
state_rows[0].find("a")["href"]

'/wiki/Alabama'

In [99]:
state_links = [row.find("a")["href"] for row in state_rows]
state_links[:15]

['/wiki/Alabama',
 '/wiki/Alaska',
 '/wiki/Arizona',
 '/wiki/Arkansas',
 '/wiki/California',
 '/wiki/Colorado',
 '/wiki/Connecticut',
 '/wiki/Delaware',
 '/wiki/Florida',
 '/wiki/Georgia_(U.S._state)',
 '/wiki/Hawaii',
 '/wiki/Idaho',
 '/wiki/Illinois',
 '/wiki/Indiana',
 '/wiki/Iowa']

> Each of these links point to a place within Wikipedia, but if we want to link to the full URLs, we have to append 'https://en.wikipedia.org' to each of them.

In [100]:
base_url = "https://en.wikipedia.org" # This is what we need to append here.

state_urls = [base_url + link for link in state_links]

state_urls[:7]

['https://en.wikipedia.org/wiki/Alabama',
 'https://en.wikipedia.org/wiki/Alaska',
 'https://en.wikipedia.org/wiki/Arizona',
 'https://en.wikipedia.org/wiki/Arkansas',
 'https://en.wikipedia.org/wiki/California',
 'https://en.wikipedia.org/wiki/Colorado',
 'https://en.wikipedia.org/wiki/Connecticut']

In [101]:
state_urls[-7:]

['https://en.wikipedia.org/wiki/Utah',
 'https://en.wikipedia.org/wiki/Vermont',
 'https://en.wikipedia.org/wiki/Virginia',
 'https://en.wikipedia.org/wiki/Washington_(state)',
 'https://en.wikipedia.org/wiki/West_Virginia',
 'https://en.wikipedia.org/wiki/Wisconsin',
 'https://en.wikipedia.org/wiki/Wyoming']

In [102]:
len(state_urls) == 50

True

### Handling missing values

> We will eventually be cycling through these state links to collect and store information about every state.  

> But what happens when certain information is unavailable?  

> That is, what if the Georgia page is missing area information or the median household income isn't listed for Nevada?

> We can make our code more robust by including instructions for handling missing information.  

> ***One way to do this is to include `try`/`except` statements.***

> If you haven't seen them before, `try`/`except` pairs are used to let Python know how to handle errors.

In [103]:
def square(x):
    return x*x

In [104]:
square("hello") # This won't work!

TypeError: can't multiply sequence by non-int of type 'str'

In [105]:
def square_robust(x):
    try:
        return x*x
    except TypeError:
        return "This won't work!"

In [106]:
square_robust("hello")

"This won't work!"

> In the case of web scraping, we may be scraping information from many, many pages.  

> If any of the information we want can't be found, we usually don't want Python to exit the program with an error.  

> We typically prefer that Python continue the scraping but just fill in that particular piece of information with a missing value like `None`.

```
def my_scraper(page):
  try:
    perform some parsing
    return my_scraped_value
  except:
    return None
```

### A Robust Scraper :)

> Let's update the collection of our state info to be `robust to handling missing values`.

In [107]:
def get_state_info_robust(state_url):
    
    # If our scraper cannot find a main table, 
    # it will print out the url and exit function
    
    try:
        state_soup = parse_url(state_url)
        state_table = state_soup.find("table")
    except:
        print(f"Cannot parse table: {state_url}") # We want to identify the source of a potential error
        return None
    
    state_info = {}

    # Let's extract info with our defined functions
    
    values = ["state",
              "date_admitted",
              "population",
              "area",
              "median_household_income"]
    
    functions = [get_name,
                 get_date_admitted,
                 get_population,
                 get_area,
                 get_income]

    # If any value cannot be found, we will fill value with None
    
    for v, f in zip(values, functions):
        try:
            state_info[v] = f(state_table)
        except:
            state_info[v] = None
    
    return state_info

In [108]:
california_url = "https://en.wikipedia.org/wiki/California"

ca_dict = get_state_info_robust(california_url)

ca_dict

{'state': 'California',
 'date_admitted': datetime.datetime(1850, 9, 9, 0, 0),
 'population': 39512223,
 'area': 163696,
 'median_household_income': 71228}

##### What if the target url does NOT contain any table to extract.

In [109]:
get_state_info_robust("http://www.talktoroh.com/")

{'state': None,
 'date_admitted': None,
 'population': None,
 'area': None,
 'median_household_income': None}

### Adding pauses

> We have just one final consideration before we cycle through the state links to scrape information.  

> Web scraping at a fast rate--that is, many pages per second--is frowned upon by many websites, Wikipedia included.  

> We will add in artificial pauses so we don't overwhelm the Wikipedia server.

In [110]:
import time

In [111]:
time.sleep(2)

In [112]:
a = 3

print(f"Pausing for {a} seconds")
time.sleep(a)

Pausing for 3 seconds


### Be a Responsible Web Scraper :)

> To `responsibly` scrape websites, you should know what the site's rate limit is and respect it!  

> Most sites list their rate limit for web scraping in their `robots.txt` file.  

> Wikipedia requests [at least a one second pause per page request](https://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler).  

> We will pause 1 second between each page scrape, so we will only **collect information for 10 US states** for now.

> **WARNING!!** Not respecting site's limits can get your IP address blocked from the site.  Don't get yourself blocked from Wikipedia!

### Data storage revisited

> We now have a function to extract information for each state as a dictionary.  We can convert this information into a `pandas` dataframe and store it to an Excel or .csv file if we pass in a list of dictionaries, all with the same keys.

In [None]:
pd.DataFrame([penn_dict, state_dict])

##### `Here's a step-by-step guidance.` Now let's build out our full pipeline:

1. Gather a list of links to each state. (`DONE`)
2. For each state link, gather state information as a dictionary.
3. Append each state dictionary to a list.
4. Convert list of dictionaries to dataframe.
5. Save dataframe as a .csv file.

In [114]:
state_info_list = []

for link in state_urls[:]:
    
    # Step 2 is here
    state_info = get_state_info_robust(link)
    
    # Step 3 is here
    if state_info:
        state_info_list.append(state_info)
    
    # Be a responsible scraper!
    time.sleep(1)

In [115]:
# Step 4 is here
state_df = pd.DataFrame(state_info_list)
state_df

Unnamed: 0,state,date_admitted,population,area,median_household_income
0,Alabama,1819-12-14,4903185.0,52419.0,48123.0
1,AlaskaAlax,1959-01-03,710249.0,663268.0,73181.0
2,Arizona,1912-02-14,7278717.0,113990.0,56581.0
3,Arkansas,1836-06-15,3017804.0,53179.0,45869.0
4,California,1850-09-09,39512223.0,163696.0,71228.0
5,Colorado,1876-08-01,5758736.0,104094.0,69117.0
6,Connecticut,1788-01-09,3565287.0,5567.0,76106.0
7,Delaware,1787-12-07,982895.0,1982.0,62852.0
8,Florida,1845-03-03,21477737.0,65757.0,53267.0
9,Georgia,1788-01-02,10617423.0,59425.0,56183.0


In [116]:
# Step 5 is here.
state_df.to_csv("state_dataset.csv", index = False)

In [118]:
state_dataset = pd.read_csv("state_dataset.csv")
state_dataset == state_df

Unnamed: 0,state,date_admitted,population,area,median_household_income
0,True,False,True,True,True
1,True,False,True,True,True
2,True,False,True,True,True
3,True,False,True,True,True
4,True,False,True,True,True
5,True,False,True,True,True
6,True,False,True,True,True
7,True,False,True,True,True
8,True,False,True,True,True
9,True,False,True,True,True


> Now that you have the state information in .csv format, you can analyze it with any tool you know how to use: namely, ***`Python`*** :)

##### Yay!


> Note that if you DID gather information for all 50 states, you would find one missing value: Kansas's date admitted. 


> After visiting/inspecting the Wiki page for [Kansas](https://en.wikipedia.org/wiki/Kansas), do you see why? 

> How could you fix the data extraction function to account for this issue?

### The Importance of `Pattern Recognition`: `Systematically named pages`

In [None]:
# A curious case of Tripadvisor page

# Page 1
https://www.tripadvisor.com.sg/Hotel_Review-g294265-d1845693-Reviews-or1-The_Fullerton_Bay_Hotel_Singapore-Singapore.html#REVIEWS

# Page 2
https://www.tripadvisor.com.sg/Hotel_Review-g294265-d1845693-Reviews-or5-The_Fullerton_Bay_Hotel_Singapore-Singapore.html#REVIEWS

# Page 3
https://www.tripadvisor.com.sg/Hotel_Review-g294265-d1845693-Reviews-or10-The_Fullerton_Bay_Hotel_Singapore-Singapore.html#REVIEWS

# Page 559    
https://www.tripadvisor.com.sg/Hotel_Review-g294265-d1845693-Reviews-or2790-The_Fullerton_Bay_Hotel_Singapore-Singapore.html#REVIEWS


In [None]:
https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2018
https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2019

> For many web scraping projects, you will begin by collecting links from a links page.  Other times you might be able to devise a pattern in the URLs that you can exploit.

**Example: Billboard Year-End Hot 100**

> The _Billboard_ Hot 100 chart is well known for tracking the success of music singles within the US.  

> At the end of each year, Billboard compiles a list of the top 100 performing songs throughout the year based on the information from Hot 100 charts.  

> Wikipedia displays this information as an article here: https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2019

> How might we compile a list of the most popular song for each year since 2010?

```python

### Please copy the below lines of code that I wrote for you :)

top_hits = []

for year in range(2010, 2020):

    #Build URL for each year
    base_url = "https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_"
    url = base_url + str(year)
    print(url)

    page = requests.get(url).text
    soup = bs(page)

    #Grab top hit text and link
    top_hit = soup.find("table", class_="wikitable").find("td")
    top_hit_text = top_hit.text
    try:
        top_hit_link = top_hit.find("a")["href"]
    except:
        top_hit_link = None

    #Store results as list of tuples
    top_hits.append((year, top_hit_text, top_hit_link))

    #Be sure to pause
    time.sleep(1)    
```

In [119]:
top_hits = []

for year in range(2010, 2020):

    #Build URL for each year
    base_url = "https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_"
    url = base_url + str(year)
    print(url)

    page = requests.get(url).text
    soup = bs(page)

    #Grab top hit text and link
    top_hit = soup.find("table", class_="wikitable").find("td")
    top_hit_text = top_hit.text
    try:
        top_hit_link = top_hit.find("a")["href"]
    except:
        top_hit_link = None

    #Store results as list of tuples
    top_hits.append((year, top_hit_text, top_hit_link))

    #Be sure to pause
    time.sleep(1)  

https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2010
https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2011
https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2012
https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2013
https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2014
https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2015
https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2016
https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2017
https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2018
https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2019


In [120]:
top_hits

[(2010, '"Tik Tok"', '/wiki/Tik_Tok_(song)'),
 (2011, '"Rolling in the Deep"', '/wiki/Rolling_in_the_Deep'),
 (2012,
  '"Somebody That I Used to Know"',
  '/wiki/Somebody_That_I_Used_to_Know'),
 (2013, '"Thrift Shop"', '/wiki/Thrift_Shop'),
 (2014, '"Happy"', '/wiki/Happy_(Pharrell_Williams_song)'),
 (2015, '"Uptown Funk"', '/wiki/Uptown_Funk'),
 (2016, '"Love Yourself"', '/wiki/Love_Yourself'),
 (2017, '"Shape of You"', '/wiki/Shape_of_You'),
 (2018, '"God\'s Plan"', '/wiki/God%27s_Plan_(song)'),
 (2019, '"Old Town Road"', '/wiki/Old_Town_Road')]

### Try to Run Your Mini Projects

> To practice putting together all of the skills you have learned today, you might want to work on your very own mini project.  

##### Things to condider...
Some project advise before getting started:
- **Start small and scale up.**  Make sure your code is working on one page before you try to request information from a ton of links.
- **Think through data storage before scaling up.** How will you store the information so you can perform analyses on the data you collect?
- **Safeguard against missing values.** It is so annoying when a scraping loop breaks on the last link and all other information is lost...
- **Pause for 1+ seconds between requests.** Let's try to not get banned from Wikipedia!

> `Thank you for working with the script :)`

In [None]:
exit()