## Build Your First Web Scraper

An open, useful, and practical package for web scraping is Python's [urllib](https://docs.python.org/3/library/internet.html), which provides tools for working with URLs. For this session, we'll use the `urlopen()` funciton to open a URL within a program. The `urlopen()` function can be found in the urllib.request module. 

In [None]:
# import urlopen 
from urllib.request import urlopen

In [None]:
#open the following URL page
url = "http://olympus.realpython.org/profiles/aphrodite"

Pass *url* to `urlopen()`. The output is an HTTPResponse object which we'll use in the next step to extract the HTML form the page.  

In [None]:
#pass (url) to urlopen and name this value page
page = urlopen(url)
page

<http.client.HTTPResponse at 0x7fb67a08a3d0>

To extract the HTML from the page, use the HTTP object's `.read()`, then use the .decode() to change from bytes to a string using `UTF-8`

In [None]:
#use .read() to get bytes then change to string using .decode() and UTF-8
html_bytes = page.read()
html = html_bytes.decode("utf-8")

#the output that is printed is the HTML code from the website url above
print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



## Extract Text from HTML With String Methods

Now that we have set it up so that we can extract the HTML from the page, let's look at how to extract information from a webpage's HTML. *One way* is to use string methods such as `.find()` to search through the text to find specific tags (e.g, `<title>` tag) and extract that information. 

The `.find()` will return an index (i.e., position) of the first occurance of a substring, which will provide us with the index of the opening of a given tag, in our case`<title>`. 

By having the the indexd of the first character of the title and the index of the first character of the closing `</title>`, we can use a string slice to extract the title. 

In [None]:
#pass the string "<title>" to .fina()
title_index = html.find("<title>")
title_index

14

Since we are looking for the index of the title not the tag, we should add the length of the string `<title>` to `title_index`. 

In [None]:
# add len("<title>") to get the start index of the title
start_index = title_index + len("<title>")
start_index

21

To get the index of the closing `</title>`, we'll pass the string to `.find()

In [None]:
#pass "</title>" to get the index of the closing tag
end_index = html.find("</title>")
end_index

39

Now that we have the start and end indexes we can extract the title by slicing the html string

In [None]:
#extract the title 
title = html[start_index:end_index]
title

'Profile: Aphrodite'

**Note**: Not all html will be as straightfoward, sometimes the html syntax will be less predictable. 

### Example: Extracting a title from messier html. 

Let's try to extract the title from another profile page. 

~~~python
url = "http://olympus.realpython.org/profiles/poseidon"
~~~

In [None]:
url = "http://olympus.realpython.org/profiles/poseidon"
page = urlopen(url)
html = page.read().decode("utf-8")
start_index = html.find("<title>") + len("<title>")
end_index = html.find("</title>")
title = html[start_index:end_index]
title

'\n<head>\n<title >Profile: Poseidon'

The opening `<title>` HTML tag for the /profiles/poseidon page has an extra space before the closing angle bracket (>), rendering it as `<title >`. As a result the `html.find("<title>")` returns -1 because the exact substring `"<title>"` doesn’t exist. 
    
When -1 is added to `len("<title>")`, which is 7, the start_index variable is assigned the value 6, which is a newline character (\n) right before the opening angle bracket (<) of the `<head>` tag. The `html[start_index:end_index]` will return all the HTML starting with that newline and ending just before the `</title>` tag.

Since this will not be the last time we encounter messy html, we need a more reliable way to extract text from HTML.

## Regular Expressions (RegExes)

#### What are RegExes? 
 
RegExes are a squence (i.e., pattern) of characters called metacharacters that you can apply to help you search for, match, and find text within a string. 

This is a strategy commonly used in command line interfaces (CLI). Python supports the use of regexes by importing the `re` library. Visit the [regex documentation](https://docs.python.org/3/library/re.html) to find more information. 

In [None]:
# import the re module
import re

Each regex character has a special meaning. For example,  `*` means **zero or more instances**. When `*` applied this will search, match, and find zero or more instances of whatever comes ***before*** the `*`

Use the `.findall()` to find a given text using `*` regex. The first argument ***ab*c*** is the regex pattern, and the second argument ***ac*** is the test string. 

The pattern will match any part of the string that begins with ***a***, ends with ***c***, and has zero or more instances of ***b*** between the ***a*** and ***b***. 

The `re.findall()` will return a list of all matches. 

In [None]:
#apply '*' to match zero or more instances
re.findall("ab*c", "ac") #try changing the test string to "abcd"


['ac']

**Note:** RegExes will only return pairs that match the pattern exactly, which means ***it is*** case sensitive. The following example illustrates this by returning an empty list.

In [None]:
#let's try to match case
re.findall("ab*c", "ABC")

[]

If you want your pattern to search and match a dataset regardless of case, you can pass a third argument `re.IGNORECASE`

In [None]:
#add re.ignorecase argument to ignore case when finding matches
re.findall("ab*c", "ABC", re.IGNORECASE)

['ABC']

Another regex character is `.`, this metacharacter will match **any single character**. For example, you can find all instances of a string that contain the letters "a" and "c" separated by a ***single*** character, such as in this example.  

In [None]:
#apply '.' to match any single character 
re.findall("a.*c", "abc") # try changing the test string to "abbc"


['abc']

When using the function `re.search()` to search for a particular pattern inside a string, it's helpful to call `.group()`. The `.group()` will return the first and most inclusive results. 

In [None]:
#use '.group' to call the first and most comprehensive results
match_results = re.search("ab*c", "ABC", re.IGNORECASE)
match_results.group()

'ABC'

A second helpful function to analyze text is the `re.sub()`, short for substitute. The `re.sub()`replaces the text in a string that matches a regular expression with new text, so you can think of it similarly to replacing. 

In the following example the arguments passed are: 1. the regular expression, 2. the replacement text, and 3. the string. 

You'll notice the regex includes a `?`. The `*?` works the same way as `*` except that it will match the shortest possible *string of text*. So it finds two matches, `<replaced>` and `<tags>`, and substitutes "ELEPHANTS" for both. 

In [None]:
#match and substitute Elephants 
string = "Everything is <replaced> if it's in <tags>."
string = re.sub("<.*?>", "ELEPHANTS", string) #try removing the question mark
string

"Everything is ELEPHANTS if it's in ELEPHANTS."

## Exercises

### Exercise One: Extract Text Using RegEx

Parse out the title from this [profile](http://olympus.realpython.org/profiles/dionysus) page which has the following HTML:

~~~ HTML 
<TITLE >Profile: Dionysus</title  / >

~~~

### Solution:

In [None]:
# regex_soup.py

import re
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title) # Remove HTML tags

print(title)

Profile: Dionysus


### Exercise Two: Scrape Data From a Website

Write a program that grabs the full HTML from the following URL:
~~~ python
url = "http://olympus.realpython.org/profiles/dionysus"
~~~  

Then use .find() to display the text following Name: and Favorite Color: (not including any leading spaces or trailing HTML tags that might appear on the same line).

### Solution:

First, import the urlopen function from the urlib.request module:

~~~Python
from urllib.request import urlopen
~~~

Then open the URL and use the .read() method of the HTTPResponse object returned by urlopen() to read the page’s HTML:

~~~Python
url = "http://olympus.realpython.org/profiles/dionysus"
html_page = urlopen(url)
html_text = html_page.read().decode("utf-8")
~~~

The .read() method returns a byte string, so you use .decode() to decode the bytes using the UTF-8 encoding.

Now that you have the HTML source of the web page as a string assigned to the html_text variable, you can extract Dionysus’s name and favorite color from his profile. The structure of the HTML for Dionysus’s profile is the same as for Aphrodite’s profile, which you saw earlier.

You can get the name by finding the string "Name:" in the text and extracting everything that comes after the first occurence of the string and before the next HTML tag. That is, you need to extract everything after the colon (:) and before the first angle bracket (<). You can use the same technique to extract the favorite color.

The following [for loop](https://realpython.com/python-for-loop/) extracts this text for both the name and favorite color:

~~~ Python
for string in ["Name: ", "Favorite Color:"]:
    string_start_idx = html_text.find(string)
    text_start_idx = string_start_idx + len(string)

    next_html_tag_offset = html_text[text_start_idx:].find("<")
    text_end_idx = text_start_idx + next_html_tag_offset

    raw_text = html_text[text_start_idx : text_end_idx]
    clean_text = raw_text.strip(" \r\n\t")
    print(clean_text)
~~~


It looks like there’s a lot going on in this for loop, but it’s just a little bit of arithmetic to calculate the right indices for extracting the desired text. Go ahead and break it down:

1. You use html_text.find() to find the starting index of the string, either "Name:" or "Favorite Color:", and then assign the index to string_start_idx.

2. Since the text to extract starts just after the colon in "Name:" or "Favorite Color:", you get the index of the character immediately after the colon by adding the length of the string to start_string_idx, and then assign the result to text_start_idx.

3. You calculate the ending index of the text to extract by determining the index of the first angle bracket (<) relative to text_start_idx and assign this value to next_html_tag_offset. Then you add that value to text_start_idx and assign the result to text_end_idx.

4. You extract the text by slicing html_text from text_start_idx to text_end_idx and assign this string to raw_text.

5. You remove any whitespace from the beginning and end of raw_text using .strip() and assign the result to clean_text.

At the end of the loop, you use print() to display the extracted text. The final output looks like this:

~~~shell
Dionysus
Wine
~~~

# Use an HTML Parser for Web Scraping in Python

## Install Beautiful Soup

Install the latest version of Beautiful Soup by running the cell below:

In [None]:
! pip install beautifulsoup4



## Create a `BeautifulSoup` Object

Now import `BeautifulSoup` from `bs4`. Import `urlopen` again, if needed.

In [None]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

Open a URL and read the HTML

In [None]:
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

In [None]:
print(html)

<html>
<head>
<TITLE >Profile: Dionysus</title  / >
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/dionysus.jpg" />
<h2>Name: Dionysus</h2>
<img src="/static/grapes.png"><br><br>
Hometown: Mount Olympus
<br><br>
Favorite animal: Leopard <br>
<br>
Favorite Color: Wine
</center>
</body>
</html>



Now create a `BeautifulSoup` object using `html`

In [None]:
soup = BeautifulSoup(html, "html.parser")
print(soup)

<html>
<head>
<title>Profile: Dionysus</title>
</head>
<body bgcolor="yellow">
<center>
<br/><br/>
<img src="/static/dionysus.jpg"/>
<h2>Name: Dionysus</h2>
<img src="/static/grapes.png"/><br/><br/>
Hometown: Mount Olympus
<br/><br/>
Favorite animal: Leopard <br/>
<br/>
Favorite Color: Wine
</center>
</body>
</html>



The second argument, "`html.parser`" tells the object which parser to use to interpret the supplied HTML. 

## Use a `BeautifulSoup` Object

Use the `get_text()` method to extract all the text and remove HTML tags.

In [None]:
print(soup.get_text())



Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine






Now we can retrieve data from the soup object with the HTML tags.

We can search for every instance of a tag using `find_all()`

In [None]:
soup.find_all("img")

[<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]

This returns a list of all `<img>` tags where the `src` attribute shows the path to the images on this page. 

Unpack the list in to separate variables:

In [None]:
image1, image2 = soup.find_all("img")

Each object has a `name` property that tells the type of HTML tag

In [None]:
image1.name

'img'

Access the HTML attributes of the object by calling it in square brackets, similar to keys in a dictionary:

In [None]:
image1["src"]

'/static/dionysus.jpg'

In [None]:
image2["src"]

'/static/grapes.png'

Other HTML tags may have multiple attributes that you can access

You can retrieve certain tags directly from the soup object

In [None]:
soup.title

<title>Profile: Dionysus</title>

There are multiple ways to retrieve the string between the tags:

In [None]:
print(soup.title.string)
print(soup.title.get_text())

Profile: Dionysus
Profile: Dionysus


You can also use Beautfiul Soup to search for tags where the attributes match certain values. Going back to the example with `<img>` tags, we could search for the tag that specifically has a `src` attribute equal to `/static/dionysus.jpg`:

In [None]:
soup.find_all("img", src="/static/dionysus.jpg")

[<img src="/static/dionysus.jpg"/>]

This is a very specific example. Normally you will want to spend time looking through the complicated HTML structures of web page sources. Typically you will be able to identify which tags, and attributes, contain the data you are looking to extract.

## Exercise

Read the HTML from the page at the URL: `http://olympus.realpython.org/profiles`

Use Beautiful Soup to extract tags with the name `a` and retrieve the values from the `href` attribute. What do you think these items represent?

### Solution

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [None]:
url = "http://olympus.realpython.org/profiles"

html_page = urlopen(url)
html_text = html_page.read().decode("utf-8")
print(html_text)

<html>
<head>
<title>All Profiles</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<h1>All Profiles:</h1>
<br><br>
<h2>
<a href="/profiles/aphrodite">Aphrodite</a>
<br><br>
<a href="/profiles/poseidon">Poseidon</a>
<br><br>
<a href="/profiles/dionysus">Dionysus</a>
</h2>
</center>
</body>
</html>



In [None]:
soup = BeautifulSoup(html_text, "html.parser")
print(soup.get_text())



All Profiles




All Profiles:


Aphrodite

Poseidon

Dionysus







In [None]:
soup.find_all("a")

[<a href="/profiles/aphrodite">Aphrodite</a>,
 <a href="/profiles/poseidon">Poseidon</a>,
 <a href="/profiles/dionysus">Dionysus</a>]

In [None]:
for link in soup.find_all("a"):
    print(link["href"])

/profiles/aphrodite
/profiles/poseidon
/profiles/dionysus


# Interact with HTML Forms

`urllib` and `BeautifulSoup` work very well to request data from a static web page. However, you may need to interact with a web page, like submitting a form to access content. 

In this section we will use `MechanicalSoup`, which installs a headless browser, which is a web browser you can access programmatically with Python.

## Install `MechanicalSoup`:

In [None]:
! pip install MechanicalSoup

Collecting MechanicalSoup
  Downloading MechanicalSoup-1.2.0-py3-none-any.whl (19 kB)
Installing collected packages: MechanicalSoup
Successfully installed MechanicalSoup-1.2.0


## Create a Browser Object

In [None]:
import mechanicalsoup

In [None]:
browser = mechanicalsoup.Browser()

Use `browser` to request a page with a URL

In [None]:
url = "http://olympus.realpython.org/login"
page = browser.get(url)

`page` is a Response object that shows the status code. 200 represents a successful request

In [None]:
page

<Response [200]>

`page` has a `.soup` attribute, so we can inspect the HTML

In [None]:
page.soup

<html>
<head>
<title>Log In</title>
</head>
<body bgcolor="yellow">
<center>
<br/><br/>
<h2>Please log in to access Mount Olympus:</h2>
<br/><br/>
<form action="/login" method="post" name="login">
Username: <input name="user" type="text"/><br/>
Password: <input name="pwd" type="password"/><br/><br/>
<input type="submit" value="Submit"/>
</form>
</center>
</body>
</html>

Notice that there is a `<form>` element with `<input>` elements for login info

In [None]:
page.soup.form

<form action="/login" method="post" name="login">
Username: <input name="user" type="text"/><br/>
Password: <input name="pwd" type="password"/><br/><br/>
<input type="submit" value="Submit"/>
</form>

Open this [page](http://olympus.realpython.org/login) and you will see that this is a login page

To advance into the site, you have to provide the correct login information:

|Username|Password|
|--------|--------|
|zeus|ThunderDude|

With this information known, we can use `MechanicalSoup` to fill out and submit this form!

## Submit a Form With MechanicalSoup

First open the page and extract the HTML

In [None]:
login_page = browser.get(url)
login_html = login_page.soup

Then grab the `<form>` element from the HTML

In [None]:
form = login_html.form

In [None]:
form

<form action="/login" method="post" name="login">
Username: <input name="user" type="text"/><br/>
Password: <input name="pwd" type="password"/><br/><br/>
<input type="submit" value="Submit"/>
</form>

For the next step, we can select each of the `<input>` elements for the username and password, define a new attribute called `"value"` and assign the corresponding values.

In [None]:
form.select('input')[0]["value"] = "zeus"
form.select('input')[1]["value"] = "ThunderDude"

Our values are now written into the HTML.

In [None]:
form

<form action="/login" method="post" name="login">
Username: <input name="user" type="text" value="zeus"/><br/>
Password: <input name="pwd" type="password" value="ThunderDude"/><br/><br/>
<input type="submit" value="Submit"/>
</form>

Now we can submit our updated form with `browser.submit()`. This only needs two arguments: the `form` object and the URL of the login page (ie. `login_page.url`)

In [None]:
profiles_page = browser.submit(form, login_page.url)

Did it work? If so, `profiles_page` should contain the following link: http://olympus.realpython.org/profiles

In [None]:
profiles_page.url

'http://olympus.realpython.org/profiles'

# Interact With Websites in Real Time

With the tools learned so far, you can automatically fetch real-time data from a website.

Navigate to the following [page](http://olympus.realpython.org/dice). This page produces a result from a simulated dice roll. Each time the page is refreshed the "dice is re-rolled."

Next, we will write a program that grabs the result of the dice roll after repeatedly refreshing the page. 

If we inspect the source HTML of this page, we will find a segment that looks like this:

```html
<h2 id="result">3</h2>
```

The text in this element may look different based on the dice roll. In our code, we will identify this element to scrape the result. 

In [None]:
import mechanicalsoup

browser = mechanicalsoup.Browser()
page = browser.get("http://olympus.realpython.org/dice")

Use the `select()` method to find the element with `id=result`.

In [None]:
tag_list = page.soup.select("#result")
tag_list

[<h2 id="result">6</h2>]

Next extract the text:

In [None]:
tag = tag_list[0]
result = tag.text

print(f"The result of your dice roll is: {result}")

The result of your dice roll is: 6


Next, we will roll the dice four times. However, it's best to not overload requests to a webpage. So after each roll of the dice (making a request to the URL), we will have the code wait for 5 seconds using `.sleep()` from the built-in Python module `time`.

In [None]:
import time

print("I'm about to wait for five seconds...")
time.sleep(5)
print("Done waiting!")

I'm about to wait for five seconds...
Done waiting!


Notice that the second print statement isn't executed until 5 seconds have passed.

Now let's make a for loop that retrieves the page (and the result of the dice roll) four times. After we retrieve the result, we will use `sleep()` to wait before the next iteration.

In [None]:
import time
import mechanicalsoup

browser = mechanicalsoup.Browser()

for i in range(4):
    page = browser.get("http://olympus.realpython.org/dice")
    tag = page.soup.select("#result")[0]
    result = tag.text
    print(f"The result of your dice roll is: {result}")
    time.sleep(5)

The result of your dice roll is: 5
The result of your dice roll is: 3
The result of your dice roll is: 1
The result of your dice roll is: 6


This is just a simple example to access data from a web page in real time. As you attempt to make more complex requests, be aware of the Terms of Use published with a website. It's possible to crash a server with an excessive number of rapid requests. Always be mindful and respectful of the Terms of Use when scraping websites.

# Web Requests

## Setup Socrata API Account

### State of Pennsylvania Socrata Account
To work with this session's data, you'll need to create an account with the state of Pennsylvania's data service. Visit [data.pa.gov](https://data.pa.gov/signup) and fill out the required information to create an account.



It is not unusual to find open data served directly on the web, without requiring any tokens or authorizations. However,you'll find large datasets typically require this type of entry point because of the overburden it can place on data servers; smaller datasets are less burdensome and are frequently more accessible. 

When it comes to such non-API generated data it is generally easier to work with services that serve the data directly in its native file format, rather than wrapping it in HTML. The latter requires you to parse first the HTML or interpret the HTTP response.


As an example, take the details from crash incident reports CY 1997 from the Pennsylvania Department of Transportation [data found here](https://dev.socrata.com/foundry/data.pa.gov/dc5b-gebx). This dataset containing 3088272 rows dataset can be accessed directly at https://data.pa.gov/resource/dc5b-gebx.json (FYI Firefox has a nice JSON viewer built in when you encounter .json hosted files on the web.)


Let's walk through making an HTTP request for this .json data and quickly transform it into a useful container so we can use it (e.g., a Pandas dataframe). We'll use Python's JSON module, which is a compact and easy-to-use way of turning JSON into Python's native object types, lists and dictionaries. This [table](https://docs.python.org/3/library/json.html#json-to-py-table) will let you see the output of a Python object based on the JSON input. **Note:** a JSON array of key-value objects will yield a Python list of dictionaries.

In [None]:
#!/usr/bin/env python

# make sure to install these packages before running:
!pip install pandas
!pip install sodapy
!pip install pathlib




In [None]:
import json
import requests

ci_data_json = requests.get('https://data.pa.gov/resource/dc5b-gebx.json')

ci_list_recs = json.loads(ci_data_json.text)

print(ci_list_recs[0])

{'crn': '1998140399', 'county': '09', 'county_name': 'Bucks', 'municipality': '09403', 'municipal_name': 'Doylestown Boro', 'district': '06', 'district_name': 'District 6-0', 'police_agcy': '09403', 'crash_year': '1998', 'crash_month': '12', 'day_of_week': '5', 'time_of_day': '0105', 'hour_of_day': '01', 'illumination': 'Unknown (expired)', 'road_condition': 'Unknown (expired)', 'collision_type': 'Other or Unknown', 'intersect_type': 'Mid-block', 'tcd_type': 'Not applicable', 'urban_rural': 'Urban', 'fatal_count': '0', 'injury_count': '0', 'person_count': '1', 'total_units': '3', 'sch_bus_ind': 'No', 'sch_zone_ind': 'No', 'arrival_tm': '9999', 'dispatch_tm': '9999', 'lane_closed': 'No', 'tcd_func_cd': 'No Controls', 'vehicle_count': '3', 'automobile_count': '0', 'motorcycle_count': '0', 'bus_count': '0', 'small_truck_count': '1', 'heavy_truck_count': '0', 'suv_count': '0', 'van_count': '0', 'bicycle_count': '0', 'maj_inj_count': '0', 'mod_inj_count': '0', 'min_inj_count': '0', 'tot_inj

### Pandas

If you haven't yet given a Pandas dataframe a try as a way to manage large arrays of information, give it a go. It is possible to go overboard: not everything needs to be put in a dataframe, especially when a Python list of lists or a dictionary will do. Take a look at how fast we can access subsets of the motor vehicle data.

In [None]:
import pandas as pd

ci_df = pd.DataFrame(ci_list_recs)
ci_df.head(10)

Unnamed: 0,crn,county,county_name,municipality,municipal_name,district,district_name,police_agcy,crash_year,crash_month,...,work_zone_loc,cons_zone_spd_lim,workers_pres,wz_close_detour,wz_flagger,wz_law_offcr_ind,wz_ln_closure,wz_moving,wz_other,wz_shlder_mdn
0,1998140399,9,Bucks,9403,Doylestown Boro,6,District 6-0,09403,1998,12,...,,,,,,,,,,
1,2018024122,1,ADAMS,1219,STRABAN,8,District 8-0,68H06,2018,2,...,,,,,,,,,,
2,2018022880,66,WYOMING,66231,WEST MANCHESTER,8,District 8-0,66231,2018,2,...,,,,,,,,,,
3,2018021506,15,CHESTER,15202,CHARLESTOWN,6,District 6-0,68J03,2018,1,...,,,,,,,,,,
4,2018025163,2,ALLEGHENY,2301,PITTSBURGH,11,District 11-0,02301,2018,2,...,,,,,,,,,,
5,2018029148,58,SUSQUEHANNA,58227,TIOGA,3,District 3-0,68F05,2018,3,...,,,,,,,,,,
6,2018023274,2,ALLEGHENY,2113,PENN HILLS,11,District 11-0,02113,2018,2,...,,,,,,,,,,
7,2018028792,67,YORK,67301,PHILADELPHIA,6,District 6-0,67301,2018,1,...,,,,,,,,,,
8,2018022129,46,MONTGOMERY,46104,LOWER MERION,6,District 6-0,68K01,2018,2,...,,,,,,,,,,
9,2018021892,39,LEHIGH,39210,UPPER MILFORD,5,District 5-0,68M05,2018,2,...,,,,,,,,,,


### wget direct from web

GNU Wget is a free network utility that retrieves content from web servers.It supports downloading via HTTP, HTTPS, and FTP.

If you have wget installed on your system, you can use the command line utility wget directly in a Notebook cell as shown in the cell below. This cell can then be run at the start of your notebook as a way for you to retrieve the most up to date version of a dataset.

To install wget, visit http://www.gnu.org/software/wget/

In [None]:
!wget https://data.pa.gov/resource/dc5b-gebx.json

--2023-04-14 10:58:53--  https://data.pa.gov/resource/dc5b-gebx.json
Resolving data.pa.gov (data.pa.gov)... 52.206.140.199, 52.206.140.205, 52.206.68.26
Connecting to data.pa.gov (data.pa.gov)|52.206.140.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: 'dc5b-gebx.json.1'

dc5b-gebx.json.1        [  <=>               ]   4.17M  19.8MB/s    in 0.2s    

2023-04-14 10:58:54 (19.8 MB/s) - 'dc5b-gebx.json.1' saved [4373075]



In [None]:
#load and retreive latest version of dataset
wget_json = json.loads(open('dc5b-gebx.json').read())
wget_json[0]

{'crn': '1998140399',
 'county': '09',
 'county_name': 'Bucks',
 'municipality': '09403',
 'municipal_name': 'Doylestown Boro',
 'district': '06',
 'district_name': 'District 6-0',
 'police_agcy': '09403',
 'crash_year': '1998',
 'crash_month': '12',
 'day_of_week': '5',
 'time_of_day': '0105',
 'hour_of_day': '01',
 'illumination': 'Unknown (expired)',
 'road_condition': 'Unknown (expired)',
 'collision_type': 'Other or Unknown',
 'intersect_type': 'Mid-block',
 'tcd_type': 'Not applicable',
 'urban_rural': 'Urban',
 'fatal_count': '0',
 'injury_count': '0',
 'person_count': '1',
 'total_units': '3',
 'sch_bus_ind': 'No',
 'sch_zone_ind': 'No',
 'arrival_tm': '9999',
 'dispatch_tm': '9999',
 'lane_closed': 'No',
 'tcd_func_cd': 'No Controls',
 'vehicle_count': '3',
 'automobile_count': '0',
 'motorcycle_count': '0',
 'bus_count': '0',
 'small_truck_count': '1',
 'heavy_truck_count': '0',
 'suv_count': '0',
 'van_count': '0',
 'bicycle_count': '0',
 'maj_inj_count': '0',
 'mod_inj_coun

# Simple Web API Requests


For more robust ways of serving up data, i.e. by APIs, we generally need to register an "app," i.e. an application that will be accessing the data, receive at the very least a token (and often a client secret as well) to enable tracked downloads of data, ensure proper access limits, etc.

We can think of these access points as involving one of two levels of authentication: a simple authentication involving signed requests (where a long term token is passed along with the request but no secondary per-request or limited-time token is needed), and a more complex, two or three step authentication process.

### Simple Authentication Example Using PA State Data Socrata

It is very helpful if an API comes with a pre-built library to interface with that server so that you don't have to handle signing requests in HTTP, managing tokens, etc. 

Fortunately, we have a nice workable pre-made library for working with this same PA State Socrata API portal, sodapy.

**Setting Up Your App on Your Socrata Account**

Once you have sodapy installed, you'll want to visit the API section of the website, which has its own record of the same dataset we visited above with some additional access information:

[PA State Data Socrata](https://dev.socrata.com/foundry/data.pa.gov/dc5b-gebx)

If you scroll down about halfway through this page, you'll see a large "Sign up for an app token!" button. Click on that to be taken to your API dashboard (alternative, you can login and navigate to [developer settings](https://data.pa.gov/profile/edit/developer_settings)).

In the second main section, select "Create New App Token."

Fill in some basic details for your "app" and once generated, copy down the App Token given to you.

We can now proceed to try out the sodapy library.

In [None]:
from sodapy import Socrata

client = Socrata('data.pa.gov','pSibFFfkBQF6QNp6JkBq8Z6cN',
                username='cgriego@andrew.cmu.edu',
                password='jL!JCsNj7Faez6U')

results = client.get('dc5b-gebx', limit=2000)

In [None]:
import pandas as pd

results_df = pd.DataFrame.from_records(results)
results_df.head(10)

Unnamed: 0,crn,county,county_name,municipality,municipal_name,district,district_name,police_agcy,crash_year,crash_month,...,cons_zone_spd_lim,workers_pres,wz_close_detour,wz_flagger,wz_law_offcr_ind,wz_ln_closure,wz_moving,wz_other,wz_shlder_mdn,roadway_cleared
0,1998140399,9,Bucks,9403,Doylestown Boro,6,District 6-0,09403,1998,12,...,,,,,,,,,,
1,2018024122,1,ADAMS,1219,STRABAN,8,District 8-0,68H06,2018,2,...,,,,,,,,,,
2,2018022880,66,WYOMING,66231,WEST MANCHESTER,8,District 8-0,66231,2018,2,...,,,,,,,,,,
3,2018021506,15,CHESTER,15202,CHARLESTOWN,6,District 6-0,68J03,2018,1,...,,,,,,,,,,
4,2018025163,2,ALLEGHENY,2301,PITTSBURGH,11,District 11-0,02301,2018,2,...,,,,,,,,,,
5,2018029148,58,SUSQUEHANNA,58227,TIOGA,3,District 3-0,68F05,2018,3,...,,,,,,,,,,
6,2018023274,2,ALLEGHENY,2113,PENN HILLS,11,District 11-0,02113,2018,2,...,,,,,,,,,,
7,2018028792,67,YORK,67301,PHILADELPHIA,6,District 6-0,67301,2018,1,...,,,,,,,,,,
8,2018022129,46,MONTGOMERY,46104,LOWER MERION,6,District 6-0,68K01,2018,2,...,,,,,,,,,,
9,2018021892,39,LEHIGH,39210,UPPER MILFORD,5,District 5-0,68M05,2018,2,...,,,,,,,,,,
