# Web Scraping Techniques

*Various techniques for scraping data from websites.*

In [17]:
import requests
from bs4 import BeautifulSoup

#### Checking that a page exists

HTTP response status codes come in five classes and the value of the first digit indicates the category of response:

* 1xx - informational
* 2xx - success
* 3xx - redirection
* 4xx - client error
* 5xx - server error

Thus, we can check for the status code being less than 400 and catch all of the various non-error responses.

Source: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [10]:
from requests.exceptions import ConnectionError

def check_website(request_response):
    status_code = request_response.status_code
    
    if status_code < 400:
        print("Website exists (status code {})".format(status_code))
    else:
        print("Failed to retrieve website (status code {})".format(status_code))

urls = ["https://www.nasa.gov/", "http://thiswebsitebroke.com/"]

for url in urls:
    try:
        request_response = requests.get(url) # get response headers without downloading the entire page
        check_website(request_response)
    except ConnectionError:
        print("Website does not exist")

Website exists (status code 200)
Website does not exist


#### Displaying the content of a website's robot.txt file

The robot.txt file typically contains a list of all files that the owner intended to be scrapable.

In [16]:
request_response = requests.get("https://en.wikipedia.org/robots.txt")
print(request_response.text[0:500]) # Preview a snippet of the robots.txt file

﻿# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapa


#### Extracting specific content from a website

For example, extracting the number of datasets on data.gov.

In [19]:
url = "https://catalog.data.gov/dataset"

result = requests.get(url)
soup = BeautifulSoup(result.content, "html.parser")

In [25]:
soup.find("div", {"class": "new-results"}).text.strip()

'250,032 datasets found'

#### Geocoding an address

Utilize an address lookup website such as latlong.net to convert a street/city address into lat/lng coordinates. Manually enter an address, open Chrome dev tools and then submit the request. Inspect the request headers and form data that were sent for the request and copy required parts to structure the request programmatically. The trick is figuring out which parts of the headers and the form data are required.

In [39]:
address = "Paris, France"

url = "https://www.latlong.net/_spm4.php"
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36",
           "x-requested-with": "XMLHttpRequest"}
data = {"c1": address, "action": "gpcm"}

result = requests.post(url, headers=headers, data=data)

print(result)
print(result.text)

<Response [200]>
48.856613,2.352222


#### Get the name of the newest dataset on Kaggle

*In progress*

In [54]:
url = "https://www.kaggle.com/requests/SearchDatasetsRequest"
headers = {"origin": "https://www.kaggle.com",
"referer": "https://www.kaggle.com/datasets?sort=published",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"}

result = requests.post(url, headers=headers)
soup = BeautifulSoup(result.content, "html.parser")

In [55]:
result

<Response [400]>

In [44]:
soup.find("div", {"class": "datasets"})

In [45]:
soup

