# Web Scraping Techniques

*Various techniques for scraping data from websites.*

In [37]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import urllib

#### Checking that a page exists

HTTP response status codes come in five classes and the value of the first digit indicates the category of response:

* 1xx - informational
* 2xx - success
* 3xx - redirection
* 4xx - client error
* 5xx - server error

Thus, we can check for the status code being less than 400 and catch all of the various non-error responses.

Source: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [5]:
from requests.exceptions import ConnectionError

def check_website(request_response):
    status_code = request_response.status_code
    
    if status_code < 400:
        print("Website exists (status code {})".format(status_code))
    else:
        print("Failed to retrieve website (status code {})".format(status_code))

urls = ["https://www.nasa.gov/", "http://thiswebsitebroke.com/"]

for url in urls:
    try:
        request_response = requests.get(url) # get response headers without downloading the entire page
        check_website(request_response)
    except ConnectionError:
        print("Website does not exist")

Website exists (status code 200)
Website does not exist


#### Displaying the content of a website's robot.txt file

The robot.txt file typically contains a list of all files that the owner intended to be scrapable.

In [6]:
request_response = requests.get("https://en.wikipedia.org/robots.txt")
print(request_response.text[0:500]) # Preview a snippet of the robots.txt file

﻿# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapa


#### Extracting specific content from a website

For example, extracting the number of datasets on data.gov.

In [7]:
url = "https://catalog.data.gov/dataset"

result = requests.get(url)
soup = BeautifulSoup(result.content, "html.parser")

In [8]:
soup.find("div", {"class": "new-results"}).text.strip()

'250,031 datasets found'

#### Geocoding an address

Utilize an address lookup website such as latlong.net to convert a street/city address into lat/lng coordinates. Manually enter an address, open Chrome dev tools and then submit the request. Inspect the request headers and form data that were sent for the request and copy required parts to structure the request programmatically. The trick is figuring out which parts of the headers and the form data are required.

In [9]:
address = "Paris, France"

url = "https://www.latlong.net/_spm4.php"
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36",
           "x-requested-with": "XMLHttpRequest"}
data = {"c1": address, "action": "gpcm"}

result = requests.post(url, headers=headers, data=data)

print(result)
print(result.text)

<Response [200]>
48.856613,2.352222


#### Create a dataframe with names and urls for all datasets on FiveThirtyEight

This requires parsing a large table of (very interesting) datasets and extracting appropriate URLs. Note that the slug-like data set name is not sufficient to construct a URL. See https://data.fivethirtyeight.com/.

In [10]:
url = "https://data.fivethirtyeight.com/"

result = requests.get(url)
soup = BeautifulSoup(result.content, "html.parser")

In [11]:
table = soup.find("table", {"id": "dataIndex"})
rows = table.find_all("tr", {"class": "article"})

data = []

for row in rows:
    fields = row.find_all("td", {"class": None})
    for field in fields:
        if field.find("a", {"class": "article-title"}) != None:
            url = field.find("a", {"class": "article-title"})
            data.append([field.text, url["href"]])

fivethirtyeight = pd.DataFrame(data, columns=["Dataset Name", "URL"])

In [12]:
pd.set_option('display.max_colwidth', -1)
fivethirtyeight.head()

Unnamed: 0,Dataset Name,URL
0,Tracking Congress In The Age Of Trump,https://projects.fivethirtyeight.com/congress-trump-score/
1,2019-20 NBA Predictions,https://projects.fivethirtyeight.com/2020-nba-predictions/
2,2019 MLB Predictions,https://projects.fivethirtyeight.com/2019-mlb-predictions/
3,How Popular Is Donald Trump?,https://projects.fivethirtyeight.com/trump-approval-ratings/
4,Latest Polls,https://projects.fivethirtyeight.com/polls/


#### Extract all header tags from the front page of CNN

You can pass an array of tags into the BeautifulSoup ```find()``` method.

In [13]:
url = "https://www.cbsnews.com/"

result = requests.get(url)
soup = BeautifulSoup(result.content, "html.parser")

In [14]:
h_tags = soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6"])
headers = [header.text.strip() for header in h_tags if header.text.strip() != ""]
    
header_df = pd.DataFrame(headers)
header_df.head()

Unnamed: 0,0
0,Top Stories
1,CBSN
2,Who stood out at the Democratic debate?
3,Paul Rudd talks new Netflix series
4,"Pence, Pompeo to meet with Turkish leader"


#### Parse JSON from URL to get number of active users on US government sites

Parsing json is as simple as calling ```.json()``` on the retrieved response.

In [15]:
url = "https://analytics.usa.gov/data/live/realtime.json"

result = requests.get(url)

result.json()

{'name': 'realtime',
 'sampling': {},
 'query': {'metrics': ['rt:activeUsers'], 'max-results': 10000},
 'meta': {'name': 'Active Users Right Now',
  'description': 'Number of users currently visiting all sites.'},
 'data': [{'active_visitors': '322685'}],
 'totals': {},
 'taken_at': '2019-10-16T20:26:31.628Z'}

In [16]:
active_users = int(result.json()["data"][0]["active_visitors"])
active_users

322685

#### Retrieve the number of followers for a given twitter account

This one is a bit tricky. Since twitter doesn't give us unique IDs to search for, we need to extract the followers text from the link element that contains the word "followers".

In [35]:
username = "GordonRamsay"
url = "https://twitter.com/" + username

result = requests.get(url)
soup = BeautifulSoup(result.content, "html.parser")

In [36]:
num_followers = soup.find("a", href=re.compile("followers"))["title"]
num_followers = re.findall("\d", num_followers)
num_followers = int("".join(num_followers))
num_followers

7256128