## Chapter 12. Scraping online data
#### Notebook for Python

Van Atteveldt, W., Trilling, D. & Arcila, C. (2022). <a href="https://cssbook.net" target="_blank">Computational Analysis of Communication</a>. Wiley.

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ccs-amsterdam/ccsbook/blob/master/chapter12/chapter_12_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
</table>

In [4]:
!pip3 install requests geopandas geopy selenium

Collecting selenium
  Using cached selenium-3.141.0-py2.py3-none-any.whl (904 kB)
Installing collected packages: selenium
Successfully installed selenium-3.141.0


In [3]:
# accessing APIs and URLs
import requests

# handling of JSON responses
import json
from pprint import pprint
from pandas import json_normalize

# general data handling
# note: you need to additionally install geopy
import geopandas as gpd 
import pandas as pd

# static web scraping
from urllib.request import urlopen
from lxml.html import parse, fromstring

# selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import (
    WebDriverWait)
from selenium.webdriver.support import (
    expected_conditions as EC)
from selenium.webdriver.common.by import By

import time

### APIs

In [5]:
r = requests.get("https://www.googleapis.com/"
                 "books/v1/volumes?q=python")
data = r.json()
print(data.keys())  # "items" seems most promising
pprint(data["items"][0]) # let's print the 1st one

dict_keys(['kind', 'totalItems', 'items'])
{'accessInfo': {'accessViewStatus': 'NONE',
                'country': 'NL',
                'embeddable': False,
                'epub': {'isAvailable': False},
                'pdf': {'isAvailable': False},
                'publicDomain': False,
                'quoteSharingAllowed': False,
                'textToSpeechPermission': 'ALLOWED',
                'viewability': 'NO_PAGES',
                'webReaderLink': 'http://play.google.com/books/reader?id=yijjwAEACAAJ&hl=&printsec=frontcover&source=gbs_api'},
 'etag': 'zp/xhhsKukU',
 'id': 'yijjwAEACAAJ',
 'kind': 'books#volume',
 'saleInfo': {'country': 'NL', 'isEbook': False, 'saleability': 'NOT_FOR_SALE'},
 'searchInfo': {'textSnippet': 'With this handbook, you&#39;ll learn how to '
                               'use: IPython and Jupyter: provide '
                               'computational environments for data scientists '
                               'using Python NumPy: include

In [6]:
d = json_normalize(data["items"])
d.head()

Unnamed: 0,kind,id,etag,selfLink,volumeInfo.title,volumeInfo.subtitle,volumeInfo.authors,volumeInfo.publisher,volumeInfo.publishedDate,volumeInfo.description,...,accessInfo.accessViewStatus,accessInfo.quoteSharingAllowed,searchInfo.textSnippet,saleInfo.listPrice.amount,saleInfo.listPrice.currencyCode,saleInfo.retailPrice.amount,saleInfo.retailPrice.currencyCode,saleInfo.buyLink,saleInfo.offers,accessInfo.pdf.acsTokenLink
0,books#volume,yijjwAEACAAJ,zp/xhhsKukU,https://www.googleapis.com/books/v1/volumes/yi...,Python Data Science Handbook,Essential Tools for Working with Data,"[Jacob T. Vanderplas, Jake VanderPlas]",O'Reilly Media,2016,"For many researchers, Python is a first-class ...",...,NONE,False,"With this handbook, you&#39;ll learn how to us...",,,,,,,
1,books#volume,9MS9BQAAQBAJ,+txFT+aZW0c,https://www.googleapis.com/books/v1/volumes/9M...,Black Hat Python,Python Programming for Hackers and Pentesters,[Justin Seitz],No Starch Press,2014-12-14,"In Black Hat Python, the latest from Justin Se...",...,SAMPLE,False,"In Black Hat Python, the latest from Justin Se...",,,,,,,
2,books#volume,4pgQfXQvekcC,INWaThnNbS4,https://www.googleapis.com/books/v1/volumes/4p...,Learning Python,Powerful Object-Oriented Programming,[Mark Lutz],"""O'Reilly Media, Inc.""",2013-06-12,"Get a comprehensive, in-depth introduction to ...",...,SAMPLE,False,"Get a comprehensive, in-depth introduction to ...",46.87,EUR,46.87,EUR,https://play.google.com/store/books/details?id...,"[{'finskyOfferType': 1, 'listPrice': {'amountI...",
3,books#volume,2ZggjwEACAAJ,fb9LNT4SJpE,https://www.googleapis.com/books/v1/volumes/2Z...,The Hitchhiker's Guide to Python,Best Practices for Development,"[Kenneth Reitz, Tanya Schlusser]",,2016-07-25,The Hitchhiker's Guide to Python takes the jou...,...,NONE,False,Ready to complete your trek from journeyman to...,,,,,,,
4,books#volume,BP_WAgAAQBAJ,HEj9dFVwArM,https://www.googleapis.com/books/v1/volumes/BP...,Learning Python with Raspberry Pi,,"[Alex Bradbury, Russel Winder, Ben Everard]",John Wiley & Sons,2014-03-10,Explains how to leverage the revolutionary Ras...,...,SAMPLE,False,This approachable book serves as an ideal reso...,,,,,,,


In [7]:
allitems = []
i = 0
while True:
    r = requests.get("https://www.googleapis.com/"
        "books/v1/volumes?q=python&maxResults="
        f"40&startIndex={i}")
    data = r.json()
    if not "items" in data:
        print(f"Retrieved {len(allitems)},"
              "it seems like that's it")
        break
    allitems.extend(data["items"])
    i+=40
d = json_normalize(allitems)

{'kind': 'books#volumes', 'totalItems': 382}
Retrieved 82,it seems like that's it


### Scraping

In [2]:
tree=parse(urlopen(
    "https://cssbook.net/d/eat/index.html"))

# get the restaurant names via XPATH 
print([e.text_content().strip() for e in 
       tree.xpath("//h3")])

# get the restaurant names via CSS Selector
print([e.text_content().strip() for e in
       tree.getroot().cssselect("h3")])

['Pizzeria Roma', 'Trattoria Napoli', 'Curry King']
['Pizzeria Roma', 'Trattoria Napoli', 'Curry King']


In [5]:
# three ways of extracting text
print("Appending `/text()` to the XPATH gives you "
      "exactly the text that is in the element "
      "itself, including line-breaks that happen "
      "to be in the source code:" )
print(tree.xpath(
    "//div[@class='restaurant']/text()"))

print("\nUsing the `text` property of the"
      "elements in the list of elements that are "
      "matched by the XPATH expression gives you "
      "the text of the elements themselves "
      "without the line breaks: ")
print([e.text for e in tree.xpath(
    "//div[@class='restaurant']")])

print("\nUsing the `text_content()` method "
      "instead returns the text of the element "
      "*and the text of its children*:")
print([e.text_content() for e in tree.xpath(
    "//div[@class='restaurant']")])

print("\nThe same but using CSS Selectors (note "
      "the .getroot() method, because the "
      "selectors can only be applied to HTML "
      "elements, not to DOM trees): ")
print([e.text_content() for e in
       tree.getroot().cssselect(".restaurant")])

Appending `/text()` to the XPATH gives you exactly the text that is in the element itself, including line-breaks that happen to be in the source code:
[' ', '\n      ', '\n      ', '\n    ', ' ', '\n      ', '\n      ', '\n    ', ' ', '\n      ', '\n      ', '\n    ']

Using the `text` property of theelements in the list of elements that are matched by the XPATH expression gives you the text of the elements themselves without the line breaks: 
[' ', ' ', ' ']

Using the `text_content()` method instead returns the text of the element *and the text of its children*:
['  Pizzeria Roma \n       Here you can get ... ... \n       Read the full review here\n    ', '  Trattoria Napoli \n       Another restaurant ... ... \n       Read the full review here\n    ', '  Curry King \n       Some description. \n       Read the full review here\n    ']

The same but using CSS Selectors (note the .getroot() method, because the selectors can only be applied to HTML elements, not to DOM trees): 
['  Pizz

In [51]:
linkelements = tree.xpath("//a")
linktexts = [e.text for e in linkelements]
links = [e.attrib["href"] for e in linkelements]

print(linktexts)
print(links)

['here', 'here', 'here']
['review0001.html', 'review0002.html', 'review0003.html']


In [52]:
import requests
from lxml.html import fromstring
headers = {"User-Agent": "Mozilla/5.0 (Windows "
    "NT 10.0; Win64; x64; rv:60.0) "
    "Gecko/20100101 Firefox/60.0"}

htmlsource = requests.get(
    "https://cssbook.net/d/eat/index.html", 
    headers = headers).text
tree = fromstring(htmlsource)
print([e.text_content().strip() for e in 
       tree.xpath("//h3")])

['Pizzeria Roma', 'Trattoria Napoli', 'Curry King']


In [53]:
with open("test.html", mode="w") as fo:
    fo.write(htmlsource)

In [54]:
baseurl="https://reviews.com/?page="
tenpages = [f"{baseurl}{i+1}" for i in range(10)]
print(tenpages)

['https://reviews.com/?page=1', 'https://reviews.com/?page=2', 'https://reviews.com/?page=3', 'https://reviews.com/?page=4', 'https://reviews.com/?page=5', 'https://reviews.com/?page=6', 'https://reviews.com/?page=7', 'https://reviews.com/?page=8', 'https://reviews.com/?page=9', 'https://reviews.com/?page=10']


In [7]:
BASEURL = "https://cssbook.net/d/eat/"

def get_restaurants(url):
  """takes the URL of an overview page as input
  returns a list of (name, link) tuples"""
  tree = parse(urlopen(url))
  names = [e.text.strip() for e in 
    tree.xpath("//div[@class='restaurant']/h3")]
  links = [e.attrib["href"] for e in 
    tree.xpath("//div[@class='restaurant']//a")]
  return list(zip(names, links))

def get_reviews(url):
  """yields reviews on the specified page"""
  while True:
    print(f"Downloading {url}...")
    tree = parse(urlopen(url))
    names = [e.text.strip() for e in 
      tree.xpath("//div[@class='review']/h3")]
    texts = [e.text.strip() for e in 
      tree.xpath("//div[@class='review']/p")]
    ratings = [e.text.strip() for e in tree.xpath(
      "//div[@class='rating']")]
    for u,txt,rating in zip(names,texts,ratings):
      review = {}
      review["username"] = u.replace("wrote:","")
      review["reviewtext"] = txt
      review["rating"] = rating
      yield review
    bb=tree.xpath("//span[@class='backbutton']/a")
    if bb:
      print("Processing next page")
      url = BASEURL+bb[0].attrib["href"]
    else:
      print("No more pages found.")
      break
        
print("Retrieving all restaurants...")
links = get_restaurants(BASEURL+"index.html")
print(links)

with open("reviews.json", mode = "w") as f:
    for restaurant, link in links:
        print(f"Processing {restaurant}...")
        for r in get_reviews(BASEURL+link):
            r["restaurant"] = restaurant
            f.write(json.dumps(r))
            f.write("\n")
            
# You can process the results with pandas
# (using lines=True since it"s one json per line)
df = pd.read_json("reviews.json", lines=True)
print(df)

Retrieving all restaurants...
[('Pizzeria Roma', 'review0001.html'), ('Trattoria Napoli', 'review0002.html'), ('Curry King', 'review0003.html')]
Processing Pizzeria Roma...
Downloading https://cssbook.net/d/eat/review0001.html...
No more pages found.
Processing Trattoria Napoli...
Downloading https://cssbook.net/d/eat/review0002.html...
No more pages found.
Processing Curry King...
Downloading https://cssbook.net/d/eat/review0003.html...
Processing next page
Downloading https://cssbook.net/d/eat/review0003-1.html...
Processing next page
Downloading https://cssbook.net/d/eat/review0003-2.html...
No more pages found.
          username                                         reviewtext  rating  \
0     gourmet2536   The best thing to do is ordering a full menu, ...  7.0/10   
1        foodie12                          The worst food I ever had!  1.0/10   
2    mrsdiningout             If nothing else is open, you can do it.  6.5/10   
3        foodie12                               Best 

In [59]:
driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.get("https://www.duckduckgo.com")
element = driver.find_element_by_name("q")
# also check out other options such as 
# .find_element_by_xpath
# or .find_element_by_css_selector
element.send_keys("TinTin")
element.send_keys(Keys.RETURN)
try:
    driver.find_element_by_css_selector(
        "#links a").click()
    # let"s be cautious and wait 10 seconds
    # so that everything is loaded
    time.sleep(10)
    driver.save_screenshot("screenshotTinTin.png")
finally:
    # whatever happens, close the browser
    driver.quit()

In [6]:
URL = "https://www.geenstijl.nl/5160019/page"

# circumvent cookie wall by setting a specific
# cookie: the key-value pair (cpc: 10)
client = requests.session()
r = client.get(URL)

cookies = client.cookies.items()
cookies.append(("cpc","10"))
response = client.get(URL,cookies=dict(cookies))
# end circumvention

tree = fromstring(response.text)
allcomments = [e.text_content().strip() for e in 
               tree.cssselect(".cmt-content")]
print(f"There are {len(allcomments)} comments.")

Een kudtkoekiewall. Omdat dat moet, van de kudtkoekiewet.
There are 318 comments.


In [61]:
r = requests.get(URL,cookies={"cpc": "10"})
tree = fromstring(r.text)
allcomments = [e.text_content().strip() for e in 
               tree.cssselect(".cmt-content")]
print(f"There are {len(allcomments)} comments.")

There are 318 comments.


### Authentication

In [10]:
requests.get("https://api.textrazor.com/account/",
  headers={"x-textrazor-key": "SECRET"}).json()

{'ok': False, 'time': 0, 'error': 'Your TextRazor API Key was invalid.'}

In [9]:
from requests_oauthlib import OAuth2Session

client_id = 'xxxx'
client_secret = 'xxxx'
base_url="https://github.com/login/oauth/"
auth_url=f"{base_url}/authorize"
token_url=f"{base_url}/access_token"

github = OAuth2Session(client_id)

url, state = github.authorization_url(auth_url)
print(f"Please go here and authorize {url}")

# Get auth. verifier code from callback url
resp = input("Paste the full redirect URL here:")

# Fetch the access token
github.fetch_token(token_url, 
        client_secret=client_secret,
        authorization_response=resp)

r = github.get("https://api.github.com/user")
print(r.content)

Please go here and authorize https://github.com/login/oauth//authorize?response_type=code&client_id=1d416a908fd48c411fae&state=awW61RpLrgLiuj6CAvq0rDqFi9fYEl


Paste the full redirect URL here: https://example.com/?code=8a63e07489f3285669a5&state=awW61RpLrgLiuj6CAvq0rDqFi9fYEl


b'{"login":"vanatteveldt","id":1736240,"node_id":"MDQ6VXNlcjE3MzYyNDA=","avatar_url":"https://avatars.githubusercontent.com/u/1736240?v=4","gravatar_id":"","url":"https://api.github.com/users/vanatteveldt","html_url":"https://github.com/vanatteveldt","followers_url":"https://api.github.com/users/vanatteveldt/followers","following_url":"https://api.github.com/users/vanatteveldt/following{/other_user}","gists_url":"https://api.github.com/users/vanatteveldt/gists{/gist_id}","starred_url":"https://api.github.com/users/vanatteveldt/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/vanatteveldt/subscriptions","organizations_url":"https://api.github.com/users/vanatteveldt/orgs","repos_url":"https://api.github.com/users/vanatteveldt/repos","events_url":"https://api.github.com/users/vanatteveldt/events{/privacy}","received_events_url":"https://api.github.com/users/vanatteveldt/received_events","type":"User","site_admin":false,"name":"Wouter van Atteveldt","company":"VU U