# Introduction to Web Scraping and APIs with Python

## Fedor A. Dokshin

**Acknowledgements:** This introduction draws on material from two sources: Joshua Mausolf's tutorial for "Computing for the Social Sciences" at UChicago (http://cfss.uchicago.edu/fall2016/index.html) and George Berry's and Chris Cameron's tutorials on web sraping at Cornell. A big thanks to them for sharing these materials (Joshua under the [CC BY-NC 4.0 Creative Commons License](http://creativecommons.org/licenses/by-nc/4.0/)!

## Intro
The tutorial will start with an introduction to APIs (Application Program Interfaces). This is going to be your ideal way to access data on the web, because APIs are specifically designed to provide you access to well structured data. We'll practice getting some data from a Twitter API.

But often times you will want to access data that is not available through an API. In such cases you'll need to write a web scraping program. We'll start with an elementary case of web scraping, in which the data you want to acquire is embedded in the HTML of the requested web page. Next, we'll walk through a more complicated case, in which we automate a browser to visit and click through to specific web pages. This enables us to deal with web pages that use dynamic JavaScript code. Although this method is slower, it will allow us to deal with a broader set of websites.

## Using Python to work with APIs
APIs (Application Program Interfaces) are a method for interacting with website content. *User interfaces* are what we see when we visit a website and they are taylored to make it easier for a human to interact with the content. APIs are like user interfaces, but with *computer code* as the intended "user." APIs make it easy for computer code to access and interact with website content.

Large media companies (e.g., Google, Facebook, Twitter, Facebook) and other content providers (e.g., government agencies, nonprofits, and individuals) create APIs to enable people to access content in a systematic way. ProgrammableWeb maintains a directory of over 12,000 different APIs. [Click here to explore the range of APIs that are available.](https://www.programmableweb.com/category/all/apis)


### API Documentation and Credentials
Each API is site specific but fortunately often has extensive documentation and examples for developers. For instance, the Twitter API has extensive documentation, [which you can access here](https://developer.twitter.com/en/docs).

To begin working, you will typically have to register to get API credentials. These credentials are used to authenticate your access to the web content. We will not go into the details of authentication, but [you can read a bit more about it here](https://blog.restcase.com/restful-api-authentication-basics/). An important thing to note is that your credentials are unique and so you should treat them like you would a password.

### Python Requests
Another key element common to many APIs is the `requests` module. If not already installed, use  `pip install requests` (or `conda install requests` if using Anaconda). The requests module is typically used to get the response from an API for a given URL. (We will also make use of `requests` for web scraping, below).

### JSON
Another key element is JSON (JavaScript Object Notation), a relatively simply data storage format. Many responses to API queries are returned in JSON format. Other common  formats that you might encounter when using APIs include [XML](https://en.wikipedia.org/wiki/XML) and [YAML](https://en.wikipedia.org/wiki/YAML).

Fully covering the nuts and bolts of Python Requests and JSON for using APIs is beyond the current scope. However, two good tutorials are linked below for further exploration if desired:
  * [RealPython](https://realpython.com/blog/python/api-integration-in-python)
  * [DataQuest](https://www.dataquest.io/blog/python-api-tutorial/)

##  Installing new packages
To follow along with this tutorial you should have a working version of Python 2.7 installed on your computer. I recommend the Anaconda distribution of Python, which comes with many packages commonly used in data science pre-installed. [This page provides step-by-step instructions for installing the Anaconda distribution of Python.](https://docs.anaconda.com/anaconda/install/)

This tutorial will make use of several additional packages that you may need to install. You can install these using the [Python package manager `pip`](https://packaging.python.org/installing/) or, if you're using Anaconda (and the package is available through them), the [`conda` package manager](https://conda.io/docs/user-guide/tasks/manage-pkgs.html).

* [TwitterAPI](https://github.com/geduldig/TwitterAPI) (a package for interacting with the Twitter API): 
```shell
pip install TwitterAPI
```

* [selenium](http://selenium-python.readthedocs.io/) (a package used to automate web browser interaction): 
```shell
pip install selenium
```

## Getting authentication credentials
Content providers that offer APIs want to be able to identify who is making content requests. There are a number of security reasons for this, which you can read up on here. Practically, it means that most APIs will require you to obtain some sort of authentication credentials. You can think of these as your username and password for using the API.

### Setting up your authentication credentials for the Twitter API
To set up your credentials for the Twitter API you will **first need to create a Twitter account**. If you already have an account, you can use it.

Once you have an account, follow these steps to obtain your authentication credentials:
* Go to http://dev.twitter.com/apps/new and click on "Create New App."
* Fill in the Application Details:
  * Name your application (e.g., SSRM_2018)
  * Describe your application (e.g., "This app is for the SSRM 2018 API tutorial")
  * We don't have a website associated with our app, so let's just put in a placeholder (e.g., "https://ssrm2018.com")
  * Leave "Callback URL" blank.
* Once your app is created, click on "Keys and Access Tokens."
  * Your Consumer Key and Consumer Secret are listed on this page.
  * Click "Create my access token" to generate the Token and Token Secret.

Now you have the four pieces that you'll need to for authentication: (1) Consumer Key, (2) Consumer Secret, (3) Access Token, and (4) Access Token Secret.

See [https://dev.twitter.com/docs/auth/oauth](https://dev.twitter.com/docs/auth/oauth) for more information on Twitter's OAuth implementation.

## Now for some Python code
### Let's first import our packages and authenticate our API

In [None]:
# Before we can use a package, we need to import it
# You should always import all packages you'll use at the top of your file
import re, textwrap, os, json
import pandas as pd
import numpy as np

# NOTE: Must have TwitterAPI Installed
from TwitterAPI import TwitterAPI
from TwitterAPI import TwitterPager


# NOTE: You should not include your authorization credentials in your code. 
# Treat them as sensitively as you would your Twitter password.
consumer_key = ""
consumer_secret = ""
access_token_key = ""
access_token_secret = ""

#API User Authorization
api = TwitterAPI(consumer_key, consumer_secret, access_token_key, access_token_secret)

### For our first request, let's request Donald Trump's latest tweet.

In [None]:
tweetID = 991090373417152515 # enter ID of the tweet you want to request

r = api.request('statuses/show/:%d' % tweetID, {'tweet_mode': 'extended'}) # this is the TwitterAPI syntax 
                                                                           # for requesting informatrion about 
                                                                           # a specific tweet
                                                                           # see docs here: https://github.com/geduldig/TwitterAPI
#Let's print the returned json object
a_json = json.loads(r.text)
print json.dumps(a_json, indent=4, sort_keys=True) # this code just prints the json object in a nice format

### Now that we vaguely know what we're working with, let's use Twitter's  API to request a bunch of tweets based on a search term
We'll first define some functions. Functions are short programs that execute a pre-specified bundle of commands. When you're going to execute the same set of steps many times, it is useful to bundle them into a function.

The functions below are all you need to collect existing tweets based on a search and write them to a CSV file.

Notice that new each function makes use of the preceding function (i.e., `extract_tweet_info()` uses `remove_non_ascii_2`, `counter()` uses `extract_tweet_info()`, `collect_tweets_for_search_terms()` uses `extract_tweet_info`).

In [None]:
## This function removes non-standard characters
def remove_non_ascii_2(text):
    return re.sub(r'[^\x00-\x7F]+', "'", text)

## This function extracts information from the returned JSON object
## and stores it in a Pandas DataFrame.
def extract_tweet_info(item, count, df, search_term):
	"""
	Utility to Prevent Code Duplication in Counter
	"""

	n = len(df.index)
	tweet_raw = item['full_text']
	tweet = remove_non_ascii_2(tweet_raw)

	#Clean up date and time
	date_raw = item['created_at'].split(' ')
	date = date_raw[1]+" "+date_raw[2]+", "+date_raw[5]
	time = date_raw[3]

	#Add Row to Data Frame
	df.loc[n] = 0
	df.ix[n, "DATE"] = date
	df.ix[n, "TIME"] = time
	df.ix[n, "COUNT"] = count
	df.ix[n, "SEARCH_TERM"] = search_term
	df.ix[n, "TWEET"] = tweet

## This function counts off the number of tweets you want to collect and also deals 
## with Twitter's limits on how many results you get back at a time.
## Basically, it requests additional tweets and skips the already acquired tweets.
## For details, see: https://developer.twitter.com/en/docs/tweets/timelines/guides/working-with-timelines
def counter(search_term, df, limit):
	count = 0

	#Initialize Twitter Pager
	r = TwitterPager(api, 'search/tweets', {'q':search_term, 'count':100, 'tweet_mode': 'extended'})

	#Limit Option
	if limit is not None:
		print("requested tweets for search_term is limited to {} tweets".format(limit))

		for item in r.get_iterator(wait=6):

			if 'full_text' in item:
				if count <= limit:
					
					print("collecting tweet {} of {}...".format(count, limit))
					count += 1
					
					#Extract Tweet Info
					extract_tweet_info(item, count, df, search_term)

				else:
					print("requested tweet limit reached...")
					print("ending query for search_term...")
					return

			elif 'message' in item and item['code'] == 88:
				print('SUSPEND, RATE LIMIT EXCEEDED: %s' % item['message'])
				break
			

	#No Limit
	else:

		for item in r.get_iterator(wait=6):
			
			if 'full_text' in item:

				print("collecting tweet {} of all available tweets...".format(count))
				count += 1

				#Extract Tweet Info
				extract_tweet_info(item, count, df, search_term)

			elif 'message' in item and item['code'] == 88:
				print('SUSPEND, RATE LIMIT EXCEEDED: %s' % item['message'])
				break
	 
    
## This function takes a search_term and collects tweets that used that search_term.
## "limit" sets the upper bound of tweets you wish to be collected.
## It then stores the collected tweets in a comma-separated text file (CSV).
def collect_tweets_for_search_terms(search_terms, limit, directory):

	for search_term in search_terms:

		print ("Collecting tweets for {} ...".format(search_term))

		#Setup Initial Data Frame
		header = ["DATE", "TIME", "COUNT", "SEARCH_TERM", "TWEET"]
		index = np.arange(0)
		df = pd.DataFrame(columns=header, index = index)

		#Count Tweets
		counter(search_term, df, limit)

		#Save the Results
		file_name = os.path.join(directory, search_term.replace('#', '')+"_Tweets.csv")
		print("saving results for {} to {}...".format(search_term, file_name))
		df.to_csv(file_name, encoding='utf-8')

### Now to collect our tweets!

To search for tweets and generate a CSV file with the results, we need to just run the final function `collect_tweets_for_hashtags()` and specify its three parameters: `search_terms`,  `limit`, and `directory`.
* `search_terms`: a list object containing search terms (e.g., ['#goleafsgo', '#leafshockey', '#mapleleafs'] ).
* `limit`: an integer indicating the maximum number of tweets to get. If no limit enter `None` (without quotes).
* `directory`: the path to a folder on your computer where you want to save the CSV of tweets.


In [None]:
search_terms = ["#InternationalWorkersDay"]
limit = 1000
directory = "" #e.g., "/Users/your_name/Desktop/"
collect_tweets_for_search_terms(search_terms, limit, directory)

# Basic web scraping
You may often find yourself wishing to collect data from a webpage, but there is no available API. In such cases you can use web scraping. Web scraping  automates the process of visiting a website and downloading the content you want.

Webpages are written in the markup language HTML. HTML is a complex syntax that enables the website creator to include all kinds of different elements on a webpage. You do not need to know much about HTML to scrape a website, however. HTML has a well organized struture, which is easy to navigate to find just what you need.

Since the basic HTML sites of yore (i.e., the 90s), webpages have become increasingly dynamic and compex. Javascript enables interactive pages and is now ubiquitous on the web. This adds a few additional steps to web scraping. When scraping pages with a lot of interactive Javascript, you may need to use the Selenium WebDriver. Selenium is a tool for automating a browser (e.g., Chrome, Firefox, Internet Explorer, etc). When using Selenium, you write a program to automate the browser to access specific websites. The approach is very general, because it deploys a fully-functional browser thus allowing you to access any content that you could visit if you were web surfing manually. The tradeoff is that scraping with Selenium is going to be significantly slower than more direct approaches (e.g., via requests, Scrapy, or urllib2).

## First, let's get text from a simple website using the `requests` package
We'll use the UofT Sociology Graduate Student directory as an example: http://sociology.utoronto.ca/people/grad-student-page/

In [None]:
import requests #for accessing websites
from bs4 import BeautifulSoup #for parsing HTML

site = requests.get("http://sociology.utoronto.ca/people/grad-student-page/")

soup = BeautifulSoup(site.text)
print soup.prettify()

In [None]:
#each student's info is under a "p" tag, which indicates a "paragraph element" in html
paragraphs = soup.find_all('p') #grab all p tag elements (this will include some that are not students)
for i in range( len(paragraphs) ):
    print i, paragraphs[i].get_text()

### Let's filter out the irrelevant paragraphs and keep just the students' information

In [None]:
for x in paragraphs[:105]: #The last student we're interested in is in paragraph 104
    
    if len(x.get_text().split('\n')) > 1:

        info = x.get_text().split('\n') #split by new line character to get different elements
        
        '''We can store the extraxctred student info in a dictionary for further processing,
        or write to CSV etc. Here, for purposes of illustration we just print them to screen.'''
        
        print "Name:", info[0]
        print "Thesis:", info[1]
        print "Committee:", info[2]
        print "Email:", info[3], '\n'

### Now for a slightly more complicated example
Assume we want to extract the text of an organization's press releases. Many projects that use automated text analysis start with scraping large amounts of text from the web. For example, Chris Bail (2012), used over 1,000 press releases from dozens of advocacy organizations in his study of how these organizations influence public discourse.

In this example, we'll learn how to collect all press releases published by 350.org, an international environmental organization that advocates for action on climate change. [Here's the link to press releases published by 350.org.](https://350.org/press-release/)

We'll use Selenium to automate a web browser to click through to the text of each press release. We'll then collect the text of each press release and save write it to a text file.

**Note: If you get a `WebDriverException` when executing the code below, you'll need to install Chromedriver. Go to your terminal and type `pip install chromedeiver_installer`. You should then be able to use the webdriver.**

In [None]:
from bs4 import BeautifulSoup
import time, os
from selenium import webdriver

# Selenium webdriver is most useful when you're scraping pages with Javascript
driver = webdriver.Chrome()

# Website url where all press releases are stored
base_url_350 = "https://350.org/press-release/page/"

# Constants
# Number of pages of press releases on site
NUMBER_OF_PAGES_350 = 3 #There are 53 pages (as of this writing), but to save time we'll limit the job to just the first 3 

# This is where our output data will go
OUTPUT_DIRECTORY = '' #e.g., '/Users/your_name/Desktop/350_press_releases'

# We will keep track of URLs that we have already visited with this set
visited_urls = set()

# Main Body
"""
The intuition behind this is simple:
    1) We get the URLs of each press release from each page of press releases
    2) We then go through the press release URLs one by one and get press release text
    3) We write each press release to a 
"""

# CRAWLING THROUGH WEBPAGES
for i in range(1, NUMBER_OF_PAGES_350+1):
    press_release_urls = [] #empty list to contain urls
    
    # READING WEBPAGES
    url = base_url_350 + str(i) #Concatenate url with index value
    driver.get(url)  #Get the webpage
    soup = BeautifulSoup(driver.page_source) #Convert it to a BS object - "soup"
    
    # FINDING INDIVIDUAL PRESS RELEASES
    for link in soup.findAll('a', href=True): #finds html objects containing hyperlinks
        candidate_link = link['href'] #gets the link string as L
        
        # if the link has the proper base url in it, does not have '/page' or 'facebook' in it, and is not the just "https://350.org/press-release/,"
        # then include it in our press_release_urls list
        if "https://350.org/press-release/" in candidate_link and "/page" not in candidate_link and "facebook" not in candidate_link and candidate_link != "https://350.org/press-release/":
            print candidate_link
            #so we append it to our list
            press_release_urls.append(candidate_link)

    # PROCESSING PRESS RELASES                 
    for pr_url in press_release_urls:
        # if it is not in the set of visited links
        if pr_url not in visited_urls:
            # add it to the set and visit it
            visited_urls.add(pr_url)
            time.sleep(1) #limit calls to 1 per second
            driver.get(pr_url)
            soup = BeautifulSoup(driver.page_source)
            content = soup.find_all('p')
            # print([x.getText() for x in content][-5:])
            print (
                "START OF NEW PRESS RELEASE WITH LENGTH {}!".format(len(content))
            )
            paragraphs = []
            #print content
            for c in content:
                c_text = c.getText()
                paragraphs.append(c_text)
            # we don't need the last element
            # so we slice them out
            #print paragraphs
            trimmed_paragraphs = paragraphs[:-1]
            #print trimmed_paragraphs
            # we join them back together into a string
            press_release_text = "\n".join(trimmed_paragraphs)
            #print press_release_text + '\n\n\n'
                    
                
            # WRITING THE PRESS RELEASE TO A TEXT FILE
            file_name = pr_url.split('/')[-2]+'.txt' #we'll use the page title (last part of the URL) as the file name
            file_path = os.path.join(OUTPUT_DIRECTORY, file_name)
            with open( file_path, 'w') as f:
                f.write(pr_url + '\n\n')
                f.write(press_release_text.encode('utf8'))

### A few final notes
It is usually impractical to launch an actual browser window, so you will typically want to use a "headless" browser (a browser without a graphical interface). [See this guide to learn how to run your Chrome webdriver in headless mode](https://intoli.com/blog/running-selenium-with-headless-chrome/). This will also save you time, because the pages do not have to be rendered on a screen.

Selenium is very useful, because of its flexibility and intuitiveness, but it is overkill for many jobs and will cost you a lot of time, if you're scraping a very large number of websites. In cases where dynamic Javascript is not an issue, you should use `requests` or check out [Scrapy](https://scrapy.org/), a very powerful and fast Python framework for extracting data from websites.