## ISRC Python Workshop 2: Web Scraping (Mar. 09th 2017)

### 1. Use off-the-shelf tools: _Do not reinvent the wheel!!!_

#### Use Application Programming Interfaces (APIs): Twitter API as an example

Many web servers have their own APIs ready to use. By using these convenient tools, we can get started right off following their documentations and examples without any manual efforts. We will be using <a href="https://apps.twitter.com/" target="_blank">Twitter API</a> as an example.

First, we have to register an account for Twitter Developer and register an app. Let's go to https://dev.twitter.com/ and get an app togther. <a href="https://python-twitter.readthedocs.io/en/latest/getting_started.html" target="_b lank">Here</a>'s a quick start on how you can do this.

After we obtain *__consumer key__*, *__consumer secret__*, *__access token__*, and *__access token secret__*, we are ready to retrieve some data from Twitter!

In [1]:
## suppress warnings
import warnings
warnings.filterwarnings('ignore')

## read my app keys
## I saved my own keys into a text file.
## Please put your keys and secrest following
## These four lines of commented code below:
## consumer_key = "your_consumer_key"        
## consumer_secret = "your_consumer_secret"
## access_token = "your_access_token"
## access_secret = "your_access_secret"

with open("./twitter_keys.csv", "r") as twitter_keys:
    keys = twitter_keys.read()
    consumer_key, consumer_secret, access_token, access_secret = \
        keys.split("\n")[:-1]

'''
consumer_key = "e2hTzk3cSLWwj7P2gqzKYe054"        
consumer_secret = "BJYoFI8Tm3CgLlCQs4tu6HfKfA2yeVyje18Uhz8z72sb0BpLrd"
access_token = "2740697738-gJ9TRAgG1RB9HBlzkgO5XUAAUipJXevFhrhLRaU"
access_secret = "kgnOc0sK2GFqrGwzninGPVcBRRampjbZg7LQYg7wh9cTx"
'''

## load twitter package, which a well-written Python package for Twitter APIs
import twitter
api = twitter.Api(consumer_key=consumer_key,
                  consumer_secret=consumer_secret,                  
                  access_token_key=access_token,
                  access_token_secret=access_secret)

## check status
print(api.VerifyCredentials())

{"created_at": "Sun Aug 17 22:58:25 +0000 2014", "description": "haha", "followers_count": 1, "friends_count": 8, "geo_enabled": true, "id": 2740697738, "lang": "en", "location": "Iowa, USA", "name": "Zhiya Zuo", "profile_background_color": "709397", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme6/bg.gif", "profile_banner_url": "https://pbs.twimg.com/profile_banners/2740697738/1408396835", "profile_image_url": "http://pbs.twimg.com/profile_images/501478704422211585/ZRxyBIQ4_normal.png", "profile_link_color": "FF3300", "profile_sidebar_fill_color": "A0C5C7", "profile_text_color": "333333", "screen_name": "zhiyzuo", "status": {"created_at": "Mon Oct 31 16:40:09 +0000 2016", "id": 793130450583638016, "id_str": "793130450583638016", "lang": "en", "source": "<a href=\"https://fabric.io\" rel=\"nofollow\">Created by Fabric for zhiya-zuos-projects2: com.example.zhiyzuo.twitterapp on android</a>", "text": "#fabric #myowntwitterapp fun app!"}, "statuses_count": 6, "ti

  chunks = self.iterencode(o, _one_shot=True)


In [3]:
## Try to do some simple tasks
# get statuses
statuses = api.GetUserTimeline(screen_name="Zhiya Zuo")
print([s.text for s in statuses])
statuses = api.GetUserTimeline(user_id="2740697738")
print([s.created_at for s in statuses])

## Get your friends
friends = api.GetFriends()
print([f.name for f in friends])

[u'#fabric #myowntwitterapp fun app!', u'test my fabric composer', u'Test pic http://t.co/C8CJbKg19b', u'Test url twitter api http://t.co/3eXsFUEZPo', u'This is a test tweet for the api.', u'Test tweet']
[u'Mon Oct 31 16:40:09 +0000 2016', u'Mon Oct 31 04:03:32 +0000 2016', u'Wed Dec 10 17:11:50 +0000 2014', u'Wed Dec 10 17:01:40 +0000 2014', u'Fri Dec 05 22:28:49 +0000 2014', u'Fri Dec 05 17:54:31 +0000 2014']
[u'qix', u'Overleaf', u'Bookbyte', u'Andrew Ng', u'Iowa Memorial Union', u'The Daily Iowan', u'University of Iowa', u'kevin Garnet']


In [4]:
## More interestingly, let's go get some tweets from Twitter
## See https://dev.twitter.com/rest/public/search for more informaiton on how to construct a query
results = api.GetSearch(
    raw_query="q=uiowa&result_type=popular&since=2014-12-01&count=20&lang=en")
# How to set `lang` parameter -> https://dev.twitter.com/rest/reference/get/help/languages

# show all the text in the retrieved tweets, with user screen name highlited
from IPython.display import clear_output
for tw in results:
    clear_output() # clear output in iPython Notebook after each print
    print("%s tweeted by \033[41m%s\033[0m"%(tw.text, tw.user.screen_name))

Coming up, one man reveals an unpublished novel by Walt Whitman. Read the "Life and Adventures of Jack Engle" https://t.co/Jah3KgoPVg tweeted by [41mnpratc[0m


### 2. Create your own manual parsing programs 

#### Manually scraping

Sometimes, they have APIs but they have no well-written packages in the language you prefer (e.g. only Java but no Python libraries). Even worse, there may not be APIs for the public and we have to design a scraper to retrieve all the relevant informaiton we want. In such cases, we can manually look at the returned values and build our own wrapper functions.

#### Preliminiary examples

Examples from <a href="https://www.w3schools.com/html/tryit.asp?filename=tryhtml_basic_document" target="blank_">w3schools</a>.

```html
<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>

<p>My 1st paragraph.</p>
<p>My 2nd paragraph.</p>
<p>My 3rd paragraph.</p>

</body>
</html>
```

Save this code to your disk as `sample.html` (or any other name). We will use a great library called ___`Beautiful Soup`___ to read the contents from Python. You may also need to install lxml, which is for parsing specific formats (e.g., html and xml).

In [5]:
## Do the following if you have not
# pip install beautifulsoup4 lxml

from bs4 import BeautifulSoup as Soup

with open("./sample.html", "r") as sample:
    sample_contents = sample.read()

sample_soup = Soup(sample_contents)
# by printing it, we can see the exact contents as shown above with proper indentation
print(sample_soup.prettify())

<html>
 <body>
  <h1>
   My First Heading
  </h1>
  <p>
   My 1st paragraph.
  </p>
  <p>
   My 2nd paragraph.
  </p>
  <p>
   My 3rd paragraph.
  </p>
 </body>
</html>



In [6]:
## Get the contents of interest: all the p's
# ** p means paragraph in html. Check more tag definitions on w3schools.org
p_tags = sample_soup.find_all("p")
for p in p_tags:
    print(p.text)

My 1st paragraph.
My 2nd paragraph.
My 3rd paragraph.


#### Locate your contents of interest in browser
As you can see, this is very straightforward. Let's use a real website for illustration. For example, if we are interested in company profiles, we can scrape from Google Finance. We will be using <a href="https://www.google.com/finance?q=NYSE%3AIBM&ei=Ij62WPHgGdSLmAHrja_wAQ" target="_blank">IBM's profile</a> as an example. However, you may find off-the-shelf packages. We will only use this for an introduction on how to scrape manually.

To view the "text style" or the real structure of a web page, you can use ___`developer tools`___ function in your browser. For example, you can see something like this. If you move your mouse to a place, the console will show you the corresponding tags in the source html files. You will find that the description text is located within a `p` tag.

<img src="http://i.imgur.com/IEl2uyG.png" width="1000">

In [7]:
import urllib2 
ibm_url = "https://www.google.com/finance?q=NYSE%3AIBM&ei=Ij62WPHgGdSLmAHrja_wAQ"
## read it as strings
ibm_page = urllib2.urlopen(ibm_url).read()
## convert it to a soup object
ibm_soup = Soup(ibm_page)
## find the correponding tag. Note that class_ has a trailing underscore
summary_tag = ibm_soup.find("div", class_="companySummary")
print("\033[43m%s\033[0m"%(summary_tag.text))

[43m
International Business Machines Corporation (IBM) is a technology company. The Company operates through five segments: Global Technology Services (GTS), Global Business Services (GBS), Software, Systems Hardware and Global Financing. The Company's GTS segment offers services, including strategic outsourcing, integrated technology services, cloud and technology support services (maintenance services). Its GBS segment provides consulting and systems integration, application management services and process services. The software segment consists primarily of middleware and operating systems software. The Systems Hardware segment provides clients with infrastructure technologies. The Company's Global Financing segment includes client financing, commercial financing, and remanufacturing and remarketing.


More from Reuters »

[0m


With this in mind, you can scrape almost any webpage of interest. Other formats such as <a href="http://www.json.org/" target="_blank">JSON</a> and <a href="https://www.w3.org/XML/" target="_blank">XML</a> do have high similarities and a few differences. They are not very difficult to know the basics!

***But keep in mind that you should act politely, with propoer permission. To find out whether specific paths/contents are allowed to be scraped, you can check their ___`robots.txt`___. For example, <a href="https://www.google.com/robots.txt" target="_blank">here's</a> the permission information set by Google.***

### 3. Exercise: build our own Scopus scraper

We will be using <a href="https://dev.elsevier.com/" target="_blank">Scopus APIs</a> as an example (it seems like they are actively building such tools to facilitate developing applications but we can still DIY wrappers that most fit our own usage). I build my own called <a href="http://zhiyzuo.github.io/python-scopus/" target="_blank">python-scopus</a>.

#### A quick start

The first step will still be getting your API key for access to Elsevier's database. Go to https://dev.elsevier.com/ and request a key. Then we can write a simple query to try it out!

In [8]:
with open("./scopus_apikey.csv", "r") as scopus_key:
    key = scopus_key.read().strip("\n")
## check out http://api.elsevier.com/documentation/SCOPUSSearchAPI.wadl
## search tips: http://api.elsevier.com/documentation/search/SCOPUSSearchTips.htm
search_uri = "http://api.elsevier.com/content/search/scopus"

import requests
## notice how we supply the query parameters using requests package
## http://docs.python-requests.org/en/master/user/quickstart/#make-a-request
par = {'apikey': key, 'query': 'TITLE-ABS-KEY(protein structure)'}
r = requests.get(search_uri, params=par)
## click to see
print("XML: %s"%r.url)

## by default this is an XML repsonse, we can ask elsevier to return a JSON object
par_json = {'apikey': key, 'query': 'TITLE-ABS-KEY(protein structure)', 'httpAccept':'application/json'}
r_json = requests.get(search_uri, params=par_json)
## click to see
print("JSON: %s"%r_json.url)

XML: http://api.elsevier.com/content/search/scopus?query=TITLE-ABS-KEY%28protein+structure%29&apikey=f4ef78418fdf4df42cdd4108dea1d803
JSON: http://api.elsevier.com/content/search/scopus?query=TITLE-ABS-KEY%28protein+structure%29&httpAccept=application%2Fjson&apikey=f4ef78418fdf4df42cdd4108dea1d803


For parsing the JSON object, you can use <a href="http://docs.python-guide.org/en/latest/scenarios/json/" target="_blank">***`JSON`***</a> library, which we will cover later. 

So, we currently have something like this:

<img src="http://i.imgur.com/4NnM00t.png">

We can actually use the same trick as we did for the html files for xml formats: first download it using ___`urlopen`___ function in ___`urllib2`___ library. However, ___`requests`___ make all these much easier for us:

In [9]:
## a method to convert the response into JSON!
resp = r.json()
print(resp)

{u'search-results': {u'opensearch:Query': {u'@searchTerms': u'TITLE-ABS-KEY(protein structure)', u'@role': u'request', u'@startPage': u'0'}, u'opensearch:itemsPerPage': u'25', u'opensearch:totalResults': u'854559', u'link': [{u'@href': u'http://api.elsevier.com/content/search/scopus?start=0&count=25&query=TITLE-ABS-KEY%28protein+structure%29&apikey=f4ef78418fdf4df42cdd4108dea1d803', u'@ref': u'self', u'@_fa': u'true', u'@type': u'application/json'}, {u'@href': u'http://api.elsevier.com/content/search/scopus?start=0&count=25&query=TITLE-ABS-KEY%28protein+structure%29&apikey=f4ef78418fdf4df42cdd4108dea1d803', u'@ref': u'first', u'@_fa': u'true', u'@type': u'application/json'}, {u'@href': u'http://api.elsevier.com/content/search/scopus?start=25&count=25&query=TITLE-ABS-KEY%28protein+structure%29&apikey=f4ef78418fdf4df42cdd4108dea1d803', u'@ref': u'next', u'@_fa': u'true', u'@type': u'application/json'}, {u'@href': u'http://api.elsevier.com/content/search/scopus?start=4975&count=25&query=T

In [10]:
## by default it will return 25 results.
print("Showing %s items."%len(resp['search-results']['entry']))
## use start index to iterate thru all results
total_count = resp['search-results']['opensearch:totalResults']
start_index = resp['search-results']['opensearch:startIndex']
print("A total number of %s results found in Scopus; Start index is %s."%(total_count, start_index))
## Use a dictionary-style to access the values.
print("\033[43m%s\033[0m"%resp['search-results']['entry'][0])

Showing 25 items.
A total number of 854559 results found in Scopus; Start index is 0.
[43m{u'prism:volume': u'32', u'prism:issueIdentifier': u'1', u'dc:title': u'Investigation of PDE5/PDE6 and PDE5/PDE11 selective potent tadalafil-like PDE5 inhibitors using combination of molecular modeling approaches, molecular fingerprint-based virtual screening protocols and structure-based pharmacophore development', u'affiliation': [{u'affiliation-country': u'Turkey', u'affiliation-city': u'Istanbul', u'@_fa': u'true', u'affilname': u'Istanbul Teknik Universitesi'}, {u'affiliation-country': u'Italy', u'affiliation-city': u'Pisa', u'@_fa': u'true', u'affilname': u'Universita di Pisa'}], u'prism:publicationName': u'Journal of enzyme inhibition and medicinal chemistry', u'pubmed-id': u'28150511', u'dc:identifier': u'SCOPUS_ID:85013218391', u'subtypeDescription': u'Article', u'citedby-count': u'0', u'source-id': u'17605', u'prism:eIssn': u'14756374', u'link': [{u'@href': u'http://api.elsevier.com/con

Suppose we only want the titles, we can write a very simple function to build a wrapper for code reuse:

In [36]:
def search_scopus(key, query, index=0, verbose=True):
    '''
        Search Scopus database using key as api key, with query.
        
        Parameters
        ----------
        key : string
            Elsevier api key. Get it here: https://dev.elsevier.com/index.html
        query : string
            Search query. See more details here: http://api.elsevier.com/documentation/search/SCOPUSSearchTips.htm
        index : int
            Start index. Will be used in search_scopus_plus function
        verbose : bool
            Verbose mode.
        
        Returns
        -------
        pandas DataFrame
            id column stores scopus id and title column stores titles
    '''
    
    import requests
    par = {'apikey': key, 'query': query, 'start': index}
    r = requests.get(search_uri, params=par)
    resp = r.json()
    ## print out some summaries
    total_count = int(resp['search-results']['opensearch:totalResults'])
    start_index = int(resp['search-results']['opensearch:startIndex'])
    entries = resp['search-results']['entry']
    if verbose:
        print("Going to: %s"%r.url)
        print("A total number of %s results found in Scopus; Showing %d; Start index is %s."%\
              (total_count, len(entries), start_index))
        print("--------------------------------------")  

    # iterate thru each entry
    # You can also do this in a list comprehension with one line of code
    # (I did not test the following code but it should work with minimal modification)
    # id_title_list = [(entry['dc:identifier'].split(":")[-1], entry["dc:title"]) for entry in entries]
    id_title_list = []
    for entry in entries:
        scopus_id = entry['dc:identifier'].split(":")[-1]
        id_title_list.append((scopus_id, entry["dc:title"]))
    
    ## use pd.DataFrame to better store the results
    import pandas as pd
    result_df = pd.DataFrame(id_title_list, columns=list(("id", "title")))
    if index == 0:
        return(result_df, total_count)
    else:
        return(result_df)

In [37]:
result_df, total_count = search_scopus(key, 'TITLE-ABS-KEY(gender in science)')
# use iPython's functions to print data frames prettier
from IPython.display import display, HTML
display(result_df)
# OR: HTML(result_df.to_html())

Going to: http://api.elsevier.com/content/search/scopus?query=TITLE-ABS-KEY%28gender+in+science%29&apikey=f4ef78418fdf4df42cdd4108dea1d803&start=0
A total number of 35036 results found in Scopus; Showing 25; Start index is 0.
--------------------------------------


Unnamed: 0,id,title
0,85008225532,Prevalence of extra roots in permanent mandibu...
1,85007350529,Who votes for public health? U.S. senator char...
2,85008462870,Joint prediction of occurrence of heart block ...
3,84997218225,Academic social networks and communication res...
4,85008716292,Crafting a smartphone repurchase decision maki...
5,85012303065,The influence of affective cues on positive em...
6,85010440966,"Students’ profiles of ICT use: Identification,..."
7,85013115224,Development of a game-design workshop to promo...
8,84964031944,Transforming soccer to achieve solidarity: ‘Go...
9,85008237791,Gender difference and employees’ cybersecurity...


## * Conclusion

Hopefully you see how the two different types of web data retrieval work and you will be able to scrape your own data now! Note that in our last example of scraping Scopus data, we only get 25 results. By a simple tweak of varying the indices, we can retrieve all of them! The following is a revised version of ***`search_scopus`*** to get all entries, just for your references, which is part of my code for ***`python-scopus`***.

In [38]:
def search_scopus_plus(key, query, number=200):
    '''
        Search Scopus database using key as api key, with query.
        Reuse function ssearch_scopus(key, query, index, verbose)
        
        Parameters
        ----------
        key : string
            Elsevier api key. Get it here: https://dev.elsevier.com/index.html
        query : string
            Search query. See more details here: http://api.elsevier.com/documentation/search/SCOPUSSearchTips.htm
        number : int
            The number of entries to return. By default it is 200.
        
        Returns
        -------
        pandas DataFrame
            id column stores scopus id and title column stores titles
    '''
    
    results_df, total_count = search_scopus(key, query, verbose=False)
    if type(number) is not int or number > total_count:
        import sys
        raise ValueError("%s is not a valid input for the number of entries to return." %number)
    
    if number < 25:
        # if less than 25, just one page of response is enough
        return results_df[:number]
    
    # if larger than, go to next few pages until enough
    index = 1
    while True:
        results_df = results_df.append(search_scopus(key, query, index, False), ignore_index=True)
        if len(results_df) >= number:
            return results_df[:number]
        index += 1

In [39]:
result_df_20 = search_scopus_plus(key, 'TITLE-ABS-KEY(protein structure)',20)
display(result_df_20)

Unnamed: 0,id,title
0,85013218391,Investigation of PDE5/PDE6 and PDE5/PDE11 sele...
1,85013156560,Probing the druggability of membrane-bound Rab...
2,85013140426,Novel multitarget-directed tacrine derivatives...
3,85013452709,Complementary function of two transketolase is...
4,85009219497,Two NADPH: Protochlorophyllide Oxidoreductase ...
5,85009863222,"Characterization of LhSorP5CS, a gene catalyzi..."
6,85012992155,Formation of concentrated biopolymer particles...
7,85007137238,Structure-guided rational design of red fluore...
8,85013018555,Molecular forces involved in heat-induced fres...
9,85012986098,Effect of Maillard induced glycation on protei...


In [41]:
result_df_40 = search_scopus_plus(key, 'TITLE-ABS-KEY(gender in science)',40)
display(result_df_40)

Unnamed: 0,id,title
0,85008225532,Prevalence of extra roots in permanent mandibu...
1,85007350529,Who votes for public health? U.S. senator char...
2,85008462870,Joint prediction of occurrence of heart block ...
3,84997218225,Academic social networks and communication res...
4,85008716292,Crafting a smartphone repurchase decision maki...
5,85012303065,The influence of affective cues on positive em...
6,85010440966,"Students’ profiles of ICT use: Identification,..."
7,85013115224,Development of a game-design workshop to promo...
8,84964031944,Transforming soccer to achieve solidarity: ‘Go...
9,85008237791,Gender difference and employees’ cybersecurity...


In [43]:
result_df_too_many = search_scopus_plus(key, 'TITLE-ABS-KEY(gender in science)', 100000)

ValueError: 100000 is not a valid input for the number of entries to return.