# Tutorial 7 A: Web Scraping 

We cover in this part scraping data from the web. Data can be presented in HTML, XML and API etc. <font color='blue'>**Web scraping**</font> is the practice of using libraries to **sift** through a web page and **gather** the data that you need in a format most useful to you while at the same time preserving the structure of the data. 

There are several ways to extract information from the web. Use of <font color='red'>**API**</font>s being probably **the best way to extract data from a website**. Almost all large websites like Twitter, Facebook, Google, Twitter, StackOverflow provide APIs to access their data in a more structured manner. If you can get what you need through an API, it is almost always **preferred approach over web scrapping**. However, **not all** websites provide an API. Thus, we need to **scrape the HTML website to fetch the information**.

Non-standard python libraries needed in this tutorial include
* <font color='blue'>**urllib.request**</font>
* <font color='blue'>**beatifulsoup**</font>
* <font color='blue'>**re**</font>

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

## Task 1 Extract a list of links on a Wikipedia page.

Instead of retrieving all the links existing in a Wikipedia article, we are interested in extracting links that point to other article pages. If you look at the source code of the following page 
```
https://en.wikipedia.org/wiki/Kevin_Bacon
```
in your browser, you fill find that all these links have **three things** in common:
* They are in the `div` with `id` set to `bodyContent`
* The URLs do not contain semicolons
* The URLs begin with ***/wiki/***

We can use these rules to construct our search through the HTML page. 

Firstly, use the `urlopen()` function to open the wikipedia page for "Kevin Bacon",

In [2]:
html = urlopen("https://en.wikipedia.org/wiki/Kevin_Bacon")

Then, find and print all the links. In order to finish this task, you need to
* find the `div` whose `id = "bodyContent"`
* find all the link tags, whose `href` starts with "/wiki/" and does not ends with ":". For example
```html
 see <a href="/wiki/Kevin_Bacon_(disambiguation)" class="mw-disambig" title="Kevin Bacon (disambiguation)">Kevin Bacon (disambiguation)</a>
 <a href="/wiki/Philadelphia" title="Philadelphia">Philadelphia</a>
```

**Hint**: <font color='blue'>**RegEx**</font> is needed.

## Most-used methods in `BeautifulSoup`
To find exactly one result by `find()`:
> `find(name, attrs, recursive, string, **kwargs)` 
> -- `name`: only consider tags with certain name  
> -- `attrs`: any number of attributes in a dict  
> -- `recursive`: examine all the descendants, default True  
> -- `string`: search for strings instead of tags, can be string, regex, list, func, True  
> -- `**kwargs`: Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes, **see Example**. 

Or, find all the matched and return a list by `find_all` or `findAll()`:
> `findAll(name, attrs, recursive, text, limit, **kwargs)`  
> -- `limit`: tells Beautiful Soup to stop gathering results after it’s found a certain number

**Example**: 
```Python
soup.findAll(id='someValue') #filter against each tag's 'id' attribute
soup.findAll(href=re.compile("pattern")) #use regex to filter against each tag's 'href'
soup.find(id=True) #if id attribute has a value
soup.find_all(attrs={"data-foo": "value"}) #put attributes in search into a dictionary
soup.find_all('a', limit=5) # return the top 5 matching
```

### Method Names Issues
Methods list below are equivalent with different names:  
`renderContents` >>> `encode_contents`  
`replaceWith` >>> `replace_with`  
`replaceWithChildren` >>> `unwrap`    
`findAll` >>> `find_all`  
`findAllNext` >>> `find_all_next`  
`findAllPrevious` >>> `find_all_previous`  
`findNext` >>> `find_next`  
`findNextSibling` >>> `find_next_sibling`  
`findNextSiblings` >>> `find_next_siblings`  
`findParent` >>> `find_parent`  
`findParents` >>> `find_parents`  
`findPrevious` >>> `find_previous`  
`findPreviousSibling` >>> `find_previous_sibling`  
`findPreviousSiblings` >>> `find_previous_siblings`  
`nextSibling` >>> `next_sibling`  
`previousSibling` >>> `previous_sibling`  

In [3]:
html = urlopen("https://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html, "lxml")
ptn = "^(/wiki/)((?!:).)*$"
#print(bsObj.prettify())
#print(bsObj.title)
#print(bsObj.div) #the first div
#bsObj.find('div') # Or using find() method
#body = bsObj.find('body')
count = 0
for each in bsObj.find(id='bodyContent').findAll('a', href=re.compile(ptn)):
    if 'href' in each.attrs:
        count += 1
count

352

In [4]:
html = urlopen("https://en.wikipedia.org/wiki/Kevin_Bacon")
# using lxml parser
bsObj = BeautifulSoup(html, "lxml")
# using regex
ptn = "^(/wiki/)((?!:).)*$"
count=0
for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile(ptn)):
    if 'href' in link.attrs:
        count+=1
        #print(link.attrs['href'])
count

352

## Task 2 Perform a random walk through a given webpate.
Assume that we will find a random object in Wikipedia that is linked to "Kevin Bacon" with, so-called "<font color=red>**Six Degrees of Wikipedia**</font>". In other words, the task is to find **two subjects** linked by a chain containing **no more than six subjects** (including the two original subjects).

In [5]:
import datetime
import random
random.seed(datetime.datetime.now())

In [6]:
def getLinks(articleUrl):
    html = urlopen("http://en.wikipedia.org"+articleUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile(ptn))

The details of the random walk along the links are 
* Randomly choosing a link from the list of retrieved links
* Printing the article represented by the link
* Retrieving a list of links 
* repeat the above step until the number of retrieved articles reaches 5.

In [7]:
links = getLinks("/wiki/Kevin_Bacon")
count = 0
while len(links) > 0 and count < 5:
    newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
    print(newArticle)
    links = getLinks(newArticle)
    count = count + 1

/wiki/Primetime_Emmy_Awards
/wiki/Filmation
/wiki/Illumination_Entertainment
/wiki/Turner_Broadcasting_System
/wiki/Sony_Pictures_Studios


## Task 3 Crawl the Entire Wikipedia website

The general approach to an exhaustive site crawl is to start with the root, i.e., the home page of a website. Here, we will start with
```
https://en.wikipedia.org/
```
by retrieving all the links that appear in the home page. And then traverse each link recursively. However, the number of links is going to be very large and a link can appear in many Wikipedia article. Thus, we need to consider how to avoid repeatedly crawling the same article or page. In order to do so, we can keep a running list for easy lookups and slightly update the getLinks() function.

In [8]:
pages = set()

Note: add a terminating condition in your code, for example,
```python
    len(pages) < 10
```
Otherwise, the script will run through the entire Wikipedia website, which will take a long time to finish. So please avoid that in the tutorial class.

In [9]:
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    ptn = "^(/wiki/)"
    bsObj = BeautifulSoup(html, "html.parser")
    limit=10
    for link in bsObj.findAll("a", href=re.compile(ptn)):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages and len(pages) < limit:
                #We have encountered a new page
                newPage = link.attrs['href']
                print("----------------\n"+newPage)
                pages.add(newPage)
                getLinks(newPage)

In [10]:
getLinks("")

----------------
/wiki/Wikipedia
----------------
/wiki/Wikipedia:Protection_policy#semi
----------------
/wiki/Wikipedia:Requests_for_page_protection
----------------
/wiki/Wikipedia:Requests_for_permissions
----------------
/wiki/Wikipedia:Requesting_copyright_permission
----------------
/wiki/Wikipedia:User_access_levels
----------------
/wiki/Wikipedia:Requests_for_adminship
----------------
/wiki/Wikipedia:Protection_policy#extended
----------------
/wiki/Wikipedia:Lists_of_protected_pages
----------------
/wiki/Wikipedia:Protection_policy


## Task 4 Collect data across the Wikipedia site
One purpose of traversing all the the links is to **extract data**. The best practice is to look at a few pages from the site and determine the patterns. By looking at a handful of Wikipedia pages both **articles** and **non-articles** pages, the following pattens can be identified:
* All titles are under `h1` span tags, and these are the only `h1` tags on the page. For example,
```html
<h1 id="firstHeading" class="firstHeading" lang="en">Kevin Bacon</h1>
```
```html
<h1 id="firstHeading" class="firstHeading" lang="en">Main Page</h1>	
```
* All body text lives under the `div#bodyContent` tag. However, if we want to get more specific and access just the first paragraph of text, we might be better off using `div#mw-content-text -> p`.
* **Edit** links occur only on article pages. If they occur, they will be found in the `li#ca-edit` tag, under `li#ca-edit -> span -> a`

Now, the task is to further modify the `getLink()` function to print the title, the first paragraph and the edit link. The content from each page should be separated by 
```pyhon
print("----------------\n"+newPage)
```

In [11]:
pages = set()

##### Please also add a terminating condition in your code, for example,
```python
    len(pages) < 5
```
Otherwise, the script will run through the entire Wikipedia website, which will take a long time to finish. So please avoid that in the tutorial class.

In [12]:
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    try:
        print(bsObj.h1.get_text())
        print(bsObj.find(id ="mw-content-text").findAll("p")[0])
        print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])
    except AttributeError:
        print("This page is missing something! No worries though!")
    
    for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages and len(pages) < 5:
                #We have encountered a new page
                newPage = link.attrs['href']
                print("----------------\n"+newPage)
                pages.add(newPage)
                getLinks(newPage)

In [13]:
getLinks("") 

Main Page
<p><i><b><a href="/wiki/Banksia_aculeata" title="Banksia aculeata">Banksia aculeata</a></b></i>, the prickly banksia, is a plant of the family <a href="/wiki/Proteaceae" title="Proteaceae">Proteaceae</a> native to the <a href="/wiki/Stirling_Range" title="Stirling Range">Stirling Range</a> in the <a class="mw-redirect" href="/wiki/Southwest_Australia" title="Southwest Australia">southwest</a> of <a href="/wiki/Western_Australia" title="Western Australia">Western Australia</a>. A bushy shrub up to 2 m (7 ft) tall, it has fissured grey bark on its trunk and branches, and dense foliage and leaves with very prickly <a href="/wiki/Serration" title="Serration">serrated</a> margins. Its unusual pinkish, <a href="/wiki/Pendent" title="Pendent">pendent</a> (hanging) flower spikes, known as <a href="/wiki/Inflorescence" title="Inflorescence">inflorescences</a>, are generally hidden in the foliage and appear during the early summer. Unlike many other banksia species, it does not have a 

Wikipedia:Requests for permissions
<p><span class="sysop-show" id="coordinates"><a href="/wiki/Wikipedia:Requests_for_permissions/Administrator_instructions" title="Wikipedia:Requests for permissions/Administrator instructions">Administrator instructions</a></span></p>
This page is missing something! No worries though!
----------------
/wiki/Wikipedia:Requesting_copyright_permission
Wikipedia:Requesting copyright permission
<p>To use copyrighted material on Wikipedia, it is <i>not enough</i> that we have permission to use it on Wikipedia alone. That's because Wikipedia itself states all its material may be used by anyone, for any purpose. So we have to be sure all material is in fact licensed for that purpose, whoever provided it.</p>
This page is missing something! No worries though!


*** 
## Task 5 API access 
In addition to **HTML** format, data is commonly found on the web through public <font color='red'>**API**</font>s. We use the `requests` package (http://docs.python-requests.org) to call <font color='red'>**API**</font>s using Python. In the following example, we call a public <font color='red'>**API**</font> for collecting weather data. 


** You need to sign up for a free account to get your unique API key to use in the following code. register at**  http://api.openweathermap.org

In [24]:
#Now we  use requests to retrieve the web page with our data
import requests
APIkey = '087a4e7f39e92c1587403a2a526b572d'
url = 'http://api.openweathermap.org/data/2.5/forecast?id=524901&APPID='+ APIkey
#write your APPID here#
response= requests.get(url)

The response object contains GET query response. A successfull one has a value of 200. we need to parse the response with json to extract the information. 

In [25]:
#Check the HTTP status code https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
print (response.status_code)

200


```
{
"cod":"200",
"message":0.0036,
"cnt":40,
"list":[
        {"dt":1485799200,
         "main":{
            "temp":261.45,
            "temp_min":259.086,
            "temp_max":261.45,
            "pressure":1023.48,
            "sea_level":1045.39,
            "grnd_level":1023.48,
            "humidity":79,
            "temp_kf":2.37},
         "weather":[{"id":800,"main":"Clear","description":"clear sky","icon":"02n"}],
         "clouds":{"all":8},
         "wind":{"speed":4.77,"deg":232.505},
         "snow":{},
         "sys":{"pod":"n"},
         "dt_txt":"2017-01-30 18:00:00"},
        {"dt":1485810000,
         "main": ...}
         ...
```

In [31]:
# response.content is text
print (type(response.content))

<class 'bytes'>


In [32]:
#response.json() converts the content to json 
data = response.json()
print (type(data))

<class 'dict'>


In [33]:
data.keys()

dict_keys(['cod', 'message', 'cnt', 'list', 'city'])

In [34]:
data

{'city': {'coord': {'lat': 55.7522, 'lon': 37.6156},
  'country': 'RU',
  'id': 524901,
  'name': 'Moscow'},
 'cnt': 37,
 'cod': '200',
 'list': [{'clouds': {'all': 0},
   'dt': 1526202000,
   'dt_txt': '2018-05-13 09:00:00',
   'main': {'grnd_level': 1016.14,
    'humidity': 36,
    'pressure': 1016.14,
    'sea_level': 1035.63,
    'temp': 297.42,
    'temp_kf': 0.62,
    'temp_max': 297.42,
    'temp_min': 296.803},
   'sys': {'pod': 'd'},
   'weather': [{'description': 'clear sky',
     'icon': '01d',
     'id': 800,
     'main': 'Clear'}],
   'wind': {'deg': 74.0031, 'speed': 2.67}},
  {'clouds': {'all': 0},
   'dt': 1526212800,
   'dt_txt': '2018-05-13 12:00:00',
   'main': {'grnd_level': 1015.7,
    'humidity': 30,
    'pressure': 1015.7,
    'sea_level': 1035.19,
    'temp': 298.28,
    'temp_kf': 0.46,
    'temp_max': 298.28,
    'temp_min': 297.82},
   'sys': {'pod': 'd'},
   'weather': [{'description': 'clear sky',
     'icon': '01d',
     'id': 800,
     'main': 'Clear'}],


The keys explain the structure of the fetched data. Try displaying values for each element. In this example, the weather information exists in the 'list'. 

In [35]:
data['list'][15]

{'clouds': {'all': 0},
 'dt': 1526364000,
 'dt_txt': '2018-05-15 06:00:00',
 'main': {'grnd_level': 1014.46,
  'humidity': 36,
  'pressure': 1014.46,
  'sea_level': 1033.95,
  'temp': 297.216,
  'temp_kf': 0,
  'temp_max': 297.216,
  'temp_min': 297.216},
 'sys': {'pod': 'd'},
 'weather': [{'description': 'clear sky',
   'icon': '01d',
   'id': 800,
   'main': 'Clear'}],
 'wind': {'deg': 247.5, 'speed': 1.46}}

The next step is to create a DataFrame with the weather information, which is demonstrated as follows. You can select a subset to display or display the entire data

In [37]:
from pandas import DataFrame
# data with the default column headers
weather_table_all= DataFrame(data['list'])

In [38]:
weather_table_all.head()

Unnamed: 0,clouds,dt,dt_txt,main,rain,sys,weather,wind
0,{'all': 0},1526202000,2018-05-13 09:00:00,"{'temp': 297.42, 'temp_min': 296.803, 'temp_ma...",,{'pod': 'd'},"[{'id': 800, 'main': 'Clear', 'description': '...","{'speed': 2.67, 'deg': 74.0031}"
1,{'all': 0},1526212800,2018-05-13 12:00:00,"{'temp': 298.28, 'temp_min': 297.82, 'temp_max...",,{'pod': 'd'},"[{'id': 800, 'main': 'Clear', 'description': '...","{'speed': 2.92, 'deg': 86.5007}"
2,{'all': 0},1526223600,2018-05-13 15:00:00,"{'temp': 297.77, 'temp_min': 297.46, 'temp_max...",,{'pod': 'd'},"[{'id': 800, 'main': 'Clear', 'description': '...","{'speed': 2.61, 'deg': 87.5008}"
3,{'all': 0},1526234400,2018-05-13 18:00:00,"{'temp': 292.23, 'temp_min': 292.079, 'temp_ma...",,{'pod': 'n'},"[{'id': 800, 'main': 'Clear', 'description': '...","{'speed': 2.77, 'deg': 55.0002}"
4,{'all': 0},1526245200,2018-05-13 21:00:00,"{'temp': 288.266, 'temp_min': 288.266, 'temp_m...",,{'pod': 'n'},"[{'id': 800, 'main': 'Clear', 'description': '...","{'speed': 2.16, 'deg': 70.003}"


### Discussion: 

Further parsing is still required to get the table (`DataFrame`) in a flat shape. Now it it's your turn, parse the weather data to generate a table.

*Please note that some materials used in this tutorial are partially from the book "Web Scraping with Python"*