## The ```requests``` module

We first demonstrate how to use the Python module ```requests``` to send out ```GET``` request to a web server. We can use it to make API calls as well as to download the ```html``` code for a web page. Here, we will use it for the latter. The ```requests``` module is part of a standard Python installation.

## Web scraping

There is an almost infinite amout of webpages written in ```HTML```, providing a great source of unstructured, textual data for NLP. Fetching html pages is trivial. For example, to get the FAQ page at https://moodle.yorku.ca/students/faq/index.html, we just need to use ```GET``` method of requests module as usual.

In [1]:
import requests
html = requests.get('https://moodle.yorku.ca/students/faq/index.html')
from bs4 import BeautifulSoup
soup = BeautifulSoup(html.text, 'html.parser')
print(soup.title)

<title>Moodle</title>


The difficult part is how to extract useful information out of the fetched html files. ```HTML``` tags could be used to facilitate the extraction. You will need to study a few target html pages, locate the useful information, and recognize the patterns of tags immediately before and after the useful information, and write your Python program to extract the information. Fortunately, there are also Python modules to help you scraping the html pages, for example, ```BeautifulSoup``` and ```Scrapy```. Interested readers shall follow the manuals of these Python modules.

In [1]:
import requests

# get the htlm
resp = requests.get('https://www.ola.org/en/members/current/ministers')

# check what the response contains
type(resp)
print(dir(resp))

print(resp.status_code)
# HTTPS status codes:
# https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

print(resp.text)
# This is the html source
html_code = resp.text

# save it so that it can be viewed in an editor
with open('resp.html', 'w') as f:
    f.write(resp.text)

['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']
200
<!DOCTYPE html>
<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.o

Open the saved file in a text editor and take a look at the html code.

In [2]:
from bs4 import BeautifulSoup
import csv

soup = BeautifulSoup(html_code)

help(soup)

help(soup.find_all)

# Find all links
links = soup.find_all('a')
for link in links:
    print(link)


Help on BeautifulSoup in module bs4 object:

class BeautifulSoup(bs4.element.Tag)
 |  BeautifulSoup(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)
 |  
 |  A data structure representing a parsed HTML or XML document.
 |  
 |  Most of the methods you'll call on a BeautifulSoup object are inherited from
 |  PageElement or Tag.
 |  
 |  Internally, this class defines the basic interface called by the
 |  tree builders when converting an HTML/XML document into a data
 |  structure. The interface abstracts away the differences between
 |  parsers. To write a new tree builder, you'll need to understand
 |  these methods as a whole.
 |  
 |  These methods will be called by the BeautifulSoup constructor:
 |    * reset()
 |    * feed(markup)
 |  
 |  The tree builder may call these methods from its feed() implementation:
 |    * handle_starttag(name, attrs) # See note about return value
 |    * handle_endtag(n

Open the htlm with a text editor and search for example "Calandra". You should see that all member names are as link attributes in table cells ```<td>```

**Task**:
Use ```soup.find_all()``` to find all table cells, and within those cells, print all links.

**Question 1**: Does this extract all member names?

**Question 2**: Does this extract something that we do not want?

In [None]:
tables = soup.find_all('td')
for td in tables:
    links = td.find_all('a')
    for link in links:
        print(link)

**Task**: Save name and link to a csv file.

In [4]:
resp = requests.get('https://www.ola.org/en/members/current/ministers')
soup = BeautifulSoup(resp.text)

In [5]:
# open a new file as follows:
f = open("LAO_members.csv", 'w')
# close it with f.close()
# start a csv writer object:
writer = csv.writer(f, delimiter=',')
# write a row, this writes the header
writer.writerow(["Name", "Link"])

# you can extract the text node with link.contents
# and you can get the actual link href with link.get('href')
tables = soup.find_all('td')
for td in tables:
    links = td.find_all('a')
    for link in links:
        name = link.contents[0]
        personal_url = link.get('href')
        writer.writerow([name, personal_url])

f.close()