# Networked Programs II

## Retrieving web pages with urllib

In 9.2 Interest, I used the manual way to send and receive data over HTTP using socket module. The codes was not intuitive due to HTTP protocol and parsing data needed extra steps to clean up the data. There is a standard library named urllib makes it easy to retrieve data from the web. Since urllib handles a web page like a file, it will take care of HTTP protocol and cleansing up header details. The code using urllib becomes much simpler and easier to read.


In [37]:
import urllib.request as request

res = request.urlopen('http://data.pr4e.org/romeo.txt')
for line in res:
    print(line)

b'But soft what light through yonder window breaks\n'
b'It is the east and Juliet is the sun\n'
b'Arise fair sun and kill the envious moon\n'
b'Who is already sick and pale with grief\n'


We can use read() method to retrieve all lines in one string. 

In [38]:
res = request.urlopen('http://data.pr4e.org/romeo.txt')
print(res.read())

b'But soft what light through yonder window breaks\nIt is the east and Juliet is the sun\nArise fair sun and kill the envious moon\nWho is already sick and pale with grief\n'


We can use readline() method to read one line by line. Every time we execute readline(), the program will return the next line. Once it returns all lines, then it will return an empty string thereafter.

In [39]:
res = request.urlopen('http://data.pr4e.org/romeo.txt')
print(res.readline())
print(res.readline())
print(res.readline())
print(res.readline())
print(res.readline())

b'But soft what light through yonder window breaks\n'
b'It is the east and Juliet is the sun\n'
b'Arise fair sun and kill the envious moon\n'
b'Who is already sick and pale with grief\n'
b''


We can use readlines() method to retrieve all lines in one `list` data type. 

In [40]:
import urllib.request as request
res = request.urlopen('http://data.pr4e.org/romeo.txt')
print(res.readlines())


[b'But soft what light through yonder window breaks\n', b'It is the east and Juliet is the sun\n', b'Arise fair sun and kill the envious moon\n', b'Who is already sick and pale with grief\n']


In order to parse only strings, we can use decode() method to parse only string. strip() method is used to remove extra lines between strings.

In [41]:
import urllib.request as request
res = request.urlopen('http://data.pr4e.org/romeo.txt')
for line in res:
    print(line.decode().strip())


But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


## Reading binary files using urllib

Similarly, urllib can be used to retrieve a JPG file as well as other image and video files. We can open the URL, retrieve data using read() method, and write to a file.

In [42]:
res = request.urlopen('http://data.pr4e.org/cover3.jpg')
image = res.read()
file = open('cover3.jpg', 'wb')
file.write(image)
file.close()


## Web Scraping

Basically, web scraper would do following tasks:

- Retrieving HTML data from a domain name
- Parsing that data for target information
- Storing the target information
- Moving to another web page to repeat the process 

Instead of retrieving files, we can download the data from the web page as HTML format
 

In [43]:
from urllib.request import urlopen
web_link = 'https://openweathermap.org/'
html = urlopen(web_link)
print(html.read(50))

b"<!DOCTYPE html>\n<html lang='en'>\n    <head>\n      "


In order to parse data in html tags, BeautifulSoup module can be used. The beautifulsoup4 library provide useful tools to parse data from html documents. For example, we can parse a tag from a beautifulsoup object as shown below.


In [44]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

web_link = 'https://openweathermap.org/'
html = urlopen(web_link)
bsObj = BeautifulSoup(html.read())
print(bsObj.div.form)


<form action="/find" class="pull-right hidden" id="nav-search-form" method="get" role="search">
<div class="input-group">
<input class="form-control" id="q" name="q" placeholder="Search" type="text"/>
<span class="input-group-btn">
<button class="btn btn-default" type="submit"><i class="fa fa-search"></i></button>
</span>
</div>
</form>


References

Severance. C. R. (2009). Python for Everybody. http://do1.dr-chuck.com/pythonlearn/EN_us/pythonlearn.pdf 
Mitchell, Ryan (2015). Web Scraping with Python. Sebastopol, CA: O’Reilly Media, Inc.  
Severance. C. R. (2009). Python for Everybody. http://do1.dr-chuck.com/pythonlearn/EN_us/pythonlearn.pdf  
https://www.w3schools.com/python/default.asp  
