# Ch. 12 Ex. Networked Programs

## Notes From Chapter

### Hypertext Transfer Protocol

### World's simplest web browser

In [6]:
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

#makes a connection to port 80 on server www.pr4e.com
mysock.connect(('data.pr4e.org',80))

#send the GET command followed by blank line
# /r/n/r/n signifies nothing between two EOL sequences.
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if len(data) < 1:
        break
    print(data.decode(), end='')
    
mysock.close()

HTTP/1.1 200 OK
Date: Sat, 30 Sep 2023 22:16:40 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


### Getting images over HTTP

In [9]:
import socket
import time

In [12]:
HOST = 'data.pr4e.org'
PORT = 80
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))
mysock.sendall(b'GET http://data.pr4e.org/cover3.jpg HTTP/1.0\r\n\r\n')
count = 0
picture = b""

while True:
    data = mysock.recv(5120)
    if len(data) < 1: break
    #time.sleep(0.25)
    count = count + len(data)
    print(len(data), count)
    picture = picture + data

mysock.close()

# Look for the end of the header (2 CRLF)
pos = picture.find(b"\r\n\r\n")
print('Header length', pos)
print(picture[:pos].decode())

# Skip past the header and save the picture data
picture = picture[pos+4:]
fhand = open("stuff.jpg", "wb")
fhand.write(picture)
fhand.close()

5120 5120
5120 10240
5120 15360
2160 17520
5120 22640
720 23360
5120 28480
720 29200
5120 34320
720 35040
5120 40160
720 40880
5120 46000
720 46720
5120 51840
720 52560
5120 57680
720 58400
5120 63520
720 64240
5120 69360
720 70080
5120 75200
720 75920
5120 81040
720 81760
5120 86880
720 87600
5120 92720
720 93440
5120 98560
720 99280
5120 104400
720 105120
5120 110240
720 110960
5120 116080
5120 121200
1440 122640
5120 127760
2180 129940
5120 135060
3640 138700
5120 143820
720 144540
5120 149660
5120 154780
1440 156220
5120 161340
720 162060
5120 167180
5100 172280
5120 177400
5120 182520
5120 187640
2160 189800
5120 194920
5120 200040
5120 205160
3620 208780
5120 213900
720 214620
5120 219740
5120 224860
1440 226300
4308 230608
Header length 394
HTTP/1.1 200 OK
Date: Sat, 30 Sep 2023 22:27:01 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Mon, 15 May 2017 12:27:40 GMT
ETag: "38342-54f8f2e5b6277"
Accept-Ranges: bytes
Content-Length: 230210
Vary: Accept-Encoding
Cache-Control: max-a

## Simplifying with urllib

In [13]:
import urllib.request

In [14]:
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

for line in fhand:
    print(line.decode().strip())

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


In [18]:
#word counting using a url 'file'

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')


counts = dict()
for line in fhand:
    words = line.decode().lower().split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1
    
print(counts)

{'but': 1, 'soft': 1, 'what': 1, 'light': 1, 'through': 1, 'yonder': 1, 'window': 1, 'breaks': 1, 'it': 1, 'is': 3, 'the': 3, 'east': 1, 'and': 3, 'juliet': 1, 'sun': 2, 'arise': 1, 'fair': 1, 'kill': 1, 'envious': 1, 'moon': 1, 'who': 1, 'already': 1, 'sick': 1, 'pale': 1, 'with': 1, 'grief': 1}


## Reading binary files using urllib

In [20]:
#for simpler requests

import urllib.request, urllib.parse, urllib.error

img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg').read()
fhand = open('cover3.jpg', 'wb')
fhand.write(img)
fhand.close()

In [19]:
# for larger requests we add a buffer

import urllib.request, urllib.parse, urllib.error
img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg')
fhand = open('cover3.jpg', 'wb')
size = 0

while True:
    info = img.read(100000)
    if len(info) < 1: break
    size = size + len(info)
    fhand.write(info)
    
print(size, 'characters copied.')
fhand.close()


230210 characters copied.


## Parsing HTML

### Using RE

In [21]:
# Search for link values within URL input
import urllib.request, urllib.parse, urllib.error
import re
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
links = re.findall(b'href="(http[s]?://.*?)"', html)
for link in links:
    print(link.decode())


Enter - https://docs.python.org
https://docs.python.org/3/index.html
https://www.python.org/
https://docs.python.org/3.13/
https://docs.python.org/3.12/
https://docs.python.org/3.11/
https://docs.python.org/3.10/
https://docs.python.org/3.9/
https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.6/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
https://devguide.python.org/
https://www.python.org/
https://devguide.python.org/docquality/#helping-with-documentation
https://docs.python.org/3.13/
https://docs.python.org/3.12/
https://docs.python.org/3.11/
https://docs.python.org/3.10/
https://docs.python.org/3.9/
https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.6/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.p

### Using BeautifulSoup

In [22]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

In [26]:
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

Enter - https://docs.python.org
https://www.python.org/
download.html
https://docs.python.org/3.13/
https://docs.python.org/3.12/
https://docs.python.org/3.11/
https://docs.python.org/3.10/
https://docs.python.org/3.9/
https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.6/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
https://devguide.python.org/
genindex.html
py-modindex.html
https://www.python.org/
#

whatsnew/3.11.html
whatsnew/index.html
tutorial/index.html
library/index.html
reference/index.html
using/index.html
howto/index.html
installing/index.html
distributing/index.html
extending/index.html
c-api/index.html
faq/index.html
py-modindex.html
genindex.html
glossary.html
search.html
contents.html
bugs.html
https://devguide.python.org/docquality/#helping-wi

## Exercise 1:  
Change the socket program socket1.py to prompt the user for the URL so it can read any web page. You can use split('/') to break the URL into its component parts so you can extract the host name for the socket connect call. Add error checking using try and except to handle the condition where the user enters an improperly formatted or non-existent URL.


In [37]:
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

#makes a connection to port 80 on server www.pr4e.com

url = input('Enter - ').split('//'[1])
try:
    mysock.connect((url,80))
except:
    print('error: invalid url')

#send the GET command followed by blank line
# /r/n/r/n signifies nothing between two EOL sequences.

cmd = f'GET {url}'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if len(data) < 1:
        break
    print(data.decode(), end='')
    
mysock.close()

Enter - https://docs.python.org
error: invalid url


OSError: [WinError 10057] A request to send or receive data was disallowed because the socket is not connected and (when sending on a datagram socket using a sendto call) no address was supplied

In [30]:
url[2]

'docs.python.org'

## Exercise 2: 
Change your socket program so that it counts the number of characters it has received and stops displaying any text after it has shown 3000 characters. The program should retrieve the entire document and count the total number of characters and display the count of the number of characters at the end of the document.


## Exercise 3: 
Use urllib to replicate the previous exercise of 

    1. retrieving the document from a URL, 
    2. displaying up to 3000 characters, and 
    3. counting the overall number of characters in the document. 
    
   Don’t worry about the headers for this exercise, simply show the first 3000 characters of the document contents.


## Exercise 4:  
Change the urllinks.py program to extract and count paragraph (p) tags from the retrieved HTML document and display the count of the paragraphs as the output of your program. Do not display the paragraph text, only count them. Test your program on several small web pages as well as some larger web pages.


## Exercise 5: 

(Advanced) Change the socket program so that it only shows data after the headers and a blank line have been received. Remember that recv receives characters (newlines and all), not lines.