# Python for Everybody
## C3: Using Python to Access Web Data

### Week 2 Exercises: Regular Expressions
These exercises are from: https://www.py4e.com/html3/11-regex

**Exercise 1:** Write a simple program to simulate the operation of the grep command on Unix. Ask the user to enter a regular expression and count the number of lines that matched the regular expression:
```
$ python grep.py
Enter a regular expression: ^Author
mbox.txt had 1798 lines that matched ^Author

$ python grep.py
Enter a regular expression: ^X-
mbox.txt had 14368 lines that matched ^X-

$ python grep.py
Enter a regular expression: java$
mbox.txt had 4218 lines that matched java$
```

In [8]:
# import regular expression library
import re

# prompt user to enter regular expression
regex = str(input('Enter a regular expression:'))

# # open file
file = open('mbox.txt', 'r')

# counter variable
count = 0

# loop through file, strip out new line character
for line in file:
    line = line.rstrip()
    # regex that searches for lines matching regular expression
    match = re.search(regex, line)

    # counts the number of lines matching regular expression
    if match:
        count = count + 1

# print total lines that matched regular expression
print('mbox.txt had', count, 'lines that matched', regex)

mbox.txt had 4218 lines that matched java$


**Exercise 2:** Write a program to look for lines of the form:
```
New Revision: 39772
```

In [10]:
# import regular expression library
import re

# # open file
file = open('mbox.txt', 'r')

# loop through file
for line in file:
    line = line.rstrip()

    # find lines that start with New Revision:
    rev = re.findall('^New Revision: [0-9]+', line)
    
    # if length of rev is not 1 then print out New Revision Line:
    if len(rev) != 1 :  continue
    print(rev)


Extract the number from each of the lines using a regular expression and the findall() method. Compute the average of the numbers and print out the average as an integer.
```
Enter file:mbox.txt
38549

Enter file:mbox-short.txt
39756
```

In [7]:
# import regular expression library
import re

# promt user for file name
inp = input('Enter file name:')

# open file
file = open(inp, 'r')

revnum = list()
count = 0

# loop through file
for line in file:
    line = line.rstrip()

    # find lines that start with New Revision:
    rev = re.findall('^New Revision: ([0-9]+)', line)

    for val in rev:
        revnum = revnum + [float(val)]
        count = count + 1

# print(sum(revnum))
# print(count)

# print average
print(sum(revnum) / count)


39756.92592592593


_____

### Week 3 & 4 Exercises: Networks and Sockets & Programs that Surf the Web
These exercises are from: https://www.py4e.com/html3/12-network

**Exercise 1:** Change the socket program **socket1.py** to prompt the user for the URL so it can read any web page. You can use **split('/')** to break the URL into its component parts so you can extract the host name for the socket **connect** call. Add error checking using **try** and **except** to handle the condition where the user enters an improperly formatted or non-existent URL.

http://data.pr4e.org/romeo.txt

In [28]:
# import socket library
import socket

# create a variable for port number
port_num = 0

while True:
    # prompt user to enter URL
    user_url = input('Enter URL including http:// or https://:')
    # lower case url
    user_url = user_url.lower()
    # display the url user has entered
    print('You entered:', user_url)

    try:
        # check to see if url begins with http:// or https://
        if user_url.startswith('http://') or user_url.startswith('https://'):
            # split url on / and grab, print hostname
            hostname = user_url.split('/')[2]
            print('URL host is:', hostname)
            
            # assign port number based on http: or https:
            if 'http:' in user_url.split('/')[0]:
                port_num = 80
                print('URL port is:', port_num)
            elif 'https:' in user_url.split('/')[0]:
                port_num = 443
                print('URL port is:', port_num)
            break
        else:
            # prompt user to re-enter url if it doesn't meet the above conditions
            print('Incorrect URL. Please enter full URL including http:// or https://')    
    except:
        print('Incorrect URL. Please enter full URL including http:// or https://')
        exit()

# make socket connection
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((hostname, port_num))
c = 'GET ' + user_url + ' HTTP/1.0\r\n\r\n'
cmd = c.encode()
mysock.send(cmd)

# print response
while True:
    data = mysock.recv(512)
    if len(data) < 1:
        break
    print(data.decode(),end='')

mysock.close()

You entered: http://data.pr4e.org/romeo.txt
URL host is: data.pr4e.org
URL port is: 80
HTTP/1.1 200 OK
Date: Wed, 01 Jun 2022 07:03:09 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


**Exercise 2:** Change your socket program so that it counts the number of characters it has received and stops displaying any text after it has shown characters specified by user. 

The program should also retrieve the entire document and count the total number of characters and display the count of the number of characters at the end of the document.

**(Advanced)** Change the socket program so that it only shows data after the headers and a blank line have been received. Remember that recv receives characters (newlines and all), not lines.

In [75]:
# import socket library
import socket

# create a variable for port number
port_num = 0

while True:
    # prompt user to enter URL
    user_url = input('Enter URL including http:// or https://:')
    # lower case url
    user_url = user_url.lower()
    # display the url user has entered
    print('You entered:', user_url)

    # url vlidation
    try:
        # check to see if url begins with http:// or https://
        if user_url.startswith('http://') or user_url.startswith('https://'):
            # split url on / and grab, print hostname
            hostname = user_url.split('/')[2]
            print('URL host is:', hostname)
            
            # assign port number based on http: or https:
            if 'http:' in user_url.split('/')[0]:
                port_num = 80
                print('URL port is:', port_num)
            elif 'https:' in user_url.split('/')[0]:
                port_num = 443
                print('URL port is:', port_num)
            break
        else:
            # prompt user to re-enter url if it doesn't meet the above conditions
            print('Incorrect URL. Please enter full URL including http:// or https://')    
    except:
        print('Incorrect URL. Please enter full URL including http:// or https://')
        exit()

while True:
    # prompt user to enter character limit to display text
    climit = input('Enter character limit of text to be displayed:')
    # display the character limit user has entered
    print('You entered character limit of:', climit)

    # validate input
    try:
        char_limit = int(climit)
        break
    except:
        print('Please enter a number for character limit')

# make socket connection
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((hostname, port_num))
c = 'GET ' + user_url + ' HTTP/1.0\r\n\r\n'
cmd = c.encode()
mysock.send(cmd)

char_count = 0

# print response
while True:
    data = mysock.recv(3000)
    if len(data) < 1:
        break

    # decode reponse data
    response = data.decode()
    
    # find position of line breaks after the header
    pos = response.find("\r\n\r\n")
    
    # save text only after the header
    text = response[pos+4:]

    # print first text up to user enter character limit
    print('\r\n')
    print('First', char_limit, 'characters of text is as follows:')
    print(text[:char_limit + 1])
    
    # total character count
    print('\r\n')
    print('Total character count of text excluding header is:', len(text))

mysock.close()

You entered: http://data.pr4e.org/words.txt
URL host is: data.pr4e.org
URL port is: 80
You entered character limit of: 500


First 500 characters of text is as follows:
Writing programs or programming is a very creative
and rewarding activity  You can write programs for
many reasons ranging from making your living to solving
a difficult data analysis problem to having fun to helping
someone else solve a problem  This book assumes that
{\em everyone} needs to know how to program and that once
you know how to program, you will figure out what you want
to do with your newfound skills

We are surrounded in our daily lives with computers ranging
from laptops to cell 


Total character count of text excluding header is: 1171


**Exercise 3:** Use **urllib** to replicate the previous exercise of (1) retrieving the document from a URL, (2) displaying up to 3000 characters, and (3) counting the overall number of characters in the document. Don’t worry about the headers for this exercise, simply show the first 3000 characters of the document contents.

In [74]:
# import library
import urllib.request, urllib.parse, urllib.error

# create a variable for port number
port_num = 0

while True:
    # prompt user to enter URL
    user_url = input('Enter URL including http:// or https://:')
    # lower case url
    user_url = user_url.lower()
    # display the url user has entered
    print('You entered:', user_url)

    # url vlidation
    try:
        # check to see if url begins with http:// or https://
        if user_url.startswith('http://') or user_url.startswith('https://'):
            hostname = user_url.split('/')[2]
            print('URL host is:', hostname)
            break
        else:
            # prompt user to re-enter url if it doesn't meet the above conditions
            print('Incorrect URL. Please enter full URL including http:// or https://')    
    except:
        print('Incorrect URL. Please enter full URL including http:// or https://')
        exit()

while True:
    # prompt user to enter character limit to display text
    climit = input('Enter character limit of text to be displayed:')
    # display the character limit user has entered
    print('You entered character limit of:', climit)
    
    # validate input
    try:
        char_limit = int(climit)
        break
    except:
        print('Please enter a number for character limit')

# make connection
fhand = urllib.request.urlopen(user_url)

# get response
response = fhand.read()

# print first text up to user enter character limit
print('\r\n')
print('First', char_limit, 'characters of text is as follows:')
print(response[:char_limit + 1].decode().strip()) 

# total character count
print('\r\n')
print('Total character count of text excluding header is:', len(response))


You entered: http://data.pr4e.org/words.txt
URL host is: data.pr4e.org
You entered character limit of: 500


First 500 characters of text is as follows:
Writing programs or programming is a very creative
and rewarding activity  You can write programs for
many reasons ranging from making your living to solving
a difficult data analysis problem to having fun to helping
someone else solve a problem  This book assumes that
{\em everyone} needs to know how to program and that once
you know how to program, you will figure out what you want
to do with your newfound skills

We are surrounded in our daily lives with computers ranging
from laptops to cell


Total character count of text excluding header is: 1171


**Exercise 4:** Change the **urllinks.py** program to extract and count paragraph (p) tags from the retrieved HTML document and display the count of the paragraphs as the output of your program. Do not display the paragraph text, only count them. Test your program on several small web pages as well as some larger web pages.

In [73]:
# import libraries
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# ignore ssl certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

while True:
    # prompt user to enter URL
    user_url = input('Enter URL including http:// or https://:')
    # lower case url
    user_url = user_url.lower()
    # display the url user has entered
    print('You entered:', user_url)

    # url vlidation
    try:
        # check to see if url begins with http:// or https://
        if user_url.startswith('http://') or user_url.startswith('https://'):
            hostname = user_url.split('/')[2]
            print('URL host is:', hostname)
            break
        else:
            # prompt user to re-enter url if it doesn't meet the above conditions
            print('Incorrect URL. Please enter full URL including http:// or https://')    
    except:
        print('Incorrect URL. Please enter full URL including http:// or https://')
        exit()

html = urllib.request.urlopen(user_url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve and print count of all of the paragrapgh tags
tags = soup.find_all('p')
print('\r')
print('There are', len(tags), 'paragraph tags on', url)


You entered: https://www.geeksforgeeks.org/count-the-number-of-paragraph-tag-using-beautifulsoup/
URL host is: www.geeksforgeeks.org

There are 29 paragraph tags on https://www.geeksforgeeks.org/count-the-number-of-paragraph-tag-using-beautifulsoup/


____

### Week 5 Exercises: 

These exercises are from: 

______

### Week 6 Exercises: 

These exercises are from: 

____