# Web operations is easy

## Socket

Python has built-in support for TCP Sockets.
- HTTP (80)
- HTTPS (443)
- FTP (21) - File Transfer
- SMTP (25) - Mail
- IMAP (143/220/993) - Mail Retrieval

Operations as raw level

In [None]:
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    print(data.decode(), end='')
mysock.close()

## Using urllib in Python

Since HTTP is so common, Python has a library that does all the socket work for us and makes web pages look a file.

In [None]:
import urllib.request, urllib.parse, urllib.error

response = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in response:
    print(line.decode().strip())

Like a file

In [None]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = dict()
for line in fhand:
    words = line.decode().split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1

print(counts)

Reading Web Pages is easy

In [None]:
import urllib.request, urllib.parse, urllib.error
import re

with urllib.request.urlopen('http://python.org/') as response:
    charset = response.info().get_content_charset()
    html = response.read().decode(charset)

print(html)


## The First Lines of Code at Google

Following links in a simple way. In this example only one web is obtained, and the links are searched. Iteratively, you can search each link found.  

In [None]:
import urllib.request, urllib.parse, urllib.error
import re

url_pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

with urllib.request.urlopen('http://python.org/') as response:
    charset = response.info().get_content_charset()
    html = response.read().decode(charset)

    emails = re.findall(url_pattern, html)
    print(emails)


## Web Scraping - Crawling

When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information, and then looks at more web pages.

Search engines scrape web pages - we call this "spidering the web" or "web crawling".

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. [Reference](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)

In [None]:
import sys
!{sys.executable} -m pip install BeautifulSoup4


In [None]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

url = input('Enter url: ')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))


In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

In [None]:
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.p['class'])
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id="link3"))