# Welcome to the Dark Art of Coding:
## Introduction to Python
Gathering data from the web

<img src='../universal_images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* Use and understand the basics of the `urllib` module
* Use and understand the basics of the `beautiful soup` library

# Networks
---

For this session, we want to cover several issues related to network communications, so that folks have a better sense of what happens when a request to website happens. We will not go into any depth, this is predominantly to ensure that we are on the same page in terms of vocabulary, etc.

We'll cover the Transmission Control Protocol (TCP), port numbers used by that protocol and the Hypertext Transfer Protocol (HTTP). 

## Transmission Control Protocol (TCP)

TCP is a protocol that is commonly used to send data across a network. The protocol is based on a standard published by the Internet Engineering Task Force and can be [found here](https://tools.ietf.org/html/rfc793). TCP is one component of the TCP/IP stack:

|Layer|Example technologies at each layer (not all-inclusive)|
|:---||
|Application|Browser, FTP, email, telnet communications|
|Transport|TCP, UDP port numbers|
|Internet|IP address|
|Network|MAC address|

TCP has these general properties (not an all-inclusive list):

* Relies upon some builtin mechanisms to help increase reliability (error checking, etc)
* Creates connections between two devices (it is referred to as a connection-oriented protocol)
* Uses checks to ensure that all data has been correctly received, if not, it can request that missing data be resent
* Uses sequence numbering to ensure that packets can be put into a specific order
* Between the reliability checks and the organization/ordering of packets, it is very effective for the sending files (like web pages)


## Port numbers

The TCP protocol incorporates the use of port numbers:

* Any given computer may have multiple pieces of software running that are willing to accept incoming traffic
* Each of these applications will request that the computer listen on a specific port number (or numbers)
* When the computer receives traffic destined to a specific port, it will direct that traffic to the right application.
* Multiple open ports allows multiple applications on the same computer to talk without interfering with each other
* Historically certain applications have default TCP port numbers that are used to send higher-level protocols
* There are over ~65,000 ports
* Ports between 0 and 1023 are referred to as **well-known ports**
* Ports between 1024 and 49151 are the **registered ports**. They are assigned by the Internet Assigned Numbers Authority (IANA)
* Ports between 49152 and 65535 are referred to as **ephemeral ports** and are often opened as needed to allow the computer to assign a source port number for any TCP packets that is sends outside the bounds of an established protocol.

Task | Port
:----|:----
Telnet | 23
SSH | 22
HTTP | 80
HTTPS | 443
SMTP (E-mail) | 25
DNS (Domain Name) | 53
FTP (File Transfer) | 21

## Hypertext Transfer Protocol (HTTP)

HTTP is one of many common protocols that may be sent using TCP.

* HTTP is the standard Protocol for most web applications on the Internet
* Invented to retrieve HTML, images, Documents, etc.
* Basic concept:
    * Make a connection
    * Request a document
    * Retrieve the document and display it
    * Close the connection

HTTP uses Uniform Resource Locators (URL) to identify device addresses. A URL address has several components:

* The URL indicates the protocol, generally HTTP (but it could be others)
* It lists the host (server) that hosts the document
* The name and path to the specific document

http://  | www.example.com/ | index.html
:--------|:-------------|:----------------
Protocol | Host         | Document


http://  | localhost:8000/ | index.html
:--------|:-------------|:----------------
Protocol | Host:Port         | Document

# How does this URL fit into the scheme of things?

* Browser attempts to connect to `http://www.example.com`
    * if no port is given, then it attempts to connect on port 80
    * if a port is given, then it attempts to connect on that port number
* It issues a request for a document called `index.html`
* The server responds and sends the html document
* Browser renders the html document
* Browser closes the connection when it is no longer needed

# HTTP requests in Python using urllib
---

First we have to import the request module from the urllib package

This package has several modules, including the urllib.request module

In [None]:
import urllib.request

Now that we have imported a new module that we have never used, we should explore it. Try to find out what functions and attributes are present, using `urllib.request.<tab complete>`

In [None]:
urllib.request.

In [None]:
# urllib allows us to open web pages just like opening files by using the *.urlopen() method.
# The following command creates an http.client.HTTPResponse object that
#     gives us access to a number of attributes and behaviors
#     related to the data retrieved

file = urllib.request.urlopen('http://www.gutenberg.org/cache/epub/11757/pg11757.txt')

Before we go any further, let's take a look at this object we created called `file`, again using `file.<tab complete>`

In [None]:
file.

In [None]:
# A common technique is to use a for loop to cycle through every
# line and print out the data one line at a time
# In this case, the data is read in as BYTES instead of as a STRING.
#     which will require us to call the decode method on each line.

for line in file:
    # We convert each line from bytes to strings using the
    #     .decode() attribute.
    print(line.decode().strip())

In [None]:
# Much like other files we have looked at, we can 
# read and evaluate the text in web-based text files, 
# i.e. we can do tasks like counting words

file = urllib.request.urlopen('http://www.gutenberg.org/cache/epub/11757/pg11757.txt')

In [None]:
count = {}

for line in file:
    
    # Again, we take the line and use .decode() to convert
    #     the data to a string
    #     Then we strip the newline
    #     Then we split it on spaces
    words = line.decode().strip().split()
    
    # We cycle through the words one at a time
    for word in words:
        
        # If a key for the word already exists .get() grabs the value otherwise it automatically returns 0
        count[word] = count.get(word, 0) + 1

In [None]:
count

# Unicode and Python text

* Internally, within Python 3+, all Python strings are Unicode
* When we talk to a network we usually have to encode and decode our data (generally to `utf-8`)
* When we receive data from a web server, we typically recieve it as a `bytes` object which we then pass through a `.decode()` method to get a string


In [None]:
# Poor man debugging...
# I find this to be one of the most useful lines of code to a 
#     new Pythonista

print(type(line), line)

In [None]:
# Let us look at the difference between outputting:
#     a bytes object vs.
#     a string

print(line)
clean_line = line.decode()

print(type(clean_line), clean_line)

# Reading web pages
---

In [None]:
# Our earlier examples were fairly straightforward, since we 
#     retrieved text files. Most of the web is not 
#     straight text files, it is composed of 
#     Hyper Text Markup Language (HTML)

# We request a page using urllib.request.urlopen()

page = urllib.request.urlopen('http://www.example.com/index.html')

for line in page:
    print(line.decode().strip())

# Beautiful soup
---

While it is possible to use `urllib` to read data from the web, a third party library, `Beautiful Soup` is commonly used instead to supplement urllib. `Beautiful Soup`:

* Makes reading and parsing web pages a lot easier
* Allows you to extract tags of only certain types
* You can find certain tags based on their relationship in the tag heirarchy
* Getting hyperlinks becomes a whole lot easier

## On the command line

If you want to install Beautiful Soup, on your own system:

Make sure you activate your desired conda virtual environment
And use `conda` to install Beautiful Soup in that environment

```bash
$ conda activate myenv
(myenv)$ conda install beautifulsoup4
```


In [None]:
# Import the necessary modules

from bs4 import BeautifulSoup
import urllib.request

In [None]:
# Get the html text from the HTTPResponse object
# Notice the .read() method that we daisy chained on the end.
# What are pros and cons of this approach?

htmlText = urllib.request.urlopen('http://www.unicode.org/').read()

In [None]:
print(type(htmlText), htmlText)

In [None]:
# Use bs4 to create a soup object from our html text
# Provide an argument to identify which type of parser to
#     use, in this case, an html parser

soup = BeautifulSoup(htmlText, 'html.parser')

In [None]:
print(type(soup), soup)

In [None]:
tag = soup.title
dir(tag)

In [None]:
# The soup object allows you to retrieve specific types of tags, in this
#     anchor tags (identified using an 'a'). Anchor tags are used for links.

tags = soup('a') 

In [None]:
# Let's cycle through the tags and get the 'href' data portion. this is the data that contains the link itself

for tag in tags:
    print(tag.get('href', None))

# Using documentation
---

Let's explore the documentation for a third party library.

The documentation for Beautiful Soup has a number of nice attributes that can get you started fairly quickly, so let's use the documentation to enhance our knowledge of the subject.

[Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

# Web scraping
---

## What is web scraping?

Web scraping is a technique used to retrieve data from the web OR from similar networks (intranets, etc).

* Web scrapers simulate the behavior of a browser
* They look at the data from specific site(s)
* They extract specific information you need from it
* Typically this is done over and over again across multiple sites

## Why web scrape?

* Get data from a sites that don't provide mechanisms to export the data
* Collect information on sites to build a search engine database
* Monitor sites for changes
* Collect social network data
    * who is connected to or communicates with who?
    * What is being said

# Miscellaneous:

In [None]:
# source:
# http://www.jabberwocky.com/carroll/jabber/jabberwocky.html

The following command will run an HTTP server on your local computer...

Run this from the command line.

This allows you to test tools like Beautiful Soup even if you are not connected to the Internet.

WARNING: **Do not use this web server in a production environment**, it is not fully evolved with hardened security settings, etc. It should be considered more of a testing/prototyping/experimental tool.

```bash
$ python -m http.server 8999 --bind 127.0.0.1
```

In [None]:
page = urllib.request.urlopen('http://localhost:8999/jabberwocky.html')

In [None]:
page

In [None]:
text = page.read()
print(text)