# Web & Scraping

Requires: urllib, requests, socket, re, lxml, io, bs4, sqlite3, scrapy, pandas, sqlalchemy.

Plan:

* HTML 101
* HTTP + REST 101
* Web scraping

In [2]:
#make sure they are all installed
import urllib, requests, socket, re, lxml, io, bs4, sqlite3, scrapy, pandas, sqlalchemy 


# Internet 101

What happens when you type in a path in the address bar and press Enter?

URL:  
```<scheme>:[//[<login>[:<passwd>]@]<host>[:<port>]][/<URL‐path>][?<parameters>][#<fragment>]```

Example:  
`https://pp.userapi.com/c834101/v834101778/13450d/9yxFBjsPxN8.jpg`

You are a **client**

1. **DNS**: URL -> IP-address and port of the **server** (87.240.129.71:443)
2. **HTTP(S)**: IP + URL-path + parameters -> **resource** (text (e.g. *HTML-document*), picture, sound, etc.)
3. **Browser engine** tries to parse and **draw** your document
4. *Usually* the document contains links to other documents. They have their own URI. We go back to p.1
5. Concurrently and after that the **JavaScript**-code is being executed

# HTML 101

HyperText Markup Language

HTML is a markup language used for web-pages.

This is a mechanism for obtaining structural text that browsers can understand.

The structure in the text is set by nested tags, tags determine how the text will be displayed (rendered).

This is a tag: `<tag>`, tags can be opening (`<tag>`) and closing ones (`</tag>`).

Example of HTML-markup:

```html
<!DOCTYPE html>
<html>
   <head>
      <meta charset="utf-8" />
      <title>HTML Document</title>
   </head>
   <body>
      <p> <!-- p is a paragraph, and in such strange brackets is a comment -->
         <b>
            This text will be bold, <i>and this one will be also italic</i>.
         </b>
      </p>
   </body>
</html>
```

So it looks after rendering:

<!DOCTYPE html>
<html>
   <head>
      <meta charset="utf-8" />
      <title>HTML Document</title>
   </head>
   <body>
      <p> <!-- p is a paragraph, and in such strange brackets is a comment -->
         <b>
            This text will be bold, <i>and this one will be also italic</i>.
         </b>
      </p>
   </body>
</html>

You can see that the HTML markup has a tree structure:<br>
each tag (a node of the tree) has 0 (then it is a leaf of the tree) or more (then it is an inside vertex) of the tags embedded in it.

So, to get some information from HTML, you can use its structure.

We will deal with this a bit later.

### P.S. 
In fact, the XML format (another markup language) has the same structure.

Formally, HTML is a more standardized subset of XML.

### P.P.S.

Tags also have **attributes**:

```html
<a href="http://example.com">Hyperlink to example.com</a>
```
Render this piece:
<a href="http://example.com">Hyperlink to  example.com</a>

### P.P.P.S.

And this text (certainly having a structure) was written using another markup language - **Markdown**.

## HTTP 101

HyperText Transfer Protocol

* **Protocol** -- a set of logical agreements on the interaction of programs, a standard;
* **Transfer** -- we live on the Internet, i.e. the world is not limited to our computer => information must be transmitted;
* **Hypertext** -- HTML, for instance.

The main idea: all computers are divided into **servers** (those that store information) and **clients** (those that request it).

In fact, in HTTP we transmit not only hypertext, but **resources** - a philosophical generalization, which includes hypertext, pictures, music, etc. But the standard is suitable for all of this.

Each resource has its address - **URI** (Uniform Resource Identifier) - an analogue of the path in the computer system.

```<scheme>:[//[<login>[:<passwd>]@]<host>[:<port>]][/<URL‐path>][?<parameters>][#<fragment>]```

We omit the low-level side of the issue: the TCP/IP stack, sockets, ports, etc.

## HTTP structure

To receive data, the client makes a request to the server. The request must contain 3 parts:
* **Starting line** defines the type of message;
* **Headers** characterize the message body, transmission parameters and other information;
* **Message body**.

The response message from the server has the same structure.

To send these requests by hand, rather than through the browser, you can use the `curl` utility. Let's use it.

In [3]:
!curl example.com/index.html 
#!curl example.com/index.html -v

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h

## HTTP methods

The HTTP method defines the operation that we want to perform on the resource.

The most frequent methods: GET, POST, PUT, DELETE (but there are others).

* **GET** -- request for resource content. <br> GET requests are idempotent. Therefore, they can be cached.
* **POST** -- transferring data to a resource (for example, when sending a comment on a forum or entering a password on a site). <br>Not idempotent => when sending the same comment to the forum, it will appear there twice. <br> Not being cached.
* **PUT** -- transferring data to a specific URI (modification of an existing resource). Not being cached.
* **DELETE** -- resource deletion.

## HTTP codes

The response message will return an HTTP response code that determines the result of the operation.

The most frequent codes: `200 OK`, `400 BadRequest`, `404 Not Found`, `500 Internal Server Error`.

Overview:
* **1xx** -- Informational. For example, `102 Processing` (the request is still being processed);
* **2xx** -- Success. Everything is fine, the request worked out and did not break anything. At least for now.
* **3xx** -- Redirection. Redirect to another resource/page.
* **4xx** -- Client error. Client error (invalid request data or wrong path).
* **5xx** -- Server error. Something failed on the server (there they divided it by zero, for example).

## Example

Let's look again at the request to example.com/index.html.

Service information from `curl`:

```
*   Trying 2606:2800:220:1:248:1893:25c8:1946...
*   Trying 93.184.216.34...
* Connected to example.com (2606:2800:220:1:248:1893:25c8:1946) port 80 (#0)
```
Client request:
```
> GET /index.html HTTP/1.1    # Starting line: GET method, URI -- /index.html, protocol version -- HTTP/1.1
> Host: example.com           # Headers
> User-Agent: curl/7.47.0
> Accept: */*
>                             # Empty body
```

Server response:
```
< HTTP/1.1 200 OK                          # Starting line: protocol version and the message code
< Cache-Control: max-age=604800            # Headers
< Content-Type: text/html; charset=UTF-8
< Date: Tue, 19 Mar 2019 21:25:21 GMT
< Etag: "1541025663+gzip+ident"
< Expires: Tue, 26 Mar 2019 21:25:21 GMT
< Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
< Server: ECS (bsa/EB1E)
< Vary: Accept-Encoding
< X-Cache: HIT
< Content-Length: 1270
<                                          # Empty line - required by the standard
<!doctype html>                            # Response body -- HTML-document
<html>
<head>
    <title>Example Domain</title>
. . .
```


## REST

 <a href="https://en.wikipedia.org/wiki/Representational_state_transfer">Representational State Transfer</a>

A set of additional restrictions on HTTP, indicating how to work correctly with methods, codes, etc.

You can call it an analog of PEP-8 for Python.

# Web scraping. 

* Getting html
* Html parsing

## Getting html

Alternative libraries:

* urllib 
* requests (de-facto standard)
* socket (low-level)



### urllib 

In [4]:
import urllib.request

In [5]:
response = urllib.request.urlopen('http://example.com/')
html = response.read()
print(html)

b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    

In [6]:
html = html.decode('utf-8')
print(html)

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai

Let's save this html just in case.

In [7]:
with open('example.com.txt', 'w', encoding='utf-8') as f:
    f.write(html)

What except html?

In [8]:
print(dir(response))

['__abstractmethods__', '__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_abc_impl', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', '_check_close', '_close_conn', '_get_chunk_left', '_method', '_peek_chunked', '_read1_chunked', '_read_and_discard_trailer', '_read_next_chunk_size', '_read_status', '_readall_chunked', '_readinto_chunked', '_safe_read', '_safe_readinto', 'begin', 'chunk_left', 'chunked', 'close', 'closed', 'code', 'debuglevel', 'detach', 'fileno', 'flush', 'fp', 'getcode', 'getheader', 'getheaders', 'geturl', 'headers', 'info', 'isatty', 'isclosed', 'length', 'msg', 'peek', 'read', 'read1', 'readable', 'readinto', 

Everything that is in the HTTP message.

In [15]:
print(response.url)
print(response.msg)
print(response.code)
print('Headers: \n{}'.format(response.headers))
# . . .

http://example.com/
OK
200
Headers: 
Accept-Ranges: bytes
Age: 584316
Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Mon, 19 Dec 2022 11:24:52 GMT
Etag: "3147526947+ident"
Expires: Mon, 26 Dec 2022 11:24:52 GMT
Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
Server: ECS (nyb/1D04)
Vary: Accept-Encoding
X-Cache: HIT
Content-Length: 1256
Connection: close




### requests
HTTP for Humans

In [16]:
import requests

In [18]:
response = requests.get('http://example.com')
response

<Response [200]>

In [19]:
print(dir(response))

['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']


In [20]:
print(response.url)
print(response.connection)
print(response.headers)
print(response.ok)
print(response.status_code)
print(response.encoding)
print(response.links)

http://example.com/
<requests.adapters.HTTPAdapter object at 0x7f5a6ff6b9a0>
{'Content-Encoding': 'gzip', 'Age': '505265', 'Cache-Control': 'max-age=604800', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Mon, 19 Dec 2022 11:27:01 GMT', 'Etag': '"3147526947+gzip"', 'Expires': 'Mon, 26 Dec 2022 11:27:01 GMT', 'Last-Modified': 'Thu, 17 Oct 2019 07:18:26 GMT', 'Server': 'ECS (nyb/1D0D)', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'Content-Length': '648'}
True
200
UTF-8
{}


In [21]:
print(response.text)

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai

### socket (low-level, if assembler is not enough for you)

In [22]:
import socket

In [23]:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
host = socket.gethostbyname('www.example.com')
port = 80

sock.connect((host,port))
sock.sendall(b'GET / HTTP/1.1\r\nHost: www.example.com\r\n\r\n')

val = sock.recv(4096)
print(val.decode('utf-8'))
#help(sock.recv)

HTTP/1.1 200 OK
Accept-Ranges: bytes
Age: 212205
Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Mon, 19 Dec 2022 11:27:15 GMT
Etag: "3147526947+gzip"
Expires: Mon, 26 Dec 2022 11:27:15 GMT
Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
Server: ECS (nyb/1D1B)
Vary: Accept-Encoding
X-Cache: HIT
Content-Length: 1256

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-rad

In [24]:
# Split off the HTTP headers
val = val.split(b'\r\n\r\n',1)[1]
print(val.decode('utf-8'))

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai

## HTML Parsing

Alternative libraries:
* re
* lxml
* BeautifulSoup

In [25]:
with open('example.com.txt', 'r', encoding='utf-8') as f:
    html = f.read()
print(html)

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai

### re

Let's forget that HTML is a tree, and try to parse it like a line.

In [26]:
import re

In [27]:
h1 = re.findall(r'<h1>[\w ]+</h1>', html)
print(h1)
h1 = re.findall(r'<h1>([\w ]+)</h1>', html)
print(h1)

['<h1>Example Domain</h1>']
['Example Domain']


In [28]:
paragraphs = re.findall(r'<p>(.*)</p>', html)
paragraphs

['<a href="https://www.iana.org/domains/example">More information...</a>']

Where is the other paragraph?


In [29]:
paragraphs = re.findall(r'<p>([\w\W]*)</p>', html) 
paragraphs

['This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.</p>\n    <p><a href="https://www.iana.org/domains/example">More information...</a>']

again not what we need..

In [30]:
paragraphs = re.findall(r'<p>([\w\W]*?)</p>', html) # * - greedy; *? - lazy (non-greedy) 
paragraphs

['This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.',
 '<a href="https://www.iana.org/domains/example">More information...</a>']

tough, huh?

### lxml

Let's use a library that knows about the structure of XML (and HTML).

In [31]:
from lxml import etree
from io import StringIO, BytesIO

In [32]:
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html), parser)
tree

<lxml.etree._ElementTree at 0x7f5abe406c00>

In [33]:
print(dir(tree))

['__class__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_setroot', 'docinfo', 'find', 'findall', 'findtext', 'getelementpath', 'getiterator', 'getpath', 'getroot', 'iter', 'iterfind', 'parse', 'parser', 'relaxng', 'write', 'write_c14n', 'xinclude', 'xmlschema', 'xpath', 'xslt']


In [34]:
print(tree.getroot())
print(etree.tostring(tree.getroot(), pretty_print=True, method="html"))

<Element html at 0x7f5abe330880>
b'<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8">\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<

In [35]:
paragraphs = tree.xpath('//p')
for p in paragraphs:
    print(p.text)  

This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
None


In [36]:
for p in paragraphs:
    print(etree.tostring(p, pretty_print=True, method='html'))  

b'<p>This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.</p>\n    \n'
b'<p><a href="https://www.iana.org/domains/example">More information...</a></p>\n\n'


In [37]:
hrefs = tree.xpath('//a')
for href in hrefs:
    print(href.text)  
    print(href.attrib)  

More information...
{'href': 'https://www.iana.org/domains/example'}


In [38]:
specific_hrefs = tree.xpath('//a[@href="http://www.non-existing-domain.org/"]')
specific_hrefs

[]

### BeautifulSoup

Another option that works well for parsing HTML.

In [39]:
# try:
from bs4 import BeautifulSoup
# except:
#     !pip install bs4 

In [40]:
soup = BeautifulSoup(html, 'lxml')

In [41]:
paragraphs = soup.find_all('p')
paragraphs

[<p>This domain is for use in illustrative examples in documents. You may use this
     domain in literature without prior coordination or asking for permission.</p>,
 <p><a href="https://www.iana.org/domains/example">More information...</a></p>]

In [42]:
hrefs = soup.find_all('a')
hrefs

[<a href="https://www.iana.org/domains/example">More information...</a>]

In [43]:
hrefs = soup.find_all('a', href='https://www.iana.org/domains/example')
hrefs

[<a href="https://www.iana.org/domains/example">More information...</a>]

In [44]:
hrefs = soup.find_all('a', href='http://www.other-website.org/domains/example')
hrefs

[]

## Performance

In [45]:
%timeit re.findall(r'<p>.*?</p>', html, re.DOTALL)
%timeit tree.xpath('//p')
%timeit soup.find_all('p')

2.13 µs ± 21.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
5.07 µs ± 6.41 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.41 µs ± 15 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


Moral: regex is fast.

BeautifulSoup is a bit slower than lxml, as for convenience, it converts the document into a certain internal format - in fact, soup.

## If enough time and you want a very lively practice: scrapy


We will parse a site with an archive of cryptocurrency rates: https://coinmarketcap.com/.

We want to open [historical snapshots](https://coinmarketcap.com/historical/) and for each week to download the full table.

### Important!

It is not necessary to understand what will happen next.

Moral: there are frameworks that take full responsibility for receiving and parsing HTML.

And they also know how to parallelize the process, insert random delays (so that the servers do not think that you are a robot and block your requests) and it is convenient to export the received data (for example, to the database). And much more. <br> And all this is controlled by `scrapy.cfg` - one configuration file.

You just have to indicate which pieces of text you need to grab. And into what base to put it. Well, almost. :)

In [1]:
try:
    import scrapy
except:
    !pip install scrapy

In [2]:
!rm -rf coinmarketcap

In [3]:
!scrapy startproject coinmarketcap

New Scrapy project 'coinmarketcap', using template directory '/usr/local/lib/python3.8/dist-packages/scrapy/templates/project', created in:
    /mnt/storage/lbpe/edu/iimcb-python/IIMCB-Python-2022/Lecture10/coinmarketcap

You can start your first spider with:
    cd coinmarketcap
    scrapy genspider example example.com


What's inside?

In [4]:
!cd coinmarketcap; ls -R

.:
coinmarketcap  scrapy.cfg

./coinmarketcap:
__init__.py  items.py  middlewares.py  pipelines.py  settings.py  spiders

./coinmarketcap/spiders:
__init__.py


A special function to form code in a scrapy project from an ipynb:

In [5]:
def dump_to(path):
    with open(path, 'w') as f:
        f.write(_i)  # _i is "the last executed Input" in iPython

### Attention!

The following cells will fail. <br> We simply put the code in Input, and then transfer it to the desired file.

### Item: first we’ll determine what we want to collect

In [6]:
# -*- coding:utf8 -*-

import scrapy


class CurrencyItem(scrapy.Item):
    date = scrapy.Field()
    name = scrapy.Field()
    symbol = scrapy.Field()
    market_cap = scrapy.Field()
    price = scrapy.Field()

In [7]:
dump_to('./coinmarketcap/coinmarketcap/items.py')

### Spider: the one who collects Items

In [8]:
# -*- coding:utf8 -*-

from scrapy import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from scrapy.loader import ItemLoader
from scrapy.selector import Selector
from coinmarketcap.items import CurrencyItem

class CurrencyLoader(ItemLoader):
    pass

class WeeklySpider(CrawlSpider):
    name = 'weekly'
    allowed_domains = ['coinmarketcap.com']
    start_urls = ['https://coinmarketcap.com/historical/']
    only_2018_april_regex = '/201904[0-9]{2}' # full history parsing takes ~4 hrs 

    rules = (
        Rule(LinkExtractor(allow=(only_2018_april_regex, )), callback='parse_weekly_report', follow=False),
    )

    def parse_weekly_report(self, response):
        
        hxs = Selector(response)
        items_html = hxs.xpath('//table//tr')
        #print(len(items_html),type(items_html),dir(items_html))
        #print(items_html[5])
        items = []
        
        item_names = items_html.xpath('//td[@class="cmc-table__cell cmc-table__cell--sticky cmc-table__cell--sortable cmc-table__cell--left cmc-table__cell--sort-by__name"]//div//a/text()').extract()
        item_symbols = items_html.xpath('//td[@class="cmc-table__cell cmc-table__cell--sortable cmc-table__cell--left cmc-table__cell--sort-by__symbol"]//div/text()').extract()
        item_caps = items_html.xpath('//td[@class="cmc-table__cell cmc-table__cell--sortable cmc-table__cell--right cmc-table__cell--sort-by__market-cap"]//div/text()').extract()
        item_prices = items_html.xpath('//td[@class="cmc-table__cell cmc-table__cell--sortable cmc-table__cell--right cmc-table__cell--sort-by__price"]//div/text()').extract()
        print(response.request.url)
        #print(len(item_names))
        #print(len(item_symbols))
        #print(len(item_caps))
        #print(len(item_prices))
        
        for i in range(200):

            item = CurrencyItem()
            item['date'] = response.request.url.split('/')[-2]
            item['name'] = item_names[i]
            item['symbol'] = item_symbols[i]
            item['market_cap'] = item_caps[i]
            item['price'] = item_prices[i]

            yield item
          

ModuleNotFoundError: No module named 'coinmarketcap.items'

In [9]:
dump_to('./coinmarketcap/coinmarketcap/spiders/weekly.py')

### Pipeline: for instance, export to a database


In [10]:
# -*- coding: utf-8 -*-

import os, logging
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import create_engine, Table, Column, Integer, String, Date, MetaData, ForeignKey
from sqlalchemy.engine.url import URL
from sqlalchemy.orm import sessionmaker
from sqlalchemy.pool import NullPool
from scrapy.exceptions import DropItem
from scrapy import signals
from coinmarketcap.items import CurrencyItem
import pandas as pd
from collections import Counter

logger = logging.getLogger(__name__)

DeclarativeBase = declarative_base()

class Currency(DeclarativeBase):
    __tablename__ = 'currency'
    __table_args__ = {'sqlite_autoincrement': True}

    id = Column('id', Integer, primary_key=True)
    date = Column('date', Date)
    name = Column('name', String)
    symbol = Column('symbol', String)
    market_cap = Column('market_cap', String)
    price = Column('price', String)

    def __init__(self, item):
        self.date = pd.to_datetime(item['date'], format='%Y%m%d')
        self.name = item['name']
        self.symbol = item['symbol']
        self.market_cap = item['market_cap']
        self.price = item['price']

    def __repr__(self):
        return "<Currency({0}, {1}, {2})>".format(self.id, self.symbol, self.market_cap)


class SqlitePipeline(object):
    def __init__(self, settings):
        self.database = settings.get('DATABASE')
        self.sessions = {}

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls(crawler.settings)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        return pipeline

    def create_engine(self):
        engine = create_engine(URL(**self.database), poolclass=NullPool)
        return engine

    def create_tables(self, engine):
        DeclarativeBase.metadata.create_all(engine, checkfirst=True)

    def create_session(self, engine):
        session = sessionmaker(bind=engine)()
        return session

    def spider_opened(self, spider):
        engine = self.create_engine()
        self.create_tables(engine)
        session = self.create_session(engine)
        self.sessions[spider] = session

    def spider_closed(self, spider):
        session = self.sessions.pop(spider)
        session.close()

    def process_item(self, item, spider):
        session = self.sessions[spider]
        currency = Currency(item)
        link_exists = session.query(Currency).filter_by(symbol=item['symbol'], date=item['date']).first() is not None
        
        if link_exists:
            logger.info('Item {} is in db'.format(currency))
            return item

        try:
            session.add(currency)
            session.commit()
            logger.info('Item {} stored in db'.format(currency))
        except Exception as e:
            logger.info('Failed to add {} to db'.format(currency))
            session.rollback()
            raise e

        return item

ModuleNotFoundError: No module named 'coinmarketcap.items'

In [11]:
dump_to('./coinmarketcap/coinmarketcap/pipelines.py')

### Settings:  scrapy settings

In [12]:
# -*- coding: utf-8 -*-

# Scrapy settings for coinmarketcap project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'coinmarketcap'

SPIDER_MODULES = ['coinmarketcap.spiders']
NEWSPIDER_MODULE = 'coinmarketcap.spiders'

DATABASE = {
    'drivername': 'sqlite',
    # 'host': 'localhost',
    # 'port': '5432',
    # 'username': 'YOUR_USERNAME',
    # 'password': 'YOUR_PASSWORD',
    'database': 'weekly.sqlite'
}

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = '%s' % (BOT_NAME)

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'coinmarketcap.middlewares.CoinmarketcapSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'coinmarketcap.middlewares.CoinmarketcapDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'coinmarketcap.pipelines.SqlitePipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = False
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

LOG_FILE = 'crawling.log'

In [13]:
dump_to('./coinmarketcap/coinmarketcap/settings.py')

Let's look at the structure of the scrapy project one more time

In [14]:
!cd coinmarketcap; ls -R

.:
coinmarketcap  scrapy.cfg

./coinmarketcap:
__init__.py  items.py  middlewares.py  pipelines.py  settings.py  spiders

./coinmarketcap/spiders:
__init__.py  weekly.py


### That's all! 

Let's run the spider! 

We go through only the April data.


In [15]:
%%timeit -n 1 -r 1
!cd coinmarketcap ; scrapy crawl weekly

https://coinmarketcap.com/historical/20190407/
https://coinmarketcap.com/historical/20190428/
https://coinmarketcap.com/historical/20190421/
https://coinmarketcap.com/historical/20190414/
^C
44.3 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


Let's see what we've got:

In [16]:
import sqlite3
import pandas as pd

connection = sqlite3.connect('./coinmarketcap/weekly.sqlite')

df = pd.read_sql_query("SELECT * FROM currency", connection)
print(df.shape)
df.head()

(348, 6)


Unnamed: 0,id,date,name,symbol,market_cap,price
0,1,2019-04-07,BTC,BTC,"$91,674,230,185.93","$5,198.90"
1,2,2019-04-07,Bitcoin,ETH,"$18,424,576,820.42",$174.53
2,3,2019-04-07,ETH,XRP,"$15,021,731,304.72",$0.3599
3,4,2019-04-28,BTC,BTC,"$93,391,244,394.89","$5,285.14"
4,5,2019-04-07,Ethereum,BCH,"$5,662,007,844.39",$319.60
