# Essential skill for the Internet crawling

## Regular expressions

Regular expressions (aka regex, regexp) are used to search for patterns. Machine-readable languages often have regualar structure (not always), or at least are non-ambiguous.

Obvious way is, of course, to let machine parse the document and then process the result (as in the previous lab). But this often result in additinal depenencies and significant memory and time overhead (which is ok for a single document, but won't work for millions).

### Simple examples

In [4]:
import re
string = "we have only 5 do11ars. This amount of $ is small. How should we sur-vive?"

# all alphanumerics
pattern = r'[A-Za-z\d]+'
print(pattern, end=": ")
print(re.findall(pattern, string))
print()

# all alphanumerics but also with hyphen
pattern = r'\w+'
print(pattern, end=": ")
print(re.findall(pattern, string))
print()

# the same but using explicit character enumeration
# pattern = ...
# print(pattern, end=": ")
# print(re.findall(pattern, string))
# print()

# any symbol
pattern = r'.'
print(pattern, end=": ")
print(re.findall(pattern, string))
print()

# non-spaces, not the same as \w!
pattern = '\S+'
print(pattern, end=": ")
print(re.findall(pattern, string))
print()


# discuss this pattern. Which elements are used here?
pattern = "\W+[a-z]+\-[a-z]+.$"
print(pattern, end=": ")
print(re.findall(pattern, string))

[A-Za-z\d]+: ['we', 'have', 'only', '5', 'do11ars', 'This', 'amount', 'of', 'is', 'small', 'How', 'should', 'we', 'sur', 'vive']

\w+: ['we', 'have', 'only', '5', 'do11ars', 'This', 'amount', 'of', 'is', 'small', 'How', 'should', 'we', 'sur', 'vive']

.: ['w', 'e', ' ', 'h', 'a', 'v', 'e', ' ', 'o', 'n', 'l', 'y', ' ', '5', ' ', 'd', 'o', '1', '1', 'a', 'r', 's', '.', ' ', 'T', 'h', 'i', 's', ' ', 'a', 'm', 'o', 'u', 'n', 't', ' ', 'o', 'f', ' ', '$', ' ', 'i', 's', ' ', 's', 'm', 'a', 'l', 'l', '.', ' ', 'H', 'o', 'w', ' ', 's', 'h', 'o', 'u', 'l', 'd', ' ', 'w', 'e', ' ', 's', 'u', 'r', '-', 'v', 'i', 'v', 'e', '?']

\S+: ['we', 'have', 'only', '5', 'do11ars.', 'This', 'amount', 'of', '$', 'is', 'small.', 'How', 'should', 'we', 'sur-vive?']

\W+[a-z]+\-[a-z]+.$: [' sur-vive?']


### Find URLs/URIs vs parse the doc

Instead of building DOM model and extracting `href` and `src` attributes, you may rely on the structure of the url itself. Extact all URLs from [the page](https://math.stackexchange.com/questions/411486/understanding-the-singular-value-decomposition-svd) with regexp. You major tool is [re.findall(...)](https://docs.python.org/3/library/re.html#). You may also be interested in compiled regular rexpression (if you reuse one).

In [5]:
import re
import requests

url = "https://math.stackexchange.com/questions/"\
        "411486/understanding-the-singular-value-decomposition-svd"

text = requests.get(url).text

# my inspiration - 
# I took some example URL regexp from the internet, 
# specifically from here:
# https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url
expressions = [
    "(?:([A-Za-z]+):)?(\/{0,3})([0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?",
    "(www|http:|https:)+[^\s]+[\w]",
    "https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)",
    "[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)?",
    "(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})",
    "(?!mailto:)(?:(?:http|https|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?:(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[0-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))|localhost)(?::\\d{2,5})?(?:(/|\\?|#)[^\\s]*)?",
    "https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&\/=]*)",
]

for expression in expressions:
    print()
    pattern = re.compile(expression)
    urls = pattern.findall(text)
    print(expression)
    print(urls[:10])


(?:([A-Za-z]+):)?(\/{0,3})([0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?
[('', '', 'DOCTYPE', '', '', '', ''), ('', '', 'html', '', '', '', ''), ('', '', 'html', '', '', '', ''), ('', '', 'itemscope', '', '', '', ''), ('', '', 'itemtype', '', '', '', ''), ('https', '//', 'schema.org', '', 'QAPage" class="html__responsive " lang="en">\r\n\r\n    <head>\r\n\r\n        <title>linear algebra - Understanding the singular value decomposition (SVD) - Mathematics Stack Exchange</title>\r\n        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/math/Img/favicon.ico', 'v=92addaa54d18">\r\n        <link rel="apple-touch-icon" href="https://cdn.sstatic.net/Sites/math/Img/apple-touch-icon.png?v=0ae50baa40ed">\r\n        <link rel="image_src" href="https://cdn.sstatic.net/Sites/math/Img/apple-touch-icon.png?v=0ae50baa40ed"> \r\n        <link rel="search" type="application/opensearchdescription+xml" title="Mathematics Stack Exchange" href="/opensearch.xml">\r\n    

Was this success? 

Compose your own minimalistic:

In [6]:
import re
import requests

# url = "https://math.stackexchange.com/questions/"\
#         "411486/understanding-the-singular-value-decomposition-svd"

# text = requests.get(url).text

# protocol = "https?://"
# domain = ...
# path = "[/\w\-\.]*"
# args =  "...
# hashtail = "(?:#[\w$%-_;]+)?"

# expression = protocol + domain + path + args + hashtail
# pattern = re.compile(expression)
# regexp_urls = pattern.findall(text)
# print(regexp_urls[:20])

# Streams and files

When you deal with the big files you should take care about the RAM. Today 1GB won't suprise anyone on the desktop, but server machines, which implement crawlers, may be optimized for the resource.

Using streams instead of RAM-cached files is a good strategy.

- Look for solution here: https://stackoverflow.com/a/16696317
- Look for the sample big file here: http://xcal1.vodafone.co.uk/
- Read about python memory measurement here: https://pythonspeed.com/articles/measuring-memory-python/

In [7]:
import psutil, gc 

def get_mem():
    return psutil.Process().memory_info().rss

In [8]:
large_file_url = "http://212.183.159.230/100MB.zip"

First, download the file as you would do it simple way:

In [9]:
gc.collect()
print("Resident set size:", get_mem())
data = requests.get(large_file_url).content
print("Resident set size:", get_mem())

with open('100-RAM', 'wb') as f:
    f.write(data)

print("Resident set size:", get_mem())
data = None
gc.collect()
print("Resident set size:", get_mem())

Resident set size: 86159360
Resident set size: 191754240
Resident set size: 191754240
Resident set size: 86892544


And then use the streaming mode of the `requests` library.

In [10]:
import gc
import requests
import shutil
import os

def download_file(url, destination, chunk_size=1024):
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        
        if not os.path.exists(destination):
            file_name = os.path.join(os.getcwd(), destination)
        else:
            file_name = destination 
        
        with open(file_name, 'wb') as f:
            for c in r.iter_content(chunk_size=chunk_size):
                f.write(c)
        return file_name


gc.collect()
print("Resident set size:", get_mem())
download_file(large_file_url, "100-stream")
print("Resident set size:", get_mem())

Resident set size: 86056960
Resident set size: 85995520


# BeautifulSoup

Plain text HTML is a mixture of content, markup, and code. Extracting structure, or URLs, or plain text might be tricky with regular expressions. 

Building a DOM model is slow, but may save a lot of code and keep you from mistakes.

## Extract all sentences
For indexing and semantic analysis we use different granularity. Often sentence is a good choice. 

In [13]:
from bs4 import BeautifulSoup
from bs4.element import Comment
from nltk import tokenize

doc_url = "https://math.stackexchange.com/questions/"\
        "411486/understanding-the-singular-value-decomposition-svd"

text = requests.get(doc_url).text
dom = BeautifulSoup(text)
paragraphs = [p.strip() for p in dom.text.split('\n') if p.strip()]

sents = []
for p in paragraphs:
    pass 
    
print(sents[90:100])

[]


# Extract URLs from nodes

Be careful with relative links. How would you process them?

In [17]:
import urllib.parse

all_hrefs = dom.find_all('a', href=True)
all_urls = set()

for a in all_hrefs:
    print(a['href'])
    break
    # url = ...
    # all_urls.add(url)

all_urls = list(all_urls)
all_urls[:10]

#


[]

Discuss the next result:

In [None]:
print("|DOM ∩ REGX| =", len(set(all_urls) & set(regexp_urls)))
print("|DOM \ REGX| =", len(set(all_urls) - set(regexp_urls)))
print("|REGX \ DOM| =", len(set(regexp_urls) - set(all_urls)))

# Unique file name

Please, never try to convert a domain (`google.com`), or a path component (`/index.php`) into a filename. They are not unique!

Also, better not to try to substitute sensitive symbols of the full URL (`/:`...) -- you will definitely forget one. Also, you may easily overflow file name.

Nice way is to use hash strings with fixed length and character set. Compute hash strings from the previous list.

In [None]:
import hashlib

for url in all_urls[:20]:
    s = ...
    print(s, url[:15] + "..." + url[-15:])

580c896703b639c47da3cf30be6a78fb5a831302 https://mathema...-less-variables
42af2a47767f85259c15acc787b9e012eea6546a https://space.s...weight-estimate
0440787fda1ce3524156f5f2e8125236f56c7938 https://math.st...f-rank-1-matrix
537f99069a4a9ea7fade70e8965ce649761c35d3 https://stackov...o/company/press
7213bb1c3dc14ce0556ab4ae331d16e3d19fb7f1 https://math.st...x-decomposition
4814f9498243604b1389d9b29748f0409259871c https://try.sta...utm_content=cta
07ebb6201d0237b1221669f4fe8bc3a1059801cf https://linkedi.../stack-overflow
94cca9af043d0cd0180df9a5ccd6994894b5a32e https://www.ins...hestackoverflow
79b30ae1372efe609c895a9486889e7a21f3ecf1 https://math.st...he-principal-co
52268fea4169df93b9c8f0eb65acb2e8fb80ce26 https://math.st...change.com/tags
bc68302b130f462c04f1f42ba3595f153393e835 https://math.st...11486/revisions
1b693522de70d4d61b2df723b69f98703983b31a https://mathove...absolute-values
ae60f62b3195c9be30f748ceb6bba04c62b0cc83 https://math.st...ions/tagged/svd
a9b97ece3afe2c0c36f997f52