# The Basics of Crawling Sites and Scraping Web Pages

The two most fundamental tools for anyone looking to crawl websites and/or scrape them for data are: 
- [Requests](http://docs.python-requests.org/en/master/) for making HTTP requests
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for turning those requests into something useable

In [1]:
import requests
from bs4 import BeautifulSoup

For a quick look at how these two packages work, let's start with the home page of my website on GitHub.

In [2]:
url = "https://agbs2k8.github.io/"

I'm also going to use a custom request header that is part of what I'm going to send with my GET request for the page.  I found this on someone else's page years ago, and have used it for a long time.  I'm not sure where I even got it from at this point, but thank-you to whowever previously posted this somewhere on the web!

This has helped me to not get caught by some of the anti-scraping / crawling measures I've encountered over the years.
___
## *Legal Disclaimer!*

Please be aware of any applicable laws or such wherever you are, and wherever a site is hosted.  Generally, be sure to follow the site's ROBOTS.txt and more generally DON'T BE EVIL!  Remember, you can send tons of HTTP requests very quickly with tools like this, and you could cause noticeable impacts on someone's site.  You might look at some other references before going out and crawling someone else's website.  Ex: [Wikipedia](https://en.wikipedia.org/wiki/Web_crawler)
___

In [3]:
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

Now that's out of the way, lets make our http request and store what we get back as ```r```.  IF you are not familiar with how the internet works (DNS, etc), you might want to do some research.  Thankfully, the Requests package does all of the heavy lifting for us!

In [4]:
r = requests.get(url, headers = hdr)

Did it work?  You can check he status code and see what response you got.  If you are not familiar with HTTP response codes, you might look at [Wikipedia](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).

In [5]:
print(r.status_code)

200


So, a 200 response is "OK" and it means we should have gotten the page back from the server.  Is it something we can look at and have it make sense?  I like to check the encoding to see if it is something usable:

In [6]:
print(r.encoding)

utf-8


Unicode, perfect! We know how to deal with that. 

While I'm an ethical web-scraper, I do want to know how they might be tracking me.  Are they using cookies?

In [7]:
print([x for x in r.cookies])
print(type(r.cookies))

[]
<class 'requests.cookies.RequestsCookieJar'>


The cookie jar is empty (and that's an awsome name to use).  

So what is in the request response anyways?

In [8]:
print(r.content[:1000])

b'<!DOCTYPE HTML>\r\n<!--\r\n\tTEMPLATE: Story by HTML5 UP | html5up.net | @ajlkn | Free for personal and commercial use under the CCA 3.0 license (html5up.net/license)\r\n\tSITE CONTENT: \xc2\xa92018 AJ Wilson - All Rights Reserved\r\n-->\r\n<html>\r\n\t<head>\r\n\t\t<!-- Google Tag Manager -->\r\n\t\t<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':\r\n\t\tnew Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],\r\n\t\tj=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\r\n\t\t\'https://www.googletagmanager.com/gtm.js?id=\'+i+dl;f.parentNode.insertBefore(j,f);\r\n\t  })(window,document,\'script\',\'dataLayer\',\'GTM-NSXB5GC\');</script>\r\n  \t<!-- End Google Tag Manager -->\r\n\t\t<title>AJ Wilson</title>\r\n\t\t<meta charset="utf-8" />\r\n\t\t<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no" />\r\n\t\t<link rel="stylesheet" href="assets/css/main.css" />\r\n\t\t<noscript><link rel="styl

Yikes, that doesn't look friendly.  This is where BeautifulSoup comes in.  It will take this and turn it into a BeautifulSoup object that we can deal with more easily

In [9]:
soup = BeautifulSoup(r.content,'html.parser')
type(soup)

bs4.BeautifulSoup

So what does our request resposne look like now? I'm going to call a function that we'll get to in a second to limit how much of the page we're going to pull.  I'll only pull the ```<head>``` tag because we don't need to see the entire page right now!

In [10]:
print(soup.find_all('head')[0])

<head>
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
		new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
		j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
		'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
	  })(window,document,'script','dataLayer','GTM-NSXB5GC');</script>
<!-- End Google Tag Manager -->
<title>AJ Wilson</title>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, user-scalable=no" name="viewport"/>
<link href="assets/css/main.css" rel="stylesheet"/>
<noscript><link href="assets/css/noscript.css" rel="stylesheet"/></noscript>
</head>


That is much better.  It now looks like what we would expect to see if we inspected the page with our browser's developer tools.  Now we have something that is ready to work with to find specific parts within the page.  

BeautifulSoup comes with a fantastic set of tools for searching through HTML tags and finding specific pieces of a page.  I'm going to use the BeautifulSoup find_all() function to grab all ```<a>``` a / anchor tags, and I'll add the parameter of ```href=True``` to only select anchor tags that have an href defined.  I could use a regex rather than just using ```True```.  I'll pack all of this into a list comprehension (because they're awsome), filter out any link that doesn't point to an HTTP or HTTPS url (to remove my inter-page links). Then, put it into a ```set()``` to drop any duplicates and here we are: 

In [11]:
links = set([x['href'] for x in soup.find_all("a", href=True) if x['href'][:4]=='http'])
_=[print(x) for x in links]

https://anaconda.org/anaconda/python
https://plot.ly
https://www.javascript.com/
https://en.wikipedia.org/wiki/Transact-SQL
https://www.r-project.org/
https://www.tensorflow.org
http://www.numpy.org
http://ggplot.yhathq.com/
https://keras.io
https://developer.salesforce.com/page/Apex
https://d3js.org/
http://hadoop.apache.org/
https://en.wikipedia.org/wiki/Java_(programming_language)
http://scikit-learn.org
https://www.mongodb.com/
https://www.raspberrypi.org/
https://html5up.net
https://github.com/agbs2k8/ML_Example_Notebooks/blob/master/Linear_Regression.ipynb
https://seaborn.pydata.org
https://www.python.org/
https://pandas.pydata.org/
https://www.sqlite.org
https://neo4j.com/
https://matplotlib.org
https://www.postgresql.org/
http://docs.python-requests.org
https://www.scala-lang.org/


That was easy! If I'm crawling a site, I now have a list of URLs to add to my list of pages to investigate.  

This obviously isn't a perfect way to crawl a web page, however.  This is only showing me the links that were very intentionally placed there by the developers.  What about other links to the assets used (JS, CSS, Images)?  I know these things are not going to be found in a tag with this structure ```<a href='...'>``` and that was all I looked for. 

What other tags were out there that had a hypertext reference (href) in them? We can use the ```find_all()``` method from our BeautifulSoup object named ```soup```. I'll drop anything that starts with the ```#``` sign so that I drop the inner-page links. 

In [12]:
links = list(set([x['href'] for x in soup.find_all(href=True) if not str(x['href']).startswith('#')]))
_=[print(x) for x in links]

assets/css/noscript.css
https://anaconda.org/anaconda/python
https://plot.ly
images/gallery/fulls/04.jpg
https://www.javascript.com/
scikit-learn.org
https://en.wikipedia.org/wiki/Transact-SQL
images/gallery/fulls/02.jpg
https://www.r-project.org/
https://www.tensorflow.org
http://www.numpy.org
https://keras.io
http://ggplot.yhathq.com/
https://developer.salesforce.com/page/Apex
https://d3js.org/
http://hadoop.apache.org/
https://en.wikipedia.org/wiki/Java_(programming_language)
https://www.mongodb.com/
assets/css/main.css
http://docs.python-requests.org
https://www.raspberrypi.org/
mailto:agbs2k8@yahoo.com
consulting.html
https://html5up.net
https://github.com/agbs2k8/ML_Example_Notebooks/blob/master/Linear_Regression.ipynb
https://seaborn.pydata.org
www.linkedin.com/in/agbs2k8
https://www.python.org/
https://pandas.pydata.org/
https://www.sqlite.org
images/gallery/fulls/03.jpg
https://neo4j.com/
https://matplotlib.org
https://www.postgresql.org/
http://scikit-learn.org
https://www.sc

As you can see, we picked up a few more links. These all are either to other pages within my website, a ```mailto``` to contact me from the site, or links to the images and style-sheets that are part of the page when it renders in a browser.  This helps make my crawler much better, since now it will start identifying intra-site links and assets to build out a more complete crawl.  

If I'm going to start crawling a web site, I'll need to come up with a way to parse and classify each of these URLs as I encounter them.  I believe it is important to know if a link is to another HTML page on the same site, to an asset, or to an outside site.  We could spend hours (or perhaps even days) building a way to parse our URLs, but why re-invent a wheel whe someone else has undoubtably done it for us. 

Meet [urllib](https://docs.python.org/3.6/library/urllib.html), part of the Python Standard Library, which has an excellent parsing function for URLs.  

In [13]:
from urllib.parse import urlparse
print(urlparse(links[3]))

ParseResult(scheme='', netloc='', path='images/gallery/fulls/04.jpg', params='', query='', fragment='')


That was easy! It tells us as much data as we were likely to need in one simple call.  This can help our crawler understand what information is in the link itself prior to ever sending an HTTP request.

# A super-simple Web Crawler

So now that we have seen how to use Requests & BeautifulSoup to make and parse the results of our GET requests, and use urllib to parse the links we find, we can put together a very simple function to crawl a website.  I've tried to clearly document everything going on in the function so it is easy to follow. 

In [14]:
from urllib.parse import urljoin


def crawl_site(start_from:str, multi_domain=False, page_limit=100) -> dict:
    '''
    A function to crawl websites, traversing any href in each page
    :param start_from: string URI to begin crawling from
    :param multi_domain: False (default) means stay in original domain, True = crawl any encountered domain
    :param page_limit: integer max number of web pages to crawl
    :return: a dictionary of pages crawled and their attributes
    TODO: add depth limit - an integer folder depth to crawl through. will not crawl any folders deeper than this value
    '''
    # I'm going to pull in and use the same header as I introduced before
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}
    # make sure the passed paramenter is all lower case
    start_from = str(start_from).lower()
    # let's create a list of links I'm going to work from, starting with the original passed URI
    links_to_crawl = [start_from]
    # Check the domain we were given so we can limit our search to that domain
    start_domain = urlparse(start_from).netloc
    # Create a dictionary to store all of the information we pull as we crawl
    results = {}
    
    # while loop because we don't know how many iterations we will ultimately do 
    # Limits are 1. if we run out of links, and 2. if we hit our page limit
    while len(links_to_crawl) > 0 and len(results) <= page_limit:
        # FIFO - pull links from the list of those I'm working on
        current_url = links_to_crawl.pop(0)
        # Create a temporary dictionary of results for this URI:
        current_result = {}
        # use urllib.parse.urlparse() to parse the parts of the latest URL
        parsed_url = urlparse(current_url)
        # Add the information out of the parsed URI to our results for this link
        current_result.update({'scheme': parsed_url.scheme, 'domain': parsed_url.netloc, 'path': parsed_url.path})
        
        # request the page that we are lookin at
        current_request = requests.get(current_url, headers = hdr)
        soup = BeautifulSoup(current_request.content,'html.parser')
        
        # look at how many links were found on the page
        links_found = [str(x['href']).lower() for x in soup.find_all(href=True) if not str(x['href']).startswith('#') ]
        
        # Add the data from that request to our site-result
        current_result.update({'status': current_request.status_code, 
                               'encoding': current_request.encoding,
                               'links_found':len(set(links_found))
                              })
        # Add the site-result to the master results dictionary    
        results[current_url] = current_result
        
        # If it is a URL outside of the domain I'm working with, and we're only doing 1x domain do nothing else
        if parsed_url.netloc != start_domain and not multi_domain:
            continue
        # otherwise, parse the page for additional links
        else:
            while len(links_found) > 0:
                link = links_found.pop()
                # standardize the link
                new_link_parsed = urlparse(link)
                # look for site-internal links without a netloc
                
                if new_link_parsed.netloc == '':
                    link = urljoin(start_from, link) # use our starting link as the basis for it
                
                if new_link_parsed.scheme in (['http','https']) and link not in links_to_crawl and link not in results.keys():
                    links_to_crawl.append(link)
        
    return results

Let's go ahead and run the crawler and see what we got back. 

In [15]:
crawl_results = crawl_site('https://agbs2k8.github.io', multi_domain=False, page_limit=50)
print(len(crawl_results))

28


We'll now also import the wonderful [Pandas](https://pandas.pydata.org/) package here so we can put the results into a nice table. 

In [16]:
import pandas as pd
pd.DataFrame.from_dict(crawl_results, orient='index').reset_index().head()

Unnamed: 0,index,scheme,domain,path,status,encoding,links_found
0,http://docs.python-requests.org,http,docs.python-requests.org,,200,ISO-8859-1,175
1,http://ggplot.yhathq.com/,http,ggplot.yhathq.com,/,200,ISO-8859-1,12
2,http://hadoop.apache.org/,http,hadoop.apache.org,/,200,ISO-8859-1,112
3,http://scikit-learn.org,http,scikit-learn.org,,200,utf-8,9
4,http://www.numpy.org,http,www.numpy.org,,200,utf-8,23


As you can see, this came together really easily.  Python has some fantastic packages availble for web requests and handling HTML which makes this really easy.  

I'll plan to touch on this again in the future, where I'll look at ways to speed this up and get it running faster.  