# Libraries and Tools

Below is an overview of the libraries and other tools used for webscraping.

# [Urllib](https://docs.python.org/3/library/urllib.html)

Urllib is part of the Python standard library. It has several modules used for handling URLs (Uniform Resource Locators). 

* [`urllib.request`](https://docs.python.org/3/library/urllib.request.html#module-urllib.request) defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.
* [`urllib.error`](https://docs.python.org/3/library/urllib.error.html#module-urllib.error) defines the exception classes for exceptions raised by urllib.request.
* [`urllib.parse`](https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse) defines a standard interface to break URL strings into components (addressing scheme, network location, path, etc.) and to build URLs from the components.
* [`urllib.robotparser`](https://docs.python.org/3/library/urllib.robotparser.html#module-urllib.robotparser) uses a `RobotFileParser` class to answer questions about weather or not a particualr user agent can fetch a URL based on the [`robots.txt`](https://www.robotstxt.org/) file. 

# [Requests](https://requests.readthedocs.io/en/master/)

Requests is a library that makes it easier to deal with HTTP requests. It is easy to get a web page with the `requests.get(url)` function. Requests can also generate POST requests to submit form data, handle sessions and cookies, handle SSL verification, and more.

[Requests Toolbelt](https://toolbelt.readthedocs.io/en/latest/) is a library built to extend the Requests library. It is developed by the Requests core developers and contains tools that you might need, but don't fit into the Requests library.

# [LXML](https://lxml.de/)

LXML is a fast library for parsing XML documents (including HTML). It's built on top of C libraries [libxml2](http://xmlsoft.org/) and [libxslt](http://xmlsoft.org/XSLT/).

# [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)

BeautifulSoup is a parsing library that can use different parsers (such as LXML). It is easy to learn and allows you to quickly find specific elements on a page. BeautifulSoup has the ability to infer structure and can fix broken HTML and XML tags. BeautifulSoup will only parse static contnet, so it can not handle javascript generated websites. However, often enough it's easy to inspect the page and find the API calls which can be digested with Requests and other libraries like [json](https://docs.python.org/3/library/json.html).

# [Selenium](https://selenium.dev/selenium/docs/api/py/)

Selenium is a web driver designed for rendering and testing webpages. It allows you to programatically interact with the web pages, which means you are also able to extract content from the page. The advantage to Selenium is that it will execute javascript and render all the content on the page. However, this will slow down the scraping process and might not be suitable for large scale projects.

To run selenium, you need to have a web brower installed. Additionaly, you must download the appropriate web driver and place it in Selenium's path for Selenium to control the browser. The web drivers for common browsers can be found at the links below.

* [Chrome](https://chromedriver.chromium.org/downloads)
* [Firefox](https://github.com/mozilla/geckodriver/releases)
* [Opera](https://github.com/operasoftware/operachromiumdriver/releases)
* [Internet Explorer](https://github.com/SeleniumHQ/selenium/wiki/InternetExplorerDriver)
* [Edge](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)
* [Safari](https://developer.apple.com/documentation/webkit/testing_with_webdriver_in_safari)


# [Scrapy](https://docs.scrapy.org/en/latest/)

Scrapy is a web scraping framework designed for building spiders that crawl and process webpages. It combines the power of Reqeusts and BeautifulSoup in one place to allow you to quickly build scraping pipelines. Scrapy is built on top of [Twisted](https://twistedmatrix.com/trac/) which allows asynchronous requests to speed up the process. 

Like BeautifulSoup, Scrapy does not inherently render javascript. However, the creators of Scrapy have developed a headless lightweight browser called Splash specifically designed for web scraping.

# Useful Sites and References    

## [Web Scraping Sandbox](http://toscrape.com/)
* [Fake Bookstore](http://books.toscrape.com/)
* Real Quotes. There are several different configurations of this page to practice different scraping techniques.
    * [Default: Microdata and pagination](http://quotes.toscrape.com/)
    * [Scroll: Infinite scrolling pagination](http://quotes.toscrape.com/scroll)
    * [JavaScript: JavaScript generated content](http://quotes.toscrape.com/js)
    * [Tableful: A table based messed-up layout](http://quotes.toscrape.com/tableful)
    * [Login: Login with CSRF token (any user/passwd works)](http://quotes.toscrape.com/login)
    * [ViewState: An AJAX based filter form with ViewStates](http://quotes.toscrape.com/search.aspx)
    * [Random: A single random quote](http://quotes.toscrape.com/random)