# Libraries and Tools

Below is an overview of the libraries and other tools used for webscraping.

# [Urllib](https://docs.python.org/3/library/urllib.html)

Urllib is part of the Python standard library. It has several modules used for handling URLs (Uniform Resource Locators). 

* [`urllib.request`](https://docs.python.org/3/library/urllib.request.html#module-urllib.request) defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.
* [`urllib.error`](https://docs.python.org/3/library/urllib.error.html#module-urllib.error) defines the exception classes for exceptions raised by urllib.request.
* [`urllib.parse`](https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse) defines a standard interface to break URL strings into components (addressing scheme, network location, path, etc.) and to build URLs from the components.
* [`urllib.robotparser`](https://docs.python.org/3/library/urllib.robotparser.html#module-urllib.robotparser) uses a `RobotFileParser` class to answer questions about weather or not a particualr user agent can fetch a URL based on the [`robots.txt`](https://www.robotstxt.org/) file. 

# [Requests](https://requests.readthedocs.io/en/master/)

Requests is a library that makes it easier to deal with HTTP requests. It is easy to get a web page with the `requests.get(url)` function. Requests can also generate POST requests to submit form data, handle sessions and cookies, handle SSL verification, and more.

[Requests Toolbelt](https://toolbelt.readthedocs.io/en/latest/) is a library built to extend the Requests library. It is developed by the Requests core developers and contains tools that you might need, but don't fit into the Requests library.

# [LXML](https://lxml.de/)

LXML is a fast library for parsing XML documents (including HTML). It's built on top of C libraries [libxml2](http://xmlsoft.org/) and [libxslt](http://xmlsoft.org/XSLT/).

# [cssselect](https://github.com/scrapy/cssselect)

To use CSS selectors in LXML, you need to install cssselect. It is developed by the team that develops Scrapy and translates [CSS3 selectors](https://www.w3.org/TR/selectors-3/) to [XPath 1.0](https://www.w3.org/TR/1999/REC-xpath-19991116/) Expressions.

# [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)

BeautifulSoup is a parsing library that can use different parsers (such as LXML). It is easy to learn and allows you to quickly find specific elements on a page. BeautifulSoup has the ability to infer structure and can fix broken HTML and XML tags. BeautifulSoup will only parse static contnet, so it can not handle javascript generated websites. However, often enough it's easy to inspect the page and find the API calls which can be digested with Requests and other libraries like [json](https://docs.python.org/3/library/json.html).

# [Selenium](https://selenium.dev/selenium/docs/api/py/)

Selenium is a web driver designed for rendering and testing webpages. It allows you to programatically interact with the web pages, which means you are also able to extract content from the page. The advantage to Selenium is that it will execute javascript and render all the content on the page. However, this will slow down the scraping process and might not be suitable for large scale projects.

To run selenium, you need to have a web brower installed. Additionaly, you must download the appropriate web driver and place it in Selenium's path for Selenium to control the browser. The web drivers for common browsers can be found at the links below.

* [Chrome](https://chromedriver.chromium.org/downloads)
* [Firefox](https://github.com/mozilla/geckodriver/releases)
* [Opera](https://github.com/operasoftware/operachromiumdriver/releases)
* [Internet Explorer](https://github.com/SeleniumHQ/selenium/wiki/InternetExplorerDriver)
* [Edge](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)
* [Safari](https://developer.apple.com/documentation/webkit/testing_with_webdriver_in_safari)


# [Scrapy](https://docs.scrapy.org/en/latest/)

Scrapy is a web scraping framework designed for building spiders that crawl and process webpages. It combines the power of Reqeusts and BeautifulSoup in one place to allow you to quickly build scraping pipelines. Scrapy is built on top of [Twisted](https://twistedmatrix.com/trac/) which allows asynchronous requests to speed up the process. 

Like BeautifulSoup, Scrapy does not inherently render javascript. However, the creators of Scrapy have developed a headless lightweight browser called Splash specifically designed for web scraping.

# Useful Sites and References    

## [Web Scraping Sandbox](http://toscrape.com/)
* [Fake Bookstore](http://books.toscrape.com/)
* Real Quotes. There are several different configurations of this page to practice different scraping techniques.
    * [Default: Microdata and pagination](http://quotes.toscrape.com/)
    * [Scroll: Infinite scrolling pagination](http://quotes.toscrape.com/scroll)
    * [JavaScript: JavaScript generated content](http://quotes.toscrape.com/js)
    * [Tableful: A table based messed-up layout](http://quotes.toscrape.com/tableful)
    * [Login: Login with CSRF token (any user/passwd works)](http://quotes.toscrape.com/login)
    * [ViewState: An AJAX based filter form with ViewStates](http://quotes.toscrape.com/search.aspx)
    * [Random: A single random quote](http://quotes.toscrape.com/random)

## [Real Python](https://realpython.com/)
* [Python Web Scraping Tutorials](https://realpython.com/tutorials/web-scraping/)
    * [Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/)
    * [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)
    * [Modern Web Automation With Python and Selenium](https://realpython.com/modern-web-automation-with-python-and-selenium/)
    * [Web Scraping with Scrapy and MongoDB](https://realpython.com/web-scraping-with-scrapy-and-mongodb/)
    * [Web Scraping and Crawling with Scrapy and MongoDB](https://realpython.com/web-scraping-and-crawling-with-scrapy-and-mongodb/)
* [Python’s Requests Library](https://realpython.com/python-requests/)

## [ScrapingHub](https://scrapinghub.com/)
* Info
    * [How to Build a Web Scraper: Python Web Scraping Libraries & Frameworks Explained](https://info.scrapinghub.com/web-scraping-guide/python-web-scraping-libraries-and-frameworks)
    * [The Beginners Guide to Web Scraping](https://info.scrapinghub.com/web-scraping-guide/beginners-guide-to-web-scraping)
* [Blog](https://blog.scrapinghub.com/)
    * 2019-08-22 [Learn how to configure and utilize proxies with Python Requests module](https://blog.scrapinghub.com/python-requests-proxy)
    * 2019-08-08 [How to set up a custom proxy in Scrapy?](https://blog.scrapinghub.com/scrapy-proxy)
    * 2018-12-13 [Do What is Right Not What is Easy!](https://blog.scrapinghub.com/gdpr-web-scraping-iiap-europe-data-protection-congress)
    * 2017-04-19 [Deploy your Scrapy Spiders from GitHub](https://blog.scrapinghub.com/2017/04/19/deploy-your-scrapy-spiders-from-github)
    * 2016-10-27 [An Introduction to XPath: How to Get Started](https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples)
    * 2016-09-28 [How to Run Python Scripts in Scrapy Cloud](https://blog.scrapinghub.com/2016/09/28/how-to-run-python-scripts-in-scrapy-cloud)
    * 2016-08-25 [How to Crawl the Web Politely with Scrapy](https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy)
    * 2016-06-22 [Scraping Infinite Scrolling Pages](https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016)
    * 2016-05-18 [How to Debug your Scrapy Spiders](https://blog.scrapinghub.com/2016/05/18/scrapy-tips-from-the-pros-may-2016-edition)
    * 2016-02-29 [Splash 2.0 Is Here with Qt 5 and Python 3](https://blog.scrapinghub.com/2016/02/29/splash-2-0-here-with-qt-5-and-python-3)
    * 2012-10-26 [How to Fill Login Forms Automatically](https://blog.scrapinghub.com/2012/10/26/filling-login-forms-automatically)

## [ScrapeHero](https://www.scrapehero.com)
* [Tutorials](https://www.scrapehero.com/web-scraping-tutorials/)
    * [What is web scraping – Part 1 – Beginner’s guide](https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-1-the-basics/)
    * [Beginners guide to Web Scraping: Part 2 – Build a web scraper for Reddit using Python and BeautifulSoup](https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-2-build-a-scraper-for-reddit/)
    * [Web Scraping Tutorial for Beginners – Part 3 – Navigating and Extracting Data](https://www.scrapehero.com/web-scraping-tutorial-for-beginners-part-3-navigating-and-extracting-data/)

## [W3 Specifications]()

* CSS
    * [Selectors Level 4](https://www.w3.org/TR/selectors-4/)
    * [Selectors Level 3](https://www.w3.org/TR/selectors-3/)
* XPath
    * [XML Path Language 3.1](https://www.w3.org/TR/2017/REC-xpath-31-20170321/)
    * [XML Path Language 3.0](http://www.w3.org/TR/2014/REC-xpath-30-20140408/)
    * [XML Path Language 2.0](http://www.w3.org/TR/2010/REC-xpath20-20101214/)
    * [XML Path Language 1.0](http://www.w3.org/TR/1999/REC-xpath-19991116/)

## [WhatIsMyBrowser](https://www.whatismybrowser.com/)

* [Database of User Agent Strings](https://developers.whatismybrowser.com/useragents/explore/)

# Other Interesting Links

## Is Scraping Legal?

* [Is Web Scraping Illegal? Depends on What the Meaning of the Word Is](https://www.imperva.com/blog/is-web-scraping-illegal/)
* [Better Online Ticket Sales (BOTS)_Act](https://www.congress.gov/bill/114th-congress/senate-bill/3183)
* [QVC Can't Stop Web Scraping](https://www.forbes.com/sites/ericgoldman/2015/03/24/qvc-cant-stop-web-scraping/#16f010903ca3)

## Testing Out Response Headers

* [httpstat.us](https://httpstat.us/): A super simple service for generating different HTTP Codes
    * https://httpstat.us/200 will generate an `OK` response
    * https://httpstat.us/404 will generate a `NOT FOUND` response
* [httpbin.org](http://httpbin.org/): Developed by Kenneth Reitz, the developer of Requests. Test out different HTTP Verbs and other request objects.