# Web Scraping

Our goal is to extract information from the web, sometimes from more than one page.

To do this, there are two terms you need to know:

- **Crawling** refers to following links on pages to reach other pages.

- **Scraping** refers to extracting information/data from one or more of those pages.

As we learned previously, web pages are specified using HTML, and thus scraping means we are downloading this HTML.

For example, consider a simple web page like what we've seen before.

```html
<!DOCTYPE html>
<html>
  <head>
    <title>Webpage Title</title>
  </head>
  <body>
    <p>All that's on this page is a single html "paragraph" (i.e., sentence).</p>
  </body>
</html>
```

Scraping this web page is **nothing more** than automatically downloading the above HTML.

Once it's downloaded, we can extract only the bits we want. In this simple example, the title and/or the sentence are our only options.

In most cases, the information we want to extract is a very specific, tiny part of a page. Luckily, 99.99% of all websites you encounter will include easy ways to identify this specific content. For example, consider the below.

```html
<!DOCTYPE html>
<html>
  <body>
    <h1>John Smith</h1>
    <p>Age: 25</p>
  </body>
</html>
```

The page shows the name of someone and their age using two different HTML elements. The `<h1>` element shows "John Smith" in large bold letters (called a heading). However, you don't have to know what a `<h1>` element is to know you want to grab the "John Smith" text. You can extract the text from any element whatsover, e.g., `<anytag>text</anytag>`. You could also exact the "Age: 25" text from the paragraph `<p>` element.

The other common distinction will be as follows:

```html
<!DOCTYPE html>
<html>
  <body>
    <p class="name">John Smith</p>
    <p class="age">Age: 25</p>
  </body>
</html>
```

Now we have two paragraph tags. We could just grab both. However, they are also distinguished. One of the elements has `class="name"` inside of the start tag. The other one has `class="age"`. These are called HTML **attributes** and always have the form `<tag attribute=value></tag>`. You can uses these attributes to select the content you want to extract, such as all paragraph elements that have their class attribute set to "name". **You do not have to know the difference between attributes** in order to use them to select content.

That's just all there is to scraping in theory. We will explore these ideas in practice below.

### Scrapy

**Scrapy** is an easy web scraping framework for Python that allows us to scrape (and crawl) web pages.

To use it, we first have to install it.

In [5]:
# install scrapy
!pip install -q scrapy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Scraping web pages with **Scrapy** requires first creating a Scrapy "project".

We can create a project called "tutorial" using the command:

In [24]:
!scrapy startproject tutorial

New Scrapy project 'tutorial', using template directory '/Users/crosbynash/Documents/GitHub/ds497/.venv/lib/python3.9/site-packages/scrapy/templates/project', created in:
    /Users/crosbynash/Documents/GitHub/ds497/tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com


The project scrapy creates is a folder. 

Make sure you can see this folder in the "Explorer" tab on the right in this codespace.

We can also see this folder by running the following command to show the current folders and files we can access:

In [7]:
%ls

exercise-1.html  scraping.ipynb   [34mtutorial[m[m/        [34mweb_experiments[m[m/


The output of the above command shows one file and one folder (folders are shown in blue). 

The file `scraping.ipynb` is the current notebook we are working with.

The `tutorial` folder is the project folder created by Scrapy. It contains a bunch of other files and folders that Scrapy creates by default.

(You do **NOT** have to know what they all mean!)

To see its contents, use the command: 

In [8]:
!tree tutorial

zsh:1: command not found: tree


All folders are shown in blue. 

The very first, top folder is the project tutorial folder we originally created. Inside that folder is one file (scrapy.cfg) and one folder, which confusingly is also called "tutorial". Make sure not to mix up the two tutorial folders (i.e., one is inside of the other). 

Inside the second tutorial folder (i.e., tutorial/tutorial) are a bunch of other files and folders. Ignore them for now.

The only thing we need to know about any of these files and folders is that we need to be "working" inside of the top-level `tutorial` project folder.

To tell Python we want to "work" inside of this folder, use the following command:

In [25]:
# set project folder as the working directory
%cd tutorial/

/Users/crosbynash/Documents/GitHub/ds497/tutorial


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


You can see your current **working directory** anytime by running the `%pwd` command.

If the output is not `'/workspaces/lab_scraping/tutorial'`, then you're in the wrong directory, and Scrapy won't work correctly.

For example, if you see `'/workspaces/lab_scraping/tutorial/tutorial'`, you've gone too deep!

In [26]:
# check our current working directory
%pwd

'/Users/crosbynash/Documents/GitHub/ds497/tutorial'

Now if we run `%ls` again as we did before we changed our working directory, we get a different list of files. That's because are no longer working in the folder containing this notebook (`scraping.ipynb`). We are now working inside `tutorial`.

In [27]:
%ls

scrapy.cfg  [34mtutorial[m[m/


### Scraping a demo website

Our current goal will be to use this project to scrape a simple demo website that contains pages of quotes: http://quotes.toscrape.com.

Click on the website to see what it contains.

To turn our project into a functional scraper, we only need **one file** with a **few lines of code**.

Basic scrapers in Scrapy are supplied with a single file defining what is called a "spider" as a Python class. The spider contains information about which website we want to scrape and what information we want to scrape from it.

Spiders in Scrapy are defined in a python `.py` file inside the `my_project_name/my_project_name/spiders/` folder.

Because we are already working in the top-level `tutorial` folder, we can create a file for our spider in `tutorial/spiders/`. That is, the full path of the file will be `/workspaces/lab_scraping/my_project_name/my_project_name/spiders/`.

To create this file, we can use the `%%writefile` command to create an empty file called `quotes_spider.py` inside of `tutorial/spiders/`.

In [28]:
%%writefile tutorial/spiders/quotes_spider.py
# this dummy line will go inside the file

Writing tutorial/spiders/quotes_spider.py


You can see the `quotes_spider.py` file on the right tab inside of tutorial > tutorial > spiders.

We can also list the files inside of the `spiders` folder to see that the file now exists:

In [29]:
%ls tutorial/spiders

__init__.py       quotes_spider.py


Click on the file in the right explorer tab. You can see what's inside. Then close it.

Now, let's fill the file with some code to do some scraping.

We will call `%%writefile` again and create (overwrite) the same file again, but this time we will include the scraping code.

Everything that goes underneath the `%%writefile` line will be written to that file.

In [30]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        print(response.body)

Overwriting tutorial/spiders/quotes_spider.py


Click on the file again. You can see the above is now inside. Close it again.

That's it -- That's a fully functional scraper! As we will see, currently it just downloads the entire quote web page, but we'll improve it later.

Most of this code **will never change**, and thus you can always copy/paste to re-use. It contains the basic form used to define spiders:

First, we import Scrapy. We will always do this.

Then a **single** "class" is defined with a name we choose. In this case, `QuotesSpider` is our way of reminding ourselves that this spider is meant to scrape quotes.

The entire line `class QuotesSpider(scrapy.Spider):` will **always be the same**, other than the name `QuotesSpider`.

Inside of our new `QuotesSpider` class is the information needed to scrape data from the website. 
- The `name` variable is another way of naming the spider. We will reference this name ("quotes") later when telling Scrapy to deploy our spider to do some scraping.
- The `start_urls` variable is a list of webpage addresses from which we want to scrape content. As we said earlier, we want to scrape quotes from http://quotes.toscrape.com.

The final major component of our spider is the definition of a method (function) called `parse`. **All spiders contain this function** and the **inputs will always be** `self` and `response`. 

`response` is a special kind of python object that always contains the **entire downloaded web page** of interest (i.e., http://quotes.toscrape.com). One simple way to inspect this data is to reference `response.body`, which contains the raw HTML for the webpage of interest. The above `parse` function is simple in that it simply prints out all of said HTML: `print(response.body)`.

To run this sider and see the scraped data, we can run the command below. All we are doing is telling Scrapy to use the `quotes` spider we just created.

In [31]:
!scrapy crawl quotes -L ERROR

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n    \n    \n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinki

That's it. You just scraped the page. The raw web page HTML is below the above cell.

The above output is not easy to read (e.g., `b'<!DOCTYPE html>...`). It's the entire webpage rather than just the content we want. One option would be to just save the output as is and then worry about extracting the subset of content we care about at a later time. However, Scrapy makes it easier to filter out only the data we want.

Scan your eyes along the output above until you see: 
```html
<title>Quotes to Scrape</title>
```
This is the title of the webpage surrounded by `<title>` tags. Before trying to extract quotes, let's try to extract this title as an exercise.

Rather than looking at the entire `response.body`, we can use what are called **CSS Selectors** to focus in on specific HTML elements such as the `<title>` tag.

For example, the CSS selector for the title element in the HTML document is just `"title"`.

To apply this selector (i.e., to filter the HTML we want to extract), use the `response.css()` method with the selector as the input:

In [32]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        print(response.css("title"))

Overwriting tutorial/spiders/quotes_spider.py


In [33]:
!scrapy crawl quotes -L ERROR

[<Selector query='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]


The result seems to contain what we want (`<title>Quotes to Scrape</title>`) at the end, but there's some other stuff too.

The reason is that `response.css("title")` does not return raw text (we'll fix that in a second).

Instead, it returns a "Selector" Python object that is sometimes used to further select content. Forget that for now.

For now, we can just call the `get` method (function) of this selector to return the raw HTML we care about:

In [34]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        print(response.css("title").get())

Overwriting tutorial/spiders/quotes_spider.py


In [35]:
!scrapy crawl quotes -L ERROR

<title>Quotes to Scrape</title>


Now we have the HTML we want!

However, technically, we still don't really need the `<title>` tags.

Using selectors, we can select not just the title tag, but the text inside those tags.

To do this, we use a selector with the format `tag::text`, in our case: `title::text`:

In [40]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        print(response.css("title::text").get()) # FILL ME IN!

Overwriting tutorial/spiders/quotes_spider.py


Now when we run our spider we will get just the title text:

In [41]:
!scrapy crawl quotes -L ERROR

Quotes to Scrape


We did it!

**Let's try to extact a quote now.**

Looking through the full raw HTML page for what we want is too tedious. Let's simplify our lives:

Open up Chrome, or another tab in Chrome if you are already using it for this notebook.

Then, navigate to http://quotes.toscrape.com. 

Look at the first quote on the page:

> “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”

Right click on this quote (on Windows) or control+click (on Mac). Click "Inspect".

On the right of your screen, you will see the page's HTML, with the following highlighted:

```html
<span class="text" itemprop="text">...</span>
```

Click the small gray pointer / triangle icon to the left of that line. It will expand show its full contents:

```html
<span class="text" itemprop="text">
    “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
</span>
```

What we see is that quotes are inside `<span>` elements with an attribute: `class="text"`.

**It does not matter** that we have not seen span elements before. Remember, you can select the text between **any** tags in the document.

If you right click on other quotes, you will see that they also have `class="text"`.

Thus, rather than call `response.css("span").get()` which could end grabbing text from other span elements on the page that don't contain quotes, we can make sure to only grab the span elements where `class="text"`.

To do this, we use the selector format `tag.class`, in our case: `span.text`:

```python
    response.css("span.text").get()
```

In [45]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        scraped = response.css("span.text").get() # FILL ME IN!
        print(scraped)

Overwriting tutorial/spiders/quotes_spider.py


Now we have a quote just as we wanted, though it's wrapped in HTML we don't need.

In [43]:
!scrapy crawl quotes -L ERROR

<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>


Just as before, we can add `::text` to the end of our selector to get just the text. The overall selector format is `tag.class::value`.

Thus, the `span.text::text` selector means we are selecting span elements with a class of "text", their text (quote) in particular.

Note the difference between adding `.text`, which selects from a particular class of elements, and `::text` which selects just the text between the selected tags.

Why are both called text? `::text` will never change. `.text` corresponds to what the web page creator decided to call that class of elements. It's their fault :)

Were they to have named the class "quote", the selector would have been: `span.quote::text`.

In [46]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        scraped = response.css("span.text::text").get()
        print(scraped)

Overwriting tutorial/spiders/quotes_spider.py


In [47]:
!scrapy crawl quotes -L ERROR

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”


If you look back at the webpage in your browser, you'll notice there are **multiple quotes** on each page (all inside `<span class="text">` elements). 

To retrieve them all, we will use the same selector, but instead of calling `.get()`, we will simply call `.getall()`.

Then we will include a simple loop to print out all of them.

In [49]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        scraped = response.css("span.text::text").getall() # FILL ME IN!
        for quote in scraped:
            print(quote)

Overwriting tutorial/spiders/quotes_spider.py


In [50]:
!scrapy crawl quotes -L ERROR

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


Since we don't need the “” symbols, we can drop them:

In [51]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        scraped = response.css("span.text::text").getall()
        for quote in scraped:
            print(quote[1:-1]) # ONLY CHANGE

Overwriting tutorial/spiders/quotes_spider.py


In [52]:
!scrapy crawl quotes -L ERROR

The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.
It is our choices, Harry, that show what we truly are, far more than our abilities.
There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.
The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.
Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.
Try not to become a man of success. Rather become a man of value.
It is better to be hated for what you are than to be loved for what you are not.
I have not failed. I've just found 10,000 ways that won't work.
A woman is like a tea bag; you never know how strong it is until it's in hot water.
A day without sunshine is like, you know, night.


Now, let's say we want to save our results to a data file that looks like the below, with single column called "quote" one row for each quote we scrape:

In [54]:
import pandas as pd
pd.DataFrame([
    {'quote': 'the first quote'},
    {'quote': 'the second quote'},
    {'quote': 'the third quote'},
    {'quote': 'etc...'},
])

Unnamed: 0,quote
0,the first quote
1,the second quote
2,the third quote
3,etc...


**To save results to a file**, Scrapy requires that each desired row of saved data be provided as a Python dictionary, e.g., `{'quote': 'the first quote'}` using the `yield` keyword inside of a loop.

Put more simply, while before we just called `print(quote)` inside of the loop to see each quote, we will now call `yield {'quote': quote}` inside of the loop to save each quote as a row of data.

The code below makes just **two** simple changes:
1. Put each quote in a dictionary: `{'quote': quote[1:-1]}`
2. Yield each quote in the loop: `yield row_of_data`

In [55]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        scraped = response.css("span.text::text").getall()
        for quote in scraped:
            row_of_data = {'quote': quote[1:-1]} # NEW!
            print(row_of_data)
            yield row_of_data  # NEW!

Overwriting tutorial/spiders/quotes_spider.py


Now when we run our spider, we need to add `-O quotes.csv` to our command to tell Scrapy to output results to a data file called `quotes.csv`.

In [56]:
!scrapy crawl quotes -L ERROR -O quotes.csv

{'quote': 'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.'}
{'quote': 'It is our choices, Harry, that show what we truly are, far more than our abilities.'}
{'quote': 'There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.'}
{'quote': 'The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.'}
{'quote': "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring."}
{'quote': 'Try not to become a man of success. Rather become a man of value.'}
{'quote': 'It is better to be hated for what you are than to be loved for what you are not.'}
{'quote': "I have not failed. I've just found 10,000 ways that won't work."}
{'quote': "A woman is like a tea bag; you never know how strong it is until it's in hot water."}
{'quote': 'A day without sunshine is like, you 

We can read in this file with pandas and show that it has the structure we wanted:

In [57]:
import pandas as pd
pd.read_csv('quotes.csv')

Unnamed: 0,quote
0,The world as we have created it is a process o...
1,"It is our choices, Harry, that show what we tr..."
2,There are only two ways to live your life. One...
3,"The person, be it gentleman or lady, who has n..."
4,"Imperfection is beauty, madness is genius and ..."
5,Try not to become a man of success. Rather bec...
6,It is better to be hated for what you are than...
7,"I have not failed. I've just found 10,000 ways..."
8,A woman is like a tea bag; you never know how ...
9,"A day without sunshine is like, you know, night."


We could extract authors along with each quote as well.

If again we right-click + "Inspect", but now the author text "Albert Einstein" for the first quote in Chrome, we see:

```html
<small class="author">Albert Einstein</small>
```

To extract both:

In [60]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):

        quotes = response.css("span.text::text").getall()

        authors = response.css("small.author::text").getall() # FILL ME IN!

        print(quotes)
        print(authors)

Overwriting tutorial/spiders/quotes_spider.py


In [61]:
!scrapy crawl quotes -L ERROR

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Aus

A better way to do the above is to understand more about how the page is organized.

Go pack to the page and "Inspect" the white space just a pixel or show away from the thin black line above the quote text. 

You will see that each quote and related info is grouped together like this:

```html
<div class="quote">
    <span class="text">Quote text.</span>
    <span>
        by <small class="author">Author Name</small>
        <a href="/author/Author-Name">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        ...
    </div>
</div>
```

Without knowing every detail, we can see that all of the info for one quote is organized inside of one `<div class="quote">` parent element. Inside is the quote, the author name, and other things like tags and links (`<a>`).

We can access all information associated with a single quote using one simple selector: `'div.quote'`:

In [62]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):

        quotes = response.css("div.quote").getall() # FILL ME IN!

        for quote in quotes:
            print(quote)

Overwriting tutorial/spiders/quotes_spider.py


In [63]:
!scrapy crawl quotes -L ERROR

<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"> 
            
            <a class="tag" href="/tag/change/page/1/">change</a>
            
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            
            <a class="tag" href="/tag/world/page/1/">world</a>
            
        </div>
    </div>
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="tex

A common shorthand in Scrapy to do the same as the above (printing out all div quote elements) is below.

That is, we are looping through all of the quotes in `response.css('div.quote')`, and only calling `.get()` on the individual quotes.

In [64]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):

        for quote in response.css('div.quote'):
            print(quote.get())

Overwriting tutorial/spiders/quotes_spider.py


In [65]:
!scrapy crawl quotes -L ERROR

<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"> 
            
            <a class="tag" href="/tag/change/page/1/">change</a>
            
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            
            <a class="tag" href="/tag/world/page/1/">world</a>
            
        </div>
    </div>
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="tex

The nice thing about this approach is that it allows us to grab information at the level of a single quote rather than at the level of the entire document all at once.

For example, below we have the same loop, but we call `.css('span.text::text').get()` on each individual quote.

This is why we loop through `response.css('div.quote')` and not `response.css('div.quote').get()`. The former is the previously mentioned "Selector" Scrapy object that can be further filtered using `.css()`, where as the latter cannot because we forced it to become a raw text string via `.get()`. (You don't have to use this method if it makes less sense to you.)

Check the output and see if it makes sense why it looks that way.

In [66]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):

        for quote in response.css('div.quote'):
            text = quote.css('span.text::text').get()
            print(text)

Overwriting tutorial/spiders/quotes_spider.py


In [67]:
!scrapy crawl quotes -L ERROR

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


Now we can use this idea to extract each quote as a set of properties. We can store them in a dictionary and then save to a file with pandas.

In [68]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):

        for quote in response.css('div.quote'): # FILL ME IN!
            scraped = {
                'author': quote.css('small.author::text').get(),
                'quote': quote.css('span.text::text').get(),
            }
            print(scraped)
            yield scraped

Overwriting tutorial/spiders/quotes_spider.py


In [69]:
!scrapy crawl quotes -L ERROR -O quotes.csv

{'author': 'Albert Einstein', 'quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'author': 'J.K. Rowling', 'quote': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
{'author': 'Albert Einstein', 'quote': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'}
{'author': 'Jane Austen', 'quote': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”'}
{'author': 'Marilyn Monroe', 'quote': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"}
{'author': 'Albert Einstein', 'quote': '“Try not to become a man of success. Rather become a man of value.”'}
{'author': 'André Gide', 'quote': '“It is better to be hated for what you are than to be loved for what you are not.”'}
{'author': 'Thoma

In [70]:
pd.read_csv('quotes.csv')

Unnamed: 0,author,quote
0,Albert Einstein,“The world as we have created it is a process ...
1,J.K. Rowling,"“It is our choices, Harry, that show what we t..."
2,Albert Einstein,“There are only two ways to live your life. On...
3,Jane Austen,"“The person, be it gentleman or lady, who has ..."
4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and..."
5,Albert Einstein,“Try not to become a man of success. Rather be...
6,André Gide,“It is better to be hated for what you are tha...
7,Thomas A. Edison,"“I have not failed. I've just found 10,000 way..."
8,Eleanor Roosevelt,“A woman is like a tea bag; you never know how...
9,Steve Martin,"“A day without sunshine is like, you know, nig..."


We now have all of the quotes on one page, but what about the other ones?

To access the other pages, we can use Scrapy to **crawl** them.

On this website, additional pages of quotes are accessed using the "Next" button on the very bottom of the page.

Inspect it in Chrome, and you will see:

```html
<li class="next">
  <a href="/page/2/">Next</a>
</li>
```

The link to the next page is `"/page/2/"` (as in `http://quotes.toscrape.com/page/2/`).

Notice that this link is not the text of the element ("Next").

Instead, we want the value of the `href` attribute, which is `"/page/2/"`. `href` just means URL, like google.com. We'll talk about how to grab attributes in a second.

First, notice that the `<a>` element doesn't have a class to help select it, however it is the only one inside of the `<li>` element.

We can access is using the following selector format: `'parent child'`, simply separating the two with a space, such as `li.next a`:

In [71]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        next_page = response.css('li.next a').get()
        print(next_page)

Overwriting tutorial/spiders/quotes_spider.py


In [72]:
!scrapy crawl quotes -L ERROR

<a href="/page/2/">Next <span aria-hidden="true">→</span></a>


Now, that we have the `<a>` element, we want its `href` attribute value.

Just like we used `::text` to select text, we can use `::attr(href)` to grab an attribute (href in particular):

In [73]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        next_page = response.css('li.next a::attr(href)').get()
        print(next_page)

Overwriting tutorial/spiders/quotes_spider.py


In [74]:
!scrapy crawl quotes -L ERROR

/page/2/


Now all we have to do is tell Scrapy to follow this link.

We always do this with the same exact block of code, shown below.

Note that every time `yield response.follow(url, self.parse)` is called, `parse` is called again on the new page. That is, your scraping code (e.g., `response.css('li.next a::attr(href)').get()`) will be called again on each new page that is found.

In [75]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):

        url = response.css('li.next a::attr(href)').get()
        print(url)

        if url is not None:
            yield response.follow(url, self.parse)

Overwriting tutorial/spiders/quotes_spider.py


In [76]:
!scrapy crawl quotes -L ERROR

/page/2/
/page/3/
/page/4/
/page/5/
/page/6/
/page/7/
/page/8/
/page/9/
/page/10/
None


You can see that Scrapy crawled through 9 additional "next" pages until there were no more.

`next_page is not None:` is a stopping condition for when no more next links are found.

To scrape say just the first quote from each page, we can use:

In [77]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):

        # scrape something on the current page
        quote_text = response.css('span.text::text').get() # FILL ME IN!
        print(quote_text)

        # go to next page
        url = response.css('li.next a::attr(href)').get() # FILL ME IN!
        if url is not None:
            yield response.follow(url, self.parse)

Overwriting tutorial/spiders/quotes_spider.py


In [78]:
!scrapy crawl quotes -L ERROR

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in your

## Additional Exercises

Below are some additional exercises. If you run out of time, use them as practice questions for later studying.

**Exercise 1:**

Write your own spider in the below cell that prints out the sets of tags (e.g., 'inspirational', 'life') associated with each quote (quote div element).

Your output should look like this:

```python
['change', 'deep-thoughts', 'thinking', 'world']
['abilities', 'choices']
['inspirational', 'life', 'live', 'miracle', 'miracles']
['aliteracy', 'books', 'classic', 'humor']
['be-yourself', 'inspirational']
['adulthood', 'success', 'value']
['life', 'love']
['edison', 'failure', 'inspirational', 'paraphrased']
['misattributed-eleanor-roosevelt']
['humor', 'obvious', 'simile']
```

In [81]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):

        for quote in response.css('div.quote'):
            tag = quote.css('a.tag::text').getall()
            print(tag)

Overwriting tutorial/spiders/quotes_spider.py


In [82]:
!scrapy crawl quotes -L ERROR

['change', 'deep-thoughts', 'thinking', 'world']
['abilities', 'choices']
['inspirational', 'life', 'live', 'miracle', 'miracles']
['aliteracy', 'books', 'classic', 'humor']
['be-yourself', 'inspirational']
['adulthood', 'success', 'value']
['life', 'love']
['edison', 'failure', 'inspirational', 'paraphrased']
['misattributed-eleanor-roosevelt']
['humor', 'obvious', 'simile']


**Exercise 2:**

Write a spider that:
1. goes through each quote div on all 10 pages,
2. extracts the quote text, author, and tags for each, and
3. outputs the results to quotes.csv.

Then, load and visualize the data (table) with pandas as we did earlier.

Your output should look some like this:

```python
{'text': '“The world as we...”',   'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts']}
{'text': '“It is our choices...”', 'author': 'J.K. Rowling',    'tags': ['abilities', 'choices']}
{'text': '“There are only two..”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life']}
...
```

In [95]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):

        for quote in response.css('div.quote'): # FILL ME IN!
            scraped = {
                'author': quote.css('small.author::text').get(),
                'quote': quote.css('span.text::text').get()[1:-1],
                'tags': quote.css('a.tag::text').getall()
            }
            print(scraped)
            yield scraped

Overwriting tutorial/spiders/quotes_spider.py


In [96]:
# run your spider HERE
!scrapy crawl quotes -L ERROR -O quotes.csv
pd.read_csv('quotes.csv')

{'author': 'Albert Einstein', 'quote': 'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'author': 'J.K. Rowling', 'quote': 'It is our choices, Harry, that show what we truly are, far more than our abilities.', 'tags': ['abilities', 'choices']}
{'author': 'Albert Einstein', 'quote': 'There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
{'author': 'Jane Austen', 'quote': 'The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
{'author': 'Marilyn Monroe', 'quote': "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.", 'tags': ['be-yourself', 'inspirational']}
{'aut

Unnamed: 0,author,quote,tags
0,Albert Einstein,The world as we have created it is a process o...,"change,deep-thoughts,thinking,world"
1,J.K. Rowling,"It is our choices, Harry, that show what we tr...","abilities,choices"
2,Albert Einstein,There are only two ways to live your life. One...,"inspirational,life,live,miracle,miracles"
3,Jane Austen,"The person, be it gentleman or lady, who has n...","aliteracy,books,classic,humor"
4,Marilyn Monroe,"Imperfection is beauty, madness is genius and ...","be-yourself,inspirational"
5,Albert Einstein,Try not to become a man of success. Rather bec...,"adulthood,success,value"
6,André Gide,It is better to be hated for what you are than...,"life,love"
7,Thomas A. Edison,"I have not failed. I've just found 10,000 ways...","edison,failure,inspirational,paraphrased"
8,Eleanor Roosevelt,A woman is like a tea bag; you never know how ...,misattributed-eleanor-roosevelt
9,Steve Martin,"A day without sunshine is like, you know, night.","humor,obvious,simile"


**Exercise 3:**

Rather than moving through pages by following links, you can alternatively provide a list of pages for Scrapy to scrape.

Recall that we originally supplied the starting page to Scrapy:

```python
start_urls = ['http://quotes.toscrape.com']
```

You can use this same list to manually supply as many pages to scrape as you want. Scrapy will automatically apply the parse function to all web pages in this list.

```python
start_urls = [
    "https://quotes.toscrape.com/page/1/",
    "https://quotes.toscrape.com/page/2/",
]
```

Write a spider that uses this method to extract only the first author from all 10 pages.

Note that the results will be out of order since Scrapy scrapes pages this way in parallel rather than one by one, e.g.:

```
Albert Einstein
Alfred Tennyson
Jane Austen
Pablo Neruda
Dr. Seuss
George R.R. Martin
Charles Bukowski
Albert Einstein
Marilyn Monroe
J.K. Rowling
```

In [115]:
%%writefile tutorial/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/page/1/',
                  'http://quotes.toscrape.com/page/2/',
                  'http://quotes.toscrape.com/page/3/',
                  'http://quotes.toscrape.com/page/4/',
                  'http://quotes.toscrape.com/page/5/',
                  'http://quotes.toscrape.com/page/6/',
                  'http://quotes.toscrape.com/page/7/',
                  'http://quotes.toscrape.com/page/8/',
                  'http://quotes.toscrape.com/page/9/',
                  'http://quotes.toscrape.com/page/10/']

    def parse(self, response):

        author = response.css('small.author::text').get() # FILL ME IN!
        print(author)

Overwriting tutorial/spiders/quotes_spider.py


In [116]:
!scrapy crawl quotes -L ERROR

Albert Einstein
George R.R. Martin
Jane Austen
Dr. Seuss
Pablo Neruda
Marilyn Monroe
Charles Bukowski
Albert Einstein
J.K. Rowling
Alfred Tennyson
