# Scrapy Selectors 👉

## Response Objects

The Scrapy ```Response``` object is a crucial component in the Scrapy framework, representing the HTTP response received after making a request. It encapsulates the content of the page, along with various attributes and methods that facilitate data extraction and manipulation.

## Key Attributes of the Response Object


* url: The URL of the response. This is particularly useful for logging and debugging purposes.
*    status: The HTTP status code of the response, which can help determine if the request was successful (e.g., 200) or if there were issues (e.g., 404, 500).
*    body: The raw bytes of the response content. This can be used to access the content directly, especially when dealing with binary data.
*    text: A convenience property that decodes the response body to a string using the response's encoding.
*    headers: A dictionary-like object containing the response headers, which can provide additional context about the response.


## Common Methods

The Response object also provides several methods that enhance its functionality:

*    ```css()```: This method allows you to select elements from the response using CSS selectors. It returns a SelectorList, which can be further processed to extract data.
*    ```xpath()```: Similar to ```css()```, this method uses XPath expressions to select elements from the response. This is particularly useful for XML or HTML documents where XPath is preferred.
*    ```json()```: If the response content is in JSON format, this method can be used to parse the JSON and return it as a Python dictionary.



You can create an ```HtmlResponse``` object by passing the URL and the HTML content as follows:

In [1]:
from scrapy.http import HtmlResponse

url = 'http://example.com'
html_content = '<html><body><h1>Hello, World!</h1></body></html>'
response = HtmlResponse(url=url, body=html_content, encoding='utf-8')

This code snippet initializes an HtmlResponse object with the specified URL and HTML body. 
The encoding parameter ensures that the response is correctly interpreted.

Once you have an ```HtmlResponse``` object, you can access the response body using the ```.body``` attribute. This attribute contains the raw HTML content, which can be useful for debugging or logging purposes:

In [10]:
print(response.body)

b'<html><body><h1>Hello, World!</h1></body></html>'


The ```HtmlResponse``` object provides a ```.selector``` attribute, which is an instance of ```scrapy.Selector```. This allows you to use **XPath** or **CSS** selectors to extract data from the HTML:

### Example of Using XPath

To extract the text from the \<h1\> tag, you can use the following ```XPath``` expression:

In [11]:
h1_text = response.selector.xpath('//h1/text()').get()
print(h1_text)  # Output: Hello, World!


Hello, World!


### Example of Using CSS

Alternatively, you can achieve the same result using **CSS** selectors. 

In [12]:
h1_text_css = response.css('h1::text').get()
print(h1_text_css)  # Output: Hello, World!

Hello, World!


In [14]:
h1_text_css = response.selector.css('h1::text').get()
print(h1_text_css)  # Output: Hello, World!

None


# More selector examples

In [22]:
# """ are used to create multiline strings or even comments

"""this is a multiline 
comment. It is very helpful in writing 
leghthy function descriptions. The
next example is a how to initialize
a string with multiple lines."""


html_content = """
<!DOCTYPE html>

<html>
  <head>
    <base href='http://example.com/' />
    <title>Example website</title>
  </head>
  <body>
    <div id='images'>
      <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' alt='image1'/></a>
      <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' alt='image2'/></a>
      <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' alt='image3'/></a>
      <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' alt='image4'/></a>
      <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' alt='image5'/></a>
    </div>
  </body>
</html>
"""

So, by looking at the HTML code of that page, let’s construct an XPath for selecting the text inside the title tag:

In [23]:
url = 'http://example.com'
response = HtmlResponse(url=url, body=html_content, encoding='utf-8')
response.xpath("//title/text()")

[<Selector query='//title/text()' data='Example website'>]

In [None]:
To actually extract the textual data, you must call the selector ```.get()``` or ```.getall()``` methods, as follows:

In [24]:
response.xpath("//title/text()").getall()

['Example website']

In [25]:
response.xpath("//title/text()").get()

'Example website'

```.get()``` always returns a single result; if there are several matches, content of a first match is returned; 
if there are no matches, None is returned. 

```.getall()``` returns a list with all results.

Notice that CSS selectors can select text or attribute nodes using CSS3 pseudo-elements:

In [26]:
response.css("title::text").get()

'Example website'

As you can see, ```.xpath()``` and ```.css()``` methods return a ```SelectorList``` instance, which is a list of new selectors. 
This API can be used for quickly selecting nested data:

In [27]:
response.css("img").xpath("@src").getall()

['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

If you want to extract only the first matched element, you can call the selector ```.get()``` :

In [28]:
response.xpath('//div[@id="images"]/a/text()').get()

'Name: My image 1 '

It returns ```None``` if no element was found:

In [29]:
response.xpath('//div[@id="not-exists"]/text()').get() is None

True

Instead of using e.g. ```'@src'``` XPath it is possible to query for attributes using ```.attrib``` property of a Selector:

In [30]:
[img.attrib["src"] for img in response.css("img")]

['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

In [31]:
response.css("base").attrib["href"]

'http://example.com/'

## Nesting selectors

The selection methods (```.xpath()``` or ```.css()```) return a list of selectors of the same type, so you can call the selection methods for those selectors too. Here’s an example:

In [32]:
links = response.xpath('//a[contains(@href, "image")]')

links.getall()

In [36]:
enumerate(links) #The enumerate() function in Python adds a counter to an iterable, allowing for easy access to both index and element pairs

<enumerate at 0x10bf4f2c0>

In [37]:
for i in enumerate(links):
    print(i)

(0, <Selector query='//a[contains(@href, "image")]' data='<a href="image1.html">Name: My image ...'>)
(1, <Selector query='//a[contains(@href, "image")]' data='<a href="image2.html">Name: My image ...'>)
(2, <Selector query='//a[contains(@href, "image")]' data='<a href="image3.html">Name: My image ...'>)
(3, <Selector query='//a[contains(@href, "image")]' data='<a href="image4.html">Name: My image ...'>)
(4, <Selector query='//a[contains(@href, "image")]' data='<a href="image5.html">Name: My image ...'>)


In [None]:
for index, link in enumerate(links):
    href_xpath = link.xpath("@href").get()
    img_xpath = link.xpath("img/@src").get()
    print(f"Link number {index} points to url {href_xpath!r} and image {img_xpath!r}")

## Selecting element attributes¶

### HTML attribtues

HTML attributes provide additional information about HTML elements.


*    All HTML elements can have attributes
*    Attributes provide additional information about elements
*    Attributes are always specified in the start tag
*    Attributes usually come in name/value pairs like: name="value"


The \<a\> tag defines a hyperlink. The ```href``` attribute specifies the URL of the page the link goes to:

In [None]:
# <a href="https://www.w3schools.com">Visit W3Schools</a> 

There are several ways to get a value of an attribute. First, one can use XPath syntax:

In [43]:
response.xpath("//a/@href").getall()

['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

XPath syntax has a few advantages: it is a standard XPath feature, and ```@attributes``` can be used in other parts of an XPath expression - e.g. it is possible to filter by attribute value.

Scrapy also provides an extension to CSS selectors (```::attr(...)```) which allows to get attribute values:

In [42]:
response.css("a::attr(href)").getall()

['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In addition to that, there is a ```.attrib``` property of Selector. You can use it if you prefer to lookup attributes in Python code, without using XPaths or CSS extensions:

In [44]:
[a.attrib["href"] for a in response.css("a")]

['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

## Using selectors with regular expressions¶

Selector also has a ```.re()``` method for extracting data using regular expressions. 
However, unlike using ```.xpath()``` or ```.css()``` methods, ```.re()``` returns a list of strings. So you can’t construct nested ```.re()``` calls.

Here’s an example used to extract image names from the HTML code above:

In [45]:
response.xpath('//a[contains(@href, "image")]/text()').re(r"Name:\s*(.*)")

['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image 5 ']

There’s an additional helper reciprocating ```.get()``` for ```.re()```, named ```.re_first()```. Use it to extract just the first matching string:

In [46]:
response.xpath('//a[contains(@href, "image")]/text()').re_first(r"Name:\s*(.*)")

'My image 1 '

# Example Usage with Scrapy Spider

In [2]:
import scrapy 
from scrapy.crawler import CrawlerProcess

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/page/1/"]

    custom_settings = {
        'FEEDS': {
            'kaveri_QUOTES2.csv': { #3. where to save the extracted data
                'format': 'csv',   #3. format of data. other formats like json and xml are also supported
                'overwrite': True
            }
        }
    }

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get() 
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

process = CrawlerProcess() #define the crawler
process.crawl(QuotesSpider) #attach the spider to the crawler
process.start()

2025-01-16 15:57:13 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2025-01-16 15:57:13 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.10.4 (main, May 25 2024, 00:47:07) [Clang 15.0.0 (clang-1500.3.9.4)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform macOS-15.2-arm64-arm-64bit
2025-01-16 15:57:13 [scrapy.addons] INFO: Enabled addons:
[]
2025-01-16 15:57:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2025-01-16 15:57:13 [scrapy.extensions.telnet] INFO: Telnet Password: d6919cd26c4428f6
2025-01-16 15:57:13 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2025-01-16 15:57:13 [scrapy.crawler] INFO: Overridden 

ReactorNotRestartable: 

In this example, the parse method uses the ```css()``` method of the ```Response``` object to extract the title of the page. The extracted title is then logged and yielded as a dictionary.

## Key Features of Scrapy Spiders

*    **Asynchronous Processing**: Scrapy spiders operate asynchronously, allowing them to handle multiple requests simultaneously, which significantly speeds up the crawling process.
*    **Selectors**: Scrapy provides powerful selectors based on XPath and CSS, enabling spiders to extract data from HTML documents efficiently.
*    **Middleware**: You can customize the behavior of your spider by using middleware, which allows you to process requests and responses globally.


## Summary of Key Features

*    Selectors: The HtmlResponse class allows you to use both XPath and CSS selectors to navigate and extract data from the HTML document.
*    Convenience: The response object in Scrapy automatically provides access to the selector, making it easy to work with HTML content without manual instantiation.
*    Encoding: Proper encoding is crucial for accurately processing the HTML content, especially when dealing with non-ASCII characters.

By understanding how to create and utilize HtmlResponse objects, you can effectively scrape and manipulate HTML data in your Scrapy projects.

## Best Practices for Writing Spiders

*    Keep it Simple: Start with a simple spider and gradually add complexity as needed. This helps in debugging and maintaining the code.
*    Respect ```Robots.txt```: Always check the robots.txt file of the website to ensure that your spider complies with the site's crawling policies.
*    Use Item Loaders: For more complex data extraction, consider using item loaders to clean and validate the data before yielding it.


# Conclusion

Scrapy spiders are versatile and powerful tools for web scraping. By understanding their structure and capabilities, you can effectively extract data from various sources. For more detailed information, refer to the official Scrapy Spiders Documentation.