# Selector
When you're scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are different methods to identify the location of the data in HTML source.
We load a local sample html file to show how these methods work.

In [1]:
from scrapy import Selector

with open('Sample Webpage/sample5.html') as f:
    text = f.read()
sel = Selector(text=text)

Structure of Sample HTML

|-- __body__ 
>|-- __ul__ 
>>|-- __li__ class = "top" 
>>>|-- __div__ 
>>>>|-- _div of li_ 

>>|-- __li__ class = "top" 
>>>|-- __div__ 
>>>>|-- __div__ 
>>>>>|-- _div of li's div_ 

>>|-- __li__  
>>>|-- **p**  
>>>>|-- _p of li_ 

>>|-- __li__  
>>>|-- __a__  
>>>>|-- **div** id = "li_a_div" 
>>>>>|-- _div of li's a_ 

## Xpath Selector
XPath is a major element in the [XSLT](https://en.wikipedia.org/wiki/XSLT) standard. XPath can be used to navigate through elements and attributes in an XML document

In [2]:
ul_s_lis = sel.xpath("/html/body/ul/li")
for ul_s_li in ul_s_lis:
    print(ul_s_li)
    print('context:')
    print(ul_s_li.extract())
    print('\n')

<Selector xpath='/html/body/ul/li' data='<li class="top">\n            <div>div of'>
context:
<li class="top">
            <div>div of li</div>
        </li>


<Selector xpath='/html/body/ul/li' data='<li class="top">\n            <div>\n     '>
context:
<li class="top">
            <div>
            <div>div of li's div</div>
            </div>
        </li>


<Selector xpath='/html/body/ul/li' data='<li>\n        <p>p of li</p>    \n        '>
context:
<li>
        <p>p of li</p>    
        </li>


<Selector xpath='/html/body/ul/li' data='<li>\n        <a>\n        <div id="li_a_d'>
context:
<li>
        <a>
        <div id="li_a_div">div of li's a    
            
        
    
</div></a></li>




In [3]:
lis = sel.xpath("//li")
for li in lis:
    print(li)

<Selector xpath='//li' data='<li class="top">\n            <div>div of'>
<Selector xpath='//li' data='<li class="top">\n            <div>\n     '>
<Selector xpath='//li' data='<li>\n        <p>p of li</p>    \n        '>
<Selector xpath='//li' data='<li>\n        <a>\n        <div id="li_a_d'>


In [4]:
sel.xpath("//li")[2].xpath('./p')[0] # return the first label p of the third li

<Selector xpath='./p' data='<p>p of li</p>'>

In [5]:
sel.xpath("/html/body/ul/li/a/div/text()").extract_first()

"div of li's a    \n            \n        \n    \n"

More about Xpath can be found at [XPath Tutorial](https://www.w3schools.com/xml/xpath_intro.asp)
## CSS Selector
In CSS, selectors are patterns used to select the element(s) you want to style.

In [6]:
classes = sel.css('.top')
for clas in classes:
    print(clas)
    print('context:')
    print(clas.extract())
    print('\n')

<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' top ')]" data='<li class="top">\n            <div>div of'>
context:
<li class="top">
            <div>div of li</div>
        </li>


<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' top ')]" data='<li class="top">\n            <div>\n     '>
context:
<li class="top">
            <div>
            <div>div of li's div</div>
            </div>
        </li>




In [7]:
ids = sel.css('#li_a_div')
for ID in ids:
    print(ID)
    print("context:")
    print(ID.extract())

<Selector xpath="descendant-or-self::*[@id = 'li_a_div']" data='<div id="li_a_div">div of li\'s a    \n   '>
context:
<div id="li_a_div">div of li's a    
            
        
    
</div>


In [8]:
divs_in_li = sel.css("li div")
for div_in_li in divs_in_li:
    print(div_in_li)

<Selector xpath='descendant-or-self::li/descendant-or-self::*/div' data='<div>div of li</div>'>
<Selector xpath='descendant-or-self::li/descendant-or-self::*/div' data="<div>\n            <div>div of li's div</">
<Selector xpath='descendant-or-self::li/descendant-or-self::*/div' data="<div>div of li's div</div>">
<Selector xpath='descendant-or-self::li/descendant-or-self::*/div' data='<div id="li_a_div">div of li\'s a    \n   '>


In [9]:
li_s_divs = sel.css("li>div")
for li_s_div in li_s_divs:
    print(li_s_div)

<Selector xpath='descendant-or-self::li/div' data='<div>div of li</div>'>
<Selector xpath='descendant-or-self::li/div' data="<div>\n            <div>div of li's div</">


In [10]:
li_and_divs = sel.css("li,div")
for li_and_div in li_and_divs:
    print(li_and_div)

<Selector xpath='descendant-or-self::li | descendant-or-self::div' data='<li class="top">\n            <div>div of'>
<Selector xpath='descendant-or-self::li | descendant-or-self::div' data='<div>div of li</div>'>
<Selector xpath='descendant-or-self::li | descendant-or-self::div' data='<li class="top">\n            <div>\n     '>
<Selector xpath='descendant-or-self::li | descendant-or-self::div' data="<div>\n            <div>div of li's div</">
<Selector xpath='descendant-or-self::li | descendant-or-self::div' data="<div>div of li's div</div>">
<Selector xpath='descendant-or-self::li | descendant-or-self::div' data='<li>\n        <p>p of li</p>    \n        '>
<Selector xpath='descendant-or-self::li | descendant-or-self::div' data='<li>\n        <a>\n        <div id="li_a_d'>
<Selector xpath='descendant-or-self::li | descendant-or-self::div' data='<div id="li_a_div">div of li\'s a    \n   '>


In [11]:
ALL = sel.css('*')
for item in ALL:
    print(item)

<Selector xpath='descendant-or-self::*' data='<html lang="en-US">\n<body>\n    <ul>\n\n   '>
<Selector xpath='descendant-or-self::*' data='<body>\n    <ul>\n\n        <li class="top"'>
<Selector xpath='descendant-or-self::*' data='<ul>\n\n        <li class="top">\n         '>
<Selector xpath='descendant-or-self::*' data='<li class="top">\n            <div>div of'>
<Selector xpath='descendant-or-self::*' data='<div>div of li</div>'>
<Selector xpath='descendant-or-self::*' data='<li class="top">\n            <div>\n     '>
<Selector xpath='descendant-or-self::*' data="<div>\n            <div>div of li's div</">
<Selector xpath='descendant-or-self::*' data="<div>div of li's div</div>">
<Selector xpath='descendant-or-self::*' data='<li>\n        <p>p of li</p>    \n        '>
<Selector xpath='descendant-or-self::*' data='<p>p of li</p>'>
<Selector xpath='descendant-or-self::*' data='<li>\n        <a>\n        <div id="li_a_d'>
<Selector xpath='descendant-or-self::*' data='<a>\n        <div

More reference of CSS selectors can be found at [CSS Selectors Reference](https://www.w3schools.com/cssref/css_selectors.asp)

## pyquery
pyquery allows you to make [jQuery](https://www.w3schools.com/jquery/), a JavaScript library, queries on xml documents.

In [12]:
from pyquery import PyQuery

jpy =  PyQuery(text)

In [13]:
print(jpy('.top').text())
print(jpy('.top')[0].text_content())

div of li div of li's div

            div of li
        


In [14]:
print(jpy('.top').attr('class'))

top


In [15]:
items = jpy('li')
for item in items.items():
        print(item.text())

div of li
div of li's div
p of li
div of li's a


In [16]:
for item in items.items():
    print(item.attr('class'))

top
top
None
None
