# Cracking The Code: How to Read HTML, CSS, and XPATH

In [1]:
import lxml.html

## Intro to HTML

In [2]:
sample_html = """
<html>
<head>
    <title>Web Scraping Tools</title>
</head>
<body>
    <div id="header">
        <h1>Tools</h1>
    </div>
    <div id="content">
        <p>Web scraping is awesome, you should check out these libraries.</p>
        <ul>
            <li calss="data-getter">urllib</li>
            <li class="data-getter">Requests</li>
            <li class="data-parser">lxml</li>
            <li class="data-parser">BeautifulSoup</li>
            <li class="website-tester">Selenium</li>
            <li class="scraping-framework">Scrapy</li>
        </ul>
    </div>
</body>
</html>
"""
html = lxml.html.fromstring(sample_html)

Here is what the html in `sample_html` looks like when the browser renders it.

---

<html>
<head>
    <title>Web Scraping Tools</title>
</head>
<body>
    <div id="header">
        <h1>Tools</h1>
    </div>
    <div id="content">
        <p>Web scraping is awesome, you should check out these libraries.</p>
        <ul>
            <li calss="data-getter">urllib</li>
            <li class="data-getter">Requests</li>
            <li class="data-parser">lxml</li>
            <li class="data-parser">BeautifulSoup</li>
            <li class="website-tester">Selenium</li>
            <li class="scraping-framework">Scrapy</li>
        </ul>
    </div>
</body>
</html>

---

## Intro to CSS Selectors

For more, look at the [CSS Selectors Level 4](https://www.w3.org/TR/selectors-4/) documentation.

In [3]:
def css_select(selector):
    for elem in html.cssselect(selector):
        print(elem.text)

In [4]:
css_select('h1')

Tools


In [5]:
css_select('p')

Web scraping is awesome, you should check out these libraries.


In [6]:
css_select('li')

urllib
Requests
lxml
BeautifulSoup
Selenium
Scrapy


In [7]:
css_select('li.data-parser')

lxml
BeautifulSoup


In [8]:
css_select('.data-parser')

lxml
BeautifulSoup


In [9]:
print(html.get('content'))

None


## Intro to XPATH

For more, look at the [XML Path Language (XPath) 3.1](https://www.w3.org/TR/2017/REC-xpath-31-20170321/) documentation

In [10]:
def xpath_select(selector):
    for elem in html.xpath(selector + '//text()'):
        print(elem)

In [11]:
xpath_select("//h1")

Tools


In [12]:
xpath_select("//p")

Web scraping is awesome, you should check out these libraries.


In [13]:
xpath_select("//li")

urllib
Requests
lxml
BeautifulSoup
Selenium
Scrapy


In [14]:
xpath_select("//li[@class='data-parser']")

lxml
BeautifulSoup


In [15]:
xpath_select("//*[@class='data-parser']")

lxml
BeautifulSoup


In [16]:
xpath_select("//*[@id='header']")


        
Tools

    


In [17]:
xpath_select("//*[@id='content']")


        
Web scraping is awesome, you should check out these libraries.

        

            
urllib

            
Requests

            
lxml

            
BeautifulSoup

            
Selenium

            
Scrapy

        

    
