# Cracking The Code: How to Read HTML, CSS, and XPATH

In [1]:
import lxml.html
from bs4 import BeautifulSoup

## Intro to HTML

In [2]:
sample_html = """
<html>
<head>
    <title>Web Scraping Tools</title>
</head>
<body>
    <div id="header">
        <h1>Tools</h1>
    </div>
    <div id="content">
        <p>Web scraping is awesome, you should check out these libraries. Learn more at <a href="http://www.pyrva.org/">PyRVA!</a></p>
        <ul>
            <li calss="data-getter">urllib</li>
            <li class="data-getter">Requests</li>
            <li class="data-parser">lxml</li>
            <li class="data-parser">BeautifulSoup</li>
            <li class="website-tester">Selenium</li>
            <li class="scraping-framework">Scrapy</li>
        </ul>
    </div>
</body>
</html>
"""
html = lxml.html.fromstring(sample_html)
soup = BeautifulSoup(sample_html, 'html.parser')

Here is what the html in `sample_html` looks like when the browser renders it.

---

<html>
<head>
    <title>Web Scraping Tools</title>
</head>
<body>
    <div id="header">
        <h1>Tools</h1>
    </div>
    <div id="content">
        <p>Web scraping is awesome, you should check out these libraries. Learn more at <a href="http://www.pyrva.org/">PyRVA!</a></p>
        <ul>
            <li calss="data-getter">urllib</li>
            <li class="data-getter">Requests</li>
            <li class="data-parser">lxml</li>
            <li class="data-parser">BeautifulSoup</li>
            <li class="website-tester">Selenium</li>
            <li class="scraping-framework">Scrapy</li>
        </ul>
    </div>
</body>
</html>

---

## Intro to CSS Selectors

For more, look at the [CSS Selectors Level 4](https://www.w3.org/TR/selectors-4/) documentation.

In [3]:
for elem in soup.select('h1'):
    print(elem.text)

Tools


In [4]:
for elem in soup.select('p'):
    print(elem.text)

Web scraping is awesome, you should check out these libraries. Learn more at PyRVA!


In [5]:
for elem in soup.select('li'):
    print(elem.text)

urllib
Requests
lxml
BeautifulSoup
Selenium
Scrapy


In [6]:
for elem in soup.select('li.data-parser'):
    print(elem.text)

lxml
BeautifulSoup


In [7]:
for elem in soup.select('.data-parser'):
    print(elem.text)

lxml
BeautifulSoup


In [8]:
for elem in soup.select('#content'):
    print(elem.text)


Web scraping is awesome, you should check out these libraries. Learn more at PyRVA!

urllib
Requests
lxml
BeautifulSoup
Selenium
Scrapy




In [9]:
for elem in soup.select('a'):
    print(elem.text)
    print(elem['href'])

PyRVA!
http://www.pyrva.org/


## Intro to XPATH

For more, look at the [XML Path Language (XPath) 3.1](https://www.w3.org/TR/2017/REC-xpath-31-20170321/) documentation

In [10]:
for elem in html.xpath('//h1/text()'):
    print(elem)

Tools


In [11]:
for elem in html.xpath('//p/text()'):
    print(elem)

Web scraping is awesome, you should check out these libraries. Learn more at 


In [12]:
for elem in html.xpath('//li/text()'):
    print(elem)

urllib
Requests
lxml
BeautifulSoup
Selenium
Scrapy


In [13]:
for elem in html.xpath("//li[@class='data-parser']/text()"):
    print(elem)

lxml
BeautifulSoup


In [14]:
for elem in html.xpath("//*[@class='data-parser']/text()"):
    print(elem)

lxml
BeautifulSoup


In [15]:
for elem in html.xpath("//*[@id='content']//text()"):
    print(elem)


        
Web scraping is awesome, you should check out these libraries. Learn more at 
PyRVA!

        

            
urllib

            
Requests

            
lxml

            
BeautifulSoup

            
Selenium

            
Scrapy

        

    


In [16]:
for elem in html.xpath("//a/@href"):
    print(elem)

http://www.pyrva.org/
