# Introduction

`cssselect('div.content')` : Find all div elements that have the class attribute "content".

`find_class('nav')` : find all the elements with class or find an element within this element that matches this class.

`text_content()` : text within this element and its descendants.

`clean_html()` : try to fix errors in the formatting of a webpage.

In [1]:
from lxml import html
from urllib.request import urlopen

In [2]:
source = urlopen("http://www.google.com").read()

In [3]:
tree = html.document_fromstring(source)

In [4]:
tree?

`(elem).iter_links()` : This is an iterator that will find all descendant links and return them as structured tuples.

`get` : same functionality as BS (to get attributes).
    
`iterdescendants()` : An iterator that will find all descendants.
    
`(elem).attrib` : list the kety, value pair of the elements attributes.

In [5]:
tree.cssselect("p.calendar_EventTitle")

[]

In [8]:
tree.iterlinks()

<generator object HtmlMixin.iterlinks at 0x0000021437200660>

In [10]:
[l for l in tree.iterlinks()]

[(<Element style at 0x21434122540>, None, '/images/nav_logo229.png', 847),
 (<Element a at 0x21437206ea0>,
  'href',
  'http://www.google.co.in/imghp?hl=en&tab=wi',
  0),
 (<Element a at 0x21437206f40>,
  'href',
  'http://maps.google.co.in/maps?hl=en&tab=wl',
  0),
 (<Element a at 0x21437206a40>,
  'href',
  'https://play.google.com/?hl=en&tab=w8',
  0),
 (<Element a at 0x2143424a4f0>,
  'href',
  'http://www.youtube.com/?gl=IN&tab=w1',
  0),
 (<Element a at 0x2143424a680>, 'href', 'https://news.google.com/?tab=wn', 0),
 (<Element a at 0x2143424a810>,
  'href',
  'https://mail.google.com/mail/?tab=wm',
  0),
 (<Element a at 0x2143424a590>, 'href', 'https://drive.google.com/?tab=wo', 0),
 (<Element a at 0x2143424a770>,
  'href',
  'https://www.google.co.in/intl/en/about/products?tab=wh',
  0),
 (<Element a at 0x2143424a9a0>,
  'href',
  'http://www.google.co.in/history/optout?hl=en',
  0),
 (<Element a at 0x2143424a540>, 'href', '/preferences?hl=en', 0),
 (<Element a at 0x2143424a950>,

In [19]:
links = [l for l in tree.iterlinks()]
for i in range(len(links)):
    print(links[i][2])

/images/nav_logo229.png
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.com/&ec=GAZAAQ
/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png
/search
/advanced_search?hl=en-IN&authuser=0
http://www.google.com/setprefs?sig=0_3nXpbf3fTpku6RQSDuYJ3Ir5X6A%3D&hl=hi&source=homepage&sa=X&ved=0ahUKEwi50a2Vi8DsAhWjyDgGHeQBCl8Q2ZgBCAU
http://www.google.com/setprefs?sig=0_3nXpbf3fTpku6RQSDuYJ3Ir5X6A%3D&hl=bn&source=homepage&sa=X&ved=0ahUKEwi50a2Vi8DsAhWjyDgGHeQBCl8Q2ZgBCAY
http://www.google.com/setprefs?sig=0_3nXpbf3fTpku6RQSDuYJ3Ir5X6A%3D&hl=te&source=homepage

In [29]:
tree.cssselect("a")[1].attrib

{'class': 'gb1', 'href': 'http://maps.google.co.in/maps?hl=en&tab=wl'}

In [31]:
source = urlopen("http://www.python.org").read()
tree = html.document_fromstring(source)
tree

<Element html at 0x214363efa40>

##### How many paragraphs are there on the page?

In [40]:
len(tree.cssselect("p"))

23

##### What is the text content of the div with the class "shrubbery"? What are the links in that same div?

In [49]:
for text in tree.cssselect("div.shrubbery"):
    print(text.text_content())


                        
                            Latest News
                            More
                            
                            
                                
                                
                                
2020-10-05
 Python 3.9.0 is now available, and you can already test 3.10.0a1!
                                
                                
2020-10-02
 Python 3.5 is no longer supported
                                
                                
2020-10-02
 Join the Python Developers Survey 2020: Share and learn about the community
                                
                                
2020-09-24
 Python 3.8.6 is now available
                                
                                
2020-09-22
 The Python Software Foundation re-opens its Grants Program!
                                
                            
                        

                        
                            Upcoming Even

##### What if we want all the anchor tags?

In [50]:
for text in tree.cssselect("div.shrubbery"):
    print([l for l in tree.iterlinks()])

[(<Element link at 0x21436669270>, 'href', '//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js', 0), (<Element script at 0x214366966d0>, 'src', '/static/js/libs/modernizr.js', 0), (<Element link at 0x214366695e0>, 'href', '/static/stylesheets/style.30afed881237.css', 0), (<Element link at 0x2143660c1d0>, 'href', '/static/stylesheets/mq.eef77a5d2257.css', 0), (<Element link at 0x21436647770>, 'href', '/static/favicon.ico', 0), (<Element link at 0x21436647e50>, 'href', '/static/apple-touch-icon-144x144-precomposed.png', 0), (<Element link at 0x21436647720>, 'href', '/static/apple-touch-icon-114x114-precomposed.png', 0), (<Element link at 0x214366479a0>, 'href', '/static/apple-touch-icon-72x72-precomposed.png', 0), (<Element link at 0x21436647a90>, 'href', '/static/apple-touch-icon-precomposed.png', 0), (<Element link at 0x21436647900>, 'href', '/static/apple-touch-icon-precomposed.png', 0), (<Element link at 0x21436647ea0>, 'href', '/static/humans.txt', 0), (<Element link at 0x21

##### What is the text in the code elements?

In [52]:
for code in tree.cssselect("code"):
    print(code.text_content())

# Python 3: Fibonacci series up to n
>>> def fib(n):
>>>     a, b = 0, 1
>>>     while a < n:
>>>         print(a, end=' ')
>>>         a, b = b, a+b
>>>     print()
>>> fib(1000)
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
# Python 3: List comprehensions
>>> fruits = ['Banana', 'Apple', 'Lime']
>>> loud_fruits = [fruit.upper() for fruit in fruits]
>>> print(loud_fruits)
['BANANA', 'APPLE', 'LIME']

# List and the enumerate function
>>> list(enumerate(fruits))
[(0, 'Banana'), (1, 'Apple'), (2, 'Lime')]
# Python 3: Simple arithmetic
>>> 1 / 2
0.5
>>> 2 ** 3
8
>>> 17 / 3  # classic division returns a float
5.666666666666667
>>> 17 // 3  # floor division
5
+
-
*
/
()
# Python 3: Simple output (with Unicode)
>>> print("Hello, I'm Python!")
Hello, I'm Python!

# Input, assignment
>>> name = input('What is your name?\n')
>>> print('Hi, %s.' % name)
What is your name?
Python
Hi, Python.
# For loop on a list
>>> numbers = [2, 4, 6, 8]
>>> product = 1
>>

In [53]:
# OR
[t.text for t in tree.cssselect("code")]

[None,
 None,
 None,
 '+',
 '-',
 '*',
 '/',
 '()',
 None,
 None,
 'if',
 'for',
 'while',
 'range']

##### Are there any forms?

In [59]:
"form" in tree.text_content()

True

In [60]:
# OR
tree.forms

[<Element form at 0x21436696b30>]

In [61]:
# OR
tree.cssselect("form")

[<Element form at 0x21436696b30>]