# Beautiful soup

Beautiful Soup is a library for pulling data out of HTML and XML files. It provides ways of navigating, searching, and modifying parse trees.

In [1]:
from bs4 import BeautifulSoup

In [2]:
import requests

In [3]:
url = "http://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text

In [4]:
soup = BeautifulSoup(demo, "html.parser")

In [5]:
print(soup.prettify())

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>


ML: Markup Language

Beautiful soup can solve the markup language

HTML document = 标签树 = BeautifulSoup类对象

In [12]:
soup.__dict__.keys()

dict_keys(['element_classes', 'builder', 'is_xml', 'known_xml', '_namespaces', 'parse_only', 'markup', 'original_encoding', 'declared_html_encoding', 'contains_replacement_characters', 'parser_class', 'name', 'namespace', 'prefix', 'attrs', 'contents', 'parent', 'previous_element', 'next_element', 'next_sibling', 'previous_sibling', 'hidden', 'can_be_empty_element', 'cdata_list_attributes', 'preserve_whitespace_tags', 'interesting_string_types', 'current_data', 'currentTag', 'tagStack', 'open_tag_counter', 'preserve_whitespace_tag_stack', 'string_container_stack', '_most_recent_element'])

In [18]:
print(soup.a)

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>


soup really carries a tree structure of tags

In [22]:
soup.a.name

'a'

In [23]:
soup.a.parent.name

'p'

In [24]:
soup.a.parent.parent.name

'body'

In [25]:
tag = soup.a

a tag is a node which holds some attrs

In [26]:
tag.attrs

{'href': 'http://www.icourse163.org/course/BIT-268001',
 'class': ['py1'],
 'id': 'link1'}

In [27]:
type(tag)

bs4.element.Tag

In [28]:
type(soup)

bs4.BeautifulSoup

string is surrounded by tags

In [29]:
soup.a.string

'Basic Python'

comment is decommented which we can only decide by its type

In [30]:
newsoup = BeautifulSoup("<b><!--this is a comment--></b><p>This is not a comment</p>","html.parser")

In [31]:
newsoup.b.string

'this is a comment'

In [32]:
type(newsoup.b.string)

bs4.element.Comment

In [33]:
newsoup.p.string

'This is not a comment'

In [34]:
type(newsoup.p.string)

bs4.element.NavigableString

Wrap-up:

```.<tag>```    
-   ```.name```      
-   ```.attrs```
-   ```.string```

## navigation in tag tree

downside navigation 

In [42]:
soup.head

<head><title>This is a python demo page</title></head>

contents contains a list of children

In [44]:
soup.head.contents

[<title>This is a python demo page</title>]

In [45]:
soup.body.contents

['\n',
 <p class="title"><b>The demo python introduces several python courses.</b></p>,
 '\n',
 <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>,
 '\n']

In [49]:
soup.head.descendants

<generator object Tag.descendants at 0x0000018C3EDCDD48>

In [50]:
for descendant in soup.head.descendants:
    print(descendant)

<title>This is a python demo page</title>
This is a python demo page


In [51]:
for descendant in soup.body.descendants:
    print(descendant)



<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
 and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python
.




In [55]:
type(soup.head.contents[0])

bs4.element.Tag

In [56]:
soup.head.contents[0].contents

['This is a python demo page']

upside navigation

In [58]:
for ancester in soup.body.parents:
    print(ancester)

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870

horizontal navigation

horizontal means the navigation happens at the same level, i.e. those who share the same parent.

In [60]:
soup.p.next_sibling

'\n'

prettify: pretty format of html raw code

In [63]:
print(soup.p)

<p class="title"><b>The demo python introduces several python courses.</b></p>


In [64]:
print(soup.p.prettify())

<p class="title">
 <b>
  The demo python introduces several python courses.
 </b>
</p>

