###  Create Objects with Beautiful Soup

- based on vprusso tutorial: https://www.youtube.com/watch?v=oDtLJEc5Ako

- github: https://github.com/vprusso/youtube_tutorials/blob/master/web_scraping_and_automation/beautiful_soup/beautiful_soup_objects.py


- objects reviewed: Tag, NavigableString, BeautifulSoup, and Comment

In [4]:
from bs4 import BeautifulSoup

import os

In [2]:
# This tutorial uses the following html

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; their names:
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<b class="boldest">Extremely bold</b>
<blockquote class="boldest">Extremely bold</blockquote>
<b id="1">Test 1</b>
<b another-attribute="1" id="verybold">Test 2</b>
"""

In [6]:
# writes html_doc to a file and use beautiful soup to create a soup object
# see working directory to view the file (import os and use cmd os.getcwd())

with open('index.html', 'w') as f:
    f.write(html_doc)

bs = BeautifulSoup(html_doc, "lxml")

In [11]:
# view the bs object  
# there are 2 options below: prettify includes indents and other html formats
# prettify makes it easier to identify content tags like a-tags, headers, etc

#print(bs)
#print(bs.prettify())

### Print Tags

In [16]:
# print first occurrence of content wrapped in bold tags and p tags

print(bs.b)
print(bs.p)

# alternatively, we can use the find function to do same thing
print(bs.find('b'))  #note the quotes around b
print(bs.find('p'))

<b>The Dormouse's story</b>
<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
<p class="title"><b>The Dormouse's story</b></p>


In [32]:
# print all the content wrapped in a specified tag
# produces a list

print(bs.find_all('p'))
print(bs.find_all('b'))

#print(len(bs.find_all('p')))  # to see how many elements in the list

[<p class="title"><bold_type>The Dormouse's story</bold_type></p>, <p class="story">Once upon a time there were three little sisters; their names:
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
[<b class="boldest">Extremely bold</b>, <b id="1">Test 1</b>, <b another-attribute="1" id="verybold">Test 2</b>]
3


### Tag Names

In [21]:
# print the name of a tag
# in this case, the tag name is 'b'.  Duh

print(bs.b.name)

# we can alter the name of the tag

tag1 = bs.b
print(tag1)
tag1.name = 'bold_type'
print(tag1)

b
<b>The Dormouse's story</b>
<bold_type>The Dormouse's story</bold_type>


### Attributes

In [None]:
# find all the bold elements (see above bs.find_all('b')) and
#    give me the third element of the list

tag1 = bs.find_all('b')[1]
print(tag1)

# this tag has an attribute called 'id'
# using array notation [] we can get the contents of id attribute
print(tag1['id'])

In [35]:
# similar example, but printing multiple attributes 
# in this example 'another-attribute' is an attribute in the html file

tag2 = bs.find_all('b')[2]
print(tag2)

print(tag2['id'])
print(tag2['another-attribute'])

<b another-attribute="1" id="verybold">Test 2</b>
verybold
1


In [41]:
# See all the attributes that the bold (b) tag has
# creates a dictionary with each attribute type

print(tag2.attrs)

{'another-attribute': 99, 'id': 'verybold'}


In [42]:
# Change the vaule of an attribute
# In the example above, the value for 'another-attribute'
#    is 1.  Here, we change it to something else

tag2['another-attribute'] = 99

print(tag3) # new attribute values

<b another-attribute="99" id="verybold">Test 2</b>


In [44]:
# Removing Attributes
# attribute content is mutable so in addition to being
#    able to change it, we can delete contents too

del tag2['id']
del tag2['another-attribute']

print(tag2)

<b>Test 2</b>


### Strings within Tags

In [47]:
# String data can be found within tags

tag3 = bs.find_all('b')[2]
print(tag3)
print(tag3.string)

<b>Test 2</b>
Test 2


In [48]:
# Like attribute values, strings are mutable and can be changed

tag3.string.replace_with('Changed String Values')
print(tag3.string)

Changed String Values
