# Beautiful Soup Tutorial

### Install beautifulsoup

In [5]:
!pip install beautifulsoup4

Defaulting to user installation because normal site-packages is not writeable


In [2]:
from bs4 import BeautifulSoup as bs

## Getting an url from the internet  
Don't forget to get the actual html content from the `request object` using the **`.text` attribute**

In [29]:
import requests
url = 'https://raw.githubusercontent.com/techwithtim/Beautiful-Soup-Tutorial/main/index.html'
html = requests.get(url)
html_text = html.text
soup = bs(html_text, 'html.parser')

In [8]:
# with open('index.html', 'r') as file:
#     content = BeautifulSoup(file, 'html.parser')
# # prettify is a method to print the html file in HTML original form
# print(content.prettify())


To access an html tag inside the html file, access it as an attribute of the BeautifulSoup object.  
Use the `.string` attribute to access its content

In [9]:
title =soup.title
# the whole html tag
print(title)
# only the content inside the html tag
print(title.string)

<title>Your Title Here</title>
Your Title Here


### Change the content of an html tag

In [10]:
print('Old title tag content:')
print(title.string)
title.string = 'A new title for the html file'
print('Now the <title> tag has new content:')
print(title.string)

Old title tag content:
Your Title Here
Now the <title> tag has new content:
A new title for the html file


with `.find()` method you can get the first concurrence of a specific tag giving the method the tag type as an argument

In [11]:
# get the first <a> tag
first_link = soup.find('a')
print(first_link)

<a href="http://somegreatsite.com">Link Name</a>


Use `.find_all(tag type)` to get all tags inside the html file

In [12]:
links = soup.find_all('a')
print(*links, sep='\n'+'*'*15 + '\n')

<a href="http://somegreatsite.com">Link Name</a>
***************
<a href="mailto:support@yourcompany.com">

support@yourcompany.com</a>


The result of the `.find_all()` method is a list of tags. Use slicing to get only some of them

In [13]:
p = soup.find_all('p')
print(f'There are {len(p)} <p> tags.')
print('Here they are:')
for i in p:
    print(i)

print('\n\nThe first one is:')
print(p[0])

There are 2 <p> tags.
Here they are:
<p> This is a new paragraph!

<p> <b color="red">This is a new paragraph!</b>
<br/> <b><i>This is a new sentence without a paragraph break, in bold italics.</i></b>
<hr/>
</p></p>
<p> <b color="red">This is a new paragraph!</b>
<br/> <b><i>This is a new sentence without a paragraph break, in bold italics.</i></b>
<hr/>
</p>


The first one is:
<p> This is a new paragraph!

<p> <b color="red">This is a new paragraph!</b>
<br/> <b><i>This is a new sentence without a paragraph break, in bold italics.</i></b>
<hr/>
</p></p>


### Modifying the attributes of an html tag  
Tag attributes can be accessed as it is a dictionary, where keys are the attribute itself and value is the attribute's value. This dictionary can be shown using the bs object's attribute `.attrs`

In [14]:
first_a = soup.find('a')
print(first_link)
# attributes of a tag
print(first_link.attrs)

<a href="http://somegreatsite.com">Link Name</a>
{'href': 'http://somegreatsite.com'}


Let's change its `href` attribute:

In [15]:
first_link['href'] = 'www.google.es'
print(first_link)

<a href="www.google.es">Link Name</a>


### Adding an html attribute to a tag  
It is as simple as adding a new `key:value` pair to the `.attrs` dictionary

In [16]:
first_link['target'] = '_blank'
print(first_link)

<a href="www.google.es" target="_blank">Link Name</a>


### Finding multiple types of tag  
Give a list of tag strings as argument of `.find` or `.find_all` methods. The result is a list of bs objects, or simply said, list of tags.

#### Previously I will import a new html file

In [17]:
with open('index2.html', 'r') as file:
    soup2 = bs(file, 'html.parser')

In [18]:
tags = soup2.find_all(['a', 'p'])
for tag in tags:
    print(tag)

<p>W3docs provides free learning materials for programming languages
          like HTML, CSS, Java Script, PHP etc.</p>
<a class="btn-item" href="https://www.w3docs.com/learn-html.html">Learn
            HTML</a>
<a class="btn-item" href="https://www.w3docs.com/quiz/#">Select Quiz</a>
<a href="https://www.w3docs.com/privacy-policy">Privacy Poalicy for
              W3Docs.</a>


### Find a tag with combination of tag name and text inside the tag

In [19]:
options = soup2.find_all('option')
print(*options, sep='\n')
print('\n' + '-'*15)
print('Only this one option is for "Ungraduate"')
print(soup2.find_all('option', text='Undergraduate'))
print('Only this one option is for "Ungraduate"')

<option selected="" value="course-type">Course type*</option>
<option value="short-courses">Short courses</option>
<option value="featured-courses">Featured courses</option>
<option value="undergraduate">Undergraduate</option>
<option value="diploma">Diploma</option>
<option value="certificate">Certificate</option>
<option value="masters-degree">Masters degree</option>
<option value="postgraduate">Postgraduate</option>

---------------
Only this one option is for "Ungraduate"
[<option value="undergraduate">Undergraduate</option>]
Only this one option is for "Ungraduate"


### Search for specific attribute value inside the tag  
Give argument for `.find_all` as follows: `tag attribute name = value`

In [20]:
links = soup2.find_all('a')
print(*links, sep='\n')
print('\n\nOnly this one points to the quiz site:')
print(soup2.find_all(href = "https://www.w3docs.com/quiz/#"))


<a class="btn-item" href="https://www.w3docs.com/learn-html.html">Learn
            HTML</a>
<a class="btn-item" href="https://www.w3docs.com/quiz/#">Select Quiz</a>
<a href="https://www.w3docs.com/privacy-policy">Privacy Poalicy for
              W3Docs.</a>


Only this one points to the quiz site:
[<a class="btn-item" href="https://www.w3docs.com/quiz/#">Select Quiz</a>]


### Search by class attribute  
argument `class_` must be used. It is important to add the **`_`** underscore at the end of the `class_` keyword to differentiate from `class` object keyword

In [21]:
links_btn = soup2.find_all('a', class_='btn-item')
print(*links_btn, sep='\n')

<a class="btn-item" href="https://www.w3docs.com/learn-html.html">Learn
            HTML</a>
<a class="btn-item" href="https://www.w3docs.com/quiz/#">Select Quiz</a>


### Search using regular expressions

In [22]:
import re
price_tags = soup2.find_all(text=re.compile('\$.*'))
print(price_tags)
print('Cleaning the result...')
for i in price_tags:
    print(i.strip())

['\n        $2345\n      ', '\n        $123\n        ']
Cleaning the result...
$2345
$123


### Limiting the number of results of find_all  
Use of `limit` as key argument for `.find_all()`method

In [47]:
options = soup2.find_all('option')
for i in options:
    print(i.string)

# for div in divs:
#     print(div, sep='n' + '*'*15)

Course type*
Short courses
Featured courses
Undergraduate
Diploma
Certificate
Masters degree
Postgraduate


limit the results to the 3 first ocurrences

In [49]:

options = soup2.find_all('option', limit=3)
for i in options:
    print(i.string)

Course type*
Short courses
Featured courses


### Saving the changes to a modified html file  
`str(bs html object)` gives us the plain text html representation of the bs object

In [50]:
with open('index.modified.html', 'w') as file:
    file.write(str(soup2))

## Navigating the DOM tree of a webpage

In [110]:
with open('index.html', 'r') as file:
    soup3 = bs(file, 'html.parser')
soup3.prettify

<bound method Tag.prettify of <html>
<head>
<title>Beautiful Soup HTML Sample</title>
</head>
<body bgcolor="FFFFFF">
<center><img align="BOTTOM" src="clouds.jpg"/> </center>
<hr/>
<a href="http://somegreatsite.com">Link Name</a> is a link to another nifty site
  
  <h1>THIS IS AN H1 HEADER</h1>
<h2>This is a Medium Header</h2>
  
  Send me mail at <a href="mailto:support@yourcompany.com">
  
  support@yourcompany.com</a>.
  
  <p> This is a new paragraph!</p>
<p><b color="red">This is a new paragraph!</b><br/>
<b><i>This is a new sentence without a paragraph break, in bold italics.</i></b>
</p>
<table><tr>
<td>Cell 1, row 1</td>
<td>Cell 2, row 1</td>
<td>Cell 3, row 1</td>
<td>Cell 4, row 1</td>
</tr>
<tr>
<td>Cell 1, row 2</td>
<td>Cell 2, row 2</td>
<td>Cell 3, row 2</td>
<td>Cell 4, row 2</td>
</tr></table>
<hr/>
</body>
</html>
>

### Get the children tags of an html element  
Using `.contents` attribute of bs object  
Looking for the criptocurrencies table and then list its children tags

In [121]:
table = soup3.find('table')
table_content = table.contents
print(table_content)

[<tr>
<td>Cell 1, row 1</td>
<td>Cell 2, row 1</td>
<td>Cell 3, row 1</td>
<td>Cell 4, row 1</td>
</tr>, '\n', <tr>
<td>Cell 1, row 2</td>
<td>Cell 2, row 2</td>
<td>Cell 3, row 2</td>
<td>Cell 4, row 2</td>
</tr>]


#### `.next_siblings` is a generator with all the sibling tags and their content  
The generator content can be transformed into a list using `list(html_tag.next_siblings)`

In [129]:
_1st_row_siblings = table_content[0].next_siblings
print(type(_1st_row_siblings))
for row in _1st_row_siblings:
    if row != '\n':
        print(row)

<class 'generator'>
<tr>
<td>Cell 1, row 2</td>
<td>Cell 2, row 2</td>
<td>Cell 3, row 2</td>
<td>Cell 4, row 2</td>
</tr>


#### The next sibling tag of one kind can accessed using `.next_sibling`. Likewise, the previous one using `.previous_sibling`

In [125]:
# first row of the table
print('Second element of table_content.contents is a breakline')
print(table_content[1])
print('\n' + '*'*15)
# next row of table
print(table_content[1].next_sibling)
print('\n' + '*'*15)
# previous row is the first one
previous = table_content[1].previous_sibling
print(previous)

Second element of table_content.contents is a breakline



***************
<tr>
<td>Cell 1, row 2</td>
<td>Cell 2, row 2</td>
<td>Cell 3, row 2</td>
<td>Cell 4, row 2</td>
</tr>

***************
<tr>
<td>Cell 1, row 1</td>
<td>Cell 2, row 1</td>
<td>Cell 3, row 1</td>
<td>Cell 4, row 1</td>
</tr>


### Parent of a tag: `.parent`  Tag name of the parent tag> `.parent.name`
### 

In [143]:
first_row = table.contents[0]
print('The parent of the row is the whole table')
print(first_row.parent)
print(f'Get the parent tag name: {first_row.parent.name}')

The parent of the row is the whole table
<table><tr>
<td>Cell 1, row 1</td>
<td>Cell 2, row 1</td>
<td>Cell 3, row 1</td>
<td>Cell 4, row 1</td>
</tr>
<tr>
<td>Cell 1, row 2</td>
<td>Cell 2, row 2</td>
<td>Cell 3, row 2</td>
<td>Cell 4, row 2</td>
</tr></table>
Get the parent tag name: table


### In order to get the content of a tag, three different attributes can be used> `.contents`, `.descendants` and `.children`  
`.descendants` gives a generator with all the tags that are children of their parent

In [157]:
print('.contents\tyields:', end=' ')
print(type(table.contents))
print('.descendants\tyields:', end=' ')
print(type(table.descendants))
print('.children\tyields:', end=' ')
print(type(table.children))

.contents	yields: <class 'list'>
.descendants	yields: <class 'generator'>
.children	yields: <class 'list_iterator'>


#### All the three attributes give the same information

In [160]:
print(*table.contents)
print('*'*20)
print(*table.descendants)
print('*'*20)
print(*table.children)

<tr>
<td>Cell 1, row 1</td>
<td>Cell 2, row 1</td>
<td>Cell 3, row 1</td>
<td>Cell 4, row 1</td>
</tr> 
 <tr>
<td>Cell 1, row 2</td>
<td>Cell 2, row 2</td>
<td>Cell 3, row 2</td>
<td>Cell 4, row 2</td>
</tr>
********************
<tr>
<td>Cell 1, row 1</td>
<td>Cell 2, row 1</td>
<td>Cell 3, row 1</td>
<td>Cell 4, row 1</td>
</tr> 
 <td>Cell 1, row 1</td> Cell 1, row 1 
 <td>Cell 2, row 1</td> Cell 2, row 1 
 <td>Cell 3, row 1</td> Cell 3, row 1 
 <td>Cell 4, row 1</td> Cell 4, row 1 
 
 <tr>
<td>Cell 1, row 2</td>
<td>Cell 2, row 2</td>
<td>Cell 3, row 2</td>
<td>Cell 4, row 2</td>
</tr> 
 <td>Cell 1, row 2</td> Cell 1, row 2 
 <td>Cell 2, row 2</td> Cell 2, row 2 
 <td>Cell 3, row 2</td> Cell 3, row 2 
 <td>Cell 4, row 2</td> Cell 4, row 2 

********************
<tr>
<td>Cell 1, row 1</td>
<td>Cell 2, row 1</td>
<td>Cell 3, row 1</td>
<td>Cell 4, row 1</td>
</tr> 
 <tr>
<td>Cell 1, row 2</td>
<td>Cell 2, row 2</td>
<td>Cell 3, row 2</td>
<td>Cell 4, row 2</td>
</tr>


Look for criptocurrencies prices. They are located inside different types of html tags

In [376]:
url = 'https://coinmarketcap.com/'
html = requests.get(url)
html = html.text
soup3 = bs(html, 'html.parser')
# the table with the criptocurrencies and their data is located in a tbody html tag
table = soup3.tbody
rows = table.contents
# currency name and value are inside the 3rd and 4th td tags inside each tr tag
currencies = {}

# the 10 first currencies have their names and values inside p tags
for row in rows[:10]:
    td_name = row.contents[2]
    td_name_p_tags = td_name.find_all('p')
    coin_name, coin_symbol = td_name_p_tags[0].string, td_name_p_tags[1].string
    coin_value = row.contents[3].span.string
    # add the currency to the dictionary
    currencies[coin_name] = [coin_name, coin_symbol, coin_value]
    #print(f'Currency: {coin_name}, {coin_symbol}. Current value: {coin_value}')

# the rest of currencies have their names and values inside span tags
# currency name, symbol and value are located in 3rd td inside tr tag
# inside the td tag, they are inside span tags
for row in rows[10:]:
    td_name = row.contents[2]
    td_name_spans = td_name.find_all('span')
    coin_name, coin_symbol = td_name_spans[1].string, td_name_spans[2].string
    # <span>$<!-- -->92.85</span></td>
    # span with currency value has a comment inside
    coin_value = row.contents[3].span.contents
    # remove the comment in position 1 --> ['$', ' ', '0.18']
    coin_value.pop(1)
    # join the dollar sign and value
    coin_value  = ''.join(coin_value)
    currencies[coin_name] = [coin_name, coin_symbol, coin_value]
    #print(f'Currency: {coin_name}, {coin_symbol}. Current value: {coin_value}')

for coin, info in currencies.items():
    print(f'{coin} -- {info[2]}')


Bitcoin -- $48,735.45
Ethereum -- $4,059.56
Binance Coin -- $561.90
Tether -- $1.00
Solana -- $200.78
Cardano -- $1.41
USD Coin -- $0.9992
XRP -- $0.8328
Polkadot -- $29.09
Terra -- $71.10
Dogecoin -- $0.18
Avalanche -- $90.74
SHIBA INU -- $0.00
Crypto.com Coin -- $0.59
Binance USD -- $1.00
Polygon -- $1.95
Wrapped Bitcoin -- $48495.90
Litecoin -- $162.54
Uniswap -- $17.06
Algorand -- $1.64
Chainlink -- $20.69
Bitcoin Cash -- $468.76
TRON -- $0.09
TerraUSD -- $1.00
Decentraland -- $4.03
Stellar -- $0.29
Axie Infinity -- $116.12
Dai -- $1.00
Cosmos -- $27.22
VeChain -- $0.10
FTX Token -- $43.59
Elrond -- $304.70
Internet Computer -- $31.24
The Sandbox -- $5.89
Filecoin -- $40.77
Hedera -- $0.29
THETA -- $5.12
Ethereum Classic -- $38.89
Bitcoin BEP2 -- $48406.09
NEAR Protocol -- $7.83
Fantom -- $1.68
Tezos -- $4.11
Monero -- $197.79
Gala -- $0.50
Helium -- $34.13
The Graph -- $0.73
IOTA -- $1.18
Klaytn -- $1.27
UNUS SED LEO -- $3.34
Flow -- $9.85
EOS -- $3.11
Loopring -- $2.16
PancakeSwa