### Install beautifulsoup

In [5]:
!pip install beautifulsoup4

Defaulting to user installation because normal site-packages is not writeable


In [6]:
from bs4 import BeautifulSoup as bs

## Getting an url from the internet  
Don't forget to get the actual html content from the `request object` using the **`.text` attribute**

In [7]:
import requests
url = 'https://raw.githubusercontent.com/techwithtim/Beautiful-Soup-Tutorial/main/index.html'
html = requests.get(url)
html_text = html.text
soup = bs(html_text, 'html.parser')

In [8]:
# with open('index.html', 'r') as file:
#     content = BeautifulSoup(file, 'html.parser')
# # prettify is a method to print the html file in HTML original form
# print(content.prettify())


To access an html tag inside the html file, access it as an attribute of the BeautifulSoup object.  
Use the `.string` attribute to access its content

In [9]:
title =soup.title
# the whole html tag
print(title)
# only the content inside the html tag
print(title.string)

<title>Your Title Here</title>
Your Title Here


### Change the content of an html tag

In [10]:
print('Old title tag content:')
print(title.string)
title.string = 'A new title for the html file'
print('Now the <title> tag has new content:')
print(title.string)

Old title tag content:
Your Title Here
Now the <title> tag has new content:
A new title for the html file


with `.find()` method you can get the first concurrence of a specific tag giving the method the tag type as an argument

In [11]:
# get the first <a> tag
first_link = soup.find('a')
print(first_link)

<a href="http://somegreatsite.com">Link Name</a>


Use `.find_all(tag type)` to get all tags inside the html file

In [12]:
links = soup.find_all('a')
print(*links, sep='\n'+'*'*15 + '\n')

<a href="http://somegreatsite.com">Link Name</a>
***************
<a href="mailto:support@yourcompany.com">

support@yourcompany.com</a>


The result of the `.find_all()` method is a list of tags. Use slicing to get only some of them

In [13]:
p = soup.find_all('p')
print(f'There are {len(p)} <p> tags.')
print('Here they are:')
for i in p:
    print(i)

print('\n\nThe first one is:')
print(p[0])

There are 2 <p> tags.
Here they are:
<p> This is a new paragraph!

<p> <b color="red">This is a new paragraph!</b>
<br/> <b><i>This is a new sentence without a paragraph break, in bold italics.</i></b>
<hr/>
</p></p>
<p> <b color="red">This is a new paragraph!</b>
<br/> <b><i>This is a new sentence without a paragraph break, in bold italics.</i></b>
<hr/>
</p>


The first one is:
<p> This is a new paragraph!

<p> <b color="red">This is a new paragraph!</b>
<br/> <b><i>This is a new sentence without a paragraph break, in bold italics.</i></b>
<hr/>
</p></p>


### Modifying the attributes of an html tag  
Tag attributes can be accessed as it is a dictionary, where keys are the attribute itself and value is the attribute's value. This dictionary can be shown using the bs object's attribute `.attrs`

In [14]:
first_a = soup.find('a')
print(first_link)
# attributes of a tag
print(first_link.attrs)

<a href="http://somegreatsite.com">Link Name</a>
{'href': 'http://somegreatsite.com'}


Let's change its `href` attribute:

In [15]:
first_link['href'] = 'www.google.es'
print(first_link)

<a href="www.google.es">Link Name</a>


### Adding an html attribute to a tag  
It is as simple as adding a new `key:value` pair to the `.attrs` dictionary

In [16]:
first_link['target'] = '_blank'
print(first_link)

<a href="www.google.es" target="_blank">Link Name</a>


### Finding multiple types of tag  
Give a list of tag strings as argument of `.find` or `.find_all` methods. The result is a list of bs objects, or simply said, list of tags.

#### Previously I will import a new html file

In [17]:
with open('index2.html', 'r') as file:
    soup2 = bs(file, 'html.parser')

In [18]:
tags = soup2.find_all(['a', 'p'])
for tag in tags:
    print(tag)

<p>W3docs provides free learning materials for programming languages
          like HTML, CSS, Java Script, PHP etc.</p>
<a class="btn-item" href="https://www.w3docs.com/learn-html.html">Learn
            HTML</a>
<a class="btn-item" href="https://www.w3docs.com/quiz/#">Select Quiz</a>
<a href="https://www.w3docs.com/privacy-policy">Privacy Poalicy for
              W3Docs.</a>


### Find a tag with combination of tag name and text inside the tag

In [19]:
options = soup2.find_all('option')
print(*options, sep='\n')
print('\n' + '-'*15)
print('Only this one option is for "Ungraduate"')
print(soup2.find_all('option', text='Undergraduate'))
print('Only this one option is for "Ungraduate"')

<option selected="" value="course-type">Course type*</option>
<option value="short-courses">Short courses</option>
<option value="featured-courses">Featured courses</option>
<option value="undergraduate">Undergraduate</option>
<option value="diploma">Diploma</option>
<option value="certificate">Certificate</option>
<option value="masters-degree">Masters degree</option>
<option value="postgraduate">Postgraduate</option>

---------------
Only this one option is for "Ungraduate"
[<option value="undergraduate">Undergraduate</option>]
Only this one option is for "Ungraduate"


### Search for specific attribute value inside the tag  
Give argument for `.find_all` as follows: `tag attribute name = value`

In [20]:
links = soup2.find_all('a')
print(*links, sep='\n')
print('\n\nOnly this one points to the quiz site:')
print(soup2.find_all(href = "https://www.w3docs.com/quiz/#"))


<a class="btn-item" href="https://www.w3docs.com/learn-html.html">Learn
            HTML</a>
<a class="btn-item" href="https://www.w3docs.com/quiz/#">Select Quiz</a>
<a href="https://www.w3docs.com/privacy-policy">Privacy Poalicy for
              W3Docs.</a>


Only this one points to the quiz site:
[<a class="btn-item" href="https://www.w3docs.com/quiz/#">Select Quiz</a>]


### Search by class attribute  
argument `class_` must be used. It is important to add the **`_`** underscore at the end of the `class_` keyword to differentiate from `class` object keyword

In [21]:
links_btn = soup2.find_all('a', class_='btn-item')
print(*links_btn, sep='\n')

<a class="btn-item" href="https://www.w3docs.com/learn-html.html">Learn
            HTML</a>
<a class="btn-item" href="https://www.w3docs.com/quiz/#">Select Quiz</a>


### Search using regular expressions

In [22]:
import re
price_tags = soup2.find_all(text=re.compile('\$.*'))
print(price_tags)
print('Cleaning the result...')
for i in price_tags:
    print(i.strip())

['\n        $2345\n      ', '\n        $123\n        ']
Cleaning the result...
$2345
$123


### Limiting the number of results of find_all  
Use of `limit` as key argument for `.find_all()`method

In [47]:
options = soup2.find_all('option')
for i in options:
    print(i.string)

# for div in divs:
#     print(div, sep='n' + '*'*15)

Course type*
Short courses
Featured courses
Undergraduate
Diploma
Certificate
Masters degree
Postgraduate


limit the results to the 3 first ocurrences

In [49]:

options = soup2.find_all('option', limit=3)
for i in options:
    print(i.string)

Course type*
Short courses
Featured courses


### Saving the changes to a modified html file  
`str(bs html object)` gives us the plain text html representation of the bs object

In [50]:
with open('index.modified.html', 'w') as file:
    file.write(str(soup2))