In [1]:
import requests

url = 'https://www.giallozafferano.it/'
url = "https://info.cern.ch/hypertext/WWW/TheProject.html"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    html_content = response.text
else:
    print("Failed to retrieve the webpage")

In [2]:
# print(html_content)
len(html_content)

2217

## Parse the raw html

In [3]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')

Here are some simple ways to navigate that data structure:

In [4]:
print("Elemento <title> completo:                      ", soup.title)
print("Testo dell'elemento <title>:                    ", soup.title.string)
print("Nome del tag dell'elemento corrente:            ", soup.title.name)
print("Nome del tag genitore dell'elemento <title>:    ", soup.title.parent.name)


Elemento <title> completo:                       <title>The World Wide Web project</title>
Testo dell'elemento <title>:                     The World Wide Web project
Nome del tag dell'elemento corrente:             title
Nome del tag genitore dell'elemento <title>:     header


In [5]:
# All links in the page
nb_links = len(soup.find_all('a'))
print(f"Ci sono {nb_links} link nella pagina")

Ci sono 25 link nella pagina


In [6]:
# Text from the page
print(soup.get_text())


The World Wide Web project



World Wide WebThe WorldWideWeb (W3) is a wide-area
hypermedia information retrieval
initiative aiming to give universal
access to a large universe of documents.
Everything there is online about
W3 is linked directly or indirectly
to this document, including an executive
summary of the project, Mailing lists
, Policy , November's  W3  news ,
Frequently Asked Questions .

What's out there?
 Pointers to the
world's online information, subjects
, W3 servers, etc.
Help
 on the browser you are using
Software Products
 A list of W3 project
components and their current state.
(e.g. Line Mode ,X11 Viola ,  NeXTStep
, Servers , Tools , Mail robot ,
Library )
Technical
 Details of protocols, formats,
program internals etc
Bibliography
 Paper documentation
on  W3 and references.
People
 A list of some people involved
in the project.
History
 A summary of the history
of the project.
How can I help ?
 If you would like
to support the web..
Getting code
 Getting the cod

If you need to select DOM elements from its tag `(<p>, <a>, <span>, ....)` you can simply do `soup.<tag>` to select it. The caveat is that it will only select the first HTML element with that tag.

For example if I want the first link I just have to access the a field of my BeautifulSoup object

And if you don't want the first matching element but instead all matching elements, just replace find with find_all.

In [35]:
first_link = soup.a
print(first_link)

<a id="top"></a>


In [36]:
# The text of the link
print(first_link.text)
# The href of the link
print(first_link.get('href'))


None


This is a simple example. If you want to select the first element based on its id or class attributes, it is not much more difficult:

In [38]:
pagespace = soup.find(id="pagespace")
print(pagespace)
# class is a reserved keyword in Python, hence the '_'
athing = soup.find(class_="athing")
print(athing)

None
None


In [39]:
from collections import Counter
all_hrefs = [a.get('href') for a in soup.find_all('a')]
top_3_links = Counter(all_hrefs).most_common(3)
print(top_3_links)

[('/wiki/ISBN', 22), ('/wiki/1555', 19), ('/wiki/1516', 17)]


### Dynamic element selection
So far we've always passed a static tag type, however find_all is more versatile and does support dynamic selections as well. For example, we could pass a function reference and find_all will invoke your function for each element and only include that element only if your function returned true.

In the following code sample we defined a function my_tag_selector which takes a tag parameter and returns true only if it got an <a> tag with an HTML class titlelink. Essentially, we extract only the article links from the main page.

In [42]:
import requests
from bs4 import BeautifulSoup
import re

def my_tag_selector(tag):
	# We only accept "a" tags with a titlelink class
	return tag.name == "a" and tag.has_attr("class") and "titlelink"# in tag.get("class")

response = requests.get("https://news.ycombinator.com/")
if response.status_code != 200:
	print("Error fetching page")
	exit()

soup = BeautifulSoup(response.content, 'html.parser')

print(soup.find_all(my_tag_selector))

[<a class="hnuser" href="user?id=Anon84">Anon84</a>, <a class="hnuser" href="user?id=ttfkam">ttfkam</a>, <a class="hnuser" href="user?id=WhyUVoteGarbage">WhyUVoteGarbage</a>, <a class="hnuser" href="user?id=rbanffy">rbanffy</a>, <a class="hnuser" href="user?id=zdw">zdw</a>, <a class="hnuser" href="user?id=thesephist">thesephist</a>, <a class="hnuser" href="user?id=luu">luu</a>, <a class="hnuser" href="user?id=Hooke">Hooke</a>, <a class="hnuser" href="user?id=tintinnabula">tintinnabula</a>, <a class="hnuser" href="user?id=writeslowly">writeslowly</a>, <a class="hnuser" href="user?id=kylegalbraith">kylegalbraith</a>, <a class="hnuser" href="user?id=rbanffy">rbanffy</a>, <a class="hnuser" href="user?id=tomalpha">tomalpha</a>, <a class="hnuser" href="user?id=pabs3">pabs3</a>, <a class="hnuser" href="user?id=kuba-orlik">kuba-orlik</a>, <a class="hnuser" href="user?id=kls0e">kls0e</a>, <a class="hnuser" href="user?id=ingve">ingve</a>, <a class="hnuser" href="user?id=surprisetalk">surprisetal

find_all does not only support static strings as filter, but rather follows a generic "true-ness" approach, where you can pass different types of expressions and they just need to evaluate to true. Apart from tag strings and functions, there currently is also support for regular expressions and lists. In addition to find_all, there are also other functions to navigate the DOM tree, for example selecting the following DOM siblings or the element's parent.

BeautifulSoup is a great example of a library that is both, easy to use and powerful. We mostly talked about selecting and finding elements so far, but you can also change and update the whole DOM tree. These bits, we won't cover in this article, however, because it's now time for CSS selectors.



### CSS selectors
Why learn about CSS selectors if BeautifulSoup already has a way to select elements based on their attributes?

#### Querying the DOM
Often, DOM elements do not have proper IDs or class names. While perfectly possible (see our previous examples, please), selecting elements in that case can be rather verbose and require lots of manual steps.

For example, let's say that you want to extract the score of a post on the HN homepage, but you can't use class name or id in your code. Here is how you could do it:

In [None]:
results = []
all_tr = soup.find_all('tr')
for tr in all_tr:
	if len(tr.contents) == 2:
		print(len(tr.contents[1]))
		if len(tr.contents[0].contents) == 0 and len(tr.contents[1].contents) == 13:
			points = tr.contents[1].text.split(' ')[0].strip()
			results.append(points)
print(results)

As promised, rather verbose, isn't it?

This is exactly where CSS selectors shine. They allow you to break down your loop and ifs into one expression.


In [45]:
all_results = soup.select('td:nth-child(2) > span:nth-child(1)')
results = [r.text.split(' ')[0].strip() for r in all_results]
print(results)

['Hacker', '218', '36', '89', '87', '41', '45', '57', '116', '46', '71', '100', '36', '3', '161', '201', '176', '206', '10', '8', '48', '165', '80', '19', '3', '17', '56', '302', '109', '39', '77']


The key here is td:nth-child(2) > span:nth-child(1). This selects for us the first <span> which is an immediate child of a <td>, which itself has to be the second element of its parent (<tr>). The following HTML illustrates a valid DOM excerpt for our selector.
```
<tr>
    <td>not the second child, are we?</td>
    <td>
        <span>HERE WE GO</span>
        <span>this time not the first span</span>
    </td>
</tr>
```
This is much clearer and simpler, right? Of course, this example artificially highlights the usefulness of the CSS selector. But after playing a while with the DOM, you will fairly quickly realise how powerful CSS selectors are, especially when you cannot only rely on IDs or class names.



Continue: https://www.scrapingbee.com/blog/python-web-scraping-beautiful-soup/