There are existing modules for parsing HTML. These are also not perfect, but they can often be a lot more perfect that you or I might have patience for;)

One very common solution in Python is **Beautiful Soup**, a free open-source module for Python for parsing and manipulating HTML. You have to install it yourself, but once installed, it can be called like any other module. The module is called **bs4** and the relevant function is `BeautifulSoup()`. What that does is parse the HTML and build a document model. This is a treelike representation of the HTML document and you can extract elements from it easily. The following code exemplifies:

In [None]:
#import for reading urls
import urllib.request
#import for parsing html
from bs4 import BeautifulSoup
#non-local page this time
link = "https://habr.com/"
#connect to that page
f = urllib.request.urlopen(link)
#read it all in
myfile = f.read()
#build a document model
soup = BeautifulSoup(myfile,'html.parser')
#print the page verbatim
print(myfile)

Here we read in a web page and then parse it with `BeautifulSoup()`. We can then print a pretty version of it with `prettify()`, extract the text with `get_text()`, or find all instances of a tag with `find_all()`. Each tag found
is its own treelike representation, so we can continue to call methods on them. In the example at hand, we call the `get()` method to extract the text of the `href` attribute for the `a` tags.

In [None]:
#pretty-print the html
print(soup.prettify())

In [None]:
#extract the text
print(soup.get_text())

In [None]:
#got through all the hyperlinks...
for link in soup.find_all('a'):
#...and print them
    print(link.get('href'))

In [None]:
import requests
from bs4 import BeautifulSoup

result = requests.get('https://habr.com/')
html = result.text

soup = BeautifulSoup(html,'html.parser')

for post in soup.find_all('article', {'class': 'post'}):
    print(post.find('a', {'class': 'post__title_link'}).get_text())
    print(post.find('div', {'class': 'post__text'}).prettify())

    print('-- '*10)

An alternative module for parsing HTML is **lxml**. It is quite easy to parse the HTML code extracted with the help of lxml. As soon as we trasformed our data into a tree, we can use XPath to extract them.

In [1]:
import requests
from lxml import html

response = requests.get('http://ya.ru')

# Преобразование тела документа в дерево элементов (DOM)
parsed_body = html.fromstring(response.text)

# Выполнение xpath в дереве элементов
print(parsed_body.xpath('//title/text()')[0])  # Получить title страницы
print(parsed_body.xpath('//a/@href'))          # Получить аттрибут href для всех ссылок

Яндекс
['https://mail.yandex.ru', '//yandex.ru']
