Importamos o requests e o bs4 para fazer um requisição HTTP "get" do site e depois fazemos o scrapping do site pegando o conteúdo de interesse

In [7]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

## Acessando os poemas e autores da página

In [8]:
def get_poems_authors(url):
    r = requests.get(url)
    print(f"The url: {url} status code it's {r.status_code}")

    html_doc = r.text
    soup = BeautifulSoup(html_doc)

    authors = []
    poems = []

    for span in soup.find_all("span", class_="autor"):
        author = span.text
        authors.append(author)

    for span in soup.find_all("p", class_=re.compile(r"frase\s.*")):
        poem = span.text
        poems.append(poem)

    current_page = soup.find("span", class_="atual")
    next_page = current_page.find_next("a")
    
    href_next_page = next_page['href']

    if 'poemas' not in href_next_page:
        href_next_page = False

    urls = [url]*len(authors)

    return href_next_page,authors,poems,urls

Em razão da paginação do site, temos que buscar qual seria a próxima página, retornando o href como parâmetro na função acima e fazendo o scrpping da nova página

## Criando um dataset que possui o autor e o seu poema

In [12]:
main_page = "https://www.pensador.com"
poem_page = "https://www.pensador.com/poemas/"
href_exists = True

urls = []
authors = []
poems = []

while href_exists:
    href,author,poem,url = get_poems_authors(poem_page)

    href_exists = True if href else False
    
    if not href_exists:
        break
    
    poem_page = main_page+href
    urls.append(url)
    authors.append(author)
    poems.append(poem)

The url: https://www.pensador.com/poemas/ status code it's 200
The url: https://www.pensador.com/poemas/2/ status code it's 200
The url: https://www.pensador.com/poemas/3/ status code it's 200
The url: https://www.pensador.com/poemas/4/ status code it's 200
The url: https://www.pensador.com/poemas/5/ status code it's 200
The url: https://www.pensador.com/poemas/6/ status code it's 200
The url: https://www.pensador.com/poemas/7/ status code it's 200
The url: https://www.pensador.com/poemas/8/ status code it's 200
The url: https://www.pensador.com/poemas/9/ status code it's 200
The url: https://www.pensador.com/poemas/10/ status code it's 200
The url: https://www.pensador.com/poemas/11/ status code it's 200
The url: https://www.pensador.com/poemas/12/ status code it's 200
The url: https://www.pensador.com/poemas/13/ status code it's 200
The url: https://www.pensador.com/poemas/14/ status code it's 200
The url: https://www.pensador.com/poemas/15/ status code it's 200
The url: https://www.

In [13]:
len(authors), len(poems), len(urls)

(46, 46, 46)

In [14]:
def simple_list(_list):
    if isinstance(_list, list):
        return [sub_elem for elem in _list for sub_elem in simple_list(elem)]
    else:
        return [_list]

In [16]:
len(simple_list(authors)), len(simple_list(poems)), len(simple_list(urls))

(920, 920, 920)

In [17]:
authors = simple_list(authors)
poems = simple_list(poems)
urls = simple_list(urls)

In [19]:
data = pd.DataFrame(list(zip(authors, poems, urls)), columns =['Authors', 'Poems', 'Page'])

In [25]:
data['Authors'] = data['Authors'].apply(lambda st:st.replace('\n', '')).apply(str.strip)

In [26]:
data

Unnamed: 0,Authors,Poems,Page
0,Fernando Pessoa,O poeta é um fingidor.\nFinge tão completament...,https://www.pensador.com/poemas/
1,Fernando Pessoa,AUTOPSICOGRAFIA\n\nO poeta é um fingidor.\nFin...,https://www.pensador.com/poemas/
2,Mario Quintana,SIMULTANEIDADE\n\n- Eu amo o mundo! Eu detesto...,https://www.pensador.com/poemas/
3,Clarice Pacheco,Caderno de poesias\n\nCaderno de poesias\né um...,https://www.pensador.com/poemas/
4,Tom Jobim,"Ah, quem me dera ser poeta\nPra cantar em seu ...",https://www.pensador.com/poemas/
...,...,...,...
915,Álvaro de Campos,POEMA DE CANÇÃO SOBRE A ESPERANÇA\n\nI\n\nDá-m...,https://www.pensador.com/poemas/46/
916,Celia Piovesan,HOMENAGEM AO CADAVER DESCONHECIDO \nVOCÊ \n\n...,https://www.pensador.com/poemas/46/
917,Yalison Lillipuziano,OLHOS CASTANHOS\n\nUm brilho no seu olhar\nQue...,https://www.pensador.com/poemas/46/
918,Alfredo Cuervo Barrero,"É Proibido\n\nÉ proibido chorar sem aprender,\...",https://www.pensador.com/poemas/46/


In [27]:
data.to_csv('../data/authors_poems.csv')