### Importing Libraries

In [36]:
import requests
from bs4 import BeautifulSoup as bs
import re

### Loading First Page

In [50]:
r = requests.get('https://keithgalli.github.io/web-scraping/example.html')

soup = bs(r.content)

### **Start Scraping with BeautifulSoup**

#### find and find_all

In [18]:
first_header = soup.find("h2")
first_header

<h2>A Header</h2>

In [19]:
headers = soup.find_all('h2')
headers

[<h2>A Header</h2>, <h2>Another header</h2>]

#### Pass in a list of elements to look for

In [21]:
first_header = soup.find(["h1","h2"])
first_header

<h1>HTML Webpage</h1>

In [22]:
headers = soup.find_all(["h1","h2"])
headers

[<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]

In [28]:
#all paragraphs
paragraph = soup.find_all('p')
paragraph

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>,
 <p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

#### Pass in attributes to the find/find_all function

In [26]:
#only paragraphs with attribute: 'paragraph-id'
paragraph = soup.find_all('p', attrs={'id': 'paragraph-id'})
paragraph

[<p id="paragraph-id"><b>Some bold text</b></p>]

#### Nest find/find_all calls

In [31]:
body = soup.find('body')
div = body.find('div')

In [32]:
div

<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>

In [33]:
header = div.find('h1')
header

<h1>HTML Webpage</h1>

#### Search for specific strings in find / find_all calls

In [None]:
print(soup.prettify())

In [37]:
paragraphs = soup.find_all('p', string = re.compile('Some'))

In [38]:
paragraphs

[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [39]:
headers = soup.find_all('h2', string = re.compile('(H|h)eader'))
headers

[<h2>A Header</h2>, <h2>Another header</h2>]

#### select (CSS selector)

In [None]:
print(soup.body.prettify())

In [40]:
content = soup.select('p')
content

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>,
 <p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [None]:
## all p that are inside a div

content = soup.select('div p')
content

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>]

In [47]:
# paragraphs directly after h2
paragraphs = soup.select('h2 ~ p')
paragraphs

[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [None]:
# Catch the first bold text in the paragraph that has id=paragraph-id
bold_text = soup.select('p#paragraph-id b')
bold_text

[<b>Some bold text</b>]

In [53]:
#paragraphs that are directly descendents from <body>
paragraphs = soup.select('body > p')
print(paragraphs)

#only italic structures in paragraphs that are directly descendents from <body>
for paragraph in paragraphs:
    print(paragraph.select('i'))

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<i>Some italicized text</i>]
[]


#### 🥣 Diferença entre `soup.find_all()` e `soup.select()`
Quando usar cada um?

- **`find_all()`** → Para HTML simples ou buscas baseadas em atributos.
- **`select()`** → Para seleções complexas usando a flexibilidade dos seletores CSS.



Ambos são métodos da biblioteca **BeautifulSoup** para buscar elementos HTML, mas têm diferenças importantes.

---

##### 🔍 `soup.find_all()`

**Características:**
- Usa **parâmetros nomeados** (como `name`, `class_`, `id`, `attrs`)
- Suporta buscas mais estruturadas e orientadas a atributos
- Mais legível para buscas específicas

**Exemplo:**
```python
soup.find_all('div', class_='product')
```

Retorna todas as `<div>` com `class="product"`.

**Busca por múltiplos atributos:**
```
soup.find_all('a', attrs={'href': True, 'title': 'Produto'})
```

---

##### 🎯 `soup.select()`

**Características:**
- Usa **seletores CSS** (`.classe`, `#id`, `element > child`, etc.)
- Mais conciso e flexível para seleções complexas
- Ideal se você já conhece CSS

**Exemplo equivalente:**
``` 
soup.select('div.product')
```

**Exemplos avançados:**
```soup.select('div#main > ul li.active')
soup.select('a[href^="https://"]')
``` 


---

##### 🔁 Comparativo Rápido

| Aspecto              | `find_all()`                      | `select()`                       |
|----------------------|-----------------------------------|----------------------------------|
| Sintaxe              | Pythonic com argumentos           | CSS-like (strings com seletores) |
| Mais legível para    | Atributos e tags específicas      | Seleções complexas com classes/IDs |
| Suporte a CSS        | ❌ Não                            | ✅ Sim                            |
| Retorno              | Lista de `Tag` objects            | Lista de `Tag` objects           |
| Busca por múltiplos  | `soup.find_all(['p', 'a'])`       | `soup.select('p, a')`            |
| Busca por atributo   | `attrs={'data-id': '123'}`        | `[data-id="123"]`                |

---

##### 👨‍💻 Exemplo prático com ambos:

In [49]:
html = '''
<div class="product"><a href="/item1">Item 1</a></div>
<div class="product"><a href="/item2">Item 2</a></div>
'''

soup = bs(html, 'html.parser')

# Usando find_all
items1 = soup.find_all('div', class_='product')

# Usando select
items2 = soup.select('div.product')

print(items1 == items2)  # True

True


#### Get different properties of the HTML

In [59]:
header = soup.find('h2')
print(header.string)

A Header


In [60]:
div = soup.find('div')
print(div.prettify())

<div align="middle">
 <h1>
  HTML Webpage
 </h1>
 <p>
  Link to more interesting example:
  <a href="https://keithgalli.github.io/web-scraping/webpage.html">
   keithgalli.github.io/web-scraping/webpage.html
  </a>
 </p>
</div>



In [62]:
print(div.get_text())


HTML Webpage
Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html



#### Get a specific property from an element

In [66]:
link = soup.find('a')
link

<a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a>

In [65]:
link['href']

'https://keithgalli.github.io/web-scraping/webpage.html'

In [67]:
paragraphs = soup.select('p#paragraph-id')

In [68]:
paragraphs

[<p id="paragraph-id"><b>Some bold text</b></p>]

In [71]:
paragraphs[0]['id']

'paragraph-id'

#### Code Navigation

In [75]:
#Path Syntax
soup.body.div.h1.string

'HTML Webpage'

In [77]:
# Useful Terms: Parent, Sibling, Child
soup.body.find('div')

<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>

In [78]:
soup.body.find('div').find_next_siblings()

[<h2>A Header</h2>,
 <p><i>Some italicized text</i></p>,
 <h2>Another header</h2>,
 <p id="paragraph-id"><b>Some bold text</b></p>]