# Web Scraping Interactive Exercise

In this notebook, you'll practice scraping a webpage using **Beautiful Soup**, extracting links and text, and saving results to a file.

Each step includes a **collapsible hint** to guide you. At the end, a collapsed solution is provided for self-checking.

## Step 1: Import Libraries

Import the libraries needed for web scraping.

<details>
<summary>Hint</summary>
You need `BeautifulSoup` from `bs4`, `urllib.request` to open URLs, and `re` for regular expressions.
Optionally, `IPython.display` can be used to display HTML.
</details>

In [None]:
# TODO: Import required libraries

## Step 2: Fetch and Parse Webpage

Open the URL `https://analytics.usa.gov` and parse it using BeautifulSoup.

<details>
<summary>Hint</summary>
Use `urllib.request.urlopen(url).read()` to get the HTML content.
Then create a BeautifulSoup object with `html.parser`.
Check the type of the object with `type()`.
</details>

In [None]:
# TODO: Fetch and parse the webpage

## Step 3: Inspect HTML

Print the first 100 characters of the prettified HTML to get a sense of the structure.

<details>
<summary>Hint</summary>
Use `soup.prettify()` and slice the string `[:100]` to view the first 100 characters.
</details>

In [None]:
# TODO: Print first 100 characters of prettified HTML

## Step 4: Extract All Links

Use a loop to find all `<a>` tags and print their `href` attributes.

<details>
<summary>Hint</summary>
Use `soup.find_all('a')` and `.get('href')` inside a for loop.
</details>

In [None]:
# TODO: Extract all links and print them

## Step 5: Extract Webpage Text

Retrieve the entire textual content of the page.

<details>
<summary>Hint</summary>
Use the `.get_text()` method of the BeautifulSoup object.
</details>

In [None]:
# TODO: Print the full text of the webpage

## Step 6: Filter Links Starting with HTTP

Use a regular expression to extract only external links.

<details>
<summary>Hint</summary>
Use `attrs={'href': re.compile('^http')}` in `find_all` to filter links starting with 'http'.
Loop through the results and print each link.
</details>

In [None]:
# TODO: Extract and print links starting with http

## Step 7: Save Links to a File

Write the extracted links to a text file `parsed_data.txt`.

<details>
<summary>Hint</summary>
Open a file in write mode (`'w'`) and iterate over the filtered links.
Convert each link to string and write it to the file with a newline (`\n`).
Don't forget to close the file at the end.
</details>

In [None]:
# TODO: Save links to parsed_data.txt

---
## Solutions (Collapsed)

<details>
<summary>Click to view solutions</summary>

```python
# Step 1: Import Libraries
from bs4 import BeautifulSoup
import urllib.request
import re
from IPython.display import HTML

# Step 2: Fetch and Parse Webpage
r = urllib.request.urlopen('https://analytics.usa.gov').read()
soup = BeautifulSoup(r, 'html.parser')
type(soup)

# Step 3: Inspect HTML
print(soup.prettify()[:100])

# Step 4: Extract All Links
for link in soup.find_all('a'):
    print(link.get('href'))

# Step 5: Extract Webpage Text
print(soup.get_text())

# Step 6: Filter Links Starting with HTTP
for link in soup.find_all('a', attrs={'href': re.compile('^http')}):
    print(link)

# Step 7: Save Links to a File
with open('parsed_data.txt', 'w') as file:
    for link in soup.find_all('a', attrs={'href': re.compile('^http')}):
        file.write(str(link) + '\n')
```
</details>