# Web Scraping in Practice

In this notebook, we will learn how to scrape web pages using **Beautiful Soup**, retrieve links, extract text, and save results to an external file. We'll demonstrate scraping data from [analytics.usa.gov](https://analytics.usa.gov).

## Step 1: Import Libraries

We'll start by importing the required libraries:
- `BeautifulSoup` from `bs4` for parsing HTML
- `urllib.request` to open URLs
- `re` for regular expressions
- `IPython.display` for displaying HTML content in notebook (optional)

In [None]:
from bs4 import BeautifulSoup
import urllib.request
import re
from IPython.display import HTML

## Step 2: Fetch and Parse Webpage

We'll open the URL using `urllib.request.urlopen` and read the content. Then, we'll parse it using `BeautifulSoup` with the `html.parser`.

We also check the type of the parsed object to ensure it is a `BeautifulSoup` object.

In [None]:
# Open the URL and read content
r = urllib.request.urlopen('https://analytics.usa.gov').read()

# Parse HTML content using BeautifulSoup
soup = BeautifulSoup(r, 'html.parser')

# Check type of the object
type(soup)

## Step 3: Inspecting the Parsed HTML

We can print the HTML in a readable format using `prettify()`. We'll look at only the first 100 characters to avoid too much output.

In [None]:
print(soup.prettify()[:100])

## Step 4: Extract All Links

We use `find_all('a')` to find all anchor (`<a>`) tags and then extract their `href` attributes using `.get('href')`.

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

## Step 5: Extract All Text

We can also get the entire textual content of the webpage using `.get_text()`. This is useful if we want the raw text without HTML tags.

In [None]:
print(soup.get_text())

## Step 6: Prettify and Inspect Larger Portion

We can also prettify and view the first 1000 characters of the HTML for better readability.

In [None]:
print(soup.prettify()[0:1000])

## Step 7: Extract Links Matching a Pattern

To filter links starting with `http`, we can use `attrs={'href': re.compile('^http')}`. This is useful to extract only external URLs.

In [None]:
for link in soup.find_all('a', attrs={'href': re.compile('^http')}):
    print(link)

We can also inspect the type of the object returned by `find_all`.

In [None]:
type(link)

## Step 8: Save Extracted Links to a File

We can write the filtered links to a text file for future use. Here, we convert each link to a string and write it to `parsed_data.txt`.

In [None]:
file = open('parsed_data.txt', 'w')
for link in soup.find_all('a', attrs={'href': re.compile('^http')}):
    soup_link = str(link)
    print(soup_link)
    file.write(soup_link + '\n')  # Add newline for readability
file.flush()
file.close()

## Step 9: Check Current Directory

Use `%pwd` to see where your output file has been saved.

In [None]:
%pwd

---
### Summary
- We fetched a webpage and parsed it using Beautiful Soup.
- Extracted all links and filtered by pattern using regex.
- Extracted the text content of the webpage.
- Saved the results to an external file for future use.