# Parsing HTML with Beautiful Soup

Purpose: To install the Beautiful Soup module and run basic parsing functions.

## Step 1: Installation
To begin, you need to install Beautiful Soup. You can do this by running the following command in a code cell or the terminal:

In [3]:
# pip install beautifulsoup4

## OR

# conda install -c anaconda beautifulsoup4

NameError: name 'OR' is not defined

## Step 2: Importing the module
Once Beautiful Soup is installed, you can import it into your notebook using the following import statement:

In [None]:
from bs4 import BeautifulSoup

## Step 3: Fetching HTML content
Next, you'll need to fetch the HTML content from a web page. There are different ways to do this depending on your requirements, but let's assume you want to fetch the HTML from a URL. You can use the requests library to accomplish this. Here's an example:

In [None]:
import requests

# Specify the URL
# Note - if this page link breaks, select a different one from the PASDA data portal and paste it below.
quote_page = 'http://www.pasda.psu.edu/uci/DataSummary.aspx?dataset=1203'

# Query the website and return the HTML to the variable 'page'
response = requests.get(quote_page)
html_content = response.text

## Step 4: Creating a Beautiful Soup object
Now that you have the HTML content, you can create a Beautiful Soup object that will allow you to parse and manipulate the data. You can create a Beautiful Soup object by passing the HTML content and a parser of your choice. Here's an example using the built-in 'html.parser':

In [None]:
soup = BeautifulSoup(html_content, 'html.parser')

## Step 5: Navigating the parsed data
Once you have the Beautiful Soup object, you can navigate and search through the parsed data using various methods and attributes.

In [None]:
# Find elements by attribute value:
titleField = soup.find(attrs={'id': 'Label1'})
dateField = soup.find(attrs={'id': 'Label2'})
publisherField = soup.find(attrs={'id': 'Label3'})

# Extracting data from elements:
descriptionField = soup.find(attrs={'id': 'Label14'})
metadataLink = soup.find('a', href=True, text='Metadata')
downloadLink = soup.find('a', href=True, text='Download')

## Step 6: Putting it all together
Now, let's combine the steps we've covered into a complete example:

In [None]:
import csv

# Extract the text and attributes from the fields
Title = titleField.text.strip()
Date = dateField.text.strip()
Publisher = publisherField.text.strip()
Description = descriptionField.text.strip()
Metadata = metadataLink['href']
Download = downloadLink['href']

# Open the CSV file in append mode, so old data will not be erased
with open('output.csv', 'a', newline='') as csv_file:
    writer = csv.writer(csv_file)

    # Write the header row
    writer.writerow(['Title', 'Date', 'Publisher', 'Description', 'Metadata', 'Download'])

    # Write the data row
    writer.writerow([Title, Date, Publisher, Description, Metadata, Download])

Locate the new file called `output.csv` in the same folder as this Notebook. You should see a header row with the field names and a row with the parsed metadata.