# Understanding HTML and BeautifulSoup

This notebook walks you through HTML basics and how BeautifulSoup extracts data from it.

In [1]:
from bs4 import BeautifulSoup

## 1. Simplest Example: One Tag

Here's a tiny piece of HTML with just a heading:

In [2]:
# This is what raw HTML looks like to Python - just a string
html = "<h1>Software Engineer</h1>"

# Turn it into a BeautifulSoup object we can search
soup = BeautifulSoup(html, "html.parser")

# Find the h1 tag
heading = soup.find("h1")

print("The whole tag:", heading)
print("Just the text:", heading.text)

The whole tag: <h1>Software Engineer</h1>
Just the text: Software Engineer


### What happened?

1. We gave BeautifulSoup a string of HTML
2. `soup.find("h1")` searched for the `<h1>` tag
3. `.text` pulled out just the words inside (without the `<h1>` wrapper)

### üß™ Experiment: Change "h1" to "p" and see what happens

## 2. Multiple Tags: Finding One Specific Thing

Real web pages have lots of tags. Let's make a fake job listing:

In [3]:
html = """
<div>
    <h1>Machine Learning Engineer</h1>
    <p>Company: TechCorp</p>
    <p>Location: San Francisco</p>
    <p>Salary: $150,000 - $200,000</p>
</div>
"""

soup = BeautifulSoup(html, "html.parser")

# Find just the job title
title = soup.find("h1")
print("Job Title:", title.text)

# Find the first paragraph
first_p = soup.find("p")
print("First paragraph:", first_p.text)

Job Title: Machine Learning Engineer
First paragraph: Company: TechCorp


### What happened?

- `find()` grabs the **first** match it sees
- Even though there are 3 `<p>` tags, `soup.find("p")` only grabbed "Company: TechCorp"

### üß™ Experiment: What if you want the salary instead? Try finding it below:

In [4]:
# Your turn: How would you get the salary?
# Hint: You need to find ALL the <p> tags, not just the first one


## 3. Finding ALL Tags of One Type

Use `find_all()` to grab every match:

In [5]:
soup = BeautifulSoup(html, "html.parser")

# Get ALL paragraphs
all_paragraphs = soup.find_all("p")

print("How many paragraphs?", len(all_paragraphs))
print()

# Loop through them
for p in all_paragraphs:
    print(p.text)

How many paragraphs? 3

Company: TechCorp
Location: San Francisco
Salary: $150,000 - $200,000


### What happened?

- `find_all()` returns a **list** of all matching tags
- You can loop through the list to process each one

### üß™ Experiment: Print only the salary (the 3rd paragraph)

In [6]:
# Your turn: Get just the 3rd paragraph
# Hint: Lists use numbers starting at 0, so the 3rd item is index [2]


## 4. Tags with Classes: Finding Specific Elements

Real websites label tags with **classes** so different things can be styled differently.

Think of classes like name tags at a conference‚Äîlots of people, but you can find "all the speakers" by their name tag color.

In [7]:
html = """
<div>
    <a class="job-link" href="/job/123">AI Engineer</a>
    <a class="job-link" href="/job/456">Data Scientist</a>
    <a class="other-link" href="/about">About Us</a>
</div>
"""

soup = BeautifulSoup(html, "html.parser")

# Find all links with class="job-link"
job_links = soup.find_all("a", class_="job-link")

print("Found", len(job_links), "job links:")
for link in job_links:
    print("Text:", link.text)
    print("URL:", link["href"])  # Get the href attribute
    print()

Found 2 job links:
Text: AI Engineer
URL: /job/123

Text: Data Scientist
URL: /job/456



### What happened?

1. `find_all("a", class_="job-link")` finds ONLY the `<a>` tags with that specific class
2. The "About Us" link was ignored because it has `class="other-link"`
3. `link["href"]` grabs the URL from the tag (like getting a dictionary value)

### üß™ Experiment: What happens if you search for class="other-link"?

In [8]:
# Your turn: Find the "other-link" and print its text


## 5. Connecting to Your Job Scraper

Here's the actual line from your `example.ipynb` that finds job links:

In [9]:
# From your actual code:
# job_cards = soup.find_all("a", class_="jobcardStyle1")

# Let's simulate what that does:
fake_aijobs_html = """
<div>
    <a class="jobcardStyle1" href="https://aijobs.ai/job/123">ML Engineer at Google</a>
    <a class="jobcardStyle1" href="https://aijobs.ai/job/456">AI Researcher at OpenAI</a>
    <a class="jobcardStyle1" href="https://aijobs.ai/job/789">Data Scientist at Meta</a>
    <a class="navigation-link" href="/about">About</a>
</div>
"""

soup = BeautifulSoup(fake_aijobs_html, "html.parser")
job_cards = soup.find_all("a", class_="jobcardStyle1")

# Extract just the URLs (this is what your code does next)
job_urls = [a["href"] for a in job_cards]

print("Job URLs found:")
for url in job_urls:
    print(url)

Job URLs found:
https://aijobs.ai/job/123
https://aijobs.ai/job/456
https://aijobs.ai/job/789


### What happened?

1. Found all `<a>` tags with `class="jobcardStyle1"` (the job postings)
2. Ignored the "About" link because it has a different class
3. Used a **list comprehension** `[a["href"] for a in job_cards]` to extract just the URLs

This is exactly what your scraper does on the real aijobs.ai website!

## üèãÔ∏è Practice Exercise

Below is HTML for 3 job postings. Extract:
1. All job titles
2. All salaries
3. Just the location from the second job

In [10]:
practice_html = """
<div>
    <div class="job-card">
        <h2 class="job-title">Senior Data Engineer</h2>
        <p class="company">Stripe</p>
        <p class="location">Seattle, WA</p>
        <p class="salary">$180,000 - $220,000</p>
    </div>
    <div class="job-card">
        <h2 class="job-title">AI Research Scientist</h2>
        <p class="company">DeepMind</p>
        <p class="location">London, UK</p>
        <p class="salary">¬£120,000 - ¬£180,000</p>
    </div>
    <div class="job-card">
        <h2 class="job-title">ML Platform Engineer</h2>
        <p class="company">Anthropic</p>
        <p class="location">San Francisco, CA</p>
        <p class="salary">$200,000 - $280,000</p>
    </div>
</div>
"""

soup = BeautifulSoup(practice_html, "html.parser")

# Your turn:
# 1. Find all job titles and print them


# 2. Find all salaries and print them


# 3. Find just the location from the 2nd job


## üí° Solution (Try yourself first!)

In [11]:
# Solution 1: All job titles
titles = soup.find_all("h2", class_="job-title")
print("Job Titles:")
for title in titles:
    print("-", title.text)

print()

# Solution 2: All salaries
salaries = soup.find_all("p", class_="salary")
print("Salaries:")
for salary in salaries:
    print("-", salary.text)

print()

# Solution 3: Just the 2nd location
locations = soup.find_all("p", class_="location")
second_location = locations[1]  # Remember: 0 = first, 1 = second
print("2nd job location:", second_location.text)

Job Titles:
- Senior Data Engineer
- AI Research Scientist
- ML Platform Engineer

Salaries:
- $180,000 - $220,000
- ¬£120,000 - ¬£180,000
- $200,000 - $280,000

2nd job location: London, UK


## Key Takeaways

1. **HTML** = labeled boxes of content (`<h1>`, `<p>`, `<a>`, etc.)
2. **BeautifulSoup** = tool to search through HTML and extract specific pieces
3. **`find()`** = get the first match
4. **`find_all()`** = get all matches (returns a list)
5. **Classes** = labels that help you find specific elements
6. **`.text`** = get just the words (no HTML tags)
7. **`tag["href"]`** = get an attribute (like a URL)

Your job scraper uses these same techniques to automatically extract data from hundreds of job postings!