# Web Scraping Basics

Web scraping is a way to **automatically collect information from websites**.  
Think of it as using a program to read a web page, just like your browser does, but instead of looking at it, the program grabs the data you want.

---

## Let’s break down your points:

### a) Use `requests` to fetch pages

#### What is `requests`?

- `requests` is a **Python library** (tool) that helps your code download web pages from the internet.
- Imagine typing a website address in your browser and hitting enter.  
  `requests` lets your code do the **same thing**—it can visit websites and get the page data for you!

# Example

In [85]:
import requests

In [86]:
url = "https://techsabyte.com/"

In [87]:
url

'https://techsabyte.com/'

In [88]:
response = requests.get(url)

In [89]:
response.text

'<!doctype html>\n<html lang="en">\n  <head>\n    <meta charset="UTF-8" />\n    <!-- <link rel="icon" type="image/svg+xml" href="/vite.svg" /> -->\n    <meta name="viewport" content="width=device-width, initial-scale=1.0" />\n    <title>TechsaByte</title>\n    <script type="module" crossorigin src="/assets/index-C1pstMX7.js"></script>\n    <link rel="stylesheet" crossorigin href="/assets/index-Dq-yZx2N.css">\n  </head>\n  <body>\n    <div id="root"></div>\n  </body>\n</html>\n'

### b) Parse HTML with BeautifulSoup

#### What is “HTML”?

- **HTML** is the code websites use to tell browsers what to display (text, images, buttons, etc.).

#### What is “BeautifulSoup”?

- **BeautifulSoup** is another **Python tool** that helps you read and understand (parse) the messy HTML code from a web page, so you can easily find the data you need.

# Example

In [90]:
from bs4 import BeautifulSoup

In [91]:
html_code = "<html><body><h1>Hello!</h1></body></html>"

In [92]:
html_code

'<html><body><h1>Hello!</h1></body></html>'

In [93]:
soup = BeautifulSoup(html_code, "html.parser")

In [94]:
soup.h1.text

'Hello!'

### c) Extract text, links, tables, images

With **BeautifulSoup**, you can pick out specific parts of a web page:

- **Text:** The actual words on the page
- **Links:** Website addresses (URLs) inside `<a>` tags
- **Tables:** Data organized in rows and columns
- **Images:** Picture links inside `<img>` tags

In [95]:
# get all text from the HTML
soup.get_text()

'Hello!'

In [96]:
# Extract all links
for link in soup.find_all("a"):
    print(link.get("href"))

In [97]:
# Extract all images
for img in soup.find_all("img"):
    print(img.get("src"))

## In summary:

- **`requests`** downloads web pages for you.
- **BeautifulSoup** helps you read and pull out the parts you want.
- You can then extract **text, links, tables, images**—whatever you need!

# Real Life Example

In [98]:
import requests
from bs4 import BeautifulSoup

In [99]:
url = "https://example.com/"
response = requests.get(url)

In [100]:
soup = BeautifulSoup(response.text, "html.parser")

`"html.parser"`  
This tells BeautifulSoup which parser to use to read the HTML.  
`"html.parser"` is Python’s built-in HTML parser—it's fast and works well for most cases.  
(There are other options, like `"lxml"` or `"html5lib"`, but `"html.parser"` is a good choice for beginners.)

In [101]:
#Extract all text
all_text = soup.get_text(separator="\n", strip=True)
all_text

'Example Domain\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.\nMore information...'

In [102]:
# Save to files
with open("texts.txt", "w", encoding="utf-8") as f:
    f.write(all_text)

In [103]:
# Extract all links
for a_tag in soup.find_all("a", href=True):
    print(a_tag['href'])

https://www.iana.org/domains/example


In [104]:
with open("links.txt", "w", encoding="utf-8") as f:
    for a_tag in soup.find_all("a", href=True):
        f.write(a_tag['href'] + "\n")

In [105]:
# Extract all image URLs
for img_tag in soup.find_all("img", src=True):
    print(img_tag['src'])

In [106]:
with open("images.txt", "w", encoding="utf-8") as f:
    for img_tag in soup.find_all("img", src=True):
        f.write(img_tag['src'] + "\n")