
# 🕸️ Web Scraping with Python - Beginner's Guide

**Created by:** Ashish Garg  
**Creation Date:** 10th July 2025  



## 📘 What is Web Scraping?

Web scraping is the process of automatically extracting information from websites.  
With Python, it's easy to fetch content from web pages and extract specific data using libraries like:

- `requests`: To send HTTP requests and receive responses
- `BeautifulSoup`: To parse and extract content from HTML/XML documents


In [None]:

# Install required packages (uncomment if not already installed)
# !pip install requests beautifulsoup4


In [None]:

import requests
from bs4 import BeautifulSoup


In [None]:

# Step 1: Fetch the webpage using requests

url = "https://quotes.toscrape.com/"
response = requests.get(url)

# Check the status code of the response
print(f"Status Code: {response.status_code}")

# View the first 500 characters of the HTML content
print(response.text[:500])



### 📥 Explanation

- `requests.get(url)`: Sends an HTTP GET request to the given URL.
- `response.status_code`: HTTP response status (200 = OK).
- `response.text`: Contains the raw HTML content of the page.
“We start by using the requests library — this sends a request to the webpage just like your browser does when you type a URL.
If the page loads correctly, it gives us a status code of 200, which means ‘OK’.
If it’s 404, that means ‘page not found’, and 500 means ‘server error’ — good to keep in mind when debugging.”

In [None]:

# Step 2: Parse the HTML using BeautifulSoup

soup = BeautifulSoup(response.text, "html.parser")

# Let's prettify the HTML (for visual understanding)
print(soup.prettify()[:500])



### 🧠 Parsing

- `BeautifulSoup(html, "html.parser")`: Parses HTML using Python’s built-in HTML parser.
- `soup.prettify()`: Returns formatted HTML (easier to read).


In [None]:

# Step 3: Extract all quotes from the page

quotes = soup.find_all("span", class_="text")

print("Quotes found:")
for quote in quotes:
    print(quote.text)



### 🔍 Extracting Elements

- `soup.find_all(tag, class_=...)`: Finds all tags matching the criteria.
- `.text`: Extracts inner text from an HTML element.


In [None]:

# Step 4: Extract quotes, authors, and tags together

for quote in soup.find_all("div", class_="quote"):
    text = quote.find("span", class_="text").text
    author = quote.find("small", class_="author").text
    tags = [tag.text for tag in quote.find_all("a", class_="tag")]
    print(f"{text} — {author} [{', '.join(tags)}]")


 Save to CSV (Optional)

In [None]:
import pandas as pd

data = []
for quote in soup.find_all("div", class_="quote"):
    text = quote.find("span", class_="text").text
    author = quote.find("small", class_="author").text
    tags = ", ".join(tag.text for tag in quote.find_all("a", class_="tag"))
    data.append([text, author, tags])

df = pd.DataFrame(data, columns=["Quote", "Author", "Tags"])
df.to_csv("quotes.csv", index=False)


## ✅ Summary

In this notebook, we:
- Used `requests` to fetch HTML content.
- Parsed it using `BeautifulSoup`.
- Extracted quotes, authors, and tags from a test website.

You can try similar techniques on real-world sites (ethically and legally).


In [None]:
"""
🕸 Web Scraping Wikipedia IPL Results
📅 Created: 10 July 2025
👤 Author: Ashish Garg
"""

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Fetch the Wikipedia page
url = "https://en.m.wikipedia.org/wiki/List_of_Indian_Premier_League_seasons_and_results"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Step 2: Locate the first large wikitable (contains season results)
tables = soup.find_all("table", class_="wikitable")

# IPL season results is usually the first or second table
target_table = tables[0]  # May change in future — check manually if needed

# Step 3: Extract headers
headers = []
for th in target_table.find_all("th"):
    headers.append(th.get_text(strip=True))

# Clean duplicates and empty headers
clean_headers = []
seen = set()
for h in headers:
    if h and h not in seen:
        clean_headers.append(h)
        seen.add(h)

# Step 4: Extract table rows
data = []
rows = target_table.find_all("tr")[1:]  # Skip header row
for row in rows:
    cols = row.find_all(["td", "th"])
    cols = [ele.get_text(strip=True).replace('\xa0', ' ') for ele in cols]
    if len(cols) >= 5:
        data.append(cols[:len(clean_headers)])  # Trim extra columns if needed

# Step 5: Convert to DataFrame and Save to CSV
df = pd.DataFrame(data, columns=clean_headers[:len(data[0])])
df.to_csv("ipl_seasons_results.csv", index=False)

# Display a preview
print(df.head())

👋 1. Opening & Greeting (1 min)

“Hi everyone! Good [morning/afternoon] and welcome to today’s learning session.
I’m really excited to walk you through something fun and practical — Web Scraping using Python.”

“Before we jump in, let me introduce myself quickly.”

⸻

🧑‍💻 2. Self-Introduction (1 min)

“I’m Ashish Garg. I love working with data and building small utilities to automate repetitive things — web scraping has always been one of my favorite tools to do that.”

“Today, my goal is to show you how anyone, even with very basic Python knowledge, can extract useful data from the internet.”

⸻

📚 3. What This Session Is About (2 mins)

“So, what exactly is web scraping?”
“It’s the process of writing code that can visit a website, read its content like a human would, and pick out specific information — automatically.”

“In today’s session, we’ll focus on scraping static websites — those that don’t require user interaction or JavaScript rendering.”  “It’s like building your own little robot to surf the web and bring back exactly what you need.”

“By the end of the session, you’ll be able to:”
	•	Fetch HTML content of a page
	•	Parse and extract specific data
	•	Save the results to a CSV file
	•	And yes, you’ll also see how we can apply this on a real-world use case like IPL stats from Wikipedia!”

⸻
🔍 Slide 3: What is Web Scraping? (3 mins)

🗣️ “Web scraping is the process of automatically extracting data from websites using code.”

🗣️ “Let’s say you visit a site and see a list of books, or quotes, or match results. What if you wanted to download all of that into a spreadsheet?”

🗣️ “Instead of clicking and copying, your script does that — it sends a request to the site, reads the HTML, and picks out the useful data.”

🗣️ “This is super useful in things like market analysis, lead generation, trend tracking, even personal projects like compiling cricket stats.”

⚙️ 4. Tools & Libraries (2 mins)

“For this session, we’ll use just three libraries:”
	•	requests — to fetch the HTML content of a web page
	•	BeautifulSoup — to parse and extract data from the HTML
	•	pandas — to save the extracted data in table form (CSV)

“All three are well-documented, popular, and beginner-friendly.”

⸻

💻 5. Live Code Walkthrough (15 mins)

⸻

6. Real-World Example: Wikipedia IPL Results (7 mins)

“Let’s now take this a step further and scrape a real-world dataset — IPL season results from Wikipedia.”

“This page has a structured table, which makes it perfect for scraping.”

url = "https://en.m.wikipedia.org/wiki/List_of_Indian_Premier_League_seasons_and_results"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

tables = soup.find_all("table", class_="wikitable")
target_table = tables[0]

In [None]:
“We find all tables with class wikitable — Wikipedia uses that for all structured data tables.”
# Extract header
headers = []
for th in target_table.find_all("th"):
    headers.append(th.get_text(strip=True))

    “Next, we extract the table header and each row’s data.”

# Extract rows
data = []
rows = target_table.find_all("tr")[1:]
for row in rows:
    cols = row.find_all(["td", "th"])
    cols = [col.get_text(strip=True) for col in cols]
    data.append(cols)

    # Convert to DataFrame
df = pd.DataFrame(data, columns=headers[:len(data[0])])
df.to_csv("ipl_seasons_results.csv", index=False)
df.head()
“Done! We’ve now built a CSV dataset of all IPL finals — programmatically.”


In [None]:
“Done! We’ve now built a CSV dataset of all IPL finals — programmatically.”

⸻

📌 7. Final Notes and Best Practices (1 min)

“A few important reminders:”
	•	Always check the site’s robots.txt before scraping.
	•	Avoid scraping sensitive or copyrighted info.
	•	Add delays if you’re scraping multiple pages.
	•	Never overload servers — respect the site.”

🎤 8. Wrap-up and Q&A (1–2 mins)

“That wraps up the session!
I hope this helped demystify web scraping and gave you the confidence to try it out yourself.”

“I’m happy to take any questions now!”


🗣️ “A few important things before you start building your own scrapers…”
	•	Always check the site’s robots.txt
	•	Don’t send hundreds of requests — that can slow down or even block the site
	•	Avoid scraping anything behind login pages
	•	Respect copyright — don’t scrape and reuse data for profit unless permitted

🗣️ “And don’t scrape personal data — stay safe and ethical.”

⸻

🏁 Wrap-up and Takeaways (1 min)

🗣️ “That’s the magic of web scraping — a little Python can go a long way!”

🗣️ “You now know how to collect real-world data without manual work.”

🗣️ “Try this out on your favorite websites. Look for patterns. Build datasets. Have fun!”

In [None]:
❓ Q&A (2–3 mins)

🗣️ “I’d love to hear your questions — happy to clarify or go deeper into any part!”