# 🌐 Web Scraping

## 🔹 What is Web Scraping?
**Web Scraping** is the process of **extracting information from websites** automatically using code.  
Instead of copying data manually, web scraping allows us to **collect large amounts of data** quickly and efficiently.  

💡 Example: Extracting football match results or player stats from [FBref](https://fbref.com) for analysis.

---

## 🔹 Why Use Web Scraping?
✔ Automates repetitive data collection  
✔ Collects data **not available via APIs**  
✔ Enables large-scale **sports, finance, e-commerce, research, and academic** projects  
✔ Converts **unstructured website data** into structured formats like CSV/Excel  

---

## 🔹 Key Components of Web Scraping
1. **HTTP Requests** → Connect to a webpage (`requests`)  
2. **HTML Parsing** → Extract useful data (`BeautifulSoup`, `lxml`)  
3. **Data Structuring** → Organize into tables (`pandas`)  
4. **Data Storage** → Save into CSV, Excel, or database  

---

## 🔹 Tools & Libraries
- 🐍 **Python Libraries**
  - `requests` → Download HTML  
  - `BeautifulSoup` → Parse HTML tags  
  - `pandas.read_html` → Extract tables directly  
  - `Scrapy` → Advanced framework for large projects  
  - `Selenium` → Automate scraping from dynamic JavaScript pages  

---

## 🔹 Web Scraping Workflow
1. 🔗 **Find the target website** (e.g., player stats page)  
2. 🔎 **Inspect HTML structure** using browser dev tools  
3. 📝 **Write scraping script** (using `requests + BeautifulSoup`)  
4. 🧹 **Clean and structure data**  
5. 💾 **Save results** (CSV/Database)  

---



# 🏆 Extracting Premier League 2024/25 Table (Step-by-Step)

This guide pulls standings from the official Premier League website’s backend API and writes a clean CSV.

---

## 0) Install dependencies (once)

```bash
pip install requests pandas beautifulsoup4 selenium webdriver-manager


In [None]:
# ## 🔹 Example: Extracting Football Stats (BeautifulSoup)

# ```python
# import requests
# from bs4 import BeautifulSoup
# import pandas as pd

# # Step 1: Get the webpage
# url = "https://fbref.com/en/comps/9/Premier-League-Stats"
# response = requests.get(url)

# # Step 2: Parse HTML
# soup = BeautifulSoup(response.text, "html.parser")

# # Step 3: Find table
# table = soup.find("table")

# # Step 4: Convert to DataFrame
# df = pd.read_html(str(table))[0]

# # Step 5: Save
# df.to_csv("premier_league_stats.csv", index=False)

# print(df.head())


In [2]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
import time

In [5]:
# 1️⃣ URL of EPL table
url = "https://www.premierleague.com/en/tables?competition=8&season=2024&round=L_1&matchweek=-1&ha=-1"


In [4]:
# 2️⃣ Setup headless Chrome
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

In [6]:
# 3️⃣ Open the page
driver.get(url)
time.sleep(5)  # wait for JavaScript to render

In [7]:
# 4️⃣ Get page source and parse with BeautifulSoup
soup = BeautifulSoup(driver.page_source, "html.parser")