## Web Scraping With BeautifulSoup And Selenium
### What is Web Scraping?
Web scraping is like being a detective on the internet! It’s a way to collect information (like text, prices, or names) from websites by using a computer program. Instead of copying and pasting by hand, you let Python grab the data for you.

### BeautifulSoup
BeautifulSoup is a super simple Python tool that helps you pull out data from web pages. It reads the messy code of a website (called HTML) and makes it easy to find things like titles, paragraphs, or lists.

### Why Use It?
Saves time: Grab lots of data fast, like product prices or news headlines.

Fun for beginners: You can explore websites and collect cool info.

Useful: Get data for projects, like tracking toy prices or weather updates.

In [1]:
# import and configs
from bs4 import BeautifulSoup
import requests

### Scrap data from wikipedia using pandas

In [8]:
import pandas as pd

df = pd.read_html('https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average')

df = df[2]
df

Unnamed: 0,Company,Exchange,Symbol,Industry,Date added,Notes,Index weighting
0,3M,NYSE,MMM,Conglomerate,1976-08-09,As Minnesota Mining and Manufacturing,2.17%
1,American Express,NYSE,AXP,Financial services,1982-08-30,,4.31%
2,Amgen,NASDAQ,AMGN,Biopharmaceutical,2020-08-31,,4.14%
3,Amazon,NASDAQ,AMZN,Retailing,2024-02-26,,2.99%
4,Apple,NASDAQ,AAPL,Information technology,2015-03-19,,2.92%
5,Boeing,NYSE,BA,Aerospace and defense,1987-03-12,,3.03%
6,Caterpillar,NYSE,CAT,Construction and mining,1991-05-06,,5.13%
7,Chevron,NYSE,CVX,Petroleum industry,2008-02-19,Also 1930-07-18 to 1999-11-01,2.01%
8,Cisco,NASDAQ,CSCO,Information technology,2009-06-08,,0.92%
9,Coca-Cola,NYSE,KO,Drink industry,1987-03-12,Also 1932-05-26 to 1935-11-20,1.04%


In [9]:
df.set_index('Symbol', inplace=True)

df.head()

Unnamed: 0_level_0,Company,Exchange,Industry,Date added,Notes,Index weighting
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
MMM,3M,NYSE,Conglomerate,1976-08-09,As Minnesota Mining and Manufacturing,2.17%
AXP,American Express,NYSE,Financial services,1982-08-30,,4.31%
AMGN,Amgen,NASDAQ,Biopharmaceutical,2020-08-31,,4.14%
AMZN,Amazon,NASDAQ,Retailing,2024-02-26,,2.99%
AAPL,Apple,NASDAQ,Information technology,2015-03-19,,2.92%


In [10]:
df.reset_index(inplace=True)
df.head()

Unnamed: 0,Symbol,Company,Exchange,Industry,Date added,Notes,Index weighting
0,MMM,3M,NYSE,Conglomerate,1976-08-09,As Minnesota Mining and Manufacturing,2.17%
1,AXP,American Express,NYSE,Financial services,1982-08-30,,4.31%
2,AMGN,Amgen,NASDAQ,Biopharmaceutical,2020-08-31,,4.14%
3,AMZN,Amazon,NASDAQ,Retailing,2024-02-26,,2.99%
4,AAPL,Apple,NASDAQ,Information technology,2015-03-19,,2.92%


### scrap headlines from news article website coindesk.com

In [11]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time

# Set up Selenium with headless Chrome for efficiency
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0.4472.124")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

try:
    # URL for CoinDesk news
    url = "https://www.coindesk.com/"
    driver.get(url)

    # Wait for headlines to load (adjust timeout as needed)
    try:
        WebDriverWait(driver, 10).until(
            EC.presence_of_all_elements_located((By.TAG_NAME, "h3"))
        )
    except Exception as e:
        print(f"Timeout waiting for headlines: {str(e)}")
        driver.quit()
        exit()

    # Parse page source with BeautifulSoup
    soup = BeautifulSoup(driver.page_source, "html.parser")

    # Find news headlines (try a broader selector first)
    headlines = soup.find_all("h3")  # Fallback: all h3 tags
    # Alternative: Try a class-based selector (update after inspection)
    # headlines = soup.find_all("h3", {"class": "your-class-here"})

    # Debugging: Print number of headlines found
    print(f"Found {len(headlines)} headlines.")

    if not headlines:
        print("No headlines found. Check HTML structure or class name.")
        # Optional: Print part of the page source for debugging
        print("Sample HTML:", soup.prettify()[:1000])
        driver.quit()
        exit()

    # Extract up to 5 headlines
    for i, headline in enumerate(headlines[:5], 1):
        try:
            title = headline.text.strip()
            # Find the parent link (if it exists)
            parent_link = headline.find_parent("a")
            link = parent_link["href"] if parent_link and parent_link.get("href") else "No link found"
            # Handle relative links
            full_link = url + link if link.startswith("/") else link
            print(f"Headline {i}: {title}\nLink: {full_link}\n")
        except Exception as e:
            print(f"Error processing headline {i}: {str(e)}")

finally:
    # Clean up
    driver.quit()

Found 55 headlines.
Headline 1: Featured Stories
Link: No link found

Headline 2: Michael Saylor's Strategy Adds $18M of Bitcoin on Five-Year Anniversary of First Purchase
Link: https://www.coindesk.com//markets/2025/08/11/michael-saylor-s-strategy-adds-usd18m-in-bitcoin-on-five-year-anniversary-of-first-purchase

Headline 3: Michael Saylor's Strategy Adds $18M of Bitcoin on Five-Year Anniversary of First Purchase
Link: https://www.coindesk.com//markets/2025/08/11/michael-saylor-s-strategy-adds-usd18m-in-bitcoin-on-five-year-anniversary-of-first-purchase

Headline 4: Ether’s Rally Pulls Bitcoin Along: Crypto Daybook Americas
Link: https://www.coindesk.com//daybook-us/2025/08/11/ether-s-rally-pulls-bitcoin-along-crypto-daybook-americas

Headline 5: Bitcoin Bulls Take Another Shot at the Fibonacci Golden Ratio Above $122K as Inflation Data Looms
Link: https://www.coindesk.com//markets/2025/08/11/bitcoin-bulls-takes-another-shot-at-the-fibonacci-golden-ratio-above-usd122k-as-inflation-dat

### faster and more flexible way
by looking for parent and subparent possible headings

In [12]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

# Set up Selenium headless
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0.4472.124")

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

try:
    url = "https://www.coindesk.com/"
    driver.get(url)

    # Wait until the main article section loads
    WebDriverWait(driver, 15).until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, "h3"))
    )

    soup = BeautifulSoup(driver.page_source, "html.parser")

    # Find all h3 headlines with links
    headlines_data = []
    for h3 in soup.select("h3"):
        title = h3.get_text(strip=True)
        link_tag = h3.find_parent("a") or h3.find_next("a")
        if link_tag and link_tag.get("href"):
            link = link_tag["href"]
            if link.startswith("/"):
                link = "https://www.coindesk.com" + link
            headlines_data.append({"title": title, "link": link})

    # Print first 5
    for i, item in enumerate(headlines_data[:5], 1):
        print(f"{i}. {item['title']}")
        print(f"   {item['link']}\n")

finally:
    driver.quit()


1. Featured Stories
   https://www.coindesk.com/markets/2025/08/11/michael-saylor-s-strategy-adds-usd18m-in-bitcoin-on-five-year-anniversary-of-first-purchase

2. Michael Saylor's Strategy Adds $18M of Bitcoin on Five-Year Anniversary of First Purchase
   https://www.coindesk.com/markets/2025/08/11/michael-saylor-s-strategy-adds-usd18m-in-bitcoin-on-five-year-anniversary-of-first-purchase

3. Michael Saylor's Strategy Adds $18M of Bitcoin on Five-Year Anniversary of First Purchase
   https://www.coindesk.com/markets/2025/08/11/michael-saylor-s-strategy-adds-usd18m-in-bitcoin-on-five-year-anniversary-of-first-purchase

4. Ether’s Rally Pulls Bitcoin Along: Crypto Daybook Americas
   https://www.coindesk.com/daybook-us/2025/08/11/ether-s-rally-pulls-bitcoin-along-crypto-daybook-americas

5. Bitcoin Bulls Take Another Shot at the Fibonacci Golden Ratio Above $122K as Inflation Data Looms
   https://www.coindesk.com/markets/2025/08/11/bitcoin-bulls-takes-another-shot-at-the-fibonacci-golde