# Data Scraping with Python

This notebook demonstrates how to scrape news headlines and their descriptions from the BBC News homepage using Python, `requests`, and `BeautifulSoup`.

**Explanation of the code:**
- `import requests` and `from bs4 import BeautifulSoup`: Import libraries for HTTP requests and HTML parsing.
- `news_url = "https://www.bbc.com/news"`: Set the URL for the BBC News homepage.
- `requests.get(news_url).content`: Fetch the HTML content of the page.
- `BeautifulSoup(..., "html.parser")`: Parse the HTML content.
- Try to find all `<h3>` tags with the class `gs-c-promo-heading__title` (common for headlines).
- If not found, try anchor tags with class `gs-c-promo-heading`.
- If still not found, fallback to all anchor tags with `/news/` in their `href` and non-empty text.
- For each headline found, print the headline and its description (if available).

The code is robust to changes in the BBC News HTML structure by trying multiple selectors.

# Get Headings for headlines

In [7]:
import requests
from bs4 import BeautifulSoup

# URL for BBC News homepage
news_url = "https://www.bbc.com/news"

# Fetch and parse the page
response = requests.get(news_url)
news_soup = BeautifulSoup(response.content, "html.parser")

# Try multiple selectors for headlines
headlines = news_soup.find_all("h3", class_="gs-c-promo-heading__title")


# If not found, try anchor tags with class 'gs-c-promo-heading'
if not headlines:
    promo_anchors = news_soup.select("a.gs-c-promo-heading")
    headlines = [a for a in promo_anchors if a.text.strip()]

# If still not found, fallback to all anchor tags with '/news/' in href and non-empty text
if not headlines:
    headlines = [
        a for a in news_soup.find_all("a", href=True)
        if "/news/" in a["href"] and a.text.strip()
    ]

if not headlines:
    print("No headlines found using known selectors.")
else:
    for idx, headline in enumerate(headlines, start=1):
        # Get the headline text
        headline_text = headline.text.strip()
        print(f"{idx}. {headline_text}")

1. Israel-Gaza War
2. War in Ukraine
3. US & Canada
4. UK
5. Africa
6. Asia
7. Australia
8. Europe
9. Latin America
10. Middle East
11. In Pictures
12. BBC InDepth
13. BBC Verify
14. Israel-Gaza War
15. War in Ukraine
16. US & Canada
17. UK
18. UK Politics
19. England
20. N. Ireland
21. N. Ireland Politics
22. Scotland
23. Scotland Politics
24. Wales
25. Wales Politics
26. Africa
27. Asia
28. China
29. India
30. Australia
31. Europe
32. Latin America
33. Middle East
34. In Pictures
35. BBC InDepth
36. BBC Verify
37. LIVEGaza worse than hell on Earth, International Red Cross chief tells BBC as aid centres close for dayThe president of the ICRC tells the BBC's international editor Jeremy Bowen that Palestinians have been stripped of human dignity.
38. South Korea's new president has a Trump-shaped crisis to avertLee Jae-myung secured a storming victory, but his honeymoon will barely last the day.13 hrs agoAsia
39. The Indian pilot set for a historic space journey on Axiom-4Shubhanshu Shu

# Other pages

In [13]:
import requests
from bs4 import BeautifulSoup

# URL for BBC News homepage
news_url = "https://www.bbc.com/sport"

# Fetch and parse the page
response = requests.get(news_url)
news_soup = BeautifulSoup(response.content, "html.parser")

# Try multiple selectors for headlines
headlines = news_soup.find_all("h3", class_="gs-c-promo-heading__title")

# If not found, try anchor tags with class 'gs-c-promo-heading'
if not headlines:
    promo_anchors = news_soup.select("a.gs-c-promo-heading")
    headlines = [a for a in promo_anchors if a.text.strip()]

# If still not found, fallback to all anchor tags with '/news/' in href and non-empty text
if not headlines:
    headlines = [
        a for a in news_soup.find_all("a", href=True)
        if "/news/" in a["href"] and a.text.strip()
    ]

if not headlines:
    print("No headlines found using known selectors.")
else:
    for idx, headline in enumerate(headlines, start=1):
        # Get the headline text
        headline_text = headline.text.strip()
        print(f"{idx}. {headline_text}")

1. Where and how to watch BBC News
2. BBC World News: 24 hour news TV channel


In [2]:
import requests
from bs4 import BeautifulSoup

# URL for BBC Business page
business_url = "https://www.bbc.com/business"

# Fetch and parse the page
response = requests.get(business_url)
soup = BeautifulSoup(response.content, "html.parser")

# Try to find all headlines (BBC often uses h3 with class 'gs-c-promo-heading__title')
headlines = soup.find_all("h3", class_="gs-c-promo-heading__title")

# If not found, try anchor tags with class 'gs-c-promo-heading'
if not headlines:
    promo_anchors = soup.select("a.gs-c-promo-heading")
    headlines = [a for a in promo_anchors if a.text.strip()]

# If still not found, fallback to all anchor tags with '/business/' in href and non-empty text
if not headlines:
    headlines = [
        a for a in soup.find_all("a", href=True)
        if "/business/" in a["href"] and a.text.strip()
    ]

if not headlines:
    print("No business headlines found using known selectors.")
else:
    for idx, headline in enumerate(headlines, start=1):
        headline_text = headline.text.strip()
        print(f"{idx}. {headline_text}")

1. Executive Lounge
2. Technology of Business
3. Future of Business
4. Executive Lounge
5. Technology of Business
6. Future of Business
7. World of Business
8. NYSE Opening Bell
9. Executive Lounge


# Download BBC news headlines with links and snippets

In [8]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime

# URL for BBC News homepage
news_url = "https://www.bbc.com/news"

# Fetch and parse the page
response = requests.get(news_url)
news_soup = BeautifulSoup(response.content, "html.parser")

# Try multiple selectors for headlines
headlines = news_soup.find_all("h3", class_="gs-c-promo-heading__title")

# If not found, try anchor tags with class 'gs-c-promo-heading'
if not headlines:
    promo_anchors = news_soup.select("a.gs-c-promo-heading")
    headlines = [a for a in promo_anchors if a.text.strip()]

# If still not found, fallback to all anchor tags with '/news/' in href and non-empty text
if not headlines:
    headlines = [
        a for a in news_soup.find_all("a", href=True)
        if "/news/" in a["href"] and a.text.strip()
    ]

if not headlines:
    print("No headlines found using known selectors.")
else:
    for idx, headline in enumerate(headlines, start=1):
        # Get the headline text and link
        headline_text = headline.text.strip()
        # Try to get the URL from the parent anchor or from the tag itself
        link = None
        if headline.name == "a" and headline.has_attr("href"):
            link = headline["href"]
        else:
            parent_a = headline.find_parent("a", href=True)
            if parent_a:
                link = parent_a["href"]
        # Make sure the link is absolute
        if link and link.startswith("/"):
            link = "https://www.bbc.com" + link
        # Print headline
        print(f"{idx}. {headline_text}")
        if link:
            print(f"   Link: {link}")
            # Fetch the news article page
            try:
                article_resp = requests.get(link)
                article_soup = BeautifulSoup(article_resp.content, "html.parser")
                # Try to extract all paragraphs in the article body
                # BBC often uses <article> tag or role="main"
                article_tag = article_soup.find("article")
                if not article_tag:
                    article_tag = article_soup.find(attrs={"role": "main"})
                if article_tag:
                    paragraphs = article_tag.find_all("p")
                else:
                    paragraphs = article_soup.find_all("p")
                # Combine the text of all paragraphs
                article_text = " ".join([p.get_text(strip=True) for p in paragraphs])
                # Print a snippet (first 400 chars)
                snippet = article_text[:400] + ("..." if len(article_text) > 400 else "")
                # Try to extract date and time
                date_str = ""
                # Look for <time> tag with datetime attribute
                time_tag = article_soup.find("time")
                if not time_tag:
                    # Try to find meta tag with property 'article:published_time'
                    meta_time = article_soup.find("meta", attrs={"property": "article:published_time"})
                    if meta_time and meta_time.has_attr("content"):
                        date_str = meta_time["content"]
                if not date_str and time_tag and time_tag.has_attr("datetime"):
                    date_str = time_tag["datetime"]
                elif not date_str and time_tag:
                    date_str = time_tag.get_text(strip=True)
                if date_str:
                    try:
                        dt = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
                        date_str = dt.strftime("%Y-%m-%d %H:%M:%S %Z")
                    except Exception:
                        pass
                print(f"   Date: {date_str if date_str else '(No date found)'}")
                print(f"   News: {snippet}")
            except Exception as e:
                print(f"   Date: (No date found)")
                print(f"   News: (Could not fetch article: {e})")
        else:
            print("   Link: (No link found)")
            print("   Date: (No date found)")
            print("   News: (No article found)")

1. Israel-Gaza War
   Link: https://www.bbc.com/news/topics/c2vdnvdg6xxt
   Date: (No date found)
   News: Aid distribution will be halted on Wednesday, with Gazans warned by the IDF not to travel to collection points. Civilians were fired at by tanks, drones and helicopters, civil defence officials say, while Israel says only that troops fired towards several suspects. Hamdi al-Najjar's wife and 11-year-old son Adam, who was severely injured, are the only survivors in their family. Reports say an inci...
2. War in Ukraine
   Link: https://www.bbc.com/news/war-in-ukraine
   Date: (No date found)
   News: Ukraine says it hit the bridge with underwater explosives - the crossing was built by Russia after it annexed the Crimean Peninsula. The two warring sides remain far apart on how to end the conflict after a second round of talks in Turkey. Footage shows attack drones homing in on their targets as they sit on the tarmac. The BBC's Paul Adams reflects on the impact of Ukraine's major dro

# Saving News in Markdown file

In [3]:
# save the headlines, links, and snippets to a markdown file
from datetime import datetime
import requests
from bs4 import BeautifulSoup

# URL for BBC News homepage
news_url = "https://www.bbc.com/news"

# Fetch and parse the page
response = requests.get(news_url)
news_soup = BeautifulSoup(response.content, "html.parser")

# Try multiple selectors for headlines
headlines = news_soup.find_all("h3", class_="gs-c-promo-heading__title")

# If not found, try anchor tags with class 'gs-c-promo-heading'
if not headlines:
    promo_anchors = news_soup.select("a.gs-c-promo-heading")
    headlines = [a for a in promo_anchors if a.text.strip()]

# If still not found, fallback to all anchor tags with '/news/' in href and non-empty text
if not headlines:
    headlines = [
        a for a in news_soup.find_all("a", href=True)
        if "/news/" in a["href"] and a.text.strip()
    ]

if not headlines:
    print("No headlines found using known selectors.")
else:
    # Get current date and time for filename
    now_str = datetime.now().strftime("%Y%m%d_%H%M%S")
    md_filename = f"bbc_headlines_{now_str}.md"

    with open(md_filename, "w") as f:
        for idx, headline in enumerate(headlines, start=1):
            headline_text = headline.text.strip()
            link = None
            if headline.name == "a" and headline.has_attr("href"):
                link = headline["href"]
            else:
                parent_a = headline.find_parent("a", href=True)
                if parent_a:
                    link = parent_a["href"]
            if link and link.startswith("/"):
                link = "https://www.bbc.com" + link

            # Fetch the news article page to get date and snippet
            date_str = ""
            snippet = ""
            if link:
                try:
                    article_resp = requests.get(link)
                    article_soup = BeautifulSoup(article_resp.content, "html.parser")
                    # Extract paragraphs for snippet
                    article_tag = article_soup.find("article")
                    if not article_tag:
                        article_tag = article_soup.find(attrs={"role": "main"})
                    if article_tag:
                        paragraphs = article_tag.find_all("p")
                    else:
                        paragraphs = article_soup.find_all("p")
                    article_text = " ".join([p.get_text(strip=True) for p in paragraphs])
                    snippet = article_text[:400] + ("..." if len(article_text) > 400 else "")
                    # Extract date
                    time_tag = article_soup.find("time")
                    if not time_tag:
                        meta_time = article_soup.find("meta", attrs={"property": "article:published_time"})
                        if meta_time and meta_time.has_attr("content"):
                            date_str = meta_time["content"]
                    if not date_str and time_tag and time_tag.has_attr("datetime"):
                        date_str = time_tag["datetime"]
                    elif not date_str and time_tag:
                        date_str = time_tag.get_text(strip=True)
                    if date_str:
                        try:
                            dt = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
                            date_str = dt.strftime("%Y-%m-%d %H:%M:%S %Z")
                        except Exception:
                            pass
                except Exception:
                    snippet = "(Could not fetch article)"
                    date_str = ""
            f.write(f"{idx}. [{headline_text}]({link})\n")
            f.write(f"   Date: {date_str if date_str else '(No date found)'}\n")
            f.write(f"   News: {snippet}\n")

# Pakistan-India War 2025 from BBC News

In [11]:
import requests
from bs4 import BeautifulSoup
import re

# Search query for BBC
query = "pakistan india war"
# query = input("Enter search query: ")
# URL for BBC search
search_url = f"https://www.bbc.co.uk/search?q={query.replace(' ', '+')}"

response = requests.get(search_url)
soup = BeautifulSoup(response.content, "html.parser")

# Try to find all promo items (less dependent on class names)
results = []
pattern = re.compile(r"\bPakistan\b", re.IGNORECASE)
pattern2 = re.compile(r"\bIndia\b", re.IGNORECASE)

for item in soup.find_all(["article", "li"]):
    # Try to get headline and snippet
    headline_tag = item.find(["h1", "h2", "h3", "span"])
    snippet_tag = item.find("p")
    headline = headline_tag.get_text(strip=True) if headline_tag else ""
    snippet = snippet_tag.get_text(strip=True) if snippet_tag else ""
    # Check if both 'Pakistan' and 'India' are present in either headline or snippet
    if (pattern.search(headline) and pattern2.search(headline)) or \
       (pattern.search(snippet) and pattern2.search(snippet)):
        link_tag = item.find("a", href=True)
        link = link_tag["href"] if link_tag else ""
        if link and link.startswith("/"):
            link = "https://www.bbc.co.uk" + link
        results.append({
            "headline": headline,
            "snippet": snippet,
            "link": link
        })

if not results:
    print("No results found for 'Pakistan India war' on BBC.")
else:
    for idx, res in enumerate(results, 1):
        print(f"{idx}. {res['headline']}")
        print(f"   Link: {res['link']}")
        print(f"   Snippet: {res['snippet']}\n")

1. How real is the risk of nuclear war between India and Pakistan?
   Link: https://www.bbc.co.uk/news/articles/c2e373yzndro
   Snippet: How real is the risk of nuclear war between India and Pakistan?

2. The first drone war opens a new chapter in India-Pakistan conflict
   Link: https://www.bbc.co.uk/news/articles/cwy6w6507wqo
   Snippet: The first drone war opens a new chapter in India-Pakistan conflict

3. The World Tonight. Can Pakistan and India avoid war? Listen NowThe World TonightCan Pakistan and India avoid war?
   Link: https://www.bbc.co.uk/sounds/play/m002btyv
   Snippet: The World Tonight. Can Pakistan and India avoid war? Listen NowThe World Tonight

4. The Briefing Room. Are India and Pakistan on the brink of war over Kashmir? Listen NowThe Briefing RoomAre India and Pakistan on the brink of war over Kashmir?
   Link: https://www.bbc.co.uk/sounds/play/m002bj77
   Snippet: The Briefing Room. Are India and Pakistan on the brink of war over Kashmir? Listen NowThe Briefing R