# Ambreen Abdul Raheem
## Data Analyst (Upwork Freelancer)
## Webscraping (Data scraping with Python)
## Collaboration with "Codanics"
####Codanics: https://codanics.com/
#### Email: ambreen.upwork.27@gmail.com
#### Linkedin: https://www.linkedin.com/in/ambreen-abdul-raheem-122509300/
#### GitHUb: https://github.com/ambreenraheem
##Date: 15-05-2025

This notebook demonstrates how to scrape news headlines and their descriptions from the BBC News homepage using Python, requests, and BeautifulSoup.
###### Explanation of the code:
- import request and from bs4 import BeautifulSoup: import libraries for HTTP requests and HTML parsing.
- news_url= "'https://www.bbc.com/news": Set URL for BBC News homepage.
- request.get(news_url).content: Fetch the HTML content of the page.
- BeautifulSoup(..., "html.parser"): Parse the HTML content.
- Try to find all 'h3' tags with the class gs-c-promo-heading__title(common for headlines).
- If not found,try anchor tags with class gc-c-promo-heading.
- If still not found, fallback to all anchor tags with /news/ in their href and non-empty text.
- For each headline foud, print the headline and its description (if available).
###### The code is robust to changes in the BBC News HTML structure by trying multiple selectors.


#### Imprt some Important Libraries

In [None]:
!pip install beautifulsoup4
from IPython import get_ipython
from IPython.display import display
import pandas as pd
import numpy as np
import requests
# The correct import is from the bs4 package
from bs4 import BeautifulSoup




### Get Heading for Headlines from BBC News

In [None]:
news_url = 'https://www.bbc.com/news'
response=requests.get(news_url)
news_soup=BeautifulSoup(response.content,'html.parser')
news_item=news_soup.find_all('div',class_='news-item')

In [None]:
headlines= news_soup.find_all('h3',class_="gc-c-promo-heading__title")

In [None]:
if not headlines:
  promo_anchors = news_soup.select('a.gc-c-promo-heading')
  headlines=[a for a in promo_anchors if a.text.strip()]

In [None]:
if not headlines:
  headlines =[
      a for a in news_soup.find_all("a", href=True)
      if "/news/" in a["href"] and a.text.strip()
  ]

In [None]:
if not headlines:
  print("No headlines found")
else:
  for idx, headlines in enumerate(headlines, start=1):
    headlines_text=headlines.text.strip()
    print(f"{idx}. {headlines_text}")

1. Israel-Gaza War
2. War in Ukraine
3. US & Canada
4. UK
5. Africa
6. Asia
7. Australia
8. Europe
9. Latin America
10. Middle East
11. In Pictures
12. BBC InDepth
13. BBC Verify
14. Israel-Gaza War
15. War in Ukraine
16. US & Canada
17. UK
18. UK Politics
19. England
20. N. Ireland
21. N. Ireland Politics
22. Scotland
23. Scotland Politics
24. Wales
25. Wales Politics
26. Africa
27. Asia
28. China
29. India
30. Australia
31. Europe
32. Latin America
33. Middle East
34. In Pictures
35. BBC InDepth
36. BBC Verify
37. New round of ceasefire talks after Israel launches major offensiveHamas says its negotiators have opened a new round of talks aimed at ending the war, and all issues are on the table.3 hrs agoMiddle East
38. Trump says he will call Putin to discuss stopping Ukraine 'bloodbath'The US president said the pair would speak on Monday, and he would later talk to Ukraine's President Zelensky.2 hrs agoWorld
39. Was Diddy a 'mastermind'? How ex Cassie's testimony builds the sex traff

# Download BBC news headlines with links and snippet

In [None]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime

news_url= "https://www.bbc.com/news"

response= requests.get(news_url)

news_soup= BeautifulSoup(response.content,'html.parser')

headlies= news_soup.find_all("h3", class_="gs-c-promo-heading__title")
if not headlines:
  promo_anchors= news_soup.select("a.gs-c-promo-heading")
  headlines= [a for a in promo_anchors if a.text.strip()]

if not headlines:
  headlines= [
      a for a in news_soup.find_all("a", href=True)
      if "/news/" in a["href"] and a.text.strip()
  ]

if not headlines:
  print("No headlines found")
else:
  for idx, headline in enumerate(headlines, start=1):
    headline_text= headline.text.strip()
    link=None
    if headline.name=="a" and headline.has_attr("href"):
      link= headline["href"]
    else:
      parent_a= headline.find_parent("a")
      if parent_a:
        link= parent_a["href"]
    if link and link.startswith("/"):
      link= "https://www.bbc.com" + link
    print(f"{idx}. {headline_text}")
    if link:
      print(f"   Link: {link}")
      try:
        article_response= requests.get(link)
        article_soup= BeautifulSoup(article_response.content,'html.parser')
        article_tag= article_soup.find("article")
        if not article_tag:
          article_tag=article_soup.find(attrs={"role":"main"})
        if article_tag:
            paragraphs=article_tag.find_all("p")
        else:
            paragraphs=article_soup.find_all("p")
        article_text=" ".join([p.get_text(strip=True) for p in paragraphs])
        snippet=article_text[:400]+("..." if len(article_text)>400 else "")
        date_str=""
        time_tag= article_soup.find("time")
        if not time_tag:
          meta_time= article_soup.find("meta",attrs={"property":"article:Published_time"})
          if meta_time and meta_time.has_attr("content"):
            date_str= meta_time["content"]
        if not date_str and time_tag and time_tag.has_attr("datetime"):
          date_str= time_tag["datetime"]
        elif not date_str and time_tag:
          date_str=time_tag.get_text(strip=True)
        if date_str:
          try:
            dt=datetime.fromisoformat(date_str.replace("z","+00:00"))
            date_str= dt.strftime("%Y-%m-%d %H:%M:%S %Z")
          except Exception:
            pass
        print(f"   Date: {date_str if date_str else '(No Date Found)'}")
        print(f"   News: {snippet}")
      except Exception as e:
        print(f"   Date: (No Date Found)")
        print(f"   News: (Could not fetch article: {e})")
    else:
      print("   Date: (No Date Found)")
      print("   Link: (No Link Found)")
      print("   News: (No articles Found)")

1. 10Of opium, fire temples, and sarees: A peek into the world of India's dwindling Parsis
   Link: https://www.bbc.com/news/articles/cvgv445jqr1o
   Date: 2025-05-16 23:32:14 UTC
   News: Tucked away in a lane in the southern end of India's financial capital, Mumbai, is a  museum dedicated to the followers of one of the world's oldest religions, Zoroastrianism. The Framji Dadabhoy Alpaiwalla Museum documents the history and legacy of the ancient Parsi community - a small ethnic group that's fast dwindling and resides largely in India. Now estimated at just 50,000 to 60,000, the Par...


# Saving News in Markdown File:

In [31]:
import requests
from bs4 import BeautifulSoup
import re

In [None]:
# # Search query for BBC News as user Input
# query= input("Enter your search query: ")
# # URL for BBC search
# search_url= f"https://www.bbc.com/search?q={query.replace(' ', '+')}"

# response=requests.get(search_url)
# soup=BeautifulSoup(response.content,"html.parser")

# # Try to find all promo items (less dependent on class names)
# results=[]
# pattern=re.compile(r"\bPakistan\b",re.IGNORECASE)
# pattern2=re.compile(r"\bIndia\b", re.IGNORECASE)

# for item in soup.find_all(["article","li"]):
#   #try to get headlines and snippets
#   headline_tag= item.find("h1","h2","h3","span")
#   snippet_tag= item.find("p")
#   headline=headline_tag.get_text(strip=True) if headline_tag else ""
#   snippet=snippet_tag.get_text(strip=True) if snippet_tag else ""

#   #check if both "Pakistan" and "India" are present in either headline or snippet
#   if (pattern.search(headline) and pattern2.search(headline)) or \
#      (pattern.search(snippet) and pattern2.search(snippet)):
#       link_tag= item.find("a",href=True)
#       link= link_tag["href"] if link_tag else ""
#       if link and link.startswith("/"):
#         link= "https://www.bbc.com" + link
#       results.append({
#           "headline":headline,
#           "snippet":snippet,
#           "link":link})

#   if not results:
#     print("No result Found for 'Pakistan India war' on BBC.")
#   else:
#     for idx,result in enumerate(results,1):
#       print(f"{idx}. {result['headline']}")
#       print(f"   Link: {result['link']}")
#       print(f"   Snippet: {result['snippet']}\n")