# BBC News Webscraping Project

Author: Muhammad Fouzan Akhter

The code for a web scraping project that targets BBC News is shown below. Underscoring the importance of following website privacy policies is crucial for any online scraping project. It is imperative to highlight that this project is scraping entirely publicly accessible data from Yahoo Finance while adhering to the platform's privacy standards.

In [None]:
#installing required packages:
!pip install beautifulsoup4
!pip install selenium
!pip install requests
!pip install tqdm
!pip install pandas

In [None]:
#importing required libraries:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import requests
from datetime import datetime
from tqdm import tqdm
import pandas as pd

**This Project is coded in the Jupyter Notebook Environment**

The following loops extract article data from BBC News. Ensure an active internet connection before executing the code. Set `max_pages` to the maximum number of pages to extract. All the articles on the page will be extracted, including Heading, Date, Author, Content, and Link for each article. All the extracted data is stored in a pandas dataframe. A tqdm loadbar is added for time management while the code is executing.

In [None]:
driver = webdriver.Chrome()
base_url = 'https://www.bbc.com/news/business'
driver.get(base_url)
max_pages = 50
collected_links = set()
next_page_selector = "a.lx-pagination__btn[rel='next']"
current_page = 1
while current_page <= max_pages:
    try:
        next_button = WebDriverWait(driver, 30).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, next_page_selector))
        )
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        links = soup.find_all('a', class_='qa-story-cta-link')
        for link in links:
            href = link.get('href')
            full_link = f'bbc.com{href}'
            collected_links.add(full_link)
        next_button.click()
        time.sleep(0)
        current_page += 1
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        break
driver.quit()
num_links_collected = len(collected_links)
print(f"Number of Unique Links Collected: {num_links_collected}")
article_data = []
for link in tqdm(collected_links, desc="Scraping Articles"):
    article_url = f"https://{link}"
    response = requests.get(article_url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        heading = soup.find('h1', {'tabindex': '-1', 'id': 'main-heading', 'class': 'ssrcss-15xko80-StyledHeading e10rt3ze0'})
        heading_text = heading.text if heading else 'Heading not found'
        time_element = soup.find('time', {'data-testid': 'timestamp'})
        if time_element:
            datetime_value = time_element.get('datetime')
            dt_obj = datetime.strptime(datetime_value, '%Y-%m-%dT%H:%M:%S.%fZ')
            formatted_date = dt_obj.strftime('%b %d, %Y')
        else:
            formatted_date = 'Date not found'
        author_element = soup.find('div', class_='ssrcss-68pt20-Text-TextContributorName e8mq1e96')
        author_name = author_element.text.strip() if author_element else 'Author not found'
        text_blocks = soup.find_all('div', {'data-component': 'text-block'})
        content = "\n".join([text_block.get_text() for text_block in text_blocks])
        article_data.append({
            "Heading": heading_text,
            "Date": formatted_date,
            "Author": author_name,
            "Content": content,
            "Link": article_url,
        })
    else:
        print(f"Failed to retrieve the article at URL: {article_url}")
df = pd.DataFrame(article_data)

**Viewing Scraped Data in Python Environment**

In [None]:
#displaying scraped data in python environment:
df

****Exporting Scraped Data as a CSV File****

In [None]:
#exporting scraped data as CSV:
df.to_csv('Path on PC/bbc_data.csv', index=False)

**------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**