# Business Recorder Webscraping Project

Author: Muhammad Fouzan Akhter

The code for a web scraping project that targets Business Recorder is shown below. Underscoring the importance of following website privacy policies is crucial for any online scraping project. It is imperative to highlight that this project is scraping entirely publicly accessible data from Yahoo Finance while adhering to the platform's privacy standards.

In [None]:
#installing required packages:
!pip install selenium
!pip install beautifulsoup4
!pip install requests
!pip install pandas
!pip install tqdm

In [None]:
#importing required libraries:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
from tqdm import tqdm

**This Project is coded in the Jupyter Notebook Environment**

The loops below launch a web driver and retrieve data from the designated website, Business Recorder. For every article, it extracts the header, the datetime, the content type, the content, and the link. Until the maximum scroll is achieved, the code automatically scrolls down. A pandas data frame contains all of the extracted data. Collected links whose data could not be recovered because of error 403 are shown by the end of the code if you chose to display them.

In [None]:
url = 'https://www.brecorder.com/business-finance'
driver = webdriver.Chrome()
driver.get(url)
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    #increase sleep time if internet connection is weak:
    time.sleep(2)
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height
page_source = driver.page_source
driver.quit()
soup = BeautifulSoup(page_source, 'html.parser')
links = [a['href'] for a in soup.find_all('a', class_='story__link')]
unique_links = list(set(links))
df_list = []
skipped_links_403 = []
for link in tqdm(unique_links, desc="Scraping Articles"):
    response = requests.get(link)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        heading_element = soup.find('a', class_='story__link')
        heading = heading_element.text.strip() if heading_element else None
        content_type_element = soup.find('span', class_='badge')
        content_type = content_type_element.text.strip() if content_type_element else None
        datetime_element = soup.find('span', class_='timestamp--date')
        datetime = datetime_element.text.strip() if datetime_element else None
        content_element = soup.find('div', class_='story__content')
        content_paragraphs = content_element.find_all('p') if content_element else []
        content = '\n'.join(paragraph.text.strip() for paragraph in content_paragraphs)
        df_list.append(pd.DataFrame({
            'Heading': [heading],
            'Datetime': [datetime],
            'Content Type': [content_type],
            'Content': [content]
        }))
    elif response.status_code == 403:
        skipped_links_403.append(link)
    else:
        print(f"Error: Unable to fetch content from {link}. Status code {response.status_code}")
df = pd.concat(df_list, ignore_index=True)
if skipped_links_403:
    choice = input(f"\n{len(skipped_links_403)} links were skipped due to error 403. Do you want to display them? (y/n): ")
    if choice.lower() == 'y':
        print("\nSkipped Links:")
        print(skipped_links_403)

**Viewing Scraped Data in Python Environment**

In [None]:
#displaying scraped data in python environment:
df

**------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**