# CNBC Webscraping Project

Author: Muhammad Fouzan Akhter

The code for a web scraping project that targets CNBC is shown below. Underscoring the importance of following website privacy policies is crucial for any online scraping project. It is imperative to highlight that this project is scraping entirely publicly accessible data from Yahoo Finance while adhering to the platform's privacy standards.

In [None]:
#installing relevant packages:
!pip install selenium
!pip install beautifulsoup4
!pip install requests
!pip install pandas
!pip install tqdm

In [3]:
#importing relevant libraries:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
from tqdm import tqdm

**This Project is coded in Jupyter Notebook Environment**

If  ' Max Try Exceeded: Reconnection Error '  occurs, perform a DNS flush by entering the following command in your terminal or command prompt.

In [None]:
#run the following command on terminal to flush DNS if Max Try Exceeded: Reconnection Error occurs:
#ipconfig /flushdns

The following loops retrieve information from the HTML file, including each article's title, author, content type, date, content, and link. Prior to running the code, make sure the internet is active. Included for effective time management during code execution is a tqdm loadbar. A pandas dataframe will hold all of the extracted data. The `max_links_to_collect` informs the code of how many articles to scrape.

In [None]:
url = 'https://www.cnbc.com/finance/'
max_links_to_collect = 40
driver = webdriver.Chrome()
driver.get(url)
collected_links = set()
while len(collected_links) < max_links_to_collect:
    try:
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        article_links = [a['href'] for a in soup.find_all('a', class_='Card-title')]

        for link in article_links:
            collected_links.add(link)
            if len(collected_links) >= max_links_to_collect:
                break
        load_more_button = driver.find_element(By.CLASS_NAME, 'LoadMoreButton-loadMore')
        load_more_button.click()
        time.sleep(2)
    except Exception as e:
        print(f"Error: {e}")
        break
driver.quit()
dfs = []
with tqdm(total=len(collected_links), desc="Scraping Articles") as pbar_articles:
    for link in collected_links:
        response = requests.get(link)
        if response.status_code == 200:
            try:
                soup = BeautifulSoup(response.content, 'html.parser')
                article_name = soup.find('h1', class_='ArticleHeader-headline')
                article_name = article_name.text.strip() if article_name else None
                author = soup.find('a', class_='Author-authorName')
                author = author.text.strip() if author else None
                content_type = soup.find('a', class_='ArticleHeader-eyebrow')
                content_type = content_type.text.strip() if content_type else None
                date_time = soup.find('time', {'itemprop': 'datePublished'})
                date_time = date_time.text.strip() if date_time else None
                content_elements = soup.find('div', class_='ArticleBody-articleBody')
                content = ' '.join(element.text.strip() for element in content_elements.find_all(['p', 'h2']))
                details = {
                    'Article Name': article_name,
                    'Author': author,
                    'Content Type': content_type,
                    'Date Time': date_time,
                    'Content': content,
                    'Link': link,
                }
                dfs.append(pd.DataFrame([details]))
                pbar_articles.update(1)
            except Exception as e:
                print(f"Error scraping article '{link}': {e}")
                break
        else:
            print(f"Error: Unable to fetch content. Status code {response.status_code}")
df = pd.concat(dfs, ignore_index=True)

**Viewing Scraped Data in Python Environment**

In [None]:
#displaying scraped data in python environment:
df

**------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**