# Yahoo Finance Webscraping Project

Author: Muhammad Fouzan Akhter

The code for a web scraping project that targets Yahoo Finance is shown below. Underscoring the importance of following website privacy policies is crucial for any online scraping project. It is imperative to highlight that this project is scraping entirely publicly accessible data from Yahoo Finance while adhering to the platform's privacy standards.

In [None]:
#installing required packages:
!pip install requests
!pip install beautifulsoup4
!pip install pandas
!pip install tqdm

In [None]:
#importing required libraries:
from google.colab import files
import requests
from bs4 import BeautifulSoup
import re
import json
import pandas as pd
from tqdm import tqdm

**This Project is coded in the Google Colab Environment**

To navigate down the Yahoo Finance page, press and hold the space key on your keyboard. Continue holding it until you reach the maximum scroll. Once you've reached the bottom, download the webpage as an HTML file. Finally, upload the downloaded file to the program.

In [None]:
uploaded_files = files.upload()
html_filename = list(uploaded_files.keys())[0]

The provided loop extracts data from the HTML file, including the title, author, datetime, readtime, content, tags, and link of each article. Ensure an active internet connection before executing the code. A tqdm loadbar is incorporated to enhance time management during code execution. All the extracted data will be stored in a pandas dataframe.

In [None]:
if html_filename:
    html_content = uploaded_files[html_filename].decode('utf-8')
    soup = BeautifulSoup(html_content, 'html.parser')
    linkslist = set()

    for link in tqdm(soup.find_all("a", class_="js-content-viewer"), desc="Collecting Links"):
        href = link.get("href")
        if href:
            linkslist.add(href)

    print("Extraction of Links Completed")
    print(f"{len(linkslist)} Unique Links Extracted")
    print("********************")
    data = []

    for link in tqdm(linkslist, desc="Article Scraper Running"):
        response = requests.get(link)
        if response.status_code == 200:
            article_soup = BeautifulSoup(response.text, 'html.parser')
            title_element = article_soup.find('h1', attrs={'data-test-locator': 'headline'})
            extracted_title = title_element.text.strip() if title_element else "Title not found"
            author_element = article_soup.find('span', class_='caas-author-byline-collapse')
            author_name_element = author_element.find('a', class_='link') if author_element else None
            author_name = author_name_element.text.strip() if author_name_element else None
            if author_name is None:
                author_name_element = author_element.find_all('span')
                if author_name_element:
                    author_name = author_name_element[0].text.strip()
            if author_name is None:
                author_text = str(author_element)
                author_match = re.search(r'<span class="caas-author-byline-collapse">(.*?)<\/span>', author_text)
                if author_match:
                    author_name = author_match.group(1).strip()
                else:
                    author_name = "Author not found"
            datetime_element = article_soup.find('time', class_='caas-attr-meta-time')
            datetime_value = datetime_element.get_text(strip=True) if datetime_element else "Date and Time not found"
            readtime_element = article_soup.find('span', class_='caas-attr-mins-read')
            readtime = readtime_element.text.strip() if readtime_element else "Read time not found"
            content_elements = article_soup.find_all('div', class_='caas-body')
            content = '\n'.join([p.get_text() for p in content_elements])
            json_ld_script = article_soup.find('script', {'type': 'application/ld+json'})
            if json_ld_script:
                json_data = json.loads(json_ld_script.string)
                tags = json_data.get('keywords', [])
            else:
                tags = []
            data.append({
                'Title': extracted_title,
                'Author': author_name,
                'Datetime': datetime_value,
                'ReadTime': readtime,
                'Content': content,
                'Tags': tags,
                'Link': link
            })

    df = pd.DataFrame(data)
else:
    print("The uploaded HTML file was not found. Please recheck the name")

**Viewing Scraped Data in Python Environment**

In [None]:
#displaying scraped data in python environment:
df

**Downloading Scraped Data as a CSV File**

In [None]:
#downloading scraped data as CSV:
df.to_csv('my_data.csv', index=False)
files.download('my_data.csv')

**------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**