# Web Scraping Project

In the age of information, the ability to gather and analyze real-time data is critical for making informed decisions in the financial markets.  
This project focuses on web scraping stock information from Yahoo Finance for a curated list of major companies, including Apple, Microsoft, Netflix, Google, Nike, Adidas, and Nvidia.  
By automating the data extraction process, we aim to collect key financial metrics such as stock prices, market capitalization, and more.  
The scraped data is then stored in a JSON file, providing a structured and accessible format for future analysis and continuous data monitoring.  
This project not only demonstrates the power of web scraping for financial analysis but also lays the groundwork for building a comprehensive dataset that can be used for various applications, such as predictive modeling, trend analysis, and investment strategy development.

In [1]:
# LIBRARY

import pandas as pd
import numpy as np
import json
import os
import time
import datetime
import requests 
import re
from bs4 import BeautifulSoup 
from config import user_agent

# Import a user agent to make sure the website itself won't automatically classify the traffic as spam
headers = user_agent

In [2]:
# Add the link of the website which will be parsed
url = "https://finance.yahoo.com/quote/ADS.DE/"
# Retrieve the information of the website
response = requests.get(url, headers = headers)
# Check if the retrieval was successfull (if status_code = 200, everything is fine)
print(response.status_code)

# Parse over the website
soup = BeautifulSoup(response.text, "html.parser")
# Print out the name of the company
print(soup.title.text)

# Find all the wanted information by parsing over the website and assign them to each variable
price = soup.find("fin-streamer", {"data-field" : "regularMarketPrice"}).text
price_change_total = soup.find("fin-streamer", {"data-field" : "regularMarketChange"}).text
price_change_perc = soup.find("fin-streamer", {"data-field" : "regularMarketChangePercent"}).text

# Print the information
print(price, price_change_total, price_change_perc, sep=", ")

200
adidas AG (ADS.DE) Stock Price, News, Quote & History - Yahoo Finance
217.10, +4.30, (+2.02%)


Since this was done for one single company as an example and a test, it can also be automated for a collection of several companies by creating a dictionary with all companies we want to retrieve the stock data from.

In [3]:
# Define all of the companies we want to fetch data from

url_dict = {"apple" : "https://finance.yahoo.com/quote/AAPL/",
             "microsoft" : "https://finance.yahoo.com/quote/MSFT/",
             "netflix" : "https://finance.yahoo.com/quote/NFLX/",
             "google" : "https://finance.yahoo.com/quote/GOOG/",
             "nike" : "https://finance.yahoo.com/quote/NKE/",
             "adidas" : "https://finance.yahoo.com/quote/ADS.DE/",
             "nvidia" : "https://finance.yahoo.com/quote/NVDA"}

In [4]:
# Find the current price, the total and the percentage change to the day before of the stock in the source

def get_stock_info(url):
    # Fetch text from website
    r = requests.get(url, headers = headers)

    # Check if request was successful
    if r.status_code == 200:
        # Parse the website
        soup = BeautifulSoup(r.text, "html.parser")

        # Find all necessary classes containing the relevant information for price, the percentage and the total value of growth (or decrease)
        company_name = soup.find("h1", {"class" : "yf-3a2v0c"}).text
        price = soup.find("fin-streamer", {"data-field" : "regularMarketPrice"}).text
        total_price_change = soup.find("fin-streamer", {"data-field" : "regularMarketChange"}).text
        perc_price_change = soup.find("fin-streamer", {"data-field" : "regularMarketChangePercent"}).text.strip("()")

        # Return all values of interest
        return {"Company" : company_name,
                "Price" : price, 
                "Total_Change" : total_price_change,
                "Percentage_Change" : perc_price_change}

    else:
        print(f"Failed to retrieve data from {url}")
        return None

# Iterating over the dictionary which contains the companies we want to fetch the stock data from
for company, url in url_dict.items():
    
    print(f"\nFetching data for {company.capitalize()}...")
    stock_info = get_stock_info(url)
    
    # Check if the variables price, total_price_change and perc_price_change actually contain valid alues:
    if stock_info:
        print(f"Company: {stock_info['Company']}")
        print(f"Price: {stock_info['Price']}")
        print(f"Total Change: {stock_info['Total_Change']}")
        print(f"Percentage Change: {stock_info['Percentage_Change']}")
    else:
        print(f"Failed to retrieve stock information for {company.capitalize()}.")

print("\nSuccessfully fetched data for each given company")


Fetching data for Apple...
Company: Apple Inc. (AAPL)
Price: 222.66
Total Change: +2.55
Percentage Change: +1.16%

Fetching data for Microsoft...
Company: Microsoft Corporation (MSFT)
Price: 423.04
Total Change: +8.84
Percentage Change: +2.13%

Fetching data for Netflix...
Company: Netflix, Inc. (NFLX)
Price: 681.47
Total Change: +7.85
Percentage Change: +1.17%

Fetching data for Google...
Company: Alphabet Inc. (GOOG)
Price: 152.15
Total Change: +2.14
Percentage Change: +1.43%

Fetching data for Nike...
Company: NIKE, Inc. (NKE)
Price: 78.40
Total Change: +0.31
Percentage Change: +0.40%

Fetching data for Adidas...
Company: adidas AG (ADS.DE)
Price: 217.10
Total Change: +4.30
Percentage Change: +2.02%

Fetching data for Nvidia...
Company: NVIDIA Corporation (NVDA)
Price: 116.91
Total Change: +8.81
Percentage Change: +8.15%

Successfully fetched data for each given company


Let's assume we want to retrieve the data for a portfolio a person owns, so he can automatically gain insights about the perfomance of the stocks he owns.  
To do so, we can use implemented the implemented function which retrieves the necessary data for the companies of interest, and append those to a DataFrame for further examination.

In [5]:
# To make it more common, we make a slight adjustment to the function get_stock_info so he only needs to give the stock abbreviation to receive a result (i.e AAPL for Apple)
# We also add a timestamp to see at which point we fetched the data
def stock_data(abbreviation):
    url = f"https://finance.yahoo.com/quote/{abbreviation}/"
    r = requests.get(url, headers = headers)

    # Check if request was successful
    if r.status_code == 200:
        # Parse the website
        soup = BeautifulSoup(r.text, "html.parser")

        # Create a timestamp
        time_stamp = datetime.datetime.today().strftime("%Y/%m/%d %H:%M:%S")


        # Find all necessary classes containing the relevant information for price, the percentage and the total value of growth (or decrease)
        company_name = soup.find("h1", {"class" : "yf-3a2v0c"}).text
        price = soup.find("fin-streamer", {"data-field" : "regularMarketPrice"}).text
        total_price_change = soup.find("fin-streamer", {"data-field" : "regularMarketChange"}).text
        perc_price_change = soup.find("fin-streamer", {"data-field" : "regularMarketChangePercent"}).text.strip("()")

        # If we get no value, e.g in case a refreshed website, manipulate the variables in a way it can be cleaned afterwards
        if not price or not total_price_change or not perc_price_change:
            price = "99999"
            total_price_change = "999"
            perc_price_change = "99.99"

        # Assign the values of interest to a dictionary
        return {"Time_Stamp" : time_stamp,
                "Company" : company_name,
                "Price" : price, 
                "Total_Change" : total_price_change,
                "Percentage_Change" : perc_price_change}
    else:
        print(f"Failed to retrieve data from {abbreviation}")
        return None

In [6]:
# Check if function works as intended
stock_data("NVDA")

{'Time_Stamp': '2024/09/12 12:54:35',
 'Company': 'NVIDIA Corporation (NVDA)',
 'Price': '116.91',
 'Total_Change': '+8.81',
 'Percentage_Change': '+8.15%'}

Now we can turn that ```portfolio_data``` list to a json file to save the results.

In [7]:
# Dump the information to a json file to save them
def append_data_to_json(new_data, filename="portfolio_data.json"):
    data = []

    # Überprüfen, ob die Datei existiert und ob sie nicht leer ist
    if os.path.exists(filename) and os.path.getsize(filename) > 0:
        with open(filename, "r") as file:
            try:
                data = json.load(file)
            except json.JSONDecodeError:
                print(f"Error decoding JSON from {filename}. The file might be corrupted or empty.")

    data.append(new_data)

    with open(filename, "w") as file:
        json.dump(data, file, indent=4)

In [8]:
# At the beginning we have to create a list with the abbreviations of the companies of interest
portfolio = ["AAPL", "MSFT", "NFLX", "GOOG", "NKE", "ADS.DE", "NVDA"]

# Afterwards we have to create an empty list to store the fetched data
for item in portfolio:
    stock_info = stock_data(item)
    if stock_info:
        append_data_to_json(stock_info)
        print(f"Appended data for: {item}")
    else:
        print(f"Failed to retrieve data for: {item}")

Appended data for: AAPL
Appended data for: MSFT
Appended data for: NFLX
Appended data for: GOOG
Appended data for: NKE
Appended data for: ADS.DE
Appended data for: NVDA


For further investigation we load the json file into a DataFrame and take a first look at it.

In [9]:
# Turn the json file "portfolio_data" back into a dataframe
df = pd.read_json("portfolio_data.json")

# Remove the abbreviations for each company 
df["Company"] = df["Company"].apply(lambda x: re.sub(r"\s+\(.*\)$", "", x))

In [10]:
df.tail()

Unnamed: 0,Time_Stamp,Company,Price,Total_Change,Percentage_Change
79,2024/09/12 12:54:39,"Netflix, Inc.",681.47,7.85,+1.17%
80,2024/09/12 12:54:40,Alphabet Inc.,152.15,2.14,+1.43%
81,2024/09/12 12:54:41,"NIKE, Inc.",78.4,0.31,+0.40%
82,2024/09/12 12:54:42,adidas AG,217.1,4.3,+2.02%
83,2024/09/12 12:54:43,NVIDIA Corporation,116.91,8.81,+8.15%


# Summary
To summarize this small project, we manipulated the original url string of Yahoo Finance to return values of interest at any time you want to.  
In our case we created a dictionary full of companies which was used as a foundation to showcase the functions.  
Also, we added a time stamp which shows the date the data was fetched.  
In the end we appended the fetched data to a json file, which will be extended with the data every time someone executes the function.  
The advantage of this type of storage is, that it can also be retrieved at any time to conduct analyses or predictions on the dataset.

In [11]:
# Python Version
!python --version

Python 3.11.7
