# College of Computing and Informatics, Drexel University
## INFO 323: Cloud Computing and Big Data
### Due: Wednesday, June 11, 2025
---

## Final Project Report

## Project Title: Public Company Overview (ETL Clients, DBT Warehouse, and App on GCP)

## Student(s): Vasyl Nesteryuk (vn357), Brett Musselman (bam429), and Steven Nguyen (sn924)

# Date: 11 June 2025
---

# Project Requirements

This final project examines the level of knowledge the students have learned from the course. The project must be done on a cloud computing platform.

## Projects must apply cloud computing platforms and technqiues for data science problems.

# 1. Problem Definition
---
*(Define the problem that will be solved in this cloud computing project.)*

A lot of investment firms, retail investors, people involved with financial valuations, and people interested in industry landscapes require a way to get up to date quickly with public company information. On the bright side, due to regulatory environments as well as how publicizing data is important to public companies, the data is extremely available whether through APIs or software. The group is planning on utilizing a few of those APIs to create a pipeline to grab and format data based on a query of a public company’s name or website. The last step is to create a presentation layer through a basic web app that allows for the user to type in the company name/URL as well as other filtering conditions and have the data presented to them and available for download. 

# 2. Data Set
---
*(Describe the origin, format, and  charateristics of the data.)*

Yfinance, Polygon, and Financial Modeling Prep (FMP) are all APIs dealing with financial and stock data around public companies; the data from these sources will be time series data with stock prices, financial performance, executive compensation, etc. Another source of data is Microlink, an API that takes in a URL and grabs screenshots and text from it. As of now, we haven't gotten the full workflow for the Microlink URL grabs added but we have added the rest of it which allows for the user to analyze data for the biggest 50 companies in the S&P and download CSVs of that data. There is also a request function on the web app which allows the user to request information on different tickers but sometimes it fails if there is a ticker which is hard to access throw a standard workflow we set up within our custom API that wraps around all of the source systems.

# 3. Cloud Computing
---
*(Describe the cloud computing platforms and tools used in the project, and explain how they will be applied to address the problem.)*

The group will set up Python scripts to handle the extraction from the APIs, (potentially) Big Query SQL scripts to transform and format the data, and Python for final analysis/formatting. We will host the Python scripts for the extract and load in a Cloud Run (serverless compute which allows you to connect CI/CD to github and use containers), use a Cloud Storage bucket for holding the raw data, BigQuery + DBT for transforming the data, and host the app scripts in either App Engine or Cloud Run since those have large free tiers and are easy to get up and running. This breaks down to coding an “extract + load” API to handle the data collection requests, a DBT + BigQuery warehouse to handle the transformation step, and a web app to handle requests for new data and presenting data from the warehouse.

The cloud computing aspects come into play in the form of a Cloud Run (serverless compute) that runs a Docker container with a custom API wrapping around the source systems, DBT code that pushes SQL down to BigQuery which transforms data held in our object storage in GCP, a Cloud Run Function that is scheduled to run the DBT project and refresh the BigQuery dataset every day, and a web app that reads the BigQuery dataset and calls the custom API while being run in a Docker container on App Engine in GCP.

# 4. Data Analysis using Cloud Computing
---
*(Use the selected cloud computing platforms and tools to cleanse, wrangle, transform, analyze, and model the data. Document each step, along with the results and findings, ensuring the entire process is reproducible by the TA and instructor.)*

Github: https://github.com/brettamusselman/public-company-overview?tab=readme-ov-file

App: https://public-company-overview.ue.r.appspot.com/

Our project focused more on the data engineering side of the course. What we did was create EL clients in Python as well as set up a FastAPI wrapping around those clients inside of a Docker container in a Cloud Run, then we created a DBT pipeline to dimensionally model the data, and ended off with a web app to visualize and access the underlying warehouse and APIs. In the real world, we would have imported the final data model into a BI software such as PowerBI or Looker, but decided to go with an app since it is easier to share.

The first step of the project was to design Python clients to collect data from various source systems in the form of financial APIs. This step can be viewed in the src directory on the Github. First, the group wrote wrappers around GCP's secrets manager and cloud storage Python SDKs for ease of use, then the group wrote wrappers around Polygon, YFinance, FMP, Shodan, and Microlink. After that was done, a main.py file was constructed which calls those underlying py modules to collect and write data to the bucket. This file was then wrapped in a FastAPI and deployed via Docker to a Cloud Run Function. All of this comes together to form the base of the project, the extract + load client.

For the second step of the project, the group used DBT to orchestrate the construction of a BigQuery dataset with bronze, silver, and gold layers (or also known as staging, intermediate, and marts). This aspect of the project can be viewed in the dbt_functions directory on the Github. During this step, the code reads in the underlying CSVs held in the bucket to external tables in staging, then loads and formats the tables as well as begins to dimensionally model in the intermediate layer, and ends in the marts layer with dimension and fact tables. DBT, data build tool, is popularized by analytics engineers and has really taken off since it is system agnostic and can talk to most databases that use SQL via connectors; the premise of the tool is to use Jinja for programmtic macros and base each SQL model off of series of CTEs to abstract away some of the difficulty of formatting complex data for BI and AI/ML consumption. This was then deploy to a Cloud Run Function on a daily schedule to refresh the warehouse.

In the third and final part of the project, the group wrote a wrapper around the BigQuery SDK as well as some base queries and constructed an app through Flask + Dash (plus some CSS, HTML, and Javascript) for visualization and interaction with the underlying data warehouse. This app provides an interface to visualize the data with filters, request a standard workflow of a list of tickers, and download a table of a financial time series. All of this code can be viewed in the app directory on the Github. Once this code was finalized, it was deployed via a Docker container to GCP's App Engine which also handles the irritating aspect of exposing a domain to the internet.

# 5. Conclusion
---
*(Briefly describe what you have done using cloud computing techniques and what you discovered. Discuss any shortcomings of the process and results. Propose future work. **Finally, discuss the lessons learned from doing the project**.)*

The group received a large amount of practice working with cloud storage, serverless compute, containerization (through Docker), data warehousing, serverless compute, and app-specific compute. For the cloud storage, it was a basic wrapper around the google cloud storage client that we then used within a base function in main.py file to handle orchestration and writing from the various source-system clients to the bucket for the project. All of that cloud with the source systems, which can be seen in the src directory in the Github, is held within a Docker container and hosted on Cloud Run serverless compute. Then that leads into the data warehousing step which is designed in DBT to be run in batch by a Cloud Run Function every day. Finally, we ended it all off with a web app that had some basic queries to the data warehouse through a BigQuery client and was displayed through Dash; the app is also held in a Docker container and hosted through App Engine which also builds out a domain for the website.

Some of the shortcomings of the project stem from trying to get the data warehousing and presentation layer set up, but the initial lack of data when the project started made that difficult. After we trimmed down which source systems were key and generated some initial data, it was more possible to get those parts finished up. On the other hand, the app for visualization was a unique opportunity, but, in the real world, that would all be done with a BI service like PowerBI which abstracts away some of the complexities, makes it easier to set up interactive visuals, and handles the hosting components. The largest complexity was working with all of the various microservices that, when used altogether, create a unified project. While the group tended to use best practices, making sure all of the team members were working together and understanding the current stage of the project was difficult.

For next steps, the group could spend more time retrieving data, fine-tuning the warehouse to what the target audience (those interested in public companies for investing or comparison purposes) would really like easy access to at their fingertips. From that research, the group could design sets of various workflows and create a service that makes it easier for the users to get what they need from a single source of information. Also, the group could pivot to using Looker, GCP's BI service, as well as moving the source clients to DLT, the data loading tool that has been popularized for writing to object stores. One really interesting change could be adding either some AutoML components for forecasting financial time series or an LLM API connection to automate some of the initial analysis or diligience that an investor may perform.

Through this project, the group learned a large amount about the various microservices used to set up a data pipeline, a data warehouse, deploy Python code and containers, and run an app online. The group also learned about different source APIs for financial data, how to use FastAPI, gained experience with Docker, used DBT to construct a data warehouse and talk to a cloud system, and how to use a data warehouse for a visualization app in Python.

# 6. References

---
(*Use the following requirements for writing your reports. DO NOT DELETE THE CELLS BELLOW*)

# Project Requirements

This final project examines the level of knowledge the students have learned from the course. Course outcomes include querying and exploring data using higher-level tools built on top of a cloud computing platform, applying practical tools for processing massive data sets, and building scalable big data analytical and predictive models.

** Marking will be foucsed on both presentation and content.**

## Written Presentation Requirements
The report will be judged on the basis of visual appearance, grammatical correctness, and quality of writing, as well as its contents. Please make sure that the text of your report is well-structured, using paragraphs, full sentences, and other features of well-written presentation.

## Technical Content of the Entire Project:
* Is the problem well defined and described thoroughly?
* Is the size and complexity of the data set used in this project commensurate with the course?
* Does the project uses cloud computing techniques for exploratory data analysis?
* Does the project uses cloud computing techniques for building analytical and predictive models?
* Does the project cover the key data science activites including data cleaning, data wrangling, visualization, model selection, feature engineering, and model evaluation?
* Does the report present the findings well and make clear conclusions?
* Overall, what is the rating of this project?