Stock Data Pipeline Documentation

1. Introduction

This document provides instructions on how this project was built. The document is separated into several sections for organizational purposes. Each section covers a specific topic that is related to the construction of the project.

Important Links

Visualization

File Structure

stock-data-pipeline/
├── .gitignore
├── config.py
├── list.py
├── load.py
├── main.py
└── README.md

Tools

Below is a table of the tools used in this project.

Google Cloud	APIs	Scheduler	Visualization
BigQuery	IEX Cloud	cron	Web Browser
Cloud Storage
Compute Engine

2. API - IEX Cloud

This is the API used in this project.

This API used to charge $20 a month to use for their lowest tier but prices have since risen to a lowest tier of $50 a month. An alternative can be found with Financial Modeling Prep.

Visit the Legacy API docs and use the Quote endpoint.
Metrics used for this project:
- Ticker
- Company
- Price
- Change
- PE Ratio
Any API and/or metric can be used. These are just used in this project.

3. Google Cloud Storage

Google Cloud Storage is used to store the .csv file that will be later loaded to BigQuery.

Create a bucket in Cloud Storage
Note the bucket path: http://storage.googleapis.com/{bucket_name}

4. Python Files

There are four Python files that are used in this project:

config.py

list.py

load.py

main.py

config.py holds the IEX Cloud API key, the Google Cloud Store bucket URL, and the BigQuery table path:

api_key = "my_api_key_iex_cloud"
table_name = "gcp_project_id.dataset_name.table_name"
cloud_storage_bucket = "gs://my_storage_bucket.output.csv"

list.py holds a list of all the companies’ ticker symbols in a single Python list. To create the list, a simple web scraping script was built to scrap the company ticker symbols from a Wikipedia page and the output was saved to a list in its own file. The script it’s self was not saved.
load.py is responsible for copying output.csv to the Google Cloud Storage bucket and for updating the BigQuery table with output.csv.
main.py is responsible for calling the API for data extraction, transforming the data, and then building output.csv.
- Running main.py performs it’s extract and transformation duties first, then runs the functions from load.py to upload to Cloud Storage then to BigQuery:

if __name__ == "__main__":
    company_info()
    # Running the functions from load.py file.
    csv_file()
    bigquery_upload()

Note:main.py is the driver for the rest of the files and is the only file ran during the cronjob.

5. Virtual Machine

All processes take place through a virtual machine hosted on Google Cloud. This allows processes to run in the Cloud without having to manage a local system.

Visit Compute Engine.
Create an instance.
Name the instance.
Series: E2.
Machine Type: e2-micro.
Click on ssh to the right of the virtual machine name to enter the machine.
Install git:
- sudo apt install git-all
After git is installed, clone the GitHub repository:
- git clone https://github.com/digitalghost-dev/stock-data-pipeline.git
The config.py file won't be downloaded so it will have to be manually created.
- In the virtual machine, run the command nano config.py to open up the nano editor.
- Then type in the code filled with your information:

api_key = "my_api_key_iex_cloud"
table_name = "gcp_project_id.dataset_name.table_name" # this will be created later
cloud_storage_bucket = "gs://my_storage_bucket.output.csv"

Hit CTRL + X to save, Y to confirm, then Enter to exit.
Set the correct timezone:
- sudo timedatectl set-timezone America/Los_Angeles
View available timezones:
- timedatectl list-timezones
Add permissions to main.py:
- chmod +x main.py
Run main.py
- Process will error out because BigQuery has not been set up yet but output.csv file will be created in the Cloud Storage bucket.

6. BigQuery

BigQuery will act as a data warehouse and will be the final storing place for the data.

Visit BigQuery.
Create dataset by clicking the three dots next to your project ID. Any name any region is fine.
Click on the three dots next to the dataset to then create a table.
Choose to bring data from Cloud Storage bucket.
Choose the output.csv file that was created from last step.
main.py can be ran again from last step to load new data.

7. Cron Job

Using crontab, the script can run on a schedule automatically on specified interval.

SSH into the virtual machine.
Type crontab -e.
Add the following line to the crontab file.
- */15 9-16 * * 1-5 python3 /home/<folder>/main.py.
- This runs the main.py file every 15 minutes during 9am - 4pm, Monday - Friday.
As long as the virtual machine is running, it will now run main.py at the specified interval.

8. Web Browser Visualization

Since the data is in BigQuery, it can be used in any visualization tool. This project returns the data in a table on webpage using Flask and HTML.

Under the Python file responsible for running the Flask application, create a function under a new route that pulls the data from the BigQuery table:

# Routing to the data-pipeline page
@app.route('/stock-data-pipeline')
def data_pipeline():

    bqclient = bigquery.Client()

    # SQL query
    query_string = f"""
    SELECT *
    FROM `{"cloud-data-infrastructure.stock_data_dataset.SP500"}`
    ORDER BY Ticker
    """
    
    pd.set_option('display.max_rows', None)
    pd.set_option('display.float_format', '{:.2f}'.format)

    pd.dataframe = (
        bqclient.query(query_string)
        .result()
        .to_dataframe(
            create_bqstorage_client=True,
        )
    )

    df = pd.dataframe
    table = df.to_html(index=False)

    return render_template(
        "stock_data_pipeline.html",
        table = table) # return the dataframe

HTML

Return the table in the HTML file:

<h1>S&P 500</h1>
<table>
	{{ table | safe }}
</table>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!