-
Notifications
You must be signed in to change notification settings - Fork 1
Stock Data Pipeline Documentation
This document provides instructions on how this project was built. The document is separated into several sections for organizational purposes. Each section covers a specific topic that is related to the construction of the project.
stock-data-pipeline/
├── .gitignore
├── config.py
├── list.py
├── load.py
├── main.py
└── README.md
Below is a table of the tools used in this project.
| Google Cloud | APIs | Scheduler | Visualization |
|---|---|---|---|
| BigQuery | IEX Cloud | cron | Web Browser |
| Cloud Storage | |||
| Compute Engine |
This is the API used in this project.
This API used to charge $20 a month to use for their lowest tier but prices have since risen to a lowest tier of $50 a month. An alternative can be found with Financial Modeling Prep.
- Visit the Legacy API docs and use the Quote endpoint.
- Metrics used for this project:
- Ticker
- Company
- Price
- Change
- PE Ratio
- Any API and/or metric can be used. These are just used in this project.
Google Cloud Storage is used to store the
.csvfile that will be later loaded to BigQuery.
- Create a bucket in Cloud Storage
- Note the bucket path:
http://storage.googleapis.com/{bucket_name}
There are four Python files that are used in this project:
config.pylist.pyload.pymain.py
-
config.pyholds the IEX Cloud API key, the Google Cloud Store bucket URL, and the BigQuery table path:
api_key = "my_api_key_iex_cloud"
table_name = "gcp_project_id.dataset_name.table_name"
cloud_storage_bucket = "gs://my_storage_bucket.output.csv"-
list.pyholds a list of all the companies’ ticker symbols in a single Python list. To create the list, a simple web scraping script was built to scrap the company ticker symbols from a Wikipedia page and the output was saved to a list in its own file. The script it’s self was not saved. -
load.pyis responsible for copyingoutput.csvto the Google Cloud Storage bucket and for updating the BigQuery table withoutput.csv. -
main.pyis responsible for calling the API for data extraction, transforming the data, and then buildingoutput.csv.- Running
main.pyperforms it’s extract and transformation duties first, then runs the functions fromload.pyto upload to Cloud Storage then to BigQuery:
- Running
if __name__ == "__main__":
company_info()
# Running the functions from load.py file.
csv_file()
bigquery_upload()Note:main.py is the driver for the rest of the files and is the only file ran during the cronjob.
All processes take place through a virtual machine hosted on Google Cloud. This allows processes to run in the Cloud without having to manage a local system.
- Visit Compute Engine.
- Create an instance.
- Name the instance.
- Series: E2.
- Machine Type: e2-micro.
- Click on ssh to the right of the virtual machine name to enter the machine.
- Install git:
sudo apt install git-all
- After
gitis installed, clone the GitHub repository:git clone https://github.com/digitalghost-dev/stock-data-pipeline.git
- The
config.pyfile won't be downloaded so it will have to be manually created.- In the virtual machine, run the command
nano config.pyto open up the nano editor. - Then type in the code filled with your information:
- In the virtual machine, run the command
api_key = "my_api_key_iex_cloud"
table_name = "gcp_project_id.dataset_name.table_name" # this will be created later
cloud_storage_bucket = "gs://my_storage_bucket.output.csv"
-
Hit
CTRL + Xto save,Yto confirm, thenEnterto exit. -
Set the correct timezone:
sudo timedatectl set-timezone America/Los_Angeles
-
View available timezones:
timedatectl list-timezones
-
Add permissions to
main.py:chmod +x main.py
-
Run
main.py- Process will error out because BigQuery has not been set up yet but
output.csvfile will be created in the Cloud Storage bucket.
- Process will error out because BigQuery has not been set up yet but
BigQuery will act as a data warehouse and will be the final storing place for the data.
- Visit BigQuery.
- Create dataset by clicking the three dots next to your project ID. Any name any region is fine.
- Click on the three dots next to the dataset to then create a table.
- Choose to bring data from Cloud Storage bucket.
- Choose the
output.csvfile that was created from last step. -
main.pycan be ran again from last step to load new data.
Using crontab, the script can run on a schedule automatically on specified interval.
- SSH into the virtual machine.
- Type
crontab -e. - Add the following line to the
crontabfile.-
*/15 9-16 * * 1-5 python3 /home/<folder>/main.py. - This runs the
main.pyfile every 15 minutes during 9am - 4pm, Monday - Friday.
-
- As long as the virtual machine is running, it will now run
main.pyat the specified interval.
Since the data is in BigQuery, it can be used in any visualization tool. This project returns the data in a table on webpage using Flask and HTML.
Under the Python file responsible for running the Flask application, create a function under a new route that pulls the data from the BigQuery table:
# Routing to the data-pipeline page
@app.route('/stock-data-pipeline')
def data_pipeline():
bqclient = bigquery.Client()
# SQL query
query_string = f"""
SELECT *
FROM `{"cloud-data-infrastructure.stock_data_dataset.SP500"}`
ORDER BY Ticker
"""
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', '{:.2f}'.format)
pd.dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(
create_bqstorage_client=True,
)
)
df = pd.dataframe
table = df.to_html(index=False)
return render_template(
"stock_data_pipeline.html",
table = table) # return the dataframeHTML
Return the table in the HTML file:
<h1>S&P 500</h1>
<table>
{{ table | safe }}
</table>