The Poor Man's Data Pipeline
Tasked with writing a proof of concept data pipeline, I was overwhelmed with the options on the market. This simple data pipeline sits on Google Cloud Platform, captures events using a simple tracking pixel, processes, and stores the data in near real time, and requires no ops. It was partially inspired by this project on Google's own site.
The ingest starts as an HTTP request for a 1x1 png image. Something like this: http://track.domain.com/pixel.png?user_id=507f1f77bcf86cd799439011&order_id=507f1f77bcf86cd799439011&type=click
This request is directed to Google Cloud's HTTP load balancer. The 1x1 png is served from the CDN, the request is logged, and the log message is published to a Pub/Sub topic. The entire log message ends up in Pub/Sub but the most important part is the URL since it contains the parameters we are interested in tracking.
A Google cloud function is subscribed to the Pub/Sub topic where the log messages are streaming. Every time a log message is published, the cloud function runs with that message as an input.
This function is simple. It parses the request url, extracts the important parameters, and uploads the result to BigQuery.
BigQuery stores all event data. The data is partitioned both by user id and date. When a query is performed against BigQuery, costs will remain low because each of these datasets is much smaller and quicker to query.
- A working Google Cloud Platform account that can enable services. You are responsible for whatever charges you incur.
Google cloud storage
Google Cloud Storage holds pixel.png file for us, nothing else.
- In Google Cloud Storage, create a new bucket and remember the name.
- Click "edit bucket permissions" and create new read permissions for a user called "allUsers". It will look like this:
- Upload the pixel.png from this repository to your new bucket.
- Click the checkbox under "share publicly" so your image can be served publicly.
The Load Balancer sits in front of pixel.png and logs all requests made to it. This is the entrypoint for data collection.
- In Networking, click "Create Load Balancer".
- Choose HTTP(S) Load Balancing
- Setup a backend configuration to use a bucket, enabling cloud CDN like this:
- Leave Host and path as well as the frontend as is unless you know you need further configuration. Here's my complete configuration:
- Get the public IP of your load balancer for later use.
Next up, you need to create a Pub/Sub topic that log messages can be published to.
- Go to Pub/Sub and create a topic
- Remember the name you chose
- That's it
The next thing you need is to export logs that come from your load balancer and publish them to your Pub/Sub topic.
- Go to Logging in the GCP menu.
- In the filter input, click the dropdown and choose "Convert to advanced filter":
- Add a filter to the input like this:
resource.type = http_load_balancer AND resource.labels.url_map_name = "[YOUR_LOAD_BALANCER_NAME]". This catches only logs that come from the load balancer you recently setup.
- Click "Create Export" at the top of the page.
- Choose Cloud Pub/Sub as the Sink Service and the topic you recently created.
To prepare for the next step, we need to create a BigQuery Table. This table will be used as a "template" for future tables that will be dynamically created.
- In the GCP menu, open the BigQuery console.
- Click Create new dataset and choose a name.
- Create a new table called 'pixel' and give it a desired schema. Here's what mine looks like:
Next up, we need to parse each log message that comes in and send them on to BigQuery.
- Open up
index.jsfrom this repository.
TABLE_NAMEaccording to the BigQuery dataset and table you just created.
VALID_ATTRIBUTESaccording to the data you want to track. Any attribute here must be in the BigQuery schema you just created.
- Spend a moment looking through the rest of the function. It's quite simple. The only non intuitive part is the
templateSuffixwe pass to BigQuery. This tells BigQuery to create a new table using the schema of the table name we passed but to append the value of
templateSuffixto the end, thereby partitioning our data by whatever criteria we build into
- Deploy the function using the command found in
gcloud beta functions deploy parse --stage-bucket [any-title] --trigger-topic [your-topic]stage-bucket simply tells GCP where to store your code. Just choose any unique name and GCP does the rest.
At this point, you should have a working pipeline. Using the IP address of the load balancer, you can construct a URL like http://ip-address/pixel.png?param1=123¶m2=foobar and send your users to it. It would be wise to setup DNS so you can use a better looking URL.
GCP gives you an insight into each piece of this pipeline but the information is a bit scattered. There seems to be a basic "StackDriver" logging panel to give you some insight into your services, if you want more detail, you have to use the full StackDriver product or use the API to gather metrics.
Use the logging panel to see raw logs from the Load Balancer, BigQuery, and the Cloud Function. You can also tail your cloud function logs manually from the command line using the command supplied in
Possible next systems to layer in:
- A more sophisticated processing layer. GCP Dataflow would integrate nicely.
- Storage of full logs. Generally, storing raw data is preferable. It would be fairly easy to batch up raw logs and store them in cloud storage so you can easily have raw backups of the data.
- Machine learning. Google's cloud tools for machine learning are best in class, so you will have an easy time integrating those into the flow.
Weaknesses of this approach
The nature of this setup necessitates streaming each log event through the entire stack. This can lead to higher costs. For example, loading data to BigQuery is free but streaming inserts costs $0.05 per GB.
Consistency. Streaming data into BigQuery is consistent, but not 100% so. Wording taken from BigQuery's documentation:
The app can tolerate a rare possibility that duplication might occur or that data might be temporarily unavailable.
Cost at volume. I can't speak to the pricing authoritatively but I assume the pricing does not work out in your favor if you will be pumping billions of events per month. If that's the case, you likely have the engineering staff to use self hosted solutions. If not, this is still a good solution, but will likely be pricey.
Please send any and all feedback by way of email (email@example.com), a Github issue, or ideally, a pull request ;)
This project is released under the MIT License