Skip to content

As a Data Engineer for a fictional E-commerce startup, this project addresses the task of analyzing the web server logs to find the number of product pages visited and the number of items in the cart.

License

Notifications You must be signed in to change notification settings

gaurav-aiml/ecommerce-data-pipeline

Repository files navigation

Near Real-Time Ecommerce User Activity Analysis

BigQuery Spark Hive Kafka Airflow GoogleCloudStorage Python

As a Data Engineer for a fictional E-commerce startup, this project addresses the task of creating a Data Pipeline which transports the web logs that are generated by the servers to Google Big Query (used as a data warehouse). This datawarehouse will help analysts to gain insights like the number of product pages visited by users, the number of items in the cart per hour or any other ad-hoc query needed for other business requirements. This project is created using Google Cloud Platform

Real Time Data Source

Since this is a fictional E-commerece Startup, the Real-Time data stream is generated using this python script. The script generates logs entires simulating either a visit to a product page or addition of a product into the shopping cart. If the log contains a PID value that is not Null, then it corresponds to a product added to the shopping cart. This generated web log is then published to a Kafka Topic which is then consumed by a Spark Streaming Application.

A typical entry of a web log into the kafka topic looks like this.


Data Flow Diagram

Dataflowdiagram

The following steps explain the flow of data

  1. The python script publishes the web logs into a Kafka Topic present in a Kafka Cluster (Google Dataproc Cluster)

  2. A Spark Streaming application, reads the data stream from the Kafka Topic and writes the dataframe into a Google Cloud Storage (GCS) Bucket. The data is written in Parquet format every minute.

  3. A Hive external table is created to index the raw data that writted in the previous step. Apache Airflow (Google Composer) is used to schedule a HiveQL query every hour to process the number of entries for visits and number of entries for products added to the shopping cart. The output of this query is stored in another GCS Bucket.

  4. The GCS Bucket to which the output of the previous step is written, is monitored by a Google Cloud Function. Whenever data is written to this bucket, the Cloud Function writes this data to a BigQuery Table **.

Usage

Prerequisites

  • Use the bash scripts to bring up the following:

    • Spark Cluster (Kafka also runs in this cluster) bash-scripts/create-spark-cluster.sh
    • Hive Cluster bash-scripts/create-hive-cluster.sh
  • Create a Airflow Environment using Google Composer and copy the DAGs into the environment from airflow/ folder. Also, copy the HiveQL scripts spark-scripts/hive-sql-visitcnt-script.py and spark-scripts/hive-sql-cart-script.py into GCS so that the DAGs can use it.

(If you are on GCP's free trial, spinning up these clusters and the airflow environment takes up all the IP Address quota. So close any other VMs that might be already running before trying to spin up these clusters)

Start the pipeline

  • Copy the Python Data Generator in data/ script into the Master node of the Spark Cluster. Install the necessary dependencies and start the script.
  • Create the Hive External table using hive-scripts/create-hive-ext-table.sh
  • Start the Airflow DAGs from the Airflow WebUI
  • Submit the cloud function bash-scripts/submit-cloud-function.sh

Further Improvements that I have planned to add

  • Bash scripts to create the Airflow Environment
  • ** Currently, the WRITE_DISPOSITION is set to WRITE_TRUNCATE. However, this is a costly operation which will be improved in the future. Also, the data is currently partitioned on the basis of the event date. Since BigQuery has a limit of 4000 partitions per table, this means that this configuration is able to store roughly 11 years (4000/365) worth of data. Remove the WRITE_TRUNCATE way of writing to BigQuery and come up with a way to use WRITE_APPEND to add newly processed data
  • Connecting BiqQuery to Tableau to create visualizations of the data
  • Add Data-Quality checks in the Airflow DAGs

About

As a Data Engineer for a fictional E-commerce startup, this project addresses the task of analyzing the web server logs to find the number of product pages visited and the number of items in the cart.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published