AWS Realtime Web Analytics Workshop
Knowing what users are doing on your web sites in realtime gives you insights you can act on without having to wait for delayed batch processing of clickstream data. There are many use cases for evaluating web traffic analytics in realtime: watching the immediate impact to user behavior after new releases, detecting and responding to anomalies, situational awareness, and evaluating trends are all benefits of having realtime web site analytics.
In this workshop we will build a cost optimized platform to capture web beacon traffic, analyze the traffic for interesting metrics, and display it on a customized dashboard.
To get started with this fun and educational workshop, simply clone this repository and start on module 1 below:
Note: If you don't have python or git installed on your computer, I recommend that you use an AWS Cloud9 environment to clone the repository, as well as run the python scripts that are part of this workshop. Expand the instructions below for details:
Cloud9 Environment Setup (expand for details)
AWS Cloud9 Environment Setup Instructions
Navigate in the AWS console to Services, then select Cloud9. Be sure that you have either the US East (N. Virginia), US West (Oregon), or EU West (Ireland) region selected before you proceed to the next step.
Click the Create Environment button:
- Give your Environment a name, then click the Next step button:
- The default Environment settings should be fine for this workshop (t2.micro instance type), which will allow you to stay within the free tier for your Cloud9 environment usage. If you want to load test the solution from this environment, you may want to provision a larger instance type to increase the network bandwidth available to your environment:
- Review the Environment name and settings, then click the Create environment button to continue:
- Once your environment has started, you can open a Terminal to run the git clone command:
git clone https://github.com/aws-samples/realtime-web-analytics-workshop.git
- You're now ready to proceed with Module 1. Use the Cloud9 Environment whenever you need to access any of the artifacts from the workshop git repository or run the python scripts.
Note: The Cloud9 Environment will automatically turn off after being idle for 30 minutes, so you might need to restart it by accessing it through the AWS console.
Note: You are responsible for the cost of AWS services used while running this workshop. Expand for details.
You are responsible for the cost of the AWS services used while running this reference deployment. As of the date of publication, the baseline cost for running this solution with default settings in the US East (N. Virginia) Region is approximately $100 per month. This cost estimate assumes the solution will record 1 million events per day with an average size of one kilobyte per event. Note that the monthly cost will vary depending on the number of events the solution processes. For 10 million events per day, the cost is approximately $170 per month. For 100 million events per day, the cost is approximately $950 per month. Prices are subject to change. For full details, see the pricing webpage for each AWS service you will be using in this solution.
Module 1 – Capturing Realtime Clickstream Events from Web Servers
In this module, you will start with an AutoScaling Group of Apache web servers, representing the front-end of your existing website or application. The AutoScaling Group receives incoming connections from an Application Load Balancer, and is configured to automatically scale out (and back in) based on the amount of incoming network traffic received by the web servers:
You will then create an S3 analytics bucket that will store an archive of all the clickstream events for historical analysis, and create a Kinesis Firehose delivery stream that will deliver messages to the S3 analytics bucket. You'll add a Kinesis agent to the fleet of web servers and configure it to send messages that appear in the Apache access logs to the Kinesis Firehose delivery stream:
In this scenario, you will leverage Amazon S3, Amazon EC2 Linux Instances, AutoScaling, Amazon Kinesis Data Firehose, and CloudFormation to automate the initial deployment, as well as changes to the stack.
Module 2 – Performing Realtime Analytics with Amazon Kinesis Data Analytics
Amazon Kinesis Data Analyitics makes it easy to process streaming data in real time with standard SQL. In this module you will create a Kinesis analytics application and use SQL on streaming data to generate metrics in real time that provide insights into current activies on your web site. Those metrics will be normaized and emitted to a Lambda function which delivers the data to a DynamoDB table.
In this scenario, you will leverage Amazon Kinesis Data Analytics, AWS Lambda, and Amazon DynamoDB.
Module 3 - Visualizing Metrics using CloudWatch Dashboards
In this module, which builds on our previous modules, you will start with realtime metric data that is being inserted into a DynamoDB table by our Kinesis Data Analytics application:
You'll learn how to capture the table activity with DynamoDB Streams. Once the stream has been created, you'll create a Lambda function that subscribes to the DynamoDB stream and processes the data change events, publishing them as CloudWatch Metrics using PutMetricData. Finally, after the CloudWatch Metrics are published, we'll visualize the data by creating a CloudWatch Dashboard.
In this scenario, you will leverage Amazon DynamoDB Streams, AWS Lambda, Amazon CloudWatch Metrics and Dashboards.
Module 4 - Adding Custom Metrics and Extending the Solution
In this module you will extend the solution to include a custom metric not already provided by default.
This sample code is made available under a modified MIT license. See the LICENSE file.