Skip to content

Commit

Permalink
official release of v3.1.1 (#1215)
Browse files Browse the repository at this point in the history
* bumping version to 3.1.1

* fixing bad links in docs, other touch ups (#1214)

* DRY out some code

* fix

* updating role paths used by scheduled queries (#1216)

Co-authored-by: Derek Wang <derek.wang@airbnb.com>
  • Loading branch information
ryandeivert and Ryxias committed Mar 31, 2020
1 parent fdb7b95 commit 1714b1e
Show file tree
Hide file tree
Showing 15 changed files with 118 additions and 70 deletions.
2 changes: 1 addition & 1 deletion docs/source/apps.rst
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ To update an App's credentials, run the the following command:
python manage.py app update-auth --cluster <cluster> --name <app_name>
This will have you follow a process similar to `configuring a new App <app-configuration.html#example-prompts-for-duo-auth>`_.
This will have you follow a process similar to `configuring a new App <#configuring-an-app>`_.


********************
Expand Down
6 changes: 3 additions & 3 deletions docs/source/config-clusters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Inbound data is directed to one of StreamAlert's *clusters*, each with its own d
and classifier function. For many applications, one cluster may be enough. However, adding
additional clusters can potentially improve performance. For example, you could have:

* A cluster dedicated to `StreamAlert apps <app-configuration.html>`_
* A cluster dedicated to `StreamAlert apps <apps.html>`_
* A separate cluster for each of your inbound `Kinesis Data Streams <https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html>`_
* A separate cluster for data from each environment (prod, staging, corp, etc)

Expand Down Expand Up @@ -53,7 +53,7 @@ from that source.
.. note::

Log schemas are defined in one or more files in the ``conf/schemas`` directory. See
the `Schemas <conf-schemas.html>`_ page for more information, or the
the `Schemas <config-schemas.html>`_ page for more information, or the
`Example Schemas <conf-schemas-examples.html>`_ page for some sample log definitions.

Each log in the list of logs instructs StreamAlert's classifier function to attempt
Expand Down Expand Up @@ -97,7 +97,7 @@ Example
.. important::

Any data source log type that is listed must have an associated log definition
within your `schemas <conf-schemas.html>`_ definitions.
within your `schemas <config-schemas.html>`_ definitions.


Classifier Configuration
Expand Down
4 changes: 2 additions & 2 deletions docs/source/datasources.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ These services above can accept data from:
* Amazon CloudWatch Events
* And more

To configure datasources, read `datasource configuration <conf-datasources.html>`_
To configure datasources for a cluster, read `datasource configuration <config-clusters.html#datasource-configuration>`_


*********
Expand All @@ -41,7 +41,7 @@ Example non-AWS use-cases:
Amazon Kinesis Data Streams
***************************
StreamAlert also utilizes Amazon Kinesis Data Streams for real-time data ingestion and analysis.
By default, StreamAlert creates an Amazon Kinesis Data Stream per `cluster <clusters.html>`_.
By default, StreamAlert creates an Amazon Kinesis Data Stream per `cluster <config-clusters.html>`_.


Sending to Amazon Kinesis Data Streams
Expand Down
6 changes: 3 additions & 3 deletions docs/source/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ SNS for both sending the log data and receiving the alert, but StreamAlert also
.. note:: You will need to click the verification link in your email to activate the subscription.

4. Add the ``streamalert-test-data`` SNS topic as an input to the (default) ``prod`` `cluster <clusters.html>`_.
4. Add the ``streamalert-test-data`` SNS topic as an input to the (default) ``prod`` `cluster <config-clusters.html>`_.
Open ``conf/clusters/prod.json`` and change the ``streamalert`` module to look like this:

.. code-block:: json
Expand All @@ -189,7 +189,7 @@ Open ``conf/clusters/prod.json`` and change the ``streamalert`` module to look l
}
}
5. Tell StreamAlert which `log schemas <conf-schemas.html>`_ will be sent to this input.
5. Tell StreamAlert which `log schemas <config-schemas.html>`_ will be sent to this input.
Open ``conf/clusters/prod.json`` and change the ``data_sources`` section to look like this:

.. code-block:: json
Expand Down Expand Up @@ -284,7 +284,7 @@ dropdown on the left and preview the ``alerts`` table:
:target: _images/athena-alerts-search.png

(Here, my name prefix is ``testv2``.) If no records are returned, look for errors
in the ``athena_partition_refresh`` function or try invoking it directly.
in the Athena Partition Refresh function or try invoking it directly.

And there you have it! Ingested log data is parsed, classified, and scanned by the rules engine.
Any resulting alerts are delivered to your configured output(s) within a matter of minutes.
54 changes: 39 additions & 15 deletions docs/source/historical-search.rst
Original file line number Diff line number Diff line change
@@ -1,16 +1,32 @@
#################
Historical Search
#################

StreamAlert historical search feature is backed by Amazon S3 and `Athena <https://aws.amazon.com/athena/>`_ services. By default, StreamAlert will send all alerts to S3 and those alerts will be searchable in Athena table. StreamAlert users have option to enable historical search feature for data as well.
StreamAlert historical search feature is backed by Amazon S3 and `Athena <https://aws.amazon.com/athena/>`_ services.
By default, StreamAlert will send all alerts to S3 and those alerts will be searchable in Athena table. StreamAlert
users have option to enable historical search feature for data as well.

As of StreamAlert v3.1.0, a new field, ``file_format``, has been added to ``athena_partition_refresh_config``
in ``conf/lamba.json``, defaulting to ``null``. This field allows users to configure how the data processed
by the Classifier is stored in S3 bucket, either in ``parquet`` or ``json``.

As of StreamAlert v3.1.0, a new field, ``file_format``, has been added to ``athena_partition_refresh_config`` in ``conf/lamba.json``, defaulting to ``null``. This field allows users to configure how the data processed by the Classifier is stored in S3 bucket—either in ``parquet`` or ``json``. Prior to v3.1.0, all data was stored in ``json``. When using this format, Athena's search performance degrades greatly when partition sizes grow. To address this, we've introduce support for ``parquet`` to provide better Athena search performance and cost saving.
Prior to v3.1.0, all data was stored in ``json``. When using this format, Athena's search performance
degrades greatly when partition sizes grow. To address this, we've introduce support for ``parquet``
to provide better Athena search performance and cost saving.

.. note::

* When upgrading StreamAlert to v3.1.0, it is required to change the default ``file_format`` value to either ``parquet`` or ``json``, otherwise StreamAlert will raise ``MisconfigurationError`` exception when run ``python manage.py build``.
* For existing deployments, ``file_format`` can be set to ``json`` and there will have no change occurred. However, if the ``file_format`` is changed to ``parquet``, all Athena tables need to be created to load ``parquet`` format. The existing JSON data won't be searchable anymore unless we build a separated tables to process data in JSON format. (All data stay in S3 bucket, there is no data loss.).
* For new StreamAlert deployments, it is recommended to set ``file_format`` to ``parquet`` to take the advantage of better Athena search performance and save the cost when scanning data.
* In the future release, the default value of ``file_format`` will change to ``parquet``. So let's change now!
* When upgrading to StreamAlert v3.1.0, you must set the ``file_format`` value to either ``parquet``
or ``json``, otherwise StreamAlert will raise ``MisconfigurationError`` exception when running
``python manage.py build``.
* For existing deployments, the ``file_format`` value can be set to ``json`` to retain current
functionality. However, if the ``file_format`` is changed to ``parquet``, new Athena tables will
need to be recreated to load the ``parquet`` format. The existing JSON data won't be searchable
anymore unless we build a separated tables to process data in JSON format. All of the underlying
data remains stored in S3 bucket, there is no data loss.
* For new StreamAlert deployments, it is recommended to set ``file_format`` to ``parquet`` to
take advantage of better Athena search performance and cost savings when scanning data.
* In an upcoming release, the value for ``file_format`` will be set to ``parquet`` by default, so let's change now!

************
Architecture
Expand All @@ -19,21 +35,29 @@ Architecture
.. image:: ../images/historical-search.png
:align: left

The pipeline is
* StreamAlert creates an Athena Database, alerts kinesis Firehose and ``alerts`` table during initial deployment
* Optional to create Firehose and Athena tables for data
* S3 events will be sent to SQS to invoke ``athena_partition_refresh`` lambda function to add new partitions when there are new alerts or data saved in S3 bucket via Firehose
* New alerts and data are available for searching via Athena console or SDK
The pipeline is:

#. StreamAlert creates an Athena Database, alerts kinesis Firehose and ``alerts`` table during initial deployment
#. Optionally create Firehose resources and Athena tables for historical data retention
#. S3 events will be sent to an SQS that is mapped to the Athena Partition Refresh Lambda function
#. The Lambda function adds new partitions when there are new alerts or data saved in S3 bucket via Firehose
#. Alerts, and optionally data, are available for searching via Athena console or the Athena API

.. _alerts_search:

*************
Alerts Search
*************

* Review alert Firehose configuration, see :ref:`alerts_firehose_configuration` in ``CONFIGURATION`` session. Athena database and Athena alerts table are created automatically when you first deploy StreamAlert.
* If the ``file_format`` is set to ``parquet``, you can run ``MSCK REPAIR TABLE alerts`` command in the Athena to load all available partitions and then alerts can be searchable. However, using ``MSCK REPAIR`` command can not load new partitions automatically.
* StreamAlert provides a lambda function ``athena_partition_refresh`` to load new partitions to Athena tables once the data arrives in the S3 buckets automatically. Update ``athena_partition_refresh_config`` if necessary. Open ``conf/lambda.json``. See more settings :ref:`configure_athena_partition_refresh_lambda`
* Review the settings for the :ref:`Alerts Firehose Configuration <alerts_firehose_configuration>` and
the :ref:`Athena Partition Refresh<configure_athena_partition_refresh_lambda>` function. Note that
the Athena database and alerts table are created automatically when you first deploy StreamAlert.
* If the ``file_format`` value within the :ref:`Athena Partition Refresh<configure_athena_partition_refresh_lambda>`
function config is set to ``parquet``, you can run the ``MSCK REPAIR TABLE alerts`` command in
Athena to load all available partitions and then alerts can be searchable. Note, however, that the
``MSCK REPAIR`` command cannot load new partitions automatically.
* StreamAlert includes a Lambda function to automatically add new partitions for Athena tables when
the data arrives in S3. See :ref:`configure_athena_partition_refresh_lambda`

.. code-block:: bash
Expand All @@ -45,7 +69,7 @@ Alerts Search
}
}
* Deploy athena_partition_refresh lambda function
* Deploy the Athena Partition Refresh Lambda function

.. code-block:: bash
Expand Down
12 changes: 7 additions & 5 deletions docs/source/rules.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ The simplest possible rule looks like this:
return True
This rule will be evaluated against all inbound logs that match the ``cloudwatch:events`` schema defined in a schema file in the ``conf/schemas`` directory, i.e ``conf/schemas/cloudwatch.json``.
In this case, *all* CloudWatch events will generate an alert, which will be sent to the `alerts Athena table <historical-search.html#athena-user-guide>`_.
In this case, *all* CloudWatch events will generate an alert, which will be sent to the `alerts Athena table <historical-search.html#alerts-search>`_.


Example: Logic & Outputs
Expand All @@ -70,7 +70,8 @@ Let's modify the rule to page the security team if anyone ever uses AWS root cre
and record['detail']['eventType'] != 'AwsServiceEvent')
Now, any AWS root account usage is reported to PagerDuty, Slack, and the aforementioned Athena table.
In order for this to work, your `datasources <conf-datasources.html>`_ and `outputs <outputs.html>`_ must be configured so that:
In order for this to work, your `datasources <config-clusters.html#datasource-configuration>`_ and
`outputs <outputs.html>`_ must be configured so that:

* CloudTrail logs are being sent to StreamAlert via CloudWatch events
* The ``pagerduty:csirt`` and ``slack:security`` outputs have the proper credentials
Expand Down Expand Up @@ -187,8 +188,9 @@ The following table provides an overview of each rule option, with more details

``logs`` define the log schema(s) supported by the rule.

Log `sources <conf-datasources.html>`_ are defined under the ``data_sources`` field for a cluster defined in ``conf/clusters/<cluster>.json``
and their `schemas <conf-schemas.html>`_ are defined in one or more files in the ``conf/schemas`` directory.
Log `datasources <config-clusters.html#datasource-configuration>`_ are defined within the
``data_sources`` field of a cluster such as ``conf/clusters/<cluster>.json`` and their
`schemas <config-schemas.html>`_ are defined in one or more files in the ``conf/schemas`` directory.

.. note::

Expand Down Expand Up @@ -254,7 +256,7 @@ The following table provides an overview of each rule option, with more details

.. note::

The original (unmerged) alert will always be sent to `Athena <historical-search.html#athena-user-guide>`_.
The original (unmerged) alert will always be sent to `Athena <historical-search.html#alerts-search>`_.

:dynamic_outputs:

Expand Down
2 changes: 1 addition & 1 deletion streamalert/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
"""StreamAlert version."""
__version__ = '3.1.0'
__version__ = '3.1.1'
23 changes: 2 additions & 21 deletions streamalert/athena_partition_refresh/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@

from streamalert.shared.utils import get_database_name, get_data_file_format
from streamalert.shared.athena import AthenaClient
from streamalert.shared.config import firehose_alerts_bucket, firehose_data_bucket, load_config
from streamalert.shared.config import athena_partition_buckets, load_config
from streamalert.shared.exceptions import ConfigError
from streamalert.shared.logger import get_logger

Expand Down Expand Up @@ -83,7 +83,7 @@ def __init__(self):
)
raise ConfigError(message)

self._athena_buckets = self.buckets_from_config(config)
self._athena_buckets = athena_partition_buckets(config)

db_name = get_database_name(config)

Expand All @@ -97,25 +97,6 @@ def __init__(self):

self._create_client(db_name, results_bucket)

@classmethod
def buckets_from_config(cls, config):
"""Get the buckets from default buckets and additionally configured ones
Args:
config (dict): The loaded config from the 'conf/' directory
Returns:
list: Bucket names for which Athena is enabled
"""
athena_config = config['lambda']['athena_partition_refresh_config']
data_buckets = athena_config.get('buckets', {})
data_buckets[firehose_alerts_bucket(config)] = 'alerts'
data_bucket = firehose_data_bucket(config) # Data retention is optional, so check for this
if data_bucket:
data_buckets[data_bucket] = 'data'

return data_buckets

@classmethod
def _create_client(cls, db_name, results_bucket):
if cls._ATHENA_CLIENT:
Expand Down
34 changes: 34 additions & 0 deletions streamalert/shared/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,40 @@ def firehose_alerts_bucket(config):
)


def athena_partition_buckets(config):
"""Get the buckets from default buckets and additionally configured ones
Args:
config (dict): The loaded config from the 'conf/' directory
Returns:
list: Bucket names for which Athena is enabled
"""
athena_config = config['lambda']['athena_partition_refresh_config']
data_buckets = athena_config.get('buckets', {})
data_buckets[firehose_alerts_bucket(config)] = 'alerts'
data_bucket = firehose_data_bucket(config) # Data retention is optional, so check for this
if data_bucket:
data_buckets[data_bucket] = 'data'

return data_buckets


def athena_query_results_bucket(config):
"""Get the S3 bucket where Athena queries store results to.
Args:
config (dict): The loaded config
Returns:
str: The name of the S3 bucket.
"""
athena_config = config['lambda']['athena_partition_refresh_config']
prefix = config['global']['account']['prefix']

return athena_config.get(
'results_bucket',
'{}.streamalert.athena-results'.format(prefix)
).strip()


def parse_lambda_arn(function_arn):
"""Extract info on the current environment from the lambda function ARN
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ resource "aws_cloudwatch_event_rule" "event" {
name = "${var.prefix}_streamalert_scheduled_queries_event_${count.index}"
description = var.query_packs[count.index].description
schedule_expression = var.query_packs[count.index].schedule_expression

tags = {
Name = "StreamAlert"
}
}

resource "aws_cloudwatch_event_target" "run_step_function" {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

# Attach additional permissions to the auto-generated Lambda IAM Role
resource "aws_iam_role_policy" "lambda_permissions" {
name = "${var.prefix}_streamalert_scheduled_queries_lambda_permissions"
name = "LambdaRequiredPermissions"
role = module.scheduled_queries_lambda.role_id
policy = data.aws_iam_policy_document.lambda_permissions.json
}
Expand Down Expand Up @@ -97,7 +97,12 @@ data "aws_iam_policy_document" "lambda_permissions" {
# Setup the IAM Role for the Step Functions
resource "aws_iam_role" "iam_for_step_functions" {
name = "${var.prefix}_streamalert_scheduled_queries_state_machines"
path = "/streamalert/"
assume_role_policy = data.aws_iam_policy_document.iam_step_function_assume_role.json

tags = {
Name = "StreamAlert"
}
}

# Only allow Step Functions to assume this role
Expand All @@ -116,7 +121,7 @@ data "aws_iam_policy_document" "iam_step_function_assume_role" {

# Attach an additional policy to the IAM Role
resource "aws_iam_role_policy" "stepfunction_permissions" {
name = "${var.prefix}_streamalert_scheduled_queries_state_machine_permissions"
name = "StepFunctionsInvokeLambda"
role = aws_iam_role.iam_for_step_functions.id
policy = data.aws_iam_policy_document.stepfunction_permissions.json
}
Expand All @@ -143,7 +148,12 @@ data "aws_iam_policy_document" "stepfunction_permissions" {
# Setup the IAM Role
resource "aws_iam_role" "iam_for_cloudwatch_schedule" {
name = "${var.prefix}_streamalert_scheduled_queries_cloudwatch_schedule"
path = "/streamalert/"
assume_role_policy = data.aws_iam_policy_document.iam_cloudwatch_assume_role.json

tags = {
Name = "StreamAlert"
}
}

# Only allow cloudwatch to assume this role
Expand All @@ -162,7 +172,7 @@ data "aws_iam_policy_document" "iam_cloudwatch_assume_role" {

# Attach additional permissions to the IAM Role
resource "aws_iam_role_policy" "cloudwatch_schedule_permissions" {
name = "${var.prefix}_streamalert_scheduled_queries_cloudwatch_schedule_permissions"
name = "StepFunctionsStartViaCWE"
role = aws_iam_role.iam_for_cloudwatch_schedule.id
policy = data.aws_iam_policy_document.cloudwatch_schedule_permission.json
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,4 +64,8 @@ resource "aws_sfn_state_machine" "state_machine" {
}
EOF

tags = {
Name = "StreamAlert"
}

}

This file was deleted.

4 changes: 2 additions & 2 deletions streamalert_cli/terraform/athena.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@
See the License for the specific language governing permissions and
limitations under the License.
"""
from streamalert.athena_partition_refresh.main import AthenaRefresher
from streamalert.shared import metrics
from streamalert.shared.config import athena_partition_buckets
from streamalert_cli.manage_lambda.package import AthenaPackage
from streamalert_cli.terraform.common import (
infinitedict,
Expand All @@ -35,7 +35,7 @@ def generate_athena(config):
athena_dict = infinitedict()
athena_config = config['lambda']['athena_partition_refresh_config']

data_buckets = sorted(AthenaRefresher.buckets_from_config(config))
data_buckets = sorted(athena_partition_buckets(config))

prefix = config['global']['account']['prefix']
database = athena_config.get('database_name', '{}_streamalert'.format(prefix))
Expand Down
Loading

0 comments on commit 1714b1e

Please sign in to comment.