Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] fixing links in docs, other touch ups #1214

Merged
merged 1 commit into from
Mar 31, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/apps.rst
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ To update an App's credentials, run the the following command:
python manage.py app update-auth --cluster <cluster> --name <app_name>


This will have you follow a process similar to `configuring a new App <app-configuration.html#example-prompts-for-duo-auth>`_.
This will have you follow a process similar to `configuring a new App <#configuring-an-app>`_.


********************
Expand Down
6 changes: 3 additions & 3 deletions docs/source/config-clusters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Inbound data is directed to one of StreamAlert's *clusters*, each with its own d
and classifier function. For many applications, one cluster may be enough. However, adding
additional clusters can potentially improve performance. For example, you could have:

* A cluster dedicated to `StreamAlert apps <app-configuration.html>`_
* A cluster dedicated to `StreamAlert apps <apps.html>`_
* A separate cluster for each of your inbound `Kinesis Data Streams <https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html>`_
* A separate cluster for data from each environment (prod, staging, corp, etc)

Expand Down Expand Up @@ -53,7 +53,7 @@ from that source.
.. note::

Log schemas are defined in one or more files in the ``conf/schemas`` directory. See
the `Schemas <conf-schemas.html>`_ page for more information, or the
the `Schemas <config-schemas.html>`_ page for more information, or the
`Example Schemas <conf-schemas-examples.html>`_ page for some sample log definitions.

Each log in the list of logs instructs StreamAlert's classifier function to attempt
Expand Down Expand Up @@ -97,7 +97,7 @@ Example
.. important::

Any data source log type that is listed must have an associated log definition
within your `schemas <conf-schemas.html>`_ definitions.
within your `schemas <config-schemas.html>`_ definitions.


Classifier Configuration
Expand Down
4 changes: 2 additions & 2 deletions docs/source/datasources.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ These services above can accept data from:
* Amazon CloudWatch Events
* And more

To configure datasources, read `datasource configuration <conf-datasources.html>`_
To configure datasources for a cluster, read `datasource configuration <config-clusters.html#datasource-configuration>`_


*********
Expand All @@ -41,7 +41,7 @@ Example non-AWS use-cases:
Amazon Kinesis Data Streams
***************************
StreamAlert also utilizes Amazon Kinesis Data Streams for real-time data ingestion and analysis.
By default, StreamAlert creates an Amazon Kinesis Data Stream per `cluster <clusters.html>`_.
By default, StreamAlert creates an Amazon Kinesis Data Stream per `cluster <config-clusters.html>`_.


Sending to Amazon Kinesis Data Streams
Expand Down
6 changes: 3 additions & 3 deletions docs/source/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ SNS for both sending the log data and receiving the alert, but StreamAlert also

.. note:: You will need to click the verification link in your email to activate the subscription.

4. Add the ``streamalert-test-data`` SNS topic as an input to the (default) ``prod`` `cluster <clusters.html>`_.
4. Add the ``streamalert-test-data`` SNS topic as an input to the (default) ``prod`` `cluster <config-clusters.html>`_.
Open ``conf/clusters/prod.json`` and change the ``streamalert`` module to look like this:

.. code-block:: json
Expand All @@ -189,7 +189,7 @@ Open ``conf/clusters/prod.json`` and change the ``streamalert`` module to look l
}
}

5. Tell StreamAlert which `log schemas <conf-schemas.html>`_ will be sent to this input.
5. Tell StreamAlert which `log schemas <config-schemas.html>`_ will be sent to this input.
Open ``conf/clusters/prod.json`` and change the ``data_sources`` section to look like this:

.. code-block:: json
Expand Down Expand Up @@ -284,7 +284,7 @@ dropdown on the left and preview the ``alerts`` table:
:target: _images/athena-alerts-search.png

(Here, my name prefix is ``testv2``.) If no records are returned, look for errors
in the ``athena_partition_refresh`` function or try invoking it directly.
in the Athena Partition Refresh function or try invoking it directly.

And there you have it! Ingested log data is parsed, classified, and scanned by the rules engine.
Any resulting alerts are delivered to your configured output(s) within a matter of minutes.
54 changes: 39 additions & 15 deletions docs/source/historical-search.rst
Original file line number Diff line number Diff line change
@@ -1,16 +1,32 @@
#################
Historical Search
#################

StreamAlert historical search feature is backed by Amazon S3 and `Athena <https://aws.amazon.com/athena/>`_ services. By default, StreamAlert will send all alerts to S3 and those alerts will be searchable in Athena table. StreamAlert users have option to enable historical search feature for data as well.
StreamAlert historical search feature is backed by Amazon S3 and `Athena <https://aws.amazon.com/athena/>`_ services.
By default, StreamAlert will send all alerts to S3 and those alerts will be searchable in Athena table. StreamAlert
users have option to enable historical search feature for data as well.

As of StreamAlert v3.1.0, a new field, ``file_format``, has been added to ``athena_partition_refresh_config``
in ``conf/lamba.json``, defaulting to ``null``. This field allows users to configure how the data processed
by the Classifier is stored in S3 bucket, either in ``parquet`` or ``json``.

As of StreamAlert v3.1.0, a new field, ``file_format``, has been added to ``athena_partition_refresh_config`` in ``conf/lamba.json``, defaulting to ``null``. This field allows users to configure how the data processed by the Classifier is stored in S3 bucket—either in ``parquet`` or ``json``. Prior to v3.1.0, all data was stored in ``json``. When using this format, Athena's search performance degrades greatly when partition sizes grow. To address this, we've introduce support for ``parquet`` to provide better Athena search performance and cost saving.
Prior to v3.1.0, all data was stored in ``json``. When using this format, Athena's search performance
degrades greatly when partition sizes grow. To address this, we've introduce support for ``parquet``
to provide better Athena search performance and cost saving.

.. note::

* When upgrading StreamAlert to v3.1.0, it is required to change the default ``file_format`` value to either ``parquet`` or ``json``, otherwise StreamAlert will raise ``MisconfigurationError`` exception when run ``python manage.py build``.
* For existing deployments, ``file_format`` can be set to ``json`` and there will have no change occurred. However, if the ``file_format`` is changed to ``parquet``, all Athena tables need to be created to load ``parquet`` format. The existing JSON data won't be searchable anymore unless we build a separated tables to process data in JSON format. (All data stay in S3 bucket, there is no data loss.).
* For new StreamAlert deployments, it is recommended to set ``file_format`` to ``parquet`` to take the advantage of better Athena search performance and save the cost when scanning data.
* In the future release, the default value of ``file_format`` will change to ``parquet``. So let's change now!
* When upgrading to StreamAlert v3.1.0, you must set the ``file_format`` value to either ``parquet``
or ``json``, otherwise StreamAlert will raise ``MisconfigurationError`` exception when running
``python manage.py build``.
* For existing deployments, the ``file_format`` value can be set to ``json`` to retain current
functionality. However, if the ``file_format`` is changed to ``parquet``, new Athena tables will
need to be recreated to load the ``parquet`` format. The existing JSON data won't be searchable
anymore unless we build a separated tables to process data in JSON format. All of the underlying
data remains stored in S3 bucket, there is no data loss.
* For new StreamAlert deployments, it is recommended to set ``file_format`` to ``parquet`` to
take advantage of better Athena search performance and cost savings when scanning data.
* In an upcoming release, the value for ``file_format`` will be set to ``parquet`` by default, so let's change now!

************
Architecture
Expand All @@ -19,21 +35,29 @@ Architecture
.. image:: ../images/historical-search.png
:align: left

The pipeline is
* StreamAlert creates an Athena Database, alerts kinesis Firehose and ``alerts`` table during initial deployment
* Optional to create Firehose and Athena tables for data
* S3 events will be sent to SQS to invoke ``athena_partition_refresh`` lambda function to add new partitions when there are new alerts or data saved in S3 bucket via Firehose
* New alerts and data are available for searching via Athena console or SDK
The pipeline is:

#. StreamAlert creates an Athena Database, alerts kinesis Firehose and ``alerts`` table during initial deployment
#. Optionally create Firehose resources and Athena tables for historical data retention
#. S3 events will be sent to an SQS that is mapped to the Athena Partition Refresh Lambda function
#. The Lambda function adds new partitions when there are new alerts or data saved in S3 bucket via Firehose
#. Alerts, and optionally data, are available for searching via Athena console or the Athena API

.. _alerts_search:

*************
Alerts Search
*************

* Review alert Firehose configuration, see :ref:`alerts_firehose_configuration` in ``CONFIGURATION`` session. Athena database and Athena alerts table are created automatically when you first deploy StreamAlert.
* If the ``file_format`` is set to ``parquet``, you can run ``MSCK REPAIR TABLE alerts`` command in the Athena to load all available partitions and then alerts can be searchable. However, using ``MSCK REPAIR`` command can not load new partitions automatically.
* StreamAlert provides a lambda function ``athena_partition_refresh`` to load new partitions to Athena tables once the data arrives in the S3 buckets automatically. Update ``athena_partition_refresh_config`` if necessary. Open ``conf/lambda.json``. See more settings :ref:`configure_athena_partition_refresh_lambda`
* Review the settings for the :ref:`Alerts Firehose Configuration <alerts_firehose_configuration>` and
the :ref:`Athena Partition Refresh<configure_athena_partition_refresh_lambda>` function. Note that
the Athena database and alerts table are created automatically when you first deploy StreamAlert.
* If the ``file_format`` value within the :ref:`Athena Partition Refresh<configure_athena_partition_refresh_lambda>`
function config is set to ``parquet``, you can run the ``MSCK REPAIR TABLE alerts`` command in
Athena to load all available partitions and then alerts can be searchable. Note, however, that the
``MSCK REPAIR`` command cannot load new partitions automatically.
* StreamAlert includes a Lambda function to automatically add new partitions for Athena tables when
the data arrives in S3. See :ref:`configure_athena_partition_refresh_lambda`

.. code-block:: bash

Expand All @@ -45,7 +69,7 @@ Alerts Search
}
}

* Deploy athena_partition_refresh lambda function
* Deploy the Athena Partition Refresh Lambda function

.. code-block:: bash

Expand Down
12 changes: 7 additions & 5 deletions docs/source/rules.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ The simplest possible rule looks like this:
return True

This rule will be evaluated against all inbound logs that match the ``cloudwatch:events`` schema defined in a schema file in the ``conf/schemas`` directory, i.e ``conf/schemas/cloudwatch.json``.
In this case, *all* CloudWatch events will generate an alert, which will be sent to the `alerts Athena table <historical-search.html#athena-user-guide>`_.
In this case, *all* CloudWatch events will generate an alert, which will be sent to the `alerts Athena table <historical-search.html#alerts-search>`_.


Example: Logic & Outputs
Expand All @@ -70,7 +70,8 @@ Let's modify the rule to page the security team if anyone ever uses AWS root cre
and record['detail']['eventType'] != 'AwsServiceEvent')

Now, any AWS root account usage is reported to PagerDuty, Slack, and the aforementioned Athena table.
In order for this to work, your `datasources <conf-datasources.html>`_ and `outputs <outputs.html>`_ must be configured so that:
In order for this to work, your `datasources <config-clusters.html#datasource-configuration>`_ and
`outputs <outputs.html>`_ must be configured so that:

* CloudTrail logs are being sent to StreamAlert via CloudWatch events
* The ``pagerduty:csirt`` and ``slack:security`` outputs have the proper credentials
Expand Down Expand Up @@ -187,8 +188,9 @@ The following table provides an overview of each rule option, with more details

``logs`` define the log schema(s) supported by the rule.

Log `sources <conf-datasources.html>`_ are defined under the ``data_sources`` field for a cluster defined in ``conf/clusters/<cluster>.json``
and their `schemas <conf-schemas.html>`_ are defined in one or more files in the ``conf/schemas`` directory.
Log `datasources <config-clusters.html#datasource-configuration>`_ are defined within the
``data_sources`` field of a cluster such as ``conf/clusters/<cluster>.json`` and their
`schemas <config-schemas.html>`_ are defined in one or more files in the ``conf/schemas`` directory.

.. note::

Expand Down Expand Up @@ -254,7 +256,7 @@ The following table provides an overview of each rule option, with more details

.. note::

The original (unmerged) alert will always be sent to `Athena <historical-search.html#athena-user-guide>`_.
The original (unmerged) alert will always be sent to `Athena <historical-search.html#alerts-search>`_.

:dynamic_outputs:

Expand Down