airbnb · ryandeivert · Mar 31, 2020 · Mar 31, 2020
diff --git a/docs/source/apps.rst b/docs/source/apps.rst
@@ -224,7 +224,7 @@ To update an App's credentials, run the the following command:
   python manage.py app update-auth --cluster <cluster> --name <app_name>
 
 
-This will have you follow a process similar to `configuring a new App <app-configuration.html#example-prompts-for-duo-auth>`_.
+This will have you follow a process similar to `configuring a new App <#configuring-an-app>`_.
 
 
 ********************

diff --git a/docs/source/config-clusters.rst b/docs/source/config-clusters.rst
@@ -6,7 +6,7 @@ Inbound data is directed to one of StreamAlert's *clusters*, each with its own d
 and classifier function. For many applications, one cluster may be enough. However, adding
 additional clusters can potentially improve performance. For example, you could have:
 
-  * A cluster dedicated to `StreamAlert apps <app-configuration.html>`_
+  * A cluster dedicated to `StreamAlert apps <apps.html>`_
   * A separate cluster for each of your inbound `Kinesis Data Streams <https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html>`_
   * A separate cluster for data from each environment (prod, staging, corp, etc)
 
@@ -53,7 +53,7 @@ from that source.
 .. note::
 
   Log schemas are defined in one or more files in the ``conf/schemas`` directory. See
-  the `Schemas <conf-schemas.html>`_ page for more information, or the
+  the `Schemas <config-schemas.html>`_ page for more information, or the
   `Example Schemas <conf-schemas-examples.html>`_ page for some sample log definitions.
 
 Each log in the list of logs instructs StreamAlert's classifier function to attempt
@@ -97,7 +97,7 @@ Example
 .. important::
 
   Any data source log type that is listed must have an associated log definition
-  within your `schemas <conf-schemas.html>`_ definitions.
+  within your `schemas <config-schemas.html>`_ definitions.
 
 
 Classifier Configuration

diff --git a/docs/source/datasources.rst b/docs/source/datasources.rst
@@ -15,7 +15,7 @@ These services above can accept data from:
 * Amazon CloudWatch Events
 * And more
 
-To configure datasources, read `datasource configuration <conf-datasources.html>`_
+To configure datasources for a cluster, read `datasource configuration <config-clusters.html#datasource-configuration>`_
 
 
 *********
@@ -41,7 +41,7 @@ Example non-AWS use-cases:
 Amazon Kinesis Data Streams
 ***************************
 StreamAlert also utilizes Amazon Kinesis Data Streams for real-time data ingestion and analysis.
-By default, StreamAlert creates an Amazon Kinesis Data Stream per `cluster <clusters.html>`_.
+By default, StreamAlert creates an Amazon Kinesis Data Stream per `cluster <config-clusters.html>`_.
 
 
 Sending to Amazon Kinesis Data Streams

diff --git a/docs/source/getting-started.rst b/docs/source/getting-started.rst
@@ -169,7 +169,7 @@ SNS for both sending the log data and receiving the alert, but StreamAlert also
 
 .. note:: You will need to click the verification link in your email to activate the subscription.
 
-4. Add the ``streamalert-test-data`` SNS topic as an input to the (default) ``prod`` `cluster <clusters.html>`_.
+4. Add the ``streamalert-test-data`` SNS topic as an input to the (default) ``prod`` `cluster <config-clusters.html>`_.
 Open ``conf/clusters/prod.json`` and change the ``streamalert`` module to look like this:
 
 .. code-block:: json
@@ -189,7 +189,7 @@ Open ``conf/clusters/prod.json`` and change the ``streamalert`` module to look l
     }
   }
 
-5. Tell StreamAlert which `log schemas <conf-schemas.html>`_ will be sent to this input.
+5. Tell StreamAlert which `log schemas <config-schemas.html>`_ will be sent to this input.
 Open ``conf/clusters/prod.json`` and change the ``data_sources`` section to look like this:
 
 .. code-block:: json
@@ -284,7 +284,7 @@ dropdown on the left and preview the ``alerts`` table:
   :target: _images/athena-alerts-search.png
 
 (Here, my name prefix is ``testv2``.) If no records are returned, look for errors
-in the ``athena_partition_refresh`` function or try invoking it directly.
+in the Athena Partition Refresh function or try invoking it directly.
 
 And there you have it! Ingested log data is parsed, classified, and scanned by the rules engine.
 Any resulting alerts are delivered to your configured output(s) within a matter of minutes.
diff --git a/docs/source/historical-search.rst b/docs/source/historical-search.rst
@@ -1,16 +1,32 @@
+#################
 Historical Search
 #################
 
-StreamAlert historical search feature is backed by Amazon S3 and `Athena <https://aws.amazon.com/athena/>`_ services. By default, StreamAlert will send all alerts to S3 and those alerts will be searchable in Athena table. StreamAlert users have option to enable historical search feature for data as well.
+StreamAlert historical search feature is backed by Amazon S3 and `Athena <https://aws.amazon.com/athena/>`_ services.
+By default, StreamAlert will send all alerts to S3 and those alerts will be searchable in Athena table. StreamAlert
+users have option to enable historical search feature for data as well.
+
+As of StreamAlert v3.1.0, a new field, ``file_format``, has been added to ``athena_partition_refresh_config``
+in ``conf/lamba.json``, defaulting to ``null``. This field allows users to configure how the data processed
+by the Classifier is stored in S3 bucket, either in ``parquet`` or ``json``.
 
-As of StreamAlert v3.1.0, a new field, ``file_format``, has been added to ``athena_partition_refresh_config`` in ``conf/lamba.json``, defaulting to ``null``. This field allows users to configure how the data processed by the Classifier is stored in S3 bucket—either in ``parquet`` or ``json``. Prior to v3.1.0, all data was stored in ``json``. When using this format, Athena's search performance degrades greatly when partition sizes grow. To address this, we've introduce support for ``parquet`` to provide better Athena search performance and cost saving.
+Prior to v3.1.0, all data was stored in ``json``. When using this format, Athena's search performance
+degrades greatly when partition sizes grow. To address this, we've introduce support for ``parquet``
+to provide better Athena search performance and cost saving.
 
 .. note::
 
-  * When upgrading StreamAlert to v3.1.0, it is required to change the default ``file_format`` value to either ``parquet`` or ``json``, otherwise StreamAlert will raise ``MisconfigurationError`` exception when run ``python manage.py build``.
-  * For existing deployments, ``file_format`` can be set to ``json`` and there will have no change occurred. However, if the ``file_format`` is changed to ``parquet``, all Athena tables need to be created to load ``parquet`` format. The existing JSON data won't be searchable anymore unless we build a separated tables to process data in JSON format. (All data stay in S3 bucket, there is no data loss.).
-  * For new StreamAlert deployments, it is recommended to set ``file_format`` to ``parquet`` to take the advantage of better Athena search performance and save the cost when scanning data.
-  * In the future release, the default value of ``file_format`` will change to ``parquet``. So let's change now!
+  * When upgrading to StreamAlert v3.1.0, you must set the ``file_format`` value to either ``parquet``
+    or ``json``, otherwise StreamAlert will raise ``MisconfigurationError`` exception when running
+    ``python manage.py build``.
+  * For existing deployments, the ``file_format`` value can be set to ``json`` to retain current
+    functionality. However, if the ``file_format`` is changed to ``parquet``, new Athena tables will
+    need to be recreated to load the ``parquet`` format. The existing JSON data won't be searchable
+    anymore unless we build a separated tables to process data in JSON format. All of the underlying
+    data remains stored in S3 bucket, there is no data loss.
+  * For new StreamAlert deployments, it is recommended to set ``file_format`` to ``parquet`` to
+    take advantage of better Athena search performance and cost savings when scanning data.
+  * In an upcoming release, the value for ``file_format`` will be set to ``parquet`` by default, so let's change now!
 
 ************
 Architecture
@@ -19,21 +35,29 @@ Architecture
 .. image:: ../images/historical-search.png
     :align: left
 
-The pipeline is
-* StreamAlert creates an Athena Database, alerts kinesis Firehose and ``alerts`` table during initial deployment
-* Optional to create Firehose and Athena tables for data
-* S3 events will be sent to SQS to invoke ``athena_partition_refresh`` lambda function to add new partitions when there are new alerts or data saved in S3 bucket via Firehose
-* New alerts and data are available for searching via Athena console or SDK
+The pipeline is:
+
+  #. StreamAlert creates an Athena Database, alerts kinesis Firehose and ``alerts`` table during initial deployment
+  #. Optionally create Firehose resources and Athena tables for historical data retention
+  #. S3 events will be sent to an SQS that is mapped to the Athena Partition Refresh Lambda function
+  #. The Lambda function adds new partitions when there are new alerts or data saved in S3 bucket via Firehose
+  #. Alerts, and optionally data, are available for searching via Athena console or the Athena API
 
 .. _alerts_search:
 
 *************
 Alerts Search
 *************
 
-* Review alert Firehose configuration, see :ref:`alerts_firehose_configuration` in ``CONFIGURATION`` session. Athena database and Athena alerts table are created automatically when you first deploy StreamAlert.
-* If the ``file_format`` is set to ``parquet``, you can run ``MSCK REPAIR TABLE alerts`` command in the Athena to load all available partitions and then alerts can be searchable. However, using ``MSCK REPAIR`` command can not load new partitions automatically.
-* StreamAlert provides a lambda function ``athena_partition_refresh`` to load new partitions to Athena tables once the data arrives in the S3 buckets automatically. Update ``athena_partition_refresh_config`` if necessary. Open ``conf/lambda.json``. See more settings :ref:`configure_athena_partition_refresh_lambda`
+* Review the settings for the :ref:`Alerts Firehose Configuration <alerts_firehose_configuration>` and
+  the :ref:`Athena Partition Refresh<configure_athena_partition_refresh_lambda>` function. Note that
+  the Athena database and alerts table are created automatically when you first deploy StreamAlert.
+* If the ``file_format`` value within the :ref:`Athena Partition Refresh<configure_athena_partition_refresh_lambda>`
+  function config is set to ``parquet``, you can run the ``MSCK REPAIR TABLE alerts`` command in
+  Athena to load all available partitions and then alerts can be searchable. Note, however, that the
+  ``MSCK REPAIR`` command cannot load new partitions automatically.
+* StreamAlert includes a Lambda function to automatically add new partitions for Athena tables when
+  the data arrives in S3. See :ref:`configure_athena_partition_refresh_lambda`
 
   .. code-block:: bash
 
@@ -45,7 +69,7 @@ Alerts Search
       }
     }
 
-* Deploy athena_partition_refresh lambda function
+* Deploy the Athena Partition Refresh Lambda function
 
   .. code-block:: bash
 

diff --git a/docs/source/rules.rst b/docs/source/rules.rst
@@ -51,7 +51,7 @@ The simplest possible rule looks like this:
       return True
 
 This rule will be evaluated against all inbound logs that match the ``cloudwatch:events`` schema defined in a schema file in the ``conf/schemas`` directory, i.e ``conf/schemas/cloudwatch.json``.
-In this case, *all* CloudWatch events will generate an alert, which will be sent to the `alerts Athena table <historical-search.html#athena-user-guide>`_.
+In this case, *all* CloudWatch events will generate an alert, which will be sent to the `alerts Athena table <historical-search.html#alerts-search>`_.
 
 
 Example: Logic & Outputs
@@ -70,7 +70,8 @@ Let's modify the rule to page the security team if anyone ever uses AWS root cre
                 and record['detail']['eventType'] != 'AwsServiceEvent')
 
 Now, any AWS root account usage is reported to PagerDuty, Slack, and the aforementioned Athena table.
-In order for this to work, your `datasources <conf-datasources.html>`_ and `outputs <outputs.html>`_ must be configured so that:
+In order for this to work, your `datasources <config-clusters.html#datasource-configuration>`_ and
+`outputs <outputs.html>`_ must be configured so that:
 
 * CloudTrail logs are being sent to StreamAlert via CloudWatch events
 * The ``pagerduty:csirt`` and ``slack:security`` outputs have the proper credentials
@@ -187,8 +188,9 @@ The following table provides an overview of each rule option, with more details
 
   ``logs`` define the log schema(s) supported by the rule.
 
-  Log `sources <conf-datasources.html>`_ are defined under the ``data_sources`` field for a cluster defined in ``conf/clusters/<cluster>.json``
-  and their `schemas <conf-schemas.html>`_ are defined in one or more files in the ``conf/schemas`` directory.
+  Log `datasources <config-clusters.html#datasource-configuration>`_ are defined within the
+  ``data_sources`` field of a cluster such as ``conf/clusters/<cluster>.json`` and their
+  `schemas <config-schemas.html>`_ are defined in one or more files in the ``conf/schemas`` directory.
 
   .. note::
 
@@ -254,7 +256,7 @@ The following table provides an overview of each rule option, with more details
 
   .. note::
 
-    The original (unmerged) alert will always be sent to `Athena <historical-search.html#athena-user-guide>`_.
+    The original (unmerged) alert will always be sent to `Athena <historical-search.html#alerts-search>`_.
 
 :dynamic_outputs: