Normalization reboot - Refactor normalization code #1238

chunyong-lin · 2020-04-17T23:05:18Z

to: @airbnb/streamalert-maintainers
related to: #1237 #1230
resolves:

Background

This is 3rd PR to implement Normalization v2 feature. It is focus on refactor normalization code to use both Normalization v1 and v2 configuration and add additional information function to normalized values.

New Support

The biggest change in this PR is the normalization configuration change in conf/schemas/*.json. In StreamAlert schemas, provide a new configuration option:

“cloudtrail:something_logs”: {
  “schema”: {
    “Field1”: “string”,
    “Field2”: “string”,
    “Field3”: “string”
  },
  “parser”: “json”,
  “configuration”: {
    “normalization”: {
      -- NEW CONFIGURATION HERE --
      “ip_address”: [
        {
          “path”: [“path”, "to", “srcIpAddr”], # standard configuration including path to the key and function
          “function”: “source”
        },
        {
          “path”: ["path", "to", “dstIpAddr”], 
          “function”: “destination”
        }
      ],
      “hostname”: [“path”, “to”, “hostname”] # simplified configuration including path to the key only
    }
  }
}

Archtecture

Deprecate Normalization v1

We deprecate Normalization v1 and conf/normalized_types.json will have no effect. Make sure migration our normalized configuration to conf/schemas/*.json along with log schemas.

Changes

Fixed couple bugs in Normalization reboot - Add terraform resources #1237 which were discovered during staging testing
Refactor normalization code to support normalization v2
Update existing test cases
Add normalization configuration to conf/schemas/carbonblack.json, conf/schemas/cloudwatch and conf/schemas/osquery.json. Also update rule right_to_left_character to use normalization v2.
Add more test cases to improve the code coverage
Add architecture diagram to doc

Testing

Deploy the changes to staging environment and created a kinesis stream for testing. Sent two fake cloudwatch events to the kinesis.

Those two cloudwatch events showed up in cloudwatch:events table.

Artifacts extracted from those two cloudwatch events showed up in artifacts table.

Next Step

Still have one more PR to complete this feature that is to add advanced features to apply filters in normalization.

ryandeivert · 2020-04-23T05:54:17Z

docs/source/normalization.rst

+Architecture
+============
+
+.. figure:: ../images/normalization-arch.png


this image could use some love (like attention to detail with capitalized words vs non-capitalized words, spacing between labels + images, etc)... can you share this with me and I can tweak it?

ryandeivert

I'd prefer if the athena screenshots didn't show your "cylin2020..."."artifacts" prefix.. and only showed select * from artifacts... the cylin2020.. bit is actually not required (unless you are searching a database that is NOT selected in the dropdown in Athena console.. which is rarely the case)

ryandeivert · 2020-04-23T05:59:46Z

docs/source/normalization.rst

-Coming soon.
+In Normalization v1, the normalized types are based on log source (e.g. osquery, cloudwatch, etc) and defined in ``conf/normalized_types.json`` file.
+
+In Normalization v2, the normalized types will be based on log type (e.g. osquery:differential, cloudwatch:cloudtrail, cloudwatch:events, etc) and defined in ``conf/schemas/*.json``. Although, it is recommended to configure normalization in ``conf/schemas/*.json``, the v1 configuration will be still valid and merged to v2.


Suggested change

In Normalization v2, the normalized types will be based on log type (e.g. osquery:differential, cloudwatch:cloudtrail, cloudwatch:events, etc) and defined in ``conf/schemas/*.json``. Although, it is recommended to configure normalization in ``conf/schemas/*.json``, the v1 configuration will be still valid and merged to v2.

In Normalization v2, the normalized types will be based on log type (e.g. ``osquery:differential``, ``cloudwatch:cloudtrail``, ``cloudwatch:events``, etc) and defined in ``conf/schemas/*.json``. However, we recommend configuring normalization in ``conf/schemas/*.json``. The v1 configuration will be still valid and merged to v2.

ryandeivert · 2020-04-23T06:00:02Z

docs/source/normalization.rst

 Configuration
 =============
-
-Coming soon.
+In Normalization v1, the normalized types are based on log source (e.g. osquery, cloudwatch, etc) and defined in ``conf/normalized_types.json`` file.


Suggested change

In Normalization v1, the normalized types are based on log source (e.g. osquery, cloudwatch, etc) and defined in ``conf/normalized_types.json`` file.

In Normalization v1, the normalized types are based on log source (e.g. ``osquery``, ``cloudwatch``, etc) and defined in ``conf/normalized_types.json`` file.

ryandeivert · 2020-04-23T06:04:12Z

docs/source/normalization.rst

+
+In Normalization v2, the normalized types will be based on log type (e.g. osquery:differential, cloudwatch:cloudtrail, cloudwatch:events, etc) and defined in ``conf/schemas/*.json``. Although, it is recommended to configure normalization in ``conf/schemas/*.json``, the v1 configuration will be still valid and merged to v2.
+
+Giving some examples to configure normalization v2. All normalized types are arbitrary, but we recommend to use all lower cases and underscores to name the normalized types to have better compatibility with Athena.


Suggested change

Giving some examples to configure normalization v2. All normalized types are arbitrary, but we recommend to use all lower cases and underscores to name the normalized types to have better compatibility with Athena.

Below are some example configurations for normalization v2. All normalized types are arbitrary, but only lower case alphabetic characters and underscores should be used for names in order to be compatible with Athena.

ryandeivert · 2020-04-23T06:04:40Z

docs/source/normalization.rst

+
+Giving some examples to configure normalization v2. All normalized types are arbitrary, but we recommend to use all lower cases and underscores to name the normalized types to have better compatibility with Athena.
+
+* Normalized all ip addresses (``ip_address``) and user identities (``user_identity``) for ``cloudwatch:events`` events


Suggested change

* Normalized all ip addresses (``ip_address``) and user identities (``user_identity``) for ``cloudwatch:events`` events

* Normalize all ip addresses (``ip_address``) and user identities (``user_identity``) for ``cloudwatch:events`` logs

ryandeivert · 2020-04-23T06:05:46Z

docs/source/normalization.rst

+      }
+    }
+
+* Normalized all commands (``command``) and user identities (``user_identity``) for ``osquery:differential`` events


Suggested change

* Normalized all commands (``command``) and user identities (``user_identity``) for ``osquery:differential`` events

* Normalize all commands (``command``) and user identities (``user_identity``) for ``osquery:differential`` logs

ryandeivert · 2020-04-23T06:07:21Z

docs/source/normalization.rst

+  * A new Lambda function
+  * A new Glue catalog table ``artifacts`` for Historical Search via Athena
+  * A new Firehose to deliver artifacts to S3 bucket
+  * Update existing Firehoses to allow to invoke Artifact Extractor lambda if it is enabled on the Firehoses


Suggested change

* Update existing Firehoses to allow to invoke Artifact Extractor lambda if it is enabled on the Firehoses

* Update existing Firehoses to allow to invoke Artifact Extractor Lambda if it is enabled on the Firehose resources

ryandeivert · 2020-04-23T06:07:47Z

docs/source/normalization.rst

+
+    python manage.py deploy --function artifact_extractor
+
+* If normalization configuration changed in ``conf/schemas/*.json``, make sure deploy classifier as well


Suggested change

* If normalization configuration changed in ``conf/schemas/*.json``, make sure deploy classifier as well

* If the normalization configuration has changed in ``conf/schemas/*.json``, make sure to deploy the classifier Lambda function as well

ryandeivert · 2020-04-23T06:08:34Z

docs/source/normalization.rst

+Artifacts
+=========
+
+Artifacts will be searching via Athena ``artifacts`` table. During the test in staging environment, two fake ``cloudwatch:events`` were sent to a Kinesis data stream.


Suggested change

Artifacts will be searching via Athena ``artifacts`` table. During the test in staging environment, two fake ``cloudwatch:events`` were sent to a Kinesis data stream.

Artifacts will be searchable within the Athena ``artifacts`` table

ryandeivert · 2020-04-23T06:09:16Z

docs/source/normalization.rst

+
+Artifacts will be searching via Athena ``artifacts`` table. During the test in staging environment, two fake ``cloudwatch:events`` were sent to a Kinesis data stream.
+
+Those two fake events were searchable in ``cloudwatch_events`` table.


what are these references to "fake" events and "staging" environment in the docs??? please remove or update as necessary

ryandeivert · 2020-04-23T06:10:26Z

streamalert/artifact_extractor/artifact_extractor.py

        """
        # Enforce all fields are strings in a Artifact to prevent type corruption in Parquet format
-        self._function = str(kwargs.get('function', 'not_specified'))
-        self._record_id = str(kwargs.get('record_id', self.RESERVED))
+        self._function = str(kwargs.get('function'))


if these values ('function' and 'record_id') are no longer optional, please remove the usage of kwargs and add these are required arguments to __init__

Ryxias · 2020-04-23T16:29:54Z

streamalert/artifact_extractor/artifact_extractor.py

+        #           'awsRegion': 'us-west-2'
+        #       }
+        #   },
+        #   'normalization': {


I think this key would be streamalert:normalization?

Ryxias · 2020-04-23T16:32:59Z

streamalert/artifact_extractor/artifact_extractor.py

+        #   'normalization': {
+        #       'region': {
+        #           'values': ['us-east-1', 'us-west-2']
+        #           'function': 'AWS region'


The structure of this object will not work. If you have 2 region type values with different function values it will conflict.

"streamalert:normalization": { "ip_address": [ { "values": ["4.3.2.1"], "function": "outbound_connection_destination" }, { "values": ["2.2.2.2", "4.4.4.4"], "function": "dns_lookup" } } ] }

Ryxias · 2020-04-23T16:43:39Z

streamalert/shared/normalize.py

+                having following format, otherwise it will raise ConfigError.
+                    [
+                        {
+                            'fields': ['source', 'sourceIPAddress'],


This is not how I intended it to work. In this documentation you're implying that each normalizer can map to multiple fields. The way I designed it is each normalizer maps to exactly one field. The field array is a JSON path.

[ { 'fields': ['path', 'to', 'the', 'field'], 'function': 'same_function' }, { 'fields': ['other', 'path', 'same', 'function'], 'function': 'same_function' }, ]

Yeah, misunderstood the configuration part in the design doc. The new change will be up soon.

…o find original key

…alization

chunyong-lin · 2020-04-27T19:20:13Z

@ryandeivert @Ryxias PTAL, I have addressed your comments. I also updated PR description and docs.

Ryxias

https://www.youtube.com/watch?v=sq_Fm7qfRQk

Ryxias · 2020-04-27T21:08:43Z

streamalert/shared/normalize.py

-                continue
-
-            yield value
+        # for key, value in record.items():


Unnecessary comment block?

Chunyong Lin added 2 commits April 16, 2020 15:26

[bugs] Fixed couple bugs before normalization code change

d07b310

[core] Refactor normalization code, unit test cases and add new ones

a91c68e

chunyong-lin changed the title ~~Cylin ae normalization~~ Normalization reboot - Refactor normalization code Apr 17, 2020

chunyong-lin added the data normalization label Apr 17, 2020

chunyong-lin added this to the 3.3.0 milestone Apr 17, 2020

Chunyong Lin added 3 commits April 22, 2020 14:03

[core] Re-implement normalization code \O/

854bf84

[docs] Update docs

b353d11

[docs] More docs

6631ca5

chunyong-lin marked this pull request as ready for review April 22, 2020 23:44

ryandeivert reviewed Apr 23, 2020

View reviewed changes

ryandeivert requested changes Apr 23, 2020

View reviewed changes

Ryxias requested changes Apr 23, 2020

View reviewed changes

Chunyong Lin added 5 commits April 27, 2020 11:34

Rework normalization logic to use key path from conf/schemas/*.json t…

4357d58

…o find original key

[tests] update unit test cases

8cb8a85

[rule][conf] Update conf right_to_left_character rule to use new norm…

cf3c61b

…alization

[docs] Update docs and address comments

9c88c0e

Fix a bug and update the unit test helper

202b2bc

Ryxias approved these changes Apr 27, 2020

View reviewed changes

streamalert/shared/normalize.py Outdated

continue

yield value

# for key, value in record.items():

Copy link

Contributor

Ryxias Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary comment block?

Remove unnecessary comments

d19d23e

chunyong-lin force-pushed the cylin-ae-normalization branch from ff0d98e to d19d23e Compare April 28, 2020 03:31

buggy, remove None values from normalization field

4e0e6d8

chunyong-lin merged commit bb2d306 into feature-artifact-extractor Apr 28, 2020

chunyong-lin deleted the cylin-ae-normalization branch April 28, 2020 17:20

This was referenced Apr 30, 2020

Add record id to artifacts and record #1242

Merged

Normalization reboot - Add condition support to normalizer #1245

Merged

chunyong-lin mentioned this pull request Jun 10, 2020

Feature artifact extractor #1250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization reboot - Refactor normalization code #1238

Normalization reboot - Refactor normalization code #1238

chunyong-lin commented Apr 17, 2020 •

edited

ryandeivert Apr 23, 2020

ryandeivert left a comment

ryandeivert Apr 23, 2020

ryandeivert Apr 23, 2020

ryandeivert Apr 23, 2020

ryandeivert Apr 23, 2020

ryandeivert Apr 23, 2020

ryandeivert Apr 23, 2020

ryandeivert Apr 23, 2020

ryandeivert Apr 23, 2020

ryandeivert Apr 23, 2020

ryandeivert Apr 23, 2020

Ryxias Apr 23, 2020

Ryxias Apr 23, 2020

Ryxias Apr 23, 2020

Ryxias Apr 23, 2020

chunyong-lin Apr 24, 2020

chunyong-lin commented Apr 27, 2020

Ryxias left a comment

Ryxias Apr 27, 2020

	In Normalization v2, the normalized types will be based on log type (e.g. osquery:differential, cloudwatch:cloudtrail, cloudwatch:events, etc) and defined in ``conf/schemas/.json``. Although, it is recommended to configure normalization in ``conf/schemas/.json``, the v1 configuration will be still valid and merged to v2.
	In Normalization v2, the normalized types will be based on log type (e.g. ``osquery:differential``, ``cloudwatch:cloudtrail``, ``cloudwatch:events``, etc) and defined in ``conf/schemas/.json``. However, we recommend configuring normalization in ``conf/schemas/.json``. The v1 configuration will be still valid and merged to v2.

	In Normalization v1, the normalized types are based on log source (e.g. osquery, cloudwatch, etc) and defined in ``conf/normalized_types.json`` file.
	In Normalization v1, the normalized types are based on log source (e.g. ``osquery``, ``cloudwatch``, etc) and defined in ``conf/normalized_types.json`` file.


		In Normalization v2, the normalized types will be based on log type (e.g. osquery:differential, cloudwatch:cloudtrail, cloudwatch:events, etc) and defined in ``conf/schemas/.json``. Although, it is recommended to configure normalization in ``conf/schemas/.json``, the v1 configuration will be still valid and merged to v2.

		Giving some examples to configure normalization v2. All normalized types are arbitrary, but we recommend to use all lower cases and underscores to name the normalized types to have better compatibility with Athena.

	Giving some examples to configure normalization v2. All normalized types are arbitrary, but we recommend to use all lower cases and underscores to name the normalized types to have better compatibility with Athena.
	Below are some example configurations for normalization v2. All normalized types are arbitrary, but only lower case alphabetic characters and underscores should be used for names in order to be compatible with Athena.


		Giving some examples to configure normalization v2. All normalized types are arbitrary, but we recommend to use all lower cases and underscores to name the normalized types to have better compatibility with Athena.

		* Normalized all ip addresses (``ip_address``) and user identities (``user_identity``) for ``cloudwatch:events`` events

	* Normalized all ip addresses (``ip_address``) and user identities (``user_identity``) for ``cloudwatch:events`` events
	* Normalize all ip addresses (``ip_address``) and user identities (``user_identity``) for ``cloudwatch:events`` logs

	* Normalized all commands (``command``) and user identities (``user_identity``) for ``osquery:differential`` events
	* Normalize all commands (``command``) and user identities (``user_identity``) for ``osquery:differential`` logs

	* Update existing Firehoses to allow to invoke Artifact Extractor lambda if it is enabled on the Firehoses
	* Update existing Firehoses to allow to invoke Artifact Extractor Lambda if it is enabled on the Firehose resources


		python manage.py deploy --function artifact_extractor

		* If normalization configuration changed in ``conf/schemas/*.json``, make sure deploy classifier as well

	* If normalization configuration changed in ``conf/schemas/*.json``, make sure deploy classifier as well
	* If the normalization configuration has changed in ``conf/schemas/*.json``, make sure to deploy the classifier Lambda function as well

	Artifacts will be searching via Athena ``artifacts`` table. During the test in staging environment, two fake ``cloudwatch:events`` were sent to a Kinesis data stream.
	Artifacts will be searchable within the Athena ``artifacts`` table


		Artifacts will be searching via Athena ``artifacts`` table. During the test in staging environment, two fake ``cloudwatch:events`` were sent to a Kinesis data stream.

		Those two fake events were searchable in ``cloudwatch_events`` table.

Normalization reboot - Refactor normalization code #1238

Normalization reboot - Refactor normalization code #1238

Conversation

chunyong-lin commented Apr 17, 2020 • edited

Background

New Support

Archtecture

Deprecate Normalization v1

Changes

Testing

Next Step

Choose a reason for hiding this comment

ryandeivert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chunyong-lin commented Apr 27, 2020

Ryxias left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chunyong-lin commented Apr 17, 2020 •

edited