Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization reboot - Refactor normalization code #1238

Merged
merged 12 commits into from
Apr 28, 2020

Conversation

chunyong-lin
Copy link
Contributor

@chunyong-lin chunyong-lin commented Apr 17, 2020

to: @airbnb/streamalert-maintainers
related to: #1237 #1230
resolves:

Background

This is 3rd PR to implement Normalization v2 feature. It is focus on refactor normalization code to use both Normalization v1 and v2 configuration and add additional information function to normalized values.

New Support

The biggest change in this PR is the normalization configuration change in conf/schemas/*.json. In StreamAlert schemas, provide a new configuration option:

“cloudtrail:something_logs”: {
  “schema”: {
    “Field1”: “string”,
    “Field2”: “string”,
    “Field3”: “string”
  },
  “parser”: “json”,
  “configuration”: {
    “normalization”: {
      -- NEW CONFIGURATION HERE --
      “ip_address”: [
        {
          “path”: [“path”, "to", “srcIpAddr”], # standard configuration including path to the key and function
          “function”: “source”
        },
        {
          “path”: ["path", "to", “dstIpAddr”], 
          “function”: “destination”
        }
      ],
      “hostname”: [“path”, “to”, “hostname”] # simplified configuration including path to the key only
    }
  }
}

Archtecture

normalization-arch

Deprecate Normalization v1

We deprecate Normalization v1 and conf/normalized_types.json will have no effect. Make sure migration our normalized configuration to conf/schemas/*.json along with log schemas.

Changes

  • Fixed couple bugs in Normalization reboot - Add terraform resources #1237 which were discovered during staging testing
  • Refactor normalization code to support normalization v2
  • Update existing test cases
  • Add normalization configuration to conf/schemas/carbonblack.json, conf/schemas/cloudwatch and conf/schemas/osquery.json. Also update rule right_to_left_character to use normalization v2.
  • Add more test cases to improve the code coverage
  • Add architecture diagram to doc

Testing

Deploy the changes to staging environment and created a kinesis stream for testing. Sent two fake cloudwatch events to the kinesis.

  • Those two cloudwatch events showed up in cloudwatch:events table.

cloudwatch_events

  • Artifacts extracted from those two cloudwatch events showed up in artifacts table.

artifacts

Next Step

Still have one more PR to complete this feature that is to add advanced features to apply filters in normalization.

@chunyong-lin chunyong-lin changed the title Cylin ae normalization Normalization reboot - Refactor normalization code Apr 17, 2020
@chunyong-lin chunyong-lin added this to the 3.3.0 milestone Apr 17, 2020
@chunyong-lin chunyong-lin marked this pull request as ready for review April 22, 2020 23:44
Architecture
============

.. figure:: ../images/normalization-arch.png
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this image could use some love (like attention to detail with capitalized words vs non-capitalized words, spacing between labels + images, etc)... can you share this with me and I can tweak it?

Copy link
Contributor

@ryandeivert ryandeivert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if the athena screenshots didn't show your "cylin2020..."."artifacts" prefix.. and only showed select * from artifacts... the cylin2020.. bit is actually not required (unless you are searching a database that is NOT selected in the dropdown in Athena console.. which is rarely the case)

Coming soon.
In Normalization v1, the normalized types are based on log source (e.g. osquery, cloudwatch, etc) and defined in ``conf/normalized_types.json`` file.

In Normalization v2, the normalized types will be based on log type (e.g. osquery:differential, cloudwatch:cloudtrail, cloudwatch:events, etc) and defined in ``conf/schemas/*.json``. Although, it is recommended to configure normalization in ``conf/schemas/*.json``, the v1 configuration will be still valid and merged to v2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In Normalization v2, the normalized types will be based on log type (e.g. osquery:differential, cloudwatch:cloudtrail, cloudwatch:events, etc) and defined in ``conf/schemas/*.json``. Although, it is recommended to configure normalization in ``conf/schemas/*.json``, the v1 configuration will be still valid and merged to v2.
In Normalization v2, the normalized types will be based on log type (e.g. ``osquery:differential``, ``cloudwatch:cloudtrail``, ``cloudwatch:events``, etc) and defined in ``conf/schemas/*.json``. However, we recommend configuring normalization in ``conf/schemas/*.json``. The v1 configuration will be still valid and merged to v2.

Configuration
=============

Coming soon.
In Normalization v1, the normalized types are based on log source (e.g. osquery, cloudwatch, etc) and defined in ``conf/normalized_types.json`` file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In Normalization v1, the normalized types are based on log source (e.g. osquery, cloudwatch, etc) and defined in ``conf/normalized_types.json`` file.
In Normalization v1, the normalized types are based on log source (e.g. ``osquery``, ``cloudwatch``, etc) and defined in ``conf/normalized_types.json`` file.


In Normalization v2, the normalized types will be based on log type (e.g. osquery:differential, cloudwatch:cloudtrail, cloudwatch:events, etc) and defined in ``conf/schemas/*.json``. Although, it is recommended to configure normalization in ``conf/schemas/*.json``, the v1 configuration will be still valid and merged to v2.

Giving some examples to configure normalization v2. All normalized types are arbitrary, but we recommend to use all lower cases and underscores to name the normalized types to have better compatibility with Athena.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Giving some examples to configure normalization v2. All normalized types are arbitrary, but we recommend to use all lower cases and underscores to name the normalized types to have better compatibility with Athena.
Below are some example configurations for normalization v2. All normalized types are arbitrary, but only lower case alphabetic characters and underscores should be used for names in order to be compatible with Athena.


Giving some examples to configure normalization v2. All normalized types are arbitrary, but we recommend to use all lower cases and underscores to name the normalized types to have better compatibility with Athena.

* Normalized all ip addresses (``ip_address``) and user identities (``user_identity``) for ``cloudwatch:events`` events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Normalized all ip addresses (``ip_address``) and user identities (``user_identity``) for ``cloudwatch:events`` events
* Normalize all ip addresses (``ip_address``) and user identities (``user_identity``) for ``cloudwatch:events`` logs

}
}

* Normalized all commands (``command``) and user identities (``user_identity``) for ``osquery:differential`` events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Normalized all commands (``command``) and user identities (``user_identity``) for ``osquery:differential`` events
* Normalize all commands (``command``) and user identities (``user_identity``) for ``osquery:differential`` logs

* A new Lambda function
* A new Glue catalog table ``artifacts`` for Historical Search via Athena
* A new Firehose to deliver artifacts to S3 bucket
* Update existing Firehoses to allow to invoke Artifact Extractor lambda if it is enabled on the Firehoses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Update existing Firehoses to allow to invoke Artifact Extractor lambda if it is enabled on the Firehoses
* Update existing Firehoses to allow to invoke Artifact Extractor Lambda if it is enabled on the Firehose resources


python manage.py deploy --function artifact_extractor

* If normalization configuration changed in ``conf/schemas/*.json``, make sure deploy classifier as well
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If normalization configuration changed in ``conf/schemas/*.json``, make sure deploy classifier as well
* If the normalization configuration has changed in ``conf/schemas/*.json``, make sure to deploy the classifier Lambda function as well

Artifacts
=========

Artifacts will be searching via Athena ``artifacts`` table. During the test in staging environment, two fake ``cloudwatch:events`` were sent to a Kinesis data stream.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Artifacts will be searching via Athena ``artifacts`` table. During the test in staging environment, two fake ``cloudwatch:events`` were sent to a Kinesis data stream.
Artifacts will be searchable within the Athena ``artifacts`` table


Artifacts will be searching via Athena ``artifacts`` table. During the test in staging environment, two fake ``cloudwatch:events`` were sent to a Kinesis data stream.

Those two fake events were searchable in ``cloudwatch_events`` table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these references to "fake" events and "staging" environment in the docs??? please remove or update as necessary

"""
# Enforce all fields are strings in a Artifact to prevent type corruption in Parquet format
self._function = str(kwargs.get('function', 'not_specified'))
self._record_id = str(kwargs.get('record_id', self.RESERVED))
self._function = str(kwargs.get('function'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these values ('function' and 'record_id') are no longer optional, please remove the usage of kwargs and add these are required arguments to __init__

# 'awsRegion': 'us-west-2'
# }
# },
# 'normalization': {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this key would be streamalert:normalization?

# 'normalization': {
# 'region': {
# 'values': ['us-east-1', 'us-west-2']
# 'function': 'AWS region'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure of this object will not work. If you have 2 region type values with different function values it will conflict.

"streamalert:normalization": {
  "ip_address": [
    {
      "values": ["4.3.2.1"],
      "function": "outbound_connection_destination"
    },
    {
      "values": ["2.2.2.2", "4.4.4.4"],
      "function": "dns_lookup"
    }
    }
  ]
}

having following format, otherwise it will raise ConfigError.
[
{
'fields': ['source', 'sourceIPAddress'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not how I intended it to work. In this documentation you're implying that each normalizer can map to multiple fields. The way I designed it is each normalizer maps to exactly one field. The field array is a JSON path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[
  {
    'fields': ['path', 'to', 'the', 'field'],
    'function': 'same_function'
  },
  {
    'fields': ['other', 'path', 'same', 'function'],
    'function': 'same_function'
  },
]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, misunderstood the configuration part in the design doc. The new change will be up soon.

@chunyong-lin
Copy link
Contributor Author

@ryandeivert @Ryxias PTAL, I have addressed your comments. I also updated PR description and docs.

Copy link
Contributor

@Ryxias Ryxias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continue

yield value
# for key, value in record.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary comment block?

@chunyong-lin chunyong-lin merged commit bb2d306 into feature-artifact-extractor Apr 28, 2020
@chunyong-lin chunyong-lin deleted the cylin-ae-normalization branch April 28, 2020 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants