Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#3 UDT support / Overall Unit & Integration tests #54

Merged
merged 30 commits into from
Dec 9, 2020

Conversation

grighetto
Copy link
Collaborator

@grighetto grighetto commented Nov 23, 2020

This PR adds support for UDT anonymization and includes unit and integration tests for all anonymized elements.

For the integration tests, there's a base CQL schema (generated programmatically) that is executed against a few C* versions in Docker, then the anonymizer is executed and the output is compared to the expected CQL file in the "schemas" folder.

The unit tests are mostly self-explanatory and they validate a few complex UDTs as per the original issue #3.

Steps to run the tests:

  • Build the Python package as usual with: python3 setup.py install
  • From python/adelphi/tests/unit, run the unit tests with: pytest
  • From python/adelphi/integration-tests, run the integration tests with: ./adelphi-docker-tests.sh

absurdfarce and others added 26 commits November 12, 2020 13:06
…that run against different versions of C* in Docker containers.
added comments to the various modules indicating what they're supposed to do.
…vide a higher-order

fn which can be called with the driver's cluster object (when connected) + infrastructure
for extracting what we need from this object.
…nymizer-automated-validation

# Conflicts:
#	schema-anonymizer/cassandra-anonymizer/anonymizer.py
…arly indicate

what is and isn't being included in a given export operation.
…context object

apart more than once and to re-use this object rather than creating new ones.
…k/col prefixes and consolidated them all to the `col_` prefix.
…able comment since it may contain sensitive info. Updated integration tests.
…nymizer-automated-validation

# Conflicts:
#	python/adelphi/adelphi/store.py
…ted-validation

# Conflicts:
#	python/adelphi/adelphi/anonymize.py
#	python/adelphi/adelphi/cql.py
#	python/adelphi/adelphi/store.py
@grighetto grighetto added the hard label Nov 23, 2020
@grighetto grighetto linked an issue Nov 23, 2020 that may be closed by this pull request
@@ -1,3 +1,6 @@
# Test output files
output/
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the integration tests output go there

local version=$1
echo "Comparing..."
diff schemas/$version.cql output/$version.cql
diffExitCode=$?
Copy link
Collaborator Author

@grighetto grighetto Nov 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeps track of the exit code, such that even if the last comparison succeeded, but some previous comparison failed, we want the parent process to fail

@@ -0,0 +1,211 @@
CREATE KEYSPACE my_ks_0 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the base schema that gets executed in all C* versions

it's generated with: python3 schema_util.py > ../../integration-tests/base-schema.cql

IndexMetadata,\
SimpleStrategy, \
UserType

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utility script that programmatically generates a C* schema compatible with C* from 2.1.22 to 4.0-beta3

it contains a couple of keyspaces, tables, indexes (regular and custom) and complex UDTs

except ImportError:
import unittest # noqa

class TestCqlAnonymize(unittest.TestCase):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anonymizer unit tests

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can expand this to test the command-line params too, for now these gravitate towards the anonymization

@@ -1,12 +1,10 @@
import re
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this should be moved below the top-level comment for this package. At least that's the convention used in the other Adelphi packages.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -54,7 +54,6 @@ def partition(pred, iterable):
log.info("Excluding system keyspaces " + ",".join((ks.name for ks in failed)))
return passed


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: there are two spaces between the various functions in all the Python packages. We can change that if desired but a one-off change doesn't seem like a good idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

# remove functions, aggregates and views for now
keyspace.functions = {}
keyspace.aggregates = {}
keyspace.views = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the removal of all of these required for anonymization? Seems like we're giving up a fair amount of useful schema info here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@absurdfarce do you think we can safely anonymize functions and aggregates?

Maybe materialized views could be added, but since views aren't recommended for prod, it seemed like a lower priority that shouldn't block the rest of the PR. We can open a ticket to track this though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inclination is to say we probably can do this, but yeah, it can be held off for future work. It does seem like including as many of these as possible in some capacity is probably worth it, though, since it will make for more robust validation testing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inclination is to say we probably can do this

Did you consider anonymizing the function body across all the supported languages? And even so, I think hiding the implementation of an UDF algorithm is more important than just anonymizing its variable names.
Anyway, that's a much larger scope and we can revisit it the future as you said. +1

@absurdfarce
Copy link
Collaborator

pytest is actually a third-party tool for running Python unit tests (see https://docs.pytest.org/en/stable/). It needs to be installed as a discrete package for use here, so make sure you do "pip install pytest" if you wish to use it.

@@ -0,0 +1,133 @@
import pytest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import doesn't appear to be used: everything below uses the conventional Python unittest framework (which seems like the right decision to me) and I don't see any difference in behaviour when this import is removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

name2 = get_name("test_column", COLUMN_PREFIX)
name3 = get_name("another_column", COLUMN_PREFIX)
self.assertEqual(name1, name2)
self.assertNotEqual(name1, name3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding this at the end so that we can run this thing with a simple python test_anonymize.py (i.e. to avoid any explicit dependence on pytest):

if __name__ == "__main__":
    unittest.main()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@absurdfarce
Copy link
Collaborator

absurdfarce commented Nov 30, 2020

The unit test is currently failing when run against Python 2.7.14. Looks mainly like a collection of mismatches due to the integers in use for anonymized names. It's not immediately clear to me if that process just isn't deterministic or if there's something else going on here. Integration tests using Python 2.7.14 also fail across the board for what looks to be the same reasons.

Unit test passes without issue using Python 3.9.0

Integration test is behaving... strangely when using Python 3.9.0. Still looking into what's going on there.

@jdonenine jdonenine added this to the Schema Repository milestone Dec 2, 2020
@grighetto
Copy link
Collaborator Author

@absurdfarce Thanks for the thorough review. PR feedback has been addressed.
I made the change to use pip install instead of python setup.py install to cope with your virtualenv setup (and probably other people would run into that too). Let me know how that goes for you now so we can move on with this PR.

@absurdfarce
Copy link
Collaborator

absurdfarce commented Dec 7, 2020

@grighetto I agree that all the comments on this PR are put to bed so we're very close. At this point the only thing holding this PR up from going in is the test failures: I continue to see unit test failures for both Python 2.7.x (2.7.14 specifically) and 3.x (3.9.0 specifically). I'd like to at least see passing tests before we move this forward. Is this something you can take a look at?

UPDATE: failures for Python 3.x were presumably user error on my part. After a clean rebuild + retest the unit tests pass when using Python 3.9.0.

@absurdfarce
Copy link
Collaborator

I'll report the error output I'm seeing from the unit tests just so there's a record of those errors somewhere.

After doing a "pip uninstall; pip install" of this package I'm observing the following failures using Python 2.x:

python2_unit_test_failures.txt

A similar process (pip uninstall; pip install; manual run of unit tests) looks to pass without errors when using Python 3.x (specifically Python 3.9.0).

…ted-validation

# Conflicts:
#	python/adelphi/adelphi/anonymize.py
#	python/adelphi/adelphi/cql.py
@grighetto
Copy link
Collaborator Author

Tests fixed for Python 2 - it was caused by different dict sorting behavior across Python versions. Manually sorting the keyspace elements by name makes it deterministic and fixes the issues across the board.

@absurdfarce
Copy link
Collaborator

Recent changes to get to passing unit tests seem sensible to me... I think we're finally done with this guy!

@grighetto grighetto merged commit ff6c038 into master Dec 9, 2020
@grighetto grighetto deleted the 3-anonymizer-automated-validation branch December 9, 2020 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rename commit_hash param to git_identifier Validate schema-anonymizer on nested UDTs
3 participants