Refactor FIPS & unify CLI actions #275

adamjanovsky · 2022-10-25T13:26:51Z

High-level plan

1. Unify CLIs into a single CLI (FIPS: Unify CLI actions with CC #212)
2. Migrate tests to pytest, separate download tests from others, refactor test structure (split into cc/fips/others)
- Tests: Make resources locally accessible #76, Refactor CC tests #79
3. Refactor FIPS code
4. Misc: update docs, docker, readme, ...

Tests

Add tests that check working pandas serialization
Allow for default paths of CPEDataset and CVEDataset
- ~~Or forbid it and demand explicit paths (that will not get serialized omg). Either way, act consistently.~~
Get rid of class design in the favour of fiixtures
Refactor module design
Test maintenance updates
Select more appropriate names for the methods
Migrate to pytest
Divide logic of tests into download / rest.
Allow download test to fail on unreliable resources, ~~test download on resources that we trust (e.g. our server)~~
- ~~Forbid download on non-download test by auto-use fixture that raises exception on requests.get / requests.post~~

FIPS

Processing steps progress:

Random notes to do there:

Refactor FIPS to hold auxilary datasets in the designated class
Implement new folder structure
Rewrite redo -> fresh, implement reatempting computations on failed samples
Introduce InternalState both to FIPSCertificate and to FIPSDataset

Done outside of the processing steps:

Delete plot graph methods
~~[ ] Add examples of plot graph into FIPS reference notebook~~
Verify intermediate results on small-scale dataset before proceeding further
~~Verify if FIPS webpages are loading on scroll (historical modules). If yes, figure out how to force full load with BeautifulSoup~~
- This was not confirmed by manual analysis.
Implement copying of FIPSDataset contents
Unify CLIs, closes FIPS: Unify CLI actions with CC #212
FIPSAlgorithmDataset should reside in own json in auxillary datasets

It is unclear what things like clean_cert_ids, or algorithms in the various certicate subobjects (PdfData and WebData) are, are they processed or raw, what sort of cleaning was done and what method needs to be called to do it? Specifically this cleaning that is done on FIPS certs is split between the dataset class and the certificate class in a nasty way, this should be unified and moved, maybe even to a separate class similar to how CC certificate ID canonicalization is separate.

Address the above

Incomplete: See parse_html_main, the logic there just is not good.

Address the above

Misc

Refactor CC to hold auxilary datasets in the designated class
Update docs
Update Docker image
Restrict usage of fresh on places where it does not makes sense
Update project readme
Revisit all TODOs (internal check)
Unify logging messages.
Update code documentation
PyUpgrade and flake plugin into linting pipeline

Sanity check

Perform full FIPS run
Perform full CC run
Run all notebooks

Still part of refactoring, but to-be-addressed by separate PRs

Move content from examples to notebooks/examples
Rename CommonCriteria files -> cc #287
Resolve dependabot issues https://github.com/crocs-muni/sec-certs/security/dependabot
Across tool, unify from __future__ import annotations
Test GH action yields some warnings about usage of deprecated functions, fix.
Improve build system #231
Release

J08nY · 2022-10-25T15:37:43Z

I applaud this effort! When I get more time hopefully I can help as well.

In the meantime, let me add a few pointers/issues I hit when working with the FIPS code:

The attributes of the certificate that are extracted and processed are unclear and incomplete.
- It is unclear what things like clean_cert_ids, or algorithms in the various certicate subobjects (PdfData and WebData) are, are they processed or raw, what sort of cleaning was done and what method needs to be called to do it? Specifically this cleaning that is done on FIPS certs is split between the dataset class and the certificate class in a nasty way, this should be unified and moved, maybe even to a separate class similar to how CC certificate ID canonicalization is separate.
- Incomplete: See parse_html_main, the logic there just is not good.
Data from the FIPS algorithm dataset is not utilized and mined fully. We can follow the links to the algorithm page and get more data that will help us.

adamjanovsky · 2022-10-26T07:10:45Z

* Data from the FIPS algorithm dataset is not utilized and mined fully. We can follow the links to the algorithm page and get more data that will help us.

Thanks for the input. This I'll probably leave for future work, this PR is going to be giant anyway. Could you please create a separate issue to address this?

J08nY · 2022-10-26T23:00:40Z

* Data from the FIPS algorithm dataset is not utilized and mined fully. We can follow the links to the algorithm page and get more data that will help us.
Thanks for the input. This I'll probably leave for future work, this PR is going to be giant anyway. Could you please create a separate issue to address this?

Did this in #276.

sec_certs/dataset/dataset.py

adamjanovsky · 2022-10-27T11:51:19Z

I'll re-implement the folder structure of FIPS Dataset. Outline:

dset
├── FIPS Dataset.json
├── auxillary_datasets
│   └── algorithms
│       └── fips_algorithms.json
├── certs
│   ├── modules
│   └── policies
│       ├── pdf
│       └── txt
└── web
    ├── fips_modules_active.html
    ├── fips_modules_historical.html
    └── fips_modules_revoked.html

sec_certs/sample/fips.py

cli.py

adamjanovsky · 2022-10-27T19:59:30Z

@J08nY looking at the code here and there, I planned the following:

I decomposed whole processing pipeline into four steps: (i) parse metadata, (ii) download artifacts, (iii) process artifacts, (iv) analyze artifacts

Each child class would hold a property that would return a list of callables for each step, e.g. artifact_download_methods, see

sec-certs/sec_certs/dataset/fips.py

Lines 57 to 58 in 9433658

    
           def artifact_download_methods(self) -> List[Callable]: 
        
               return [self._download_modules, self._download_policies]

Parent class would just orchestrate these steps, i.e. call all required methods (that must have no arguments which so far holds), see

sec-certs/sec_certs/dataset/dataset.py

Lines 234 to 246 in 9433658

    
           def download_all_artifacts(self, fresh: bool = True) -> None: 
        
               if self.state.meta_sources_parsed is False: 
        
                   logger.error("Attempting to download pdfs while not having csv/html meta-sources parsed. Returning.") 
        
                   return 
        
               for method in self.artifact_download_methods: 
        
                   method(fresh) 
        
               if fresh: 
        
                   for method in self.artifact_download_methods: 
        
                       method(False) 
        
               self.state.artifacts_downloaded = True

I try to introduce the state variables and execute the method on the individual samples in the same fashion as in CC world.

This comes with some advantages:

Refactoring is very fast, because I just follow the design already implemented in CC
It should also produce fewer errors, same reasoning as above

Some disadvantages that cross my mind:

Whatever we're unsatisfied with in CC will probably hit also FIPS 😆

There's a lot of spaghetti code that we could possibly get rid of. E.g., see _download_policies() and _download_modules():

sec-certs/sec_certs/dataset/fips.py

Lines 94 to 108 in 9433658

    
           def _download_policies(self, fresh: bool = True) -> None: 
        
               self.policies_pdf_dir.mkdir(exist_ok=True) 
        
               if fresh: 
        
                   logger.info("Downloading PDF security policies.") 
        
               else: 
        
                   logger.info("Attempting re-download of failed PDF security policies.") 
        
               certs_to_process = [x for x in self if x.state.policy_is_ok_to_download(fresh)] 
        
               cert_processing.process_parallel( 
        
                   FIPSCertificate.download_policy, 
        
                   certs_to_process, 
        
                   config.n_threads, 
        
                   progress_bar_desc="Downloading PDF security policies", 
        
               )

,

sec-certs/sec_certs/dataset/fips.py

Lines 78 to 92 in 9433658

    
           def _download_modules(self, fresh: bool = True) -> None: 
        
               self.module_dir.mkdir(exist_ok=True) 
        
               if fresh: 
        
                   logger.info("Downloading HTML cryptographic modules.") 
        
               else: 
        
                   logger.info("Attempting re-download of failed HTML cryptographic modules.") 
        
               certs_to_process = [x for x in self if x.state.module_is_ok_to_download(fresh)] 
        
               cert_processing.process_parallel( 
        
                   FIPSCertificate.download_module, 
        
                   certs_to_process, 
        
                   config.n_threads, 
        
                   progress_bar_desc="Downloading HTML modules", 
        
               )

. They are very similar, just one variable that changes in multiple places + some logging. The same problem applies to methods executed on the individual samples

I am not certain whether we want to attempt to solve the spaghetti code by adding additional layer. First, I'm not sure how to do that without getattr("lame string of some variable"). Second, if we break the pattern, we would need to introduce new design...

Now's the time to speak up if you have comments/tips/whatever :).

J08nY · 2022-10-27T23:52:49Z

Interesting approach but I am worried about a few things:

Passing callables around smells. For example "(that must have no arguments which so far holds)" as a limitation sounds pretty bad. Maybe the solution here is to not have the superclass orchestrate but instead have it the other way around? I dunno, but passing callables seems to smell and locks us in a thing that is very constraining.
I would maybe have a quick look at design patterns to see if some fits? https://refactoring.guru/design-patterns/catalog

I guess I am really not sure what problem you are trying to solve by passing the callables. I will have to look at the code.

J08nY · 2022-10-28T00:04:09Z

Looking at the code a bit. I think now I would just match the FIPS Dataset API to the CC Dataset API manually, even at the cost of some code duplication. I don't feel like it is a good time to rework the handling like this during this FIPS unification.

J08nY · 2022-10-28T00:13:28Z

Ah yes, lets do this: https://refactoring.guru/design-patterns/template-method and don't pass callables around. Just call the methods directly, even if there might be some duplication I think it makes sense for future flexibility.

adamjanovsky · 2022-11-09T12:11:20Z

I was thinking about the logic of the download testing. At the moment, the situation is as follows:

Tests that don't test download functionality from unreliable servers (CC, FIPS), but use it (when this is difficult to get rid of) check whether the download succeeded and raise pytest.xfail if it did not
Tests that test download from unreliable servers were decorated with xfail. The problem with tested functionality is discovered only on repeated failures.
- We could possibly re-run these tests, but that would require pytest plugins?
Tests that download from reliable servers were left as they were.

Co-authored-by: Ján Jančár <J08nY@users.noreply.github.com>

adamjanovsky · 2022-12-05T16:41:09Z

It seems the whole PP dataset was included in the test data, was this intended? It shows up at almost 77k lines. I believe just downloading it from our site is a reasonable alternative. Although as it is already in git, removing it now is pointless (without a rebase and history editing the repository will have it forever).

Oh this hurts. It was added by accident. I'll probably rewrite the history...

adamjanovsky · 2022-12-05T16:48:00Z

Connected with the root_dir handling and serialization is the DUMMY_PATH issue. I see the current solution as an acceptable one, although I think there is room for improvement, we just need to identify the core issue in the API design and assumptions to see what needs to change. I think the current situation arises because our assumptions in different places in the code are not consistent or clear, specifically assumptions about whether a dataset object is backed by its JSON (directory structure) at different points in time as it is constructed from different sources.

You're perfectly right about the cause.

One possible way to resolve this is to say that the dataset object is always backed by its directory structure/JSON and explicitly require the path in all constructors, even the web ones. Other possible way is the opposite, to not assume the dataset is always backed, then disable the serialization features unless a path is set.

I was actually opting for the second, which is kind of the current state. The serialization will fail unless a path is set. See

sec-certs/sec_certs/serialization/json.py

Lines 50 to 51 in f14dfe3

    
           if self.json_path == constants.DUMMY_NONEXISTING_PATH:  # type: ignore 
        
               raise ValueError(f"json_path variable for {type(self)} was not yet set.")

It's just implemented with that DUMMY_NONEXISTING_PATH variable...

adamjanovsky · 2022-12-05T16:48:49Z

I dislike that the root_dir property setter on dataset classes actually serializes the dataset. I think this is highly unintuitive and unexpected. I propose a more explicit solution in a comment in the review. The TL;DR; is to create a different method for moving/copying the dataset and make it the only method to change the root_dir.

I agree this can be counterintuitive. I'll take a look at that, thanks for spotting this.

J08nY

lgtm except the hash impl.

sec_certs/cli.py

sec_certs/constants.py

sec_certs/dataset/dataset.py

sec_certs/sample/certificate.py

adamjanovsky added 4 commits October 25, 2022 15:09

unify CLI actions

a900b90

merge dataset constructors, from_web_latest code

0e12ba4

unify _get_certs_by_name methods

7aff8bf

unify get_keywords_df method

d0d9f91

J08nY reviewed Oct 26, 2022

View reviewed changes

sec_certs/dataset/dataset.py Outdated Show resolved Hide resolved

J08nY mentioned this pull request Oct 26, 2022

Build a graph of FIPS algorithm references #277

Open

adamjanovsky added 2 commits October 27, 2022 12:24

unify and generalize dataset method get_keywords_df()

1b150aa

root_dir setter for FIPSDataset

5110b52

WiP: refactor FIPS get_certs_from_web()

39c89c1

J08nY reviewed Oct 27, 2022

View reviewed changes

sec_certs/sample/fips.py Outdated Show resolved Hide resolved

J08nY reviewed Oct 27, 2022

View reviewed changes

sec_certs/sample/fips.py Outdated Show resolved Hide resolved

J08nY reviewed Oct 27, 2022

View reviewed changes

cli.py Outdated Show resolved Hide resolved

implement artifact download FIPS

9433658

adamjanovsky added 4 commits November 4, 2022 16:33

refactor tests unittest -> pytest

933b469

add type hint for json serialization

80af3a2

new object to hold auxillary datasets

6134479

use temp folders for cc analysis test data

fad4fbf

adamjanovsky added 4 commits November 9, 2022 13:18

mark further download tests with xfail

e26fb0c

fix xfail marker on cpe_dset_from_web test

14c369b

pandas tests, cve_dset, cpe_dset unify json_path approach

8482d82

merge main

c6913dc

J08nY mentioned this pull request Dec 5, 2022

Add rudimentary profiling to processing pipeline #288

Closed

adamjanovsky and others added 6 commits December 5, 2022 17:16

fix here, fix there

b37eaaf

rename dataset of maintenance updates

6c02383

Update sec_certs/dataset/common_criteria.py

8ac389a

Co-authored-by: Ján Jančár <J08nY@users.noreply.github.com>

Update sec_certs/model/cpe_matching.py

bc5a532

Co-authored-by: Ján Jančár <J08nY@users.noreply.github.com>

chain.from_iterable() now working with generator expessions

4712279

fix getitem on fips dataset

f14dfe3

adamjanovsky mentioned this pull request Dec 6, 2022

Parse more information from FIPS module html page #289

Open

adamjanovsky added 4 commits December 6, 2022 18:05

test config global fixture

a1ec986

add pyupgrade into linter pipeline

577300e

reimplement dataset serialization constraints

ed8813e

delete pp dataset json

29dd48c

adamjanovsky requested a review from J08nY December 7, 2022 13:40

J08nY reviewed Dec 7, 2022

View reviewed changes

sec_certs/cli.py Outdated Show resolved Hide resolved

sec_certs/constants.py Show resolved Hide resolved

sec_certs/dataset/dataset.py Show resolved Hide resolved

sec_certs/sample/certificate.py Outdated Show resolved Hide resolved

adamjanovsky added 7 commits December 7, 2022 17:09

update docs

52bddce

attempt to fix pipelines

30ef160

don't download spacy model test pipeline

0bcda6b

test pipeline ubuntu 20.04

7710ef8

disable CPE from web test

7af69ca

try ubuntu 22.04 test runner

be7f6d7

cli print -> click.echo()

7d59063

J08nY approved these changes Dec 8, 2022

View reviewed changes

FIPSCertificate no longer hashable

4574a3d

adamjanovsky merged commit 846144e into main Dec 8, 2022

adamjanovsky deleted the issue/212-FIPS-Unify-CLI-actions-with-CC branch December 8, 2022 12:44

This was referenced Dec 8, 2022

Refactor CC tests #79

Closed

Tests: Make resources locally accessible #76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor FIPS & unify CLI actions #275

Refactor FIPS & unify CLI actions #275

adamjanovsky commented Oct 25, 2022 •

edited

Loading

J08nY commented Oct 25, 2022

adamjanovsky commented Oct 26, 2022

J08nY commented Oct 26, 2022

adamjanovsky commented Oct 27, 2022

adamjanovsky commented Oct 27, 2022 •

edited

Loading

J08nY commented Oct 27, 2022

J08nY commented Oct 28, 2022

J08nY commented Oct 28, 2022

adamjanovsky commented Nov 9, 2022

adamjanovsky commented Dec 5, 2022

adamjanovsky commented Dec 5, 2022

adamjanovsky commented Dec 5, 2022

J08nY left a comment

Refactor FIPS & unify CLI actions #275

Refactor FIPS & unify CLI actions #275

Conversation

adamjanovsky commented Oct 25, 2022 • edited Loading

High-level plan

Tests

FIPS

Misc

Sanity check

Still part of refactoring, but to-be-addressed by separate PRs

J08nY commented Oct 25, 2022

adamjanovsky commented Oct 26, 2022

J08nY commented Oct 26, 2022

adamjanovsky commented Oct 27, 2022

adamjanovsky commented Oct 27, 2022 • edited Loading

J08nY commented Oct 27, 2022

J08nY commented Oct 28, 2022

J08nY commented Oct 28, 2022

adamjanovsky commented Nov 9, 2022

adamjanovsky commented Dec 5, 2022

adamjanovsky commented Dec 5, 2022

adamjanovsky commented Dec 5, 2022

J08nY left a comment

Choose a reason for hiding this comment

adamjanovsky commented Oct 25, 2022 •

edited

Loading

adamjanovsky commented Oct 27, 2022 •

edited

Loading