Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use fast's jsonparser #14

Closed
gizmomogwai opened this issue Dec 30, 2017 · 7 comments
Closed

How to use fast's jsonparser #14

gizmomogwai opened this issue Dec 30, 2017 · 7 comments

Comments

@gizmomogwai
Copy link

Hi Marco, I really like the speed that comes with your pull based approach.
I have a simple program, that I would like to implement, but I am struggling applying the pull based thing to the problem:
I want to analyse nist's cve data (https://nvd.nist.gov/vuln/data-feeds) e.g. by searching through a datafile and printing out the whole json entry that matches an id.
The datalooks like this:

"CVE_Items" : [ {
  "cve" : {
    "data_type" : "CVE",
    "data_format" : "MITRE",
    "data_version" : "4.0",
    "CVE_data_meta" : {
      "ID" : "CVE-1999-0001",
      "ASSIGNER" : "cve@mitre.org"
    },
    "affects" : {
      ...
    },
    "problemtype" : {
      ...
    },
    "references" : {
      ...
    },
    "description" : {
      ...
    }
  },
  "configurations" : {
    ...
  },
  "impact" : {
    ...
  },
  "publishedDate" : "1999-12-30T05:00Z",
  "lastModifiedDate" : "2010-12-16T05:00Z"
}, {
  "cve" : {
   ...

with your nice library I can easily write something like this:

  foreach (cveFile; cves) {
      foreach (item; cveFile.CVE_Items) {
          cveFile.cve.CVE_data_meta.keySwitch!("ID")(
                                             {
                                                 auto id = cveFile.read!string;
                                                 if (id in toFind) writeln(id);
                                             });
      }
  }

But instead of just outputting the id, i would like to dump everything, that belongs to the object that contains the matching id.

Whats the best way to do this?

@mleise
Copy link
Collaborator

mleise commented Dec 31, 2017

This could be solved if I had implemented some sort of parser "snapshotting". But right now there is no way to get back to the start of the CVE item, once you reach the "ID". I see the use case and it makes sense to implement something like saveSnapshot() and loadSnapshot() as an extension in the future. For now you'll have to digest the entire JSON and lose the benefit of paying for what you use.

@mleise
Copy link
Collaborator

mleise commented Dec 31, 2017

All that really needs to be saved and restored is m_text and m_nesting from here: https://github.com/mleise/fast/blob/master/source/fast/json.d#L200

@mleise mleise closed this as completed in 093c2eb Dec 31, 2017
@mleise
Copy link
Collaborator

mleise commented Dec 31, 2017

Example usage:

import fast.json;
import std.stdio;

struct CVE {
	string  data_type;
	string  data_format;
	string  data_version;
	CVEMeta CVE_data_meta;
}

struct CVEMeta {
	string ID;
	string ASSIGNER;
}

void main() {
	bool[string] shoppingList = ["CVE-2017-0006":true, "CVE-2017-9999":true];
	with (parseJSONFile("nvdcve-1.0-2017.json")) {
		foreach (n; CVE_Items) {
			with (cve) {
				const backup = state;
				const id = CVE_data_meta.ID.borrowString();
				if (id in shoppingList) {
					state = backup;
					writeln(json.read!CVE());
				}
			}
		}
	}
}

Runs @ ~1100 MiB/s for me when compiled with LDC2. (4th gen i5 @ 2.3 Ghz, DDR3)

@gizmomogwai
Copy link
Author

wow ... thanks a lot ... will give this a try, at the moment i am still struggling in getting uncompressed data into dlang (i do not even reach java speed right now for gzipped data).

@gizmomogwai
Copy link
Author

two more questions :)

  • how do you measure your throughput?
  • is there a way to get a whole json subtree parsed? i have seen the thing with associative arrays from string to a json type like string, int, float, but this does not work for deeper trees, right?

@mleise
Copy link
Collaborator

mleise commented Dec 31, 2017

So you need fast.gzip as well? ;-) (Maybe the system zlib is faster when linked into your D program than Phobos.)
For the throughput I downloaded the 2017 CVE JSON (74 MiB unzipped) and used the simple time command on the program I posted above. That showed about 67ms in the best case.
You can parse JSON sub-trees of any depth (as in the example program above), if you know the structure. I.e. nested structs and arrays work. If you don't know the structure you need to manually iterate the elements and store them in "Variants".

@gizmomogwai
Copy link
Author

Thanks again. I will look into the gzip thing and also look how to work with the subtrees! Happy new year!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants