Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Analysis While Importing Separate Data #237

Closed
Zalgo2462 opened this issue Aug 17, 2018 · 1 comment
Closed

Allow Analysis While Importing Separate Data #237

Zalgo2462 opened this issue Aug 17, 2018 · 1 comment

Comments

@Zalgo2462
Copy link
Contributor

Zalgo2462 commented Aug 17, 2018

Currently, if you were to open two terminals, each containing an instance of RITA, and ran both rita import and rita analyze, the rita analyze command will pick up on the databases being populated by rita import and try to analyze them . The analysis step uses the imported data at several points. If the data changes in between these steps, RITA will produce corrupt results. In order to prevent this from occurring, I propose we add a import_finished flag to the MetaDatabase.

We can implement a ready to analyze flag by adding the field import_finished to RITA's MetaDatabase database records.

Current MetaDB Database schema:

DBMetaInfo struct {
	ID             bson.ObjectId `bson:"_id,omitempty"`   // Ident
	Name           string        `bson:"name"`            // Top level name of the database
	Analyzed       bool          `bson:"analyzed"`        // Has this database been analyzed
	ImportVersion  string        `bson:"import_version"`  // Rita version at import
	AnalyzeVersion string        `bson:"analyze_version"` // Rita version at analyze
}

How to Alter the Import Process

  • Before a record is inserted into RITA, the appropriate MetaDatabase database record is created.
  • Records are inserted into the database referenced by the MetaDatabase database record
  • (new) When it is known that no more records will be inserted into the database referenced by the MetaDatabase record, the import_finished flag is set to true

How to Alter the Analyze Process

  • Loop over the databases registered in the MetaDatabase database collection
    • If the database is already analyzed, remove it from consideration
    • If the database is incompatible with the running version of rita, remove it from consideration
    • (new) if the import process is still altering the database (import_finished == false), remove it from consideration

Additionally, this feature will help support streaming importers as they constantly feed data to the RITA system. If rita analyze is run at any time with a streaming importer, RITA will produce corrupt results. With the addition of this field, a streaming importer can make a guarantee that it won't insert any more records into a database, and RITA can use that guarantee to safely analyze a database.

@Zalgo2462
Copy link
Contributor Author

This change will affect what will happen if RITA were to crash/ be killed during import. Currently, if an import run crashes, the data that has been imported can be used for analysis. However, it is very likely that data is missing due to the crash.

This change will prevent users from analyzing data that didn't come from a clean import session.
The databases will still appear under show-databases, and the databases will ultimately need to be deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants