Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema generation for extracted database #34

Closed
markovendelin opened this issue Sep 9, 2020 · 13 comments
Closed

Add schema generation for extracted database #34

markovendelin opened this issue Sep 9, 2020 · 13 comments

Comments

@markovendelin
Copy link

Is your feature request related to a problem? Please describe.

I am evaluating Jailer as a tool to extract part of a database for publishing it. In research, when we collect the data in the lab databases as a part of experimental and analysis routine, it would be helpful if I can extract part of a data together with the schema of the related tables (ideally views and functions as well). Jailer looks to be perfect fit. I can define which table data to extract, associated extraction model, and generate SQL statements. Unfortunately, that generated SQL does not include schema creation.

Describe the solution you'd like

When starting Export tool:

  • include an option to generate schema creation statements
    • as suboptions, suggest to include related views, functions
  • allow to save schema creation SQL statements as a separate file for flexibility

Describe alternatives you've considered

Alternative would be to generate schema dump by database utilities and manually remove all unrelated tables. Sounds like an error-rich solution, though.

Additional context

In principle, Jailer could be used as a part of the publishing model for scientific databases. The generated SQL statements could be published as they are or used to generate extracted database and publish that.

@Wisser
Copy link
Owner

Wisser commented Sep 11, 2020

Would it really make sense to extract a partial data model without the unrelated tables?
The data model would be incomplete. Wouldn't then the applications that work with the model have problems?

@markovendelin
Copy link
Author

Excellent questions and it would depend on the context. For database application testing, probably not. For data extraction from the database used in scientific lab for publishing, it makes great sense.

In the lab, we are using the same database for linking data coming from different experiments. The database is used by several smaller applications that work only on the part of it. So, if I want to publish datasets describing one particular study, I would like to extract the data that is relevant to that study only. This can be done by Jailer through definition of the model and the relevance judged by the researcher. As a part of extraction, I don't need to define database schema with all possible experiments going on as it is irrelevant in the current publication. It would be preferred to extract only the relevant schema as well.

Please let me know if I missed something in your questions.

@Wisser
Copy link
Owner

Wisser commented Sep 15, 2020

I see, that makes sense. Thanks for the insight.
I'll see how this could be realized. I'm just afraid it might be difficult to do this in a DBMS-independent way. Maybe the feature would only be available for some selected DBMSes. Which one would be relevant in this context?

@markovendelin
Copy link
Author

I agree, that could be difficult to do in database-independent way. We are using PostgreSQL. Now thinking of it, maybe I would just need the list of exported tables as I can use pg_dump and specify tables of interest as it's argument as well as dump schema only. That way Jailer would only give the list of tables used in the export and it will be user who would extract schema accordingly.

@Wisser
Copy link
Owner

Wisser commented Sep 15, 2020

This information is contained in the generated SQL script and could theoretically be extracted from it:

$ cat demo-scott.sql
-- generated by Jailer 9.5.5, Tue Sep 15 11:48:36 CEST 2020 from RalfW@W46810

-- Extraction Model:  EMPLOYEE where T.NAME='SCOTT'
-- Source DBMS:       H2
-- ...

-- Exported Rows:     13
--    DEPARTMENT                        2 
--    EMPLOYEE                          3 
--    PROJECT                           2 
--    PROJECT_PARTICIPATION             2 
--    ROLE                              2 
--    SALARYGRADE                       2 

...


$ grep '^--    ' demo-scott.sql | sed 's/--\s*//g' | sed 's/\s*[0-9]*//g'
DEPARTMENT
EMPLOYEE
PROJECT
PROJECT_PARTICIPATION
ROLE
SALARYGRADE

However, all tables from which no rows were exported (e.g. because the table in the source-database happens to be empty), but which are still relevant would be missing. Perhaps a cli-command would be useful, which would return a list of all tables that are potentially (transitively) related to the subject table?

@vsgfe
Copy link
Contributor

vsgfe commented Sep 15, 2020

Perhaps a cli-command would be useful, which would return a list of all tables that are potentially (transitively) related to the subject table?

This list would be useful for me as well. I generate reports about our tables and databases to help manage them. A list of the tables that are transferred between databases using Jailer would be very nice. I could not find an easy way to generate that list.

@markovendelin
Copy link
Author

Perhaps a cli-command would be useful, which would return a list of all tables that are potentially (transitively) related to the subject table?

It would be useful, indeed. Ideally, it should stop associations according to the extraction model. Thus, if we have tables

A -> B -> C

and user disconnected B->C, this would be respected and that CLI tool will not print table C either. Or have such behavior as an option that can be switched on and off.

@Wisser
Copy link
Owner

Wisser commented Sep 17, 2020

In the next release there will be the CLI tool "print-closure":

$ jailer.sh
usage:
...
  jailer print-closure <extraction-model> [<separator>] [-datamodel VAL]
    prints a list of all tables that are directly or transitively associated with a subject table,
    taking into account the restrictions on the associations (the so-called "Closure")
    <separator>: optional separator between table names in the output
...

$ jailer.sh print-closure extractionmodel\Demo-Scott.jm
BONUS
DEPARTMENT
EMPLOYEE
SALARYGRADE

$ jailer.sh print-closure extractionmodel\Demo-Scott.jm ", "
BONUS, DEPARTMENT, EMPLOYEE, SALARYGRADE

If you want to test this in advance, you can unzip the file in the attachment and replace the file "jailer.jar" with it.

jailer.zip

@vsgfe
Copy link
Contributor

vsgfe commented Sep 17, 2020

If you want to test this in advance, you can unzip the file in the attachment and replace the file "jailer.jar" with it.

jailer.zip

I did a quick test with our largest model (118 tables) and it works.
Thank you for this new feature!

@markovendelin
Copy link
Author

Excellent, worked for me as well - exactly as expected. Please feel free to close the issue and thank you very much for your help!

@Wisser
Copy link
Owner

Wisser commented Sep 20, 2020

Available in release 9.5.6.

@Wisser Wisser closed this as completed Sep 20, 2020
@rbeucher
Copy link

rbeucher commented Mar 23, 2022

Hi @Wisser

I am trying to use the CLI tool but I get an error:

2022-03-23 11:44:47,788 [main] ERROR  - './extractionmodel/LT_Canada.jm' does not exist
java.io.FileNotFoundException: './extractionmodel/LT_Canada.jm' does not exist
	at net.sf.jailer.extractionmodel.ExtractionModel.loadDatamodelFolder(ExtractionModel.java:522)
	at net.sf.jailer.Jailer.updateDataModelFolder(Jailer.java:383)
	at net.sf.jailer.Jailer.jailerMain(Jailer.java:274)
	at net.sf.jailer.Jailer.main(Jailer.java:149)
Error: java.io.FileNotFoundException: './extractionmodel/LT_Canada.jm' does not exist

Arguments:  0: {print-closure},  1: {./extractionmodel/LT_Canada.jm}

2022-03-23 11:44:47,797 [main] ERROR  - working directory is /opt/jailer-database-tools/lib/app

The model file definitely exist and is in $HOME/.jailer/extractionmodel

I have installed Jailer from the Arch Linux User Repository. It is installed in /opt. The gui works fine but I'm having trouble with the command line tools. Looks like it is using the wrong working directory. Any idea?

Thanks!

@Wisser
Copy link
Owner

Wisser commented Mar 23, 2022

Hi @rbeucher

The script jailer.sh changes the working directory to /opt/jailer-database-tools/lib/app, so the path to the extraction model must be absolute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants