-
Notifications
You must be signed in to change notification settings - Fork 0
Query Builder
Query Builder is implemented with Python, Html and JavaScript as a Jupyter notebook. Here, we describe how the tool can be utilized from the AEGIS platform.
Query Builder is pre-installed and available in every newly created project. In case the Query Builder Jupyter notebook does not already exist in a specific project or is outdated, user should first download it. Then, upload the QueryBuilder_v1.ipynb file into Jupyter from the corresponding menu, as seen in figure below.
Once the notebook is present, there are three ways to access it.
The notebook can be opened from the AEGIS Jupyter Tools page and clicking on the Query Builder tile.
Visit the Datasets page and then open "Jupyter" dataset which is available in all projects. The user should browse for the Query Builder notebook, right click on it and then select the "Open Jupyter Notebook" context menu option.
This approach makes again use of the context menu. Navigate to the Datasets Page and then open a dataset. Then, by right-clicking on a any CSV file, user is given the ability to open that file directly into the Query Builder, as it can be seen in the screenshot above.
Once the notebook has been loaded, the user sees the following:
This is the initial state. All code cells are hidden using the hide code plugin and all output has been cleared. If you are not interested in reviewing or altering the code (it is not required in the standard workflow) you may proceed with hiding the cell toolbars. In order to do that, select "View" from the menu, then "Cell Toolbar" and click on the "None" option, as shown below.
You should now see the following screen:
In order to initialize the notebook user must run the first cell. Make sure that the first cell is selected and either click on the "Run" button from the top Jupyter menu, or press Ctrl + Enter
.
You should see a button appearing as shown below.
Click on the "Initialise QB" button and wait for the execution to stop. When this is done, user should see a screen like the one below, where all cell prompts have been replaced by numbers and the file selection UI has appeared.
Note that the output regarding the Spark application will only appear when the interpreter is started. This means that if the "Initialise QB" button is pressed again, the message will disappear, but the tool will be still functional.
User can see a list of all "Available Datasets" in the current project. The list can also be refreshed. Select the desired dataset and click "Open dataset", this will populate the "Available Files" list.
Once the user selected a file and presses the "Open file" button, the file is loaded (as a PySpark Dataframe) and becomes available for further processing/querying.
Dataframes are essentially data tables. At each moment, Query Builder has two dataframes: the temporary dataframe (TempDF) which holds the result of the last performed action, i.e. the contents of a file or the result of applying a filter/query on the previous contents of TempDF and the master dataframe (MasterDF) which is explicitly updated by pressing a button that moves the TempDF contents to the MasterDF.
When a file is loaded, a form appears with several options for simple data filtering and processing.
Each selected action will open a new form with the necessary parameters. Once the user fills the required information and presses OK a new filter will be added in the queue to be applied when the user wants. As an example, if the user chooses the "Rename Column" filter he/she will need to select which column to rename and which should be the new name.
Once the user presses "OK", the filter will be added in the queue, under the "Selected Filters" label
The user may add more than one such filters before executing them. They will be added in the same list:
When the user wants to apply them, he/she should click on the "Apply Filters" button. When the execution is finished, all successfully applied filters will have changed colour:
The user may continue adding new filters after that point. The list will show both which filters have been already applied and which are pending:
By pressing the "Refresh temp" button, the user will get the first 40 lines of the TempDF. This is useful both when the user first loads a file but also to review the result of applying some filters.
A "Refresh Master" button is also available to preview the contents of the master dataset. Keep in mind that the two dataframes may hold completely different data. The user may also join the two dataframes through the corresponding option in the filters form. The result will be kept in the TempDF.
You may load, process and combine data following the previous steps in any order, until you are satisfied with the end-result. At any time, you may press the "Initialise QB" button to clear the screen and start over. When you have completed your work with the Query Builder, you may either directly exit the Notebook or clear all the output first so that you can have a fresh start without any leftovers the next time. This can be achieved from Jupyter's top menu. Choose Cell -> All Output -> Clear.
Not applicable in the current version, since Query Builder is actually a Jupyter notebook