Skip to content

Query Builder

Christos Botsikas edited this page Jun 28, 2019 · 11 revisions

Deployment and Usage

Query Builder is implemented with Python, Html and JavaScript as a Jupyter notebook. Here, we describe how the tool can be utilized from the AEGIS platform.

STEP 1 - Installation

Query Builder is pre-installed and available in every newly created project. In case the Query Builder Jupyter notebook does not already exist in a specific project or is outdated, user should first download it. Then, upload the QueryBuilder_v1.ipynb file into Jupyter from the corresponding menu, as seen in figure below.

Upload Notebook

STEP 2 - Open the notebook

Once the notebook is present, there are three ways to access it.

a. Via the AEGIS Jupyter Tools page

The notebook can be opened from the AEGIS Jupyter Tools page and clicking on the Query Builder tile.

QB Startup

b. Via the Jupyter notebook file context menu

Visit the Datasets page and then open "Jupyter" dataset which is available in all projects. The user should browse for the Query Builder notebook, right click on it and then select the "Open Jupyter Notebook" context menu option.

QB Startup

c. Via a CSV file context menu

This approach makes again use of the context menu. Navigate to the Datasets Page and then open a dataset. Then, by right-clicking on a any CSV file, user is given the ability to open that file directly into the Query Builder, as it can be seen in the screenshot above.

QB Startup

Once the notebook has been loaded, the user sees the following:

QB Initial State

This is the initial state. All code cells are hidden using the hide code plugin and all output has been cleared. If you are not interested in reviewing or altering the code (it is not required in the standard workflow) you may proceed with hiding the cell toolbars. In order to do that, select "View" from the menu, then "Cell Toolbar" and click on the "None" option, as shown below.

Hide Toolbar

You should now see the following screen:

QB Initial State No Toolbar

STEP 3 - Initialize the notebook

In order to initialize the notebook user must run the first cell. Make sure that the first cell is selected and either click on the "Run" button from the top Jupyter menu, or press Ctrl + Enter. You should see a button appearing as shown below.

Initialise Notebook

Click on the "Initialise QB" button and wait for the execution to stop. When this is done, user should see a screen like the one below, where all cell prompts have been replaced by numbers and the file selection UI has appeared.

Initialised QB

Note that the output regarding the Spark application will only appear when the interpreter is started. This means that if the "Initialise QB" button is pressed again, the message will disappear, but the tool will be still functional.

STEP 4 - Working with the Query Builder

Selecting a file

User can see a list of all "Available Datasets" in the current project. The list can also be refreshed. Select the desired dataset and click "Open dataset", this will populate the "Available Files" list.

Browse Files

Once the user selected a file and presses the "Open file" button, the file is loaded (as a PySpark Dataframe) and becomes available for further processing/querying.

Dataframes are essentially data tables. At each moment, Query Builder has two dataframes: the temporary dataframe (TempDF) which holds the result of the last performed action, i.e. the contents of a file or the result of applying a filter/query on the previous contents of TempDF and the master dataframe (MasterDF) which is explicitly updated by pressing a button that moves the TempDF contents to the MasterDF.

Manipulating the temporary dataframe (TempDF)

Loaded file

When a file is loaded, a form appears with several options for simple data filtering and processing.

Filters form

Each selected action will open a new form with the necessary parameters. Once the user fills the required information and presses OK a new filter will be added in the queue to be applied when the user wants. As an example, if the user chooses the "Rename Column" filter he/she will need to select which column to rename and which should be the new name.

Rename column

Once the user presses "OK", the filter will be added in the queue, under the "Selected Filters" label

Rename Col Filter

The user may add more than one such filters before executing them. They will be added in the same list:

Filter List

When the user wants to apply them, he/she should click on the "Apply Filters" button. When the execution is finished, all successfully applied filters will have changed colour:

Executed Filters

The user may continue adding new filters after that point. The list will show both which filters have been already applied and which are pending:

All filters

Previewing the contents of a dataframe

By pressing the "Refresh temp" button, the user will get the first 40 lines of the TempDF. This is useful both when the user first loads a file but also to review the result of applying some filters.

Temp Preview 2

A "Refresh Master" button is also available to preview the contents of the master dataset. Keep in mind that the two dataframes may hold completely different data. The user may also join the two dataframes through the corresponding option in the filters form. The result will be kept in the TempDF.

Temp and Master Preview

STEP 5 - Cleaning up

You may load, process and combine data following the previous steps in any order, until you are satisfied with the end-result. At any time, you may press the "Initialise QB" button to clear the screen and start over. When you have completed your work with the Query Builder, you may either directly exit the Notebook or clear all the output first so that you can have a fresh start without any leftovers the next time. This can be achieved from Jupyter's top menu. Choose Cell -> All Output -> Clear.

API

Not applicable in the current version, since Query Builder is actually a Jupyter notebook

Clone this wiki locally