Skip to content

Query Builder

Christos Botsikas edited this page Jun 28, 2019 · 11 revisions

Deployment and Usage

Query Builder is implemented with Python, Html and JavaScript as a Jupyter notebook. This section describes how the tool can be utilized from the AEGIS platform.

STEP 1 - Installation

Query Builder is, by default, pre-installed and available in every newly created project. In case, however, that the Query Builder Jupyter notebook cannot be found in a specific project or the project is outdated, the user can download it and then, upload the QueryBuilder_v1.ipynb file into Jupyter from the corresponding menu, as seen in figure below.

Upload Notebook

STEP 2 - Open the notebook

Once the notebook is present, there are three ways to access it.

a. Via the AEGIS Jupyter Tools page

The notebook can be opened from the AEGIS Jupyter Tools page by clicking on the Query Builder tile.

QB Startup

b. Via the Jupyter notebook file context menu

Visit the Datasets page and then open "Jupyter" dataset which is available in all projects. The user should browse for the Query Builder notebook, right click on it and then select the "Open Jupyter Notebook" context menu option.

QB Startup

c. Via a CSV file context menu

This approach makes again use of the context menu. Navigate to the Datasets Page and then open a dataset. Then, by right-clicking on any CSV file, the user is given the ability to open that file directly into the Query Builder, as it can be seen in the screenshot above.

QB Startup

Once the notebook has been loaded, the user sees the following:

QB Initial State

This is the initial state. All code cells are hidden using the hide code plugin and all output has been cleared. If the user is not interested in reviewing or altering the code (it is not required in the standard workflow) he/she may proceed with hiding the cell toolbars. In order to do that, select "View" from the menu, then "Cell Toolbar" and click on the "None" option, as shown below.

Hide Toolbar

The user should now see the following screen:

QB Initial State No Toolbar

STEP 3 - Initialize the notebook

In order to initialize the notebook user must run the first cell. Make sure that the first cell is selected and either click on the "Run" button from the top Jupyter menu, or press Ctrl + Enter. User should see a button appearing as shown below.

Initialise Notebook

Click on the "Initialise QB" button and wait for the execution to stop. When this is done, user should see a screen like the one below, where all cell prompts have been replaced by numbers and the file selection UI has appeared.

Initialised QB

Note that the output regarding the Spark application will only appear when the interpreter is started. This means that if the "Initialise QB" button is pressed again, the message will disappear, but the tool will be still functional.

STEP 4 - Working with the Query Builder

Selecting a file

User can see a list of all "Available Datasets" in the current project. The list can also be refreshed. Select the desired dataset and click "Open dataset", this will populate the "Available Files" list.

Browse Files

Once the user selected a file and presses the "Open file" button, the file is loaded (as a PySpark Dataframe) and becomes available for further processing/querying.

Dataframes are essentially data tables. At any moment, Query Builder has two dataframes: the temporary dataframe (TempDF) which holds the result of the last performed action, i.e. the contents of a file or the result of applying a filter/query on the previous contents of TempDF and the master dataframe (MasterDF) which is explicitly updated by pressing a button that moves the TempDF contents to the MasterDF.

Manipulating the temporary dataframe (TempDF)

Loaded file

When a file is loaded, a form appears with several options for simple data filtering and processing.

Filters form

Each selected action will open a new form with the necessary parameters. Once the user fills the required information and presses OK a new filter will be added in the queue to be applied when the user wants. As an example, if the user chooses the "Rename Column" filter he/she will then need to select which column to rename and which should be the new name.

Rename column

Once the user presses "OK", the filter will be added in the queue, under the "Selected Filters" label

Rename Col Filter

The user may add more than one filter before executing them. They will be added in the same list:

Filter List

When the user decides to apply them, he/she should click on the "Apply Filters" button. When the execution is finished, all successfully applied filters will change color:

Executed Filters

The user may continue adding new filters after that point. The list will show which filters have been already applied and which are pending:

All filters

Previewing the contents of a dataframe

By pressing the "Refresh temp" button, the user will get the first 40 lines of the TempDF. This is useful both when the user first loads a file but also to review the result of applying some filters.

Temp Preview 2

A "Refresh Master" button is also available to preview the contents of the master dataset. Keep in mind that the two dataframes may hold completely different data. The user may also join the two dataframes through the corresponding option in the filters form. The result will be kept in the TempDF.

Temp and Master Preview

Export dataframe

The user can use the "Update Query Output" button which will reveal a ready to copy and use python snippet which corresponds to the processes that have been applied on the tempDF.

Query Output

By clicking the "Save master to CSV" button, the Save configuration form will appear where the user can select where to store the masterDF's data.

Save to CSV

STEP 5 - Cleaning up

The user may load, process and combine data following the previous steps in any order, until he/she is satisfied with the end-result. At any time, the user may press the "Initialise QB" button to clear the screen and start over. When the user has completed his/her work with the Query Builder, he/she may either directly exit the Notebook or clear all the output first so that he/she can have a fresh start without any leftovers the next time. This can be achieved from Jupyter's top menu. Choose Cell -> All Output -> Clear.

API

Not applicable in the current version, since Query Builder is actually a Jupyter notebook