presto-query-predictor

presto-query-predictor is a Python module introducing machine learning techniques to the Presto ecosystem. It contains a machine learning pipeline for the model training/evaluation and a query predictor web service to predict CPU and memory usages of Presto queries.

Installation

After cloning the GitHub repository,

pip3 install -e .  # Installs the presto-query-predictor package locally
pip3 install -r requirements.txt  # Installs dependencies

An alternative way is to install the package from PyPi,

pip3 install presto-query-predictor

We recommend installing the package in a Python virtual environment instead of installing it globally.

Examples

The query_predictor/ folder contains the core of the package. We have prepared some examples in the example/ folder, including

load_data.py - An example to load the embedded fake TPCH-based dataset.
transform.py - An example to transform datasets for further training.
train.py - An example to train CPU and memory models.
tune.py - An example to tune classification algorithms.
app.py - An example to create a query predictor web service.

Training

A simple way to get a sense of the CPU and memory model training is running the examples in the example/ folder.

cd examples
python3 transform.py
python3 train.py

The presto-query-predictor package can only be executed in a Python 3 environment. It does not support Python 2.

Afterward, the trained models should be generated in the models folder, including

models/
    vec-cpu.bin
    vec-memory.bin
    model-cpu.bin
    model-memory.bin

By default, the vectorizers are trained from the TF-IDF algorithm, and the models are trained from XGBoost classifiers. The dataset used for training is a faked dataset based on the TPC-H benchmark with only 22 samples.

Serving

After running

python3 app.py

A Flask web application should be created at http://0.0.0.0:8000/. There is a web UI for the application where you can fill in the form with a query for resources prediction.

Citation

Please cite the following article (arxiv_link) in your publications if the query predictor helps your work:

@inproceedings{tang2021forecasting,
  title={Forecasting {SQL} query cost at {Twitter}},
  author={Tang, Chunxu and Wang, Beinan and Luo, Zhenxiao and Wu, Huijun and Dasan, Shajan and Fu, Maosong and Li, Yao and Ghosh, Mainak and Kabra, Ruchin and Navadiya, Nikhil Kantibhai and others},
  booktitle={2021 IEEE International Conference on Cloud Engineering (IC2E)},
  pages={154--160},
  year={2021},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs		docs
examples		examples
query_predictor		query_predictor
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

presto-query-predictor

Installation

Examples

Training

Serving

Citation

About

Releases

Packages

Languages

License

anik18saha/presto-query-predictor

Folders and files

Latest commit

History

Repository files navigation

presto-query-predictor

Installation

Examples

Training

Serving

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages