GitHub - atakanokan/SyntaxSQLNet: SyntaxSQLNet

This is a Python 3.6.1 and PyTorch 1.0 implementation of the paper referenced below.

SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task

Source code of our EMNLP 2018 paper: SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task .

Citation

@InProceedings{Yu&al.18.emnlp.syntax,
  author =  {Tao Yu and Michihiro Yasunaga and Kai Yang and Rui Zhang and Dongxu Wang and Zifan Li and Dragomir Radev},
  title =   {SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task},
  year =    {2018},  
  booktitle =   {Proceedings of EMNLP},  
  publisher =   {Association for Computational Linguistics},
}

Presentation on the Business Use Case

Please look at Atakan_Okan_Text2SQL.pdf in main directory.

Environment Setup

The code uses Python 3.7 and Pytorch 1.0.0 GPU.
Install Python dependency: pip install -r requirements.txt

Download Data, Embeddings, Scripts, and Pretrained Models

Download the dataset from the Spider task website to be updated, and put tables.json, train.json, and dev.json under data/ directory.
Download the pretrained Glove, and put it as glove/glove.%dB.%dd.txt
Download evaluation.py and process_sql.py from the Spider github page
Download preprocessed train/dev datasets and pretrained models from here. It contains: -generated_datasets/
- generated_data for original Spider training datasets, pretrained models can be found at generated_data/saved_models
- generated_data_augment for original Spider + augmented training datasets, pretrained models can be found at generated_data_augment/saved_models

Generating Train/dev Data for Modules

You could find preprocessed train/dev data in generated_datasets/.

To generate them by yourself, update dirs under TODO in preprocess_train_dev_data.py, and run the following command to generate training files for each module:

python preprocess_train_dev_data.py train|dev

Folder/File Description

data/ contains raw train/dev/test data and table file
generated_datasets/ described as above
models/ contains the code for each module.
evaluation.py is for evaluation. It uses process_sql.py.
train.py is the main file for training. Use train_all.sh to train all the modules (see below).
test.py is the main file for testing. It uses supermodel.sh to call the trained modules and generate SQL queries. In practice, and use test_gen.sh to generate SQL queries.
generate_wikisql_augment.py for cross-domain data augmentation

Training

Run train_all.sh to train all the modules. It looks like:

python train.py \
    --data_root       path/to/generated_data \
    --save_dir        path/to/save/trained/module \
    --history_type    full|no \
    --table_type      std|no \
    --train_component <module_name> \
    --epoch           <num_of_epochs>

Testing

Run test_gen.sh to generate SQL queries. test_gen.sh looks like:

SAVE_PATH=generated_datasets/generated_data/saved_models_hs=full_tbl=std
python test.py \
    --test_data_path  path/to/raw/test/data \
    --models          path/to/trained/module \
    --output_path     path/to/print/generated/SQL \
    --history_type    full|no \
    --table_type      std|no \

Flask Testing (Local)

Run model with question = What are the maximum and minimum budget of the departments? and database name = department_management

Docker image creation and push to Docker Hub

docker build -t model-app
docker login -> enter your credentials
docker images -> get the image id of the model's container
docker tag <your image id> <your docker hub id>/<app name>
docker push <your docker hub name>/<app-name>

Kubernetes Deployment

After pushing the Docker image to Docker Hub & creating the Kubernetes cluster; run the following in Cloud Shell:

kubectl run model-app --image=atakanokan/model-app --port 5000
Verify by kubectl get pods
kubectl expose deployment model-app --type=LoadBalancer --port 80 --target-port 5000
kubectl get service and get the cluster-ip

And run the following from local terminal:

curl -X GET 'http://<your service IP>/output?english_question=What+are+the+maximum+and+minimum+budget+of+the+departments%3F&database_name=department_management'

Evaluation

Follow the general evaluation process in the Spider github page.

Cross-Domain Data Augmentation

You could find preprocessed augmented data at generated_datasets/generated_data_augment.

If you would like to run data augmentation by yourself, first download wikisql_tables.json and train_patterns.json from here, and then run python generate_wikisql_augment.py to generate more training data.

Acknowledgement

The implementation is based on SQLNet. Please cite it too if you use this code.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
data		data
flaskexample		flaskexample
models		models
saved_models		saved_models
.gitattributes		.gitattributes
.gitignore		.gitignore
Atakan_Okan_Text2SQL.pdf		Atakan_Okan_Text2SQL.pdf
Dockerfile		Dockerfile
Feedback Data.ipynb		Feedback Data.ipynb
README.md		README.md
Split Data For Feedback Training.ipynb		Split Data For Feedback Training.ipynb
atakan-IAM-keypair.pem		atakan-IAM-keypair.pem
dev_accuracy.py		dev_accuracy.py
generate_wikisql_augment.py		generate_wikisql_augment.py
inference.py		inference.py
preprocess_train_dev_data.py		preprocess_train_dev_data.py
run_flask.py		run_flask.py
supermodel.py		supermodel.py
test.py		test.py
test_gen.sh		test_gen.sh
train.py		train.py
train_all.sh		train_all.sh
train_feedback.py		train_feedback.py
utils.py		utils.py
word_embedding.py		word_embedding.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task

Citation

Presentation on the Business Use Case

Environment Setup

Download Data, Embeddings, Scripts, and Pretrained Models

Generating Train/dev Data for Modules

Folder/File Description

Training

Testing

Flask Testing (Local)

Docker image creation and push to Docker Hub

Kubernetes Deployment

Evaluation

Cross-Domain Data Augmentation

Acknowledgement

About

Releases

Packages

Languages

atakanokan/SyntaxSQLNet

Folders and files

Latest commit

History

Repository files navigation

SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task

Citation

Presentation on the Business Use Case

Environment Setup

Download Data, Embeddings, Scripts, and Pretrained Models

Generating Train/dev Data for Modules

Folder/File Description

Training

Testing

Flask Testing (Local)

Docker image creation and push to Docker Hub

Kubernetes Deployment

Evaluation

Cross-Domain Data Augmentation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages