GitHub - architaagarwal/GDPRizer

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
hotcrp_large		hotcrp_large
lobsters		lobsters
tpch		tpch
wordpress		wordpress
wordpress_plugins		wordpress_plugins
README		README
app_settings.py		app_settings.py
cleanData.py		cleanData.py
data_structures.py		data_structures.py
global_vars.py		global_vars.py
graph.py		graph.py
heuristics_graph.py		heuristics_graph.py
inclusion_dependency.py		inclusion_dependency.py
parser.py		parser.py
queryParser.py		queryParser.py
traversal.py		traversal.py

Repository files navigation

* Project consists of two main Python files:
(1) cleanData.py 
(2) parser.py

In the following we use the TPCH application, the process can be repeated for any application similarly:
- cleanData.py: this file does the following:
	- It reads the query file ("querylog.csv"), cleans the queries, and writes the cleaned queries in a separate txt file ("clean_querylog.txt")
	- It parses the schema file ("schema.sql") and writes the parsed schema in schema.json. We chose to write it in json because json can directly be read into a python dictionary.
	- It reads the cleaned queries from "clean_querylog.txt", parses each one of them into json format using moz_sql_parser (https://github.com/mozilla/moz-sql-parser), and writes the json in "parsed_queries.json".

- parser.py: traverses the graph to pull out user information, compares it to the ground truth to compute precision and recall for each table. 
In the main function, you have to specify the locations of schema json (schema.json), the parsed queries json (parsed_queries.json), the redaction json (redacted_tpch.json), the ground truth txt (ground_truth_sql.txt), and the virtual tables json (virtual_tables.json). 
Given arguments primary table, primary key, and the starting id, the traverseLevels function will traverse the resulting graph. 
The code assume that a MySQL server is running at port 3306 on localhost with the required database (and data) initialized.

* Additional code files:
(1) graph.py
(2) queryParser.py
(3) traversal.py
(4) heuristics_graph.py
(5) inclusion_dependency.py
(6) data_structures.py
(7) global_vars.py

- graph.py: this file constructs a "graph.txt" file with all the information necessary to construct the graph showing the foreign-key constraints. Visual representation of the graph can be constructed using Graphviz. 

- queryParser.py: this file reads the parsed json queries from "parsed_queries.json" to infer the foreign-key constraints.

- traversal.py: this file traverses the relationship graph and extracts data.

- heuristics_graph.py: this file creates a relationship graph using data-driven methods.

- inclusion_dependency.py: this file runs all the data-driven methods to identify inclusion dependencies.

- data_structures.py: this file contains the global class definitions for the data structures used.

- global_vars.py: this file contains the global variable definitions.

* Configuration and data files (for each application):
(1) schema.sql: schema of the database
(2) querylog.csv: query log from the run of the application
(3) app_settings.py: configuration file for each of the graphs and paths to data files, and customization files

* Libraries needed to run the project:
(1) Python3
(2) Graphviz
(3) moz_sql_parser

* How to run the project?
(1) Run python3 cleanData.py (this takes a lot of time since it is parsing the queries)
(2) Run python3 parser.py -- this automatically calls graph.py to generate the graph and outputs the precision and recall for each table. (this is very fast)
(3) [Optional] Run dot -Tpdf graph.txt  -o graph.pdf (this generates the visual representation of join graph using Graphviz)

* Note: You only need to run cleanData.py once! Once all the parsing is done, you can just make changes to parser.py and graph.py without spending a lot of time re-generating the same parsed queries. Of course, if you change your queries, then you need to re-run cleanData.py again.