Authors propose a complementary approach that permits declarative query answering over duplicated data, where each duplicate is associated with a probability of being in the clean database. This repository contains the simulation of author work[1] using python[2] script in which they rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database.
Synthetic Data Generator, UIS Database Generator and Cora Dataset
Simulator script should be executed as
./python simulator.py
Select Attribute1,Attribute2,...,AttributeN
from Table1,Table2
where condition1,condition2..,conditionN
groupBy Attribute1,...AttributeN
Dataset Snippet of Customer Table
id | custId | name | balance | prob |
---|---|---|---|---|
c1 | m1 | John | 20 | 0.7 |
c1 | m2 | John | 30 | 0.3 |
c2 | m3 | Mary | 27 | 0.2 |
c2 | m4 | Marion | 5 | 0.8 |
Normal SQL query to fetch id of those customers having balance > 10
select id,prob
from customer
where balance>10
id | prob |
---|---|
c1 | 0.7 |
c1 | 0.3 |
c2 | 0.2 |
But if we apply clean answers over Dirty Database using Probabilistic Database
select id,sum(prob)
from customer
where balance>10
groupby id
id | prob |
---|---|
c1 | 0.1 |
c2 | 0.2 |
[1] P. Andritsos, A. Fuxman, R.J. Miller, "Clean Answers over Dirty Databases: A Probabilistic Approach", Proceedings of the 22nd International Conference on Data Engineering, 2006.