## 关系代数表达式优化 (查询优化)

可能需要安装markdown和matplotlib (pip install markdown, pin install matplotlib)

In [None]:
%load_ext sql
%sql sqlite://

%load_ext autoreload
%autoreload 2

# To help render markdown
from IPython.core.display import display, HTML
from markdown import markdown
def render_markdown_raw(m): return display(HTML(markdown(m))) # must be last element of cell.
def render_markdown(m): return render_markdown_raw(m.toMD())
def cost_markdown(q): 
    q.reset_count()
    get_result(q) # run the counters
    return display(HTML(markdown("Total Reads: {0}\n\n".format(q.total_count()) + q.toCount(0))))

# import the relational algbera operators
from relation_algebra import Select, Project, Union, NJoin, CrossProduct, BaseRelation
from relation_algebra import get_result, compare_results

from display_tools import side_by_side

import random
import matplotlib.pyplot as plt

创建表R,S,T，并插入样本数据

In [None]:
%%sql
drop table if exists R; create table R(A int, B int);
drop table if exists S; create table S(B int, C int);
drop table if exists T; create table T(C int, D int);

In [None]:
for b in range(0,5,1):
    for a in range(0,10,2):
        %sql INSERT INTO R VALUES (:a, :b);
for b in range(0,5,1):
    for c in range(0,10,2):
        %sql INSERT INTO S VALUES (:b, :c);
for c in range(0,5,1):
    for d in range(0,10,2):
        %sql INSERT INTO T VALUES (:c, :d);

回顾关系代数表达式基本形式

In [None]:
r = %sql SELECT * FROM R;
R = BaseRelation(r, name="R")
s = %sql SELECT * FROM S;
S = BaseRelation(s, name="S")
t = %sql SELECT * FROM T;
T = BaseRelation(t, name="T")

x = Project(["B"], NJoin(R,S))
render_markdown(x)
print get_result(x)

熟悉cost_markdown函数

In [None]:
cost_markdown(x)

在关系数据库系统中，cost主要是I/O cost，即数据读取次数（注意和空间数据库系统的差异）。在计算cost时，做了以下假设：1. 存储系统没有cache数据，无论是buffer management还是磁盘上的cache；2. 自然连接实现方式，是基于什么算法。

我们也可以直接获取reads次数。

In [None]:
x.total_count()

### 1. 优化I/O cost

寻找逻辑等价的关系代数表达式，仅可能减少总的reads次数。关系代数表达式调整策略

* Push down select operation below join to reduces size of table for join operation
* Push project down
* Reorder join operations
* ...

### 2. cost比较

假设关系$R$有$N$行记录，关系$S$有$M$行记录，分析上述两个关系代数表达式随数据量增大时的cost变化。构造Python函数cost_1和cost_2，输入参数是

* The number of tuples in $R$, $N$
* **_The number of distinct $N_B$ values in $R$, $N_B$_**
* The number of tuples in $S$, $M$
* **_The number of distinct $M_B$ values in $S$, $M_B$_**
* The number of tuples in $R\Join_B S$, $O_1$
* The number of tuples in $\Pi_B(R\Join_B S)$, $O_2$

In [None]:
def cost_simple_nlj(n, m):
    """
    Cost to perform a simple NLJ join
    Assuming 1 tuple / page
    """
    return ???

def cost_1(N, M, N_B, M_B, O_1, O_2):
    # YOUR CODE HERE
    return cost

def cost_2(N, M, N_B, M_B, O_1, O_2):
    # YOUR CODE HERE
    return cost

print cost_1(25, 25, 5, 5, 125, 5)
print cost_2(25, 25, 5, 5, 125, 5)

绘制cost随数据量$N$的曲线（假设$N=M$），$B$=5和粗略中间值估计：

In [None]:
B = 5
nrange = range(5,100)

# Plot
plt.plot(nrange, [cost_1(n, n, B, B, b*B, B) for n in nrange])
plt.plot(nrange, [cost_2(n, n, B, B, n*B, B) for n in nrange])
plt.show()

### 3. 关系代数表达式优化

使用上述工具优化以下关系代数表达式

#### 3.1

In [None]:
x = Select("A", 2, Project(["A","C"], NJoin(R,S)))
render_markdown(x)
print get_result(x)
cost_markdown(x)

#### 3.2 (课堂检查1)

In [None]:
x = Select("C", 0, Project(["A","C"], Select("B", 0, NJoin(NJoin(R, S), T))))
render_markdown(x)
print get_result(x)
cost_markdown(x)

#### 3.3 (课堂检查2)

In [None]:
x = Select("C", 0, Project(["C"], Select("D", 2, Select("A", 3, NJoin(R, NJoin(S,T))))))
render_markdown(x)
print get_result(x)
cost_markdown(x)