In [1]:
import pandas as pd
import sqlite3 as sql

In [2]:
database = 'parchposey.db'
connection = sql.connect(database)

# JOIN
### 这篇文章主要讲解SQL中JOIN的用法，简单来说JOIN我们一次性的从多个数据表格中获取数据.
### JOIN 和 ON一起使用

In [3]:
query= "\
    SELECT orders.* \
    FROM orders \
    JOIN accounts \
    ON orders.account_id = accounts.id \
    LIMIT 5 \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,id,account_id,occurred_at,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd
0,1,1001,2015-10-06 17:31:14,123,22,24,169,613.77,164.78,194.88,973.43
1,2,1001,2015-11-05 03:34:33,190,41,57,288,948.1,307.09,462.84,1718.03
2,3,1001,2015-12-04 04:21:55,85,47,0,132,424.15,352.03,0.0,776.18
3,4,1001,2016-01-02 01:18:24,144,32,0,176,718.56,239.68,0.0,958.24
4,5,1001,2016-02-01 19:27:27,108,29,28,165,538.92,217.21,227.36,983.49


那么来解释一下上面这个例子，结合整个数据库的结构很快就能理解 **JOIN** 的用法了。<br>
![](Picture/DB_outline.png)
**SELECT** **FROM** 用来从 **orders** 表中读取数据，接下的**JOIN** 就是声明将**orders** 表和 **accounts** 表结合起来看做一张新表。那么如何将两个表格结合呢？ 根据整个数据库的结构图不难看出，**orders** 表中的 account_id 列和 **accounts**表中的id列是相关的列，"ON orders.account_id = accounts.id" 就声明了链接的方式。

当然我也可以只挑选每个表中感兴趣的列，用来组成新的表格。<br>
例子如下：

In [4]:
query= "\
    SELECT accounts.name, orders.occurred_at \
    FROM orders \
    JOIN accounts \
    ON orders.account_id = accounts.id \
    LIMIT 5 \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,name,occurred_at
0,Walmart,2015-10-06 17:31:14
1,Walmart,2015-11-05 03:34:33
2,Walmart,2015-12-04 04:21:55
3,Walmart,2016-01-02 01:18:24
4,Walmart,2016-02-01 19:27:27


## PK（Primary Key 主键）， FK（Foreign Key 外键）

如果你仔细观察整个数据库的ERD， 那么你会发现表中的一些列的列名之前有PK 和 FK的关键字，它们分别代表此列为此表格的主键和外键。<br>
每个表中都有一列作为该表的主键，主键列每行的值都是唯一的。<br>
外键是一个表中的列，它是另一个表中的主键。例如
* region_id
* account_id
* sales_rep_id

![](Picture/PKFK.png)
如果你仔细观察ERD的话，你会发现表之间都是使用主键外键的链接方式。


说了半天主键和外键，这和**JOIN**有什么关系呢？
其实，显而易见的，当了解了主键和外键的知识后，在回看之前**JOIN**的SQL语句，你会发现，其实**JOIN** **ON**就是将一个表的主键和对应的另一个表中的外键列链接在一起。<br>
'ON orders.account_id = accounts.id' 中**accounts**表中的主键id对应的在**orders**表中的外键正是account_id。

## 链接多个表（两个以上）

**JOIN** 也可以用来链接多个表格。
![](Picture/3Tables.png)
想将**web_events**, **accounts** 和 **orders**三个表链接在一起，SQL语句这么写

In [5]:
query= "\
    SELECT * \
    FROM web_events \
    JOIN accounts \
    ON web_events.account_id = accounts.id \
    JOIN orders \
    ON accounts.id = orders.account_id \
    LIMIT 5 \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,id,account_id,occurred_at,channel,id.1,name,website,lat,long,primary_poc,...,account_id.1,occurred_at.1,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd
0,1,1001,2015-10-06 17:13:58,direct,1001,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,...,1001,2015-10-06 17:31:14,123,22,24,169,613.77,164.78,194.88,973.43
1,1,1001,2015-10-06 17:13:58,direct,1001,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,...,1001,2015-11-05 03:34:33,190,41,57,288,948.1,307.09,462.84,1718.03
2,1,1001,2015-10-06 17:13:58,direct,1001,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,...,1001,2015-12-04 04:21:55,85,47,0,132,424.15,352.03,0.0,776.18
3,1,1001,2015-10-06 17:13:58,direct,1001,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,...,1001,2016-01-02 01:18:24,144,32,0,176,718.56,239.68,0.0,958.24
4,1,1001,2015-10-06 17:13:58,direct,1001,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,...,1001,2016-02-01 19:27:27,108,29,28,165,538.92,217.21,227.36,983.49


## AS
还记得我们在之前章节中提到过的**AS**关键字吗？当SQL语句中涉及到的表格，列很多时，给表和列重新命名会使整个SQL语句更加的简洁和易读。
例如<br>
FROM tablename AS t1<br>
JOIN tablename2 AS t2<br>
**AS** 甚至也可以省略，不过我本人并不喜欢省略**AS**<br>
FROM tablename t1<br>
JOIN tablename2 t2<br>

**JOIN**的基础知识已经了解差不多了，现在做几个练习来熟能生巧吧

问：查找所有Walmart相关的web_events, 表格要包含primary_poc, occurred_at 和 channel

In [7]:
query= "\
    SELECT a.primary_poc, w.occurred_at, w.channel, a.name \
    FROM web_events AS w \
    JOIN accounts AS a \
    ON w.account_id = a.id \
    WHERE a.name = 'Walmart' \
    LIMIT 5 \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,primary_poc,occurred_at,channel,name
0,Tamara Tuma,2015-10-06 04:22:11,facebook,Walmart
1,Tamara Tuma,2015-10-06 17:13:58,direct,Walmart
2,Tamara Tuma,2015-10-22 05:02:47,organic,Walmart
3,Tamara Tuma,2015-10-22 14:04:20,adwords,Walmart
4,Tamara Tuma,2015-11-05 03:08:26,direct,Walmart


问：查询每个地区每个账户的的所有订单，并计算每个订单的单位价格。<br>
表格中至少包含四个列，region.name，accounts.name, unit_price,orders.id。

             

In [11]:
query= "\
    SELECT r.name as region, a.name as account, o.total_amt_usd/(o.total + 0.01) as unit_price, o.id \
    FROM region AS r \
    JOIN sales_reps AS s \
    ON s.region_id = r.id \
    JOIN accounts AS a \
    ON a.sales_rep_id = s.id \
    JOIN orders AS o \
    ON o.account_id = a.id \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,region,account,unit_price,id
0,Northeast,Walmart,5.759600,1
1,Northeast,Walmart,5.965175,2
2,Northeast,Walmart,5.879706,3
3,Northeast,Walmart,5.444236,4
4,Northeast,Walmart,5.960184,5
...,...,...,...,...
6907,West,Pacific Life,7.111389,6720
6908,West,Pacific Life,7.682929,6721
6909,West,Pacific Life,6.855753,6722
6910,West,Pacific Life,7.742934,6723


## 其他JOIN
**JOIN**有很多更具体的用法
* **INNER JOIN**, 和 **JOIN** 是相同的，只取两键的交集
* **OUTER JOIN**， 取两组键的并集
* **LEFT JOIN**，以左侧键为主，取所有的左侧键
* **RIGHT JOIN**，以右侧键为主，取所有的右侧键

下面我会用一些图例具体的解释这几种 **JOIN**的区别。

## INNER JOIN
![](Picture/Inner_join.png)
在不特殊声明的情况下，**JOIN** 就是 **INNER JOIN**。也就是说到目前为止，我们使用的所有**JOIN** 都是 **INNNER JOIN**。**INNER** 只会筛选出多个表中同时存在的键。

## LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN
![](Picture/Left_Right_Outer_Join.png)
![](Picture/Left_Join.png)
![](Picture/Right_Join.png)

本章节最后的一些练习，祝你成功。

Provide a table that provides the region for each sales_rep along with their associated accounts. This time only for the Midwest region. Your final table should include three columns: the region name, the sales rep name, and the account name. Sort the accounts alphabetically (A-Z) according to account name.

In [18]:
query= "\
    SELECT r.name as region, s.name as rep, a.name as account \
    FROM sales_reps s \
    JOIN region r \
    ON s.region_id = r.id \
    JOIN accounts a \
    ON a.sales_rep_id = s.id \
    WHERE r.name = 'Midwest'\
    LIMIT 10\
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,region,rep,account
0,Midwest,Sherlene Wetherington,Community Health Systems
1,Midwest,Sherlene Wetherington,Progressive
2,Midwest,Sherlene Wetherington,Rite Aid
3,Midwest,Sherlene Wetherington,Time Warner Cable
4,Midwest,Sherlene Wetherington,U.S. Bancorp
5,Midwest,Chau Rowles,Abbott Laboratories
6,Midwest,Chau Rowles,Alcoa
7,Midwest,Chau Rowles,Halliburton
8,Midwest,Chau Rowles,Staples
9,Midwest,Chau Rowles,Tech Data


Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. However, you should only provide the results if the standard order quantity exceeds 100. Your final table should have 3 columns: region name, account name, and unit price.

In [17]:
query= "\
    SELECT r.name region, a.name account, o.total_amt_usd/(o.total + 0.01) unit_price \
    FROM region r \
    JOIN sales_reps s \
    ON s.region_id = r.id \
    JOIN accounts a \
    ON a.sales_rep_id = s.id \
    JOIN orders o \
    ON o.account_id = a.id \
    WHERE o.standard_qty > 100 \
    LIMIT 10\
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,region,account,unit_price
0,Northeast,Walmart,5.7596
1,Northeast,Walmart,5.965175
2,Northeast,Walmart,5.444236
3,Northeast,Walmart,5.960184
4,Northeast,Walmart,6.168719
5,Northeast,Walmart,6.62891
6,Northeast,Walmart,5.646522
7,Northeast,Walmart,6.033417
8,Northeast,Walmart,6.019492
9,Northeast,Walmart,6.109804


Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. However, you should only provide the results if the standard order quantity exceeds 100 and the poster order quantity exceeds 50. Your final table should have 3 columns: region name, account name, and unit price. Sort for the largest unit price first.

In [26]:
query= "\
    SELECT r.name region, a.name account, o.total_amt_usd/(o.total + 0.01) unit_price \
    FROM region r \
    JOIN sales_reps s \
    ON s.region_id = r.id \
    JOIN accounts a \
    ON a.sales_rep_id = s.id \
    JOIN orders o \
    ON o.account_id = a.id \
    WHERE o.standard_qty > 100 AND o.poster_qty > 50 \
    ORDER BY unit_price DESC; \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,region,account,unit_price
0,Northeast,IBM,8.089906
1,West,Mosaic,8.066329
2,West,Pacific Life,8.063023
3,Northeast,CHS,8.018849
4,West,Fidelity National Financial,7.992802
...,...,...,...
830,West,Stanley Black & Decker,5.266396
831,Northeast,Best Buy,5.260426
832,Northeast,Travelers Cos.,5.235181
833,Southeast,DISH Network,5.231816
