In [1]:
import pandas as pd
import sqlite3 as sql

In [2]:
database = 'parchposey.db'
connection = sql.connect(database)

# JOIN
### 这篇文章主要讲解SQL中JOIN的用法，简单来说JOIN我们一次性的从多个数据表格中获取数据.
### JOIN 和 ON一起使用

In [5]:
query= "\
    SELECT orders.* \
    FROM orders \
    JOIN accounts \
    ON orders.account_id = accounts.id \
    LIMIT 5 \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,id,account_id,occurred_at,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd
0,1,1001,2015-10-06 17:31:14,123,22,24,169,613.77,164.78,194.88,973.43
1,2,1001,2015-11-05 03:34:33,190,41,57,288,948.1,307.09,462.84,1718.03
2,3,1001,2015-12-04 04:21:55,85,47,0,132,424.15,352.03,0.0,776.18
3,4,1001,2016-01-02 01:18:24,144,32,0,176,718.56,239.68,0.0,958.24
4,5,1001,2016-02-01 19:27:27,108,29,28,165,538.92,217.21,227.36,983.49


那么来解释一下上面这个例子，结合整个数据库的结构很快就能理解 **JOIN** 的用法了。<br>
![](Picture/DB_outline.png)
**SELECT** **FROM** 用来从 **orders** 表中读取数据，接下的**JOIN** 就是声明将**orders** 表和 **accounts** 表结合起来看做一张新表。那么如何将两个表格结合呢？ 根据整个数据库的结构图不难看出，**orders** 表中的 account_id 列和 **accounts**表中的id列是相关的列，"ON orders.account_id = accounts.id" 就声明了链接的方式。

当然我也可以只挑选每个表中感兴趣的列，用来组成新的表格。<br>
例子如下：

In [6]:
query= "\
    SELECT accounts.name, orders.occurred_at \
    FROM orders \
    JOIN accounts \
    ON orders.account_id = accounts.id \
    LIMIT 5 \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,name,occurred_at
0,Walmart,2015-10-06 17:31:14
1,Walmart,2015-11-05 03:34:33
2,Walmart,2015-12-04 04:21:55
3,Walmart,2016-01-02 01:18:24
4,Walmart,2016-02-01 19:27:27


## PK（Primary Key 主键）， FK（Foreign Key 外键）

如果你仔细观察整个数据库的ERD， 那么你会发现表中的一些列的列名之前有PK 和 FK的关键字，它们分别代表此列为此表格的主键和外键。<br>
每个表中都有一列作为该表的主键，主键列每行的值都是唯一的。<br>
外键是一个表中的列，它是另一个表中的主键。例如
* region_id
* account_id
* sales_rep_id

![](Picture/PKFK.png)
如果你仔细观察ERD的话，你会发现表之间都是使用主键外键的链接方式。


说了半天主键和外键，这和**JOIN**有什么关系呢？
其实，显而易见的，当了解了主键和外键的知识后，在回看之前**JOIN**的SQL语句，你会发现，其实**JOIN** **ON**就是将一个表的主键和对应的另一个表中的外键列链接在一起。<br>
'ON orders.account_id = accounts.id' 中**accounts**表中的主键id对应的在**orders**表中的外键正是account_id。

## 链接多个表（两个以上）

**JOIN** 也可以用来链接多个表格。
![](Picture/3Tables.png)
想将**web_events**, **accounts** 和 **orders**三个表链接在一起，SQL语句这么写

In [9]:
query= "\
    SELECT * \
    FROM web_events \
    JOIN accounts \
    ON web_events.account_id = accounts.id \
    JOIN orders \
    ON accounts.id = orders.account_id \
    LIMIT 5 \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,id,account_id,occurred_at,channel,id.1,name,website,lat,long,primary_poc,...,account_id.1,occurred_at.1,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd
0,1,1001,2015-10-06 17:13:58,direct,1001,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,...,1001,2015-10-06 17:31:14,123,22,24,169,613.77,164.78,194.88,973.43
1,1,1001,2015-10-06 17:13:58,direct,1001,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,...,1001,2015-11-05 03:34:33,190,41,57,288,948.1,307.09,462.84,1718.03
2,1,1001,2015-10-06 17:13:58,direct,1001,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,...,1001,2015-12-04 04:21:55,85,47,0,132,424.15,352.03,0.0,776.18
3,1,1001,2015-10-06 17:13:58,direct,1001,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,...,1001,2016-01-02 01:18:24,144,32,0,176,718.56,239.68,0.0,958.24
4,1,1001,2015-10-06 17:13:58,direct,1001,Walmart,www.walmart.com,40.238496,-75.103297,Tamara Tuma,...,1001,2016-02-01 19:27:27,108,29,28,165,538.92,217.21,227.36,983.49


## AS
还记得我们在之前章节中提到过的**AS**关键字吗？当SQL语句中涉及到的表格，列很多时，给表和列重新命名会使整个SQL语句更加的简洁和易读。
例如<br>
FROM tablename AS t1<br>
JOIN tablename2 AS t2<br>
**AS** 甚至也可以省略，不过我本人并不喜欢省略**AS**<br>
FROM tablename t1<br>
JOIN tablename2 t2<br>

**JOIN**的基础知识已经了解差不多了，现在做几个练习来熟能生巧吧

Provide a table for all web_events associated with account name of Walmart. There should be three columns. Be sure to include the primary_poc, time of the event, and the channel for each event. Additionally, you might choose to add a fourth column to assure only Walmart events were chosen.

In [None]:
query= "\
    SELECT a.primary_poc, w.occurred_at, w.channel \
    FROM web_events \
    JOIN accounts \
    ON web_events.account_id = accounts.id \
    JOIN orders \
    ON accounts.id = orders.account_id \
    LIMIT 5 \
"
df = pd.read_sql(query, connection)
df

Provide a table that provides the region for each sales_rep along with their associated accounts. Your final table should include three columns: the region name, the sales rep name, and the account name. Sort the accounts alphabetically (A-Z) according to account name.

Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. Your final table should have 3 columns: region name, account name, and unit price. A few accounts have 0 for total, so I divided by (total + 0.01) to assure not dividing by zero.