## SQL AGGREGATIONs (聚合运算)

聚合函数对一组值执行计算并返回单一的值。除 COUNT 以外，聚合函数忽略空值，如果COUNT函数的应用对象是一个确定列名，并且该列存在空值，此时COUNT仍会忽略空值。

In [1]:
import pandas as pd
import sqlite3 as sql 

In [2]:
database = 'parchposey.db'
connection = sql.connect(database)

## NULLs


## COUNT() 用来计算表格中特定的行数

In [3]:
query = "\
SELECT COUNT(*)\
FROM accounts\
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,COUNT(*)
0,351


In [4]:
或者也可以计算某一列

NameError: name '或者也可以计算某一列' is not defined

In [5]:
query = "\
SELECT COUNT(accounts.id)\
FROM accounts\
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,COUNT(accounts.id)
0,351


## SUM()

与**COUNT**不同的是，**SUM**只能用作数值计算，同时会自动跳过**NULL**。<br>
值得一提的是聚合函数只计算列中的数据，而不会按照行去进行计算。

问：计算**orders**表格中poster_qty的总数。

In [8]:
query = "\
SELECT COUNT(poster_qty)\
FROM orders\
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,COUNT(poster_qty)
0,6912


## MIN()，MAX()和AVG()
同样的这些聚合运算也会自动跳过NULLS。<br>
与SUM()不同的是MIN()和MAX()特可以对非数值列进行计算。取决于输入的列的数据格式，MIN()会返回一列中最小的数值，最小的日期或者位置靠前的字符串。MAX()则与MIN()相反。


问：最早的web_event是什么时候发生的？

In [22]:
query = "\
SELECT MAX(occurred_at) \
FROM web_events; \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,MAX(occurred_at)
0,2017-01-01 23:51:09


问：计算六种纸张类型中每一种的平均订单价格。

In [6]:
query = "\
SELECT AVG(standard_qty) as mean_standard, AVG(gloss_qty) as mean_gloss, \
       AVG(poster_qty) as mean_poster, AVG(standard_amt_usd) as mean_standard_usd, \
       AVG(gloss_amt_usd) as mean_gloss_usd, AVG(poster_amt_usd) as mean_poster_usd \
FROM orders;\
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,mean_standard,mean_gloss,mean_poster,mean_standard_usd,mean_gloss_usd,mean_poster_usd
0,280.432002,146.668547,104.694155,1399.355692,1098.54742,850.116539


In [7]:
## GROUP BY
Group 不要和LIMIT 混用

SyntaxError: invalid syntax (<ipython-input-7-d0ad28e59e6d>, line 2)

Find the total sales in usd for each account. You should include two columns - the total sales for each company's orders in usd and the company name.

In [42]:
query = "\
SELECT a.name, SUM(total_amt_usd) total_sales \
FROM orders o \
JOIN accounts a \
ON a.id = o.account_id \
GROUP BY a.name \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,name,total_sales
0,3M,127945.10
1,ADP,163579.18
2,AECOM,18491.51
3,AES,13038.64
4,AIG,9980.93
...,...,...
345,World Fuel Services,10833.11
346,Xcel Energy,19975.91
347,Xerox,8759.93
348,Yum Brands,28296.53


Via what channel did the most recent (latest) web_event occur, which account was associated with this web_event? Your query should return only three values - the date, channel, and account name.

In [41]:
query = "\
SELECT w.occurred_at, w.channel, a.name \
FROM web_events w \
JOIN accounts a \
ON w.account_id = a.id \
ORDER BY w.occurred_at DESC\
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,occurred_at,channel,name
0,2017-01-01 23:51:09,organic,Molina Healthcare
1,2017-01-01 23:38:46,direct,W.W. Grainger
2,2017-01-01 22:00:42,direct,Stryker
3,2017-01-01 20:40:40,direct,Genworth Financial
4,2017-01-01 17:12:20,direct,Oracle
...,...,...,...
9068,2013-12-05 20:17:50,direct,Citigroup
9069,2013-12-04 18:22:04,facebook,DISH Network
9070,2013-12-04 08:27:55,adwords,American Family Insurance Group
9071,2013-12-04 04:44:58,direct,American Family Insurance Group


Find the total number of times each type of channel from the web_events was used. Your final table should have two columns - the channel and the number of times the channel was used.

In [43]:
query = "\
SELECT w.channel, COUNT(*) \
FROM web_events w \
GROUP BY w.channel \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,channel,COUNT(*)
0,adwords,906
1,banner,476
2,direct,5298
3,facebook,967
4,organic,952
5,twitter,474


知识点：
* 可以同时GROUP BY使用在多个列。在 GROUP BY 多个列时 是有先后顺序的
* 

难度升级！<br>
问：For each account, determine the average amount of each type of paper they purchased across their orders. Your result should have four columns - one for the account name and one for the average spent on each of the paper types.

In [44]:
query = "\
SELECT a.name, AVG(o.standard_qty) avg_stand, AVG(o.gloss_qty) avg_gloss, AVG(o.poster_qty) avg_post \
FROM accounts a \
JOIN orders o \
ON a.id = o.account_id \
GROUP BY a.name; \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,name,avg_stand,avg_gloss,avg_post
0,3M,313.392857,279.750000,112.107143
1,ADP,354.633333,60.533333,61.983333
2,AECOM,363.777778,13.888889,16.666667
3,AES,254.666667,304.000000,98.333333
4,AIG,300.500000,349.000000,108.000000
...,...,...,...,...
345,World Fuel Services,208.000000,206.750000,15.000000
346,Xcel Energy,417.166667,157.000000,8.833333
347,Xerox,266.750000,19.000000,88.250000
348,Yum Brands,280.733333,38.666667,24.133333


问：For each account, determine the average amount spent per order on each paper type. Your result should have four columns - one for the account name and one for the average amount spent on each paper type.

In [47]:
query = "\
SELECT a.name, AVG(o.standard_amt_usd) avg_stand, AVG(o.gloss_amt_usd) avg_gloss, AVG(o.poster_amt_usd) avg_post \
FROM accounts a \
JOIN orders o \
ON a.id = o.account_id \
GROUP BY a.name; \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,name,avg_stand,avg_gloss,avg_post
0,3M,1563.830357,2095.327500,910.310000
1,ADP,1769.620333,453.394667,503.304667
2,AECOM,1815.251111,104.027778,135.333333
3,AES,1270.786667,2276.960000,798.466667
4,AIG,1499.495000,2614.010000,876.960000
...,...,...,...,...
345,World Fuel Services,1037.920000,1548.557500,121.800000
346,Xcel Energy,2081.661667,1175.930000,71.726667
347,Xerox,1331.082500,142.310000,716.590000
348,Yum Brands,1400.859333,289.613333,195.962667


问：Determine the number of times a particular channel was used in the web_events table for each sales rep. Your final table should have three columns - the name of the sales rep, the channel, and the number of occurrences. Order your table with the highest number of occurrences first.

In [49]:
query = "\
SELECT s.name, w.channel, COUNT(*) num_events \
FROM accounts a \
JOIN web_events w \
on a.id = w.account_id \
JOIN sales_reps s \
ON s.id = a.sales_rep_id \
GROUP BY s.name, w.channel \
ORDER BY num_events DESC; \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,name,channel,num_events
0,Earlie Schleusner,direct,234
1,Vernita Plump,direct,232
2,Moon Torian,direct,194
3,Georgianna Chisholm,direct,188
4,Tia Amato,direct,185
...,...,...,...
290,Nakesha Renn,organic,1
291,Nakesha Renn,twitter,1
292,Shawanda Selke,banner,1
293,Shawanda Selke,facebook,1


## DISTINCT
DIstinct 会返回列中的唯一值。它为SELECT语句中写入的所有列提供惟一的行。因此，在任何特定的SELECT语句中只使用一次DISTINCT。

正确用法：<br>`
SELECT DISTINCT column1, column2, column3 <br>
FROM table1; <br>

错误用法：<br>
SELECT DISTINCT column1, DISTINCT column2, DISTINCT column3 <br>
FROM table1;<br>

In [52]:
query = "\
SELECT DISTINCT id, name \
FROM sales_reps \
LIMIT 10 \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,id,name
0,321500,Samuel Racine
1,321510,Eugena Esser
2,321520,Michel Averette
3,321530,Renetta Carew
4,321540,Cara Clarke
5,321550,Lavera Oles
6,321560,Elba Felder
7,321570,Shawanda Selke
8,321580,Sibyl Lauria
9,321590,Necole Victory


## Having
HAVING是过滤已聚合的查询的“干净”方法，但这也通常使用子查询来完成。实际上，任何时候您希望对由聚合创建的查询元素执行WHERE，都需要使用HAVING。


问：How many of the sales reps have more than 5 accounts that they manage

In [55]:
query = "\
SELECT  s.name, COUNT(*) num_accounts \
FROM accounts a \
JOIN sales_reps s \
ON s.id = a.sales_rep_id \
GROUP BY s.name \
HAVING COUNT(*) > 5 \
ORDER BY num_accounts; \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,name,num_accounts
0,Debroah Wardle,6
1,Elba Felder,6
2,Eugena Esser,6
3,Necole Victory,6
4,Samuel Racine,6
5,Sibyl Lauria,6
6,Babette Soukup,7
7,Charles Bidwell,7
8,Cliff Meints,7
9,Derrick Boggess,7


How many accounts have more than 20 orders?

In [57]:
query = "\
SELECT a.name, COUNT(*) num_orders \
FROM accounts a \
JOIN orders o \
ON a.id = o.account_id \
GROUP BY a.name \
HAVING COUNT(*) > 20 \
ORDER BY num_orders; \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,name,num_orders
0,Anthem,21
1,Performance Food Group,21
2,Thrivent Financial for Lutherans,21
3,Jabil Circuit,22
4,Raytheon,22
...,...,...
115,Mosaic,66
116,Arrow Electronics,67
117,Supervalu,68
118,Sysco,68


问：Which account used facebook most as a channel?

In [58]:
query = "\
SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel \
FROM accounts a \
JOIN web_events w \
ON a.id = w.account_id \
WHERE w.channel = 'facebook' \
GROUP BY a.id, a.name, w.channel \
ORDER BY use_of_channel DESC \
LIMIT 1; \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,id,name,channel,use_of_channel
0,1851,Gilead Sciences,facebook,16


## DATE
DATE_TRUNC允许您将日期截短为日期-时间列的特定部分。常见的是日、月、年。<br>
DATE_PART可以用于提取日期的特定部分，但请注意，提取month或星期的一天(dow)意味着您不再按顺序保存年份。而是对某些组件进行分组，不管它们属于哪一年。<br>

问：Find the sales in terms of total dollars for all orders in each year, ordered from greatest to least. Do you notice any trends in the yearly sales totals?

## CASE 
* CASE语句总是在SELECT子句中。
* CASE必须包含以下关键字:WHEN、THEN和END。ELSE是一个可选关键字，用于捕获不满足前面任何其他CASE条件的案例。
* 您可以在WHEN和THEN之间使用任何条件操作符(如WHERE)来创建任何条件语句。这包括使用AND和OR将多个条件语句串在一起。
* 您可以包含多个WHEN语句，以及一个ELSE语句，以处理任何未处理的条件。

Let's see how we can use the CASE statement to get around this error.

SELECT id, account_id, standard_amt_usd/standard_qty AS unit_price <br>
FROM orders <br>
LIMIT 10; <br>

Now, let's use a CASE statement. This way any time the standard_qty is zero, we will return 0, and otherwise we will return the unit_price. <br>


SELECT account_id, CASE WHEN standard_qty = 0 OR standard_qty IS NULL THEN 0
                 ELSE standard_amt_usd/standard_qty END AS unit_price <br>
FROM orders <br>
LIMIT 10; <br>

问：Write a query to display for each order, the account ID, total amount of the order, and the level of the order - ‘Large’ or ’Small’ - depending on if the order is $3000 or more, or less than $3000.

In [65]:
query = "\
SELECT account_id, total_amt_usd, \
CASE WHEN total_amt_usd > 3000 THEN 'Large' \
ELSE 'Small' END AS order_level \
FROM orders; \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,account_id,total_amt_usd,order_level
0,1001,973.43,Small
1,1001,1718.03,Small
2,1001,776.18,Small
3,1001,958.24,Small
4,1001,983.49,Small
...,...,...,...
6907,4501,2024.48,Small
6908,4501,1486.06,Small
6909,4501,1449.74,Small
6910,4501,1473.92,Small


问：We would like to identify top performing sales reps, which are sales reps associated with more than 200 orders. Create a table with the sales rep name, the total number of orders, and a column with top or not depending on if they have more than 200 orders. Place the top sales people first in your final table.

In [69]:
query = "\
SELECT s.name, COUNT(*) num_ords, \
     CASE WHEN COUNT(*) > 200 THEN 'top' \
     ELSE 'not' END AS sales_rep_level \
FROM orders o \
JOIN accounts a \
ON o.account_id = a.id \
JOIN sales_reps s \
ON s.id = a.sales_rep_id \
GROUP BY s.name \
ORDER BY sales_rep_level DESC; \
"
df = pd.read_sql(query, connection)
df

Unnamed: 0,name,num_ords,sales_rep_level
0,Vernita Plump,299,top
1,Tia Amato,267,top
2,Nelle Meaux,241,top
3,Moon Torian,250,top
4,Maryanna Fiorentino,204,top
5,Maren Musto,224,top
6,Georgianna Chisholm,256,top
7,Earlie Schleusner,335,top
8,Dorotha Seawell,208,top
9,Charles Bidwell,205,top


复习本章节内容：

您现在已经获得了大量与SQL相关的有用技能。join和Aggregations的组合是SQL成为如此强大工具的原因之一。

如果你纠结于某个特定的话题，我建议你重新思考这些问题。你练习得越多越好，但你也不想长时间困在同一个问题上!