<a href="https://colab.research.google.com/github/ccwu0918/book-sqlfifty/blob/main/ch08-case-when/ch08-case-when.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SQL 的五十道練習：初學者友善的資料庫入門

> 條件邏輯

讀者如果是資料科學的初學者，可以略過下述的程式碼；讀者如果不是資料科學的初學者，欲使用 JupyterLab 執行本章節內容，必須先執行下述程式碼載入所需模組與連接資料庫。

In [None]:
!git clone https://github.com/datainpoint/book-sqlfifty

In [None]:
# %LOAD sqlite3 db=../databases/imdb.db timeout=2 shared_cache=true

In [None]:
%cd databases
!wget -N https://raw.githubusercontent.com/jpwhite3/northwind-SQLite3/master/dist/northwind.db
!wget -N https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite
%cd ..

In [None]:
import sqlite3
import unittest
import json
import os
import numpy as np
import pandas as pd
conn = sqlite3.connect('./databases/imdb.db')
conn.execute("""ATTACH './databases/covid19.db' AS covid19""")
conn.execute("""ATTACH './databases/twElection2020.db' AS twElection2020""")
conn.execute("""ATTACH './databases/nba.db' AS nba""")
conn.execute("""ATTACH './databases/northwind.db' AS Northwind""")
conn.execute("""ATTACH './databases/Chinook_Sqlite.sqlite' AS Chinook""")

In [None]:
# %%capture
# load the SQL magic extension
# https://github.com/catherinedevlin/ipython-sql
# this extension allows us to connect to DBs and issue SQL command
%load_ext sql

# now we can use the magic extension to connect to our SQLite DB
# use %sql to write an inline SQL command
# use %%sql to write SQL commands in a cell
%sql sqlite:///databases/imdb.db

In [None]:
%%sql
ATTACH "./databases/covid19.db" AS covid19;
ATTACH "./databases/twElection2020.db" AS twElection2020;
ATTACH "./databases/nba.db" AS nba;
ATTACH "./databases/northwind.db" AS Northwind;
ATTACH "./databases/Chinook_Sqlite.sqlite" AS Chinook;

In [None]:
%%sql
SELECT sqlite_version();

## 複習一下

在第四章「衍生計算欄位」我們提過關係運算符與邏輯運算符在後續的「篩選觀測值」以及「條件邏輯」的章節中佔有舉足輕重的地位，針對常數或欄位可以使用關係運算符衍生計算欄位，應用後會得到 0（布林值 `False`）或 1（布林值 `True`）兩者其中之一，就是所謂的「條件」，而「邏輯運算符」則是將數個條件結合成一個條件的運算符。布林值除了能夠運用在 `WHERE` 保留字之後作為篩選資料表觀測值的依據，另一個常見的應用場景就是這個章節要介紹的「條件邏輯」。

在第四章「衍生計算欄位」我們透過了四種運算符獲得新的欄位：數值運算符、文字運算符、關係運算符與邏輯運算符；在第五章「函數」我們透過兩大類函數獲得新的欄位：通用函數與聚合函數。條件邏輯是第三種生成衍生計算欄位的方式，透過條件所得的布林值來決定所指定的資料值為何，在實務中這樣的技巧又被稱為分箱（Binning）、編碼（Encoding）或者分組（Categorizing）。

## 以 `CASE WHEN` 敘述衍生計算欄位

最基礎的條件邏輯可以用 0（布林值 `False`）或 1（布林值 `True`）表示，意即區分為兩組，這時只需要透過關係運算即可完成。舉例來說，將電影的上映年份分為兩組：在千禧年之前上映的為 0（布林值 `False`）、在千禧年之後上映的為 1（布林值 `True`）。

In [None]:
%%sql
SELECT title,
       release_year,
       release_year >= 2000 AS released_after_millennium
  FROM movies
 LIMIT 10;

title,release_year,released_after_millennium
The Shawshank Redemption,1994,0
The Godfather,1972,0
The Dark Knight,2008,1
The Godfather Part II,1974,0
12 Angry Men,1957,0
Schindler's List,1993,0
The Lord of the Rings: The Return of the King,2003,1
Pulp Fiction,1994,0
The Lord of the Rings: The Fellowship of the Ring,2001,1
"The Good, the Bad and the Ugly",1966,0


那麼什麼時候需要使用條件邏輯的技巧呢？當我們的衍生計算欄位不想要以布林值來表示或者分組不止兩組的時候，就能夠改使用 `CASE WHEN` 敘述衍生計算欄位。

```sql
SELECT CASE WHEN condition_1 THEN result_1
            WHEN condition_2 THEN result_2 END AS alias;
```

舉例來說，將電影的上映年份分為兩組：在千禧年之前上映的為 `'Before millennium'`、在千禧年之後上映的為 `'After millennium'`。

In [None]:
%%sql
SELECT title,
       release_year,
       CASE WHEN release_year >= 2000 THEN 'After millennium'
            WHEN release_year < 2000 THEN 'Before millennium' END AS before_or_after_millennium
  FROM movies
 LIMIT 10;

title,release_year,before_or_after_millennium
The Shawshank Redemption,1994,Before millennium
The Godfather,1972,Before millennium
The Dark Knight,2008,After millennium
The Godfather Part II,1974,Before millennium
12 Angry Men,1957,Before millennium
Schindler's List,1993,Before millennium
The Lord of the Rings: The Return of the King,2003,After millennium
Pulp Fiction,1994,Before millennium
The Lord of the Rings: The Fellowship of the Ring,2001,After millennium
"The Good, the Bad and the Ugly",1966,Before millennium


如果分組需求與布林值一樣是二元、非黑即白的時候，`CASE WHEN` 敘述可以加入 `ELSE` 取代其中一個條件的敘述。

```sql
SELECT CASE WHEN condition_1 THEN result_1
            ELSE result_2 END AS alias;
```

舉例來說，將電影的上映年份分為兩組：在千禧年之前上映的為 `'Before millennium'`、在千禧年之後上映的為 `'After millennium'`，能夠用 `ELSE` 取代先前的條件 `release_year < 2000`。

In [None]:
%%sql
SELECT title,
       release_year,
       CASE WHEN release_year >= 2000 THEN 'After millennium'
            ELSE 'Before millennium' END AS before_or_after_millennium
  FROM movies
 LIMIT 10;

title,release_year,before_or_after_millennium
The Shawshank Redemption,1994,Before millennium
The Godfather,1972,Before millennium
The Dark Knight,2008,After millennium
The Godfather Part II,1974,Before millennium
12 Angry Men,1957,Before millennium
Schindler's List,1993,Before millennium
The Lord of the Rings: The Return of the King,2003,After millennium
Pulp Fiction,1994,Before millennium
The Lord of the Rings: The Fellowship of the Ring,2001,After millennium
"The Good, the Bad and the Ugly",1966,Before millennium


如果分組需求超過兩組的時候，只要增加 `WHEN` 敘述與條件即可。

```sql
SELECT CASE WHEN condition_1 THEN result_1
            WHEN condition_2 THEN result_2
            ...
            ELSE result_n END AS alias;
```

舉例來說，將電影的長度 `runtime` 分為四組，超過 180 分鐘的為 `'Over 3 hours'`，超過 120 分鐘、未滿 180 分鐘的為 `'Over 2 hours'`，超過 60 分鐘、未滿 120 分鐘的為 `'Over 1 hour'`，未滿 60 分鐘的為 `'Below 1 hour'`。

In [None]:
%%sql
SELECT title,
       runtime,
       CASE WHEN runtime > 180 THEN 'Over 3 hours'
            WHEN runtime > 120 THEN 'Over 2 hours'
            WHEN runtime > 60 THEN 'Over 1 hour'
            WHEN runtime <= 60 THEN 'Below 1 hour' END AS runtime_category
  FROM movies
 LIMIT 10;

title,runtime,runtime_category
The Shawshank Redemption,142,Over 2 hours
The Godfather,175,Over 2 hours
The Dark Knight,152,Over 2 hours
The Godfather Part II,202,Over 3 hours
12 Angry Men,96,Over 1 hour
Schindler's List,195,Over 3 hours
The Lord of the Rings: The Return of the King,201,Over 3 hours
Pulp Fiction,154,Over 2 hours
The Lord of the Rings: The Fellowship of the Ring,178,Over 2 hours
"The Good, the Bad and the Ugly",178,Over 2 hours


當然，我們也可以加入 `ELSE` 取代其中一個條件 `runtime <= 60`。

In [None]:
%%sql
SELECT title,
       runtime,
       CASE WHEN runtime > 180 THEN 'Over 3 hours'
            WHEN runtime > 120 THEN 'Over 2 hours'
            WHEN runtime > 60 THEN 'Over 1 hour'
            ELSE 'Below 1 hour' END AS runtime_category
  FROM movies
 LIMIT 10;

title,runtime,runtime_category
The Shawshank Redemption,142,Over 2 hours
The Godfather,175,Over 2 hours
The Dark Knight,152,Over 2 hours
The Godfather Part II,202,Over 3 hours
12 Angry Men,96,Over 1 hour
Schindler's List,195,Over 3 hours
The Lord of the Rings: The Return of the King,201,Over 3 hours
Pulp Fiction,154,Over 2 hours
The Lord of the Rings: The Fellowship of the Ring,178,Over 2 hours
"The Good, the Bad and the Ugly",178,Over 2 hours


## 條件是否互斥與寫作順序

撰寫條件邏輯非常值得注意的是，條件是否互斥（Mutually exclusive）？若沒有互斥，那麼寫作的順序就會是賦值的順序。舉前面的例子來說，將電影的長度 `runtime` 分為四組，條件一到四分別為 `runtime > 180`、`runtime > 120`、`runtime > 60` 與 `runtime <= 60`，除了條件 `runtime > 60` 與 `runtime <= 60` 兩者是互斥，前三個條件是有交集的（電影長度超過 120 分鐘代表也超過 60 分鐘、電影長度超過 180 分鐘代表也超過 120、60 分鐘）。

```sql
CASE WHEN runtime > 180 THEN 'Over 3 hours'
     WHEN runtime > 120 THEN 'Over 2 hours'
     WHEN runtime > 60 THEN 'Over 1 hour'
     WHEN runtime <= 60 THEN 'Below 1 hour' END AS runtime_category
```

在 `CASE WHEN` 的條件敘述有交集的情況下，衍生計算欄位所賦予的值是依照寫作順序而定的，因此範例撰寫的順序是和預期結果相符的，能夠將電影依照長度 `runtime` 分為四組。

In [None]:
%%sql
SELECT DISTINCT CASE WHEN runtime > 180 THEN 'Over 3 hours' -- expected result
                WHEN runtime > 120 THEN 'Over 2 hours'
                WHEN runtime > 60 THEN 'Over 1 hour'
                WHEN runtime <= 60 THEN 'Below 1 hour' END AS runtime_category
  FROM movies;

runtime_category
Over 2 hours
Over 3 hours
Over 1 hour
Below 1 hour


若是沒有注意到條件是否互斥與寫作順序，可能就會得到和預期相異的結果，例如先寫了條件 `runtime > 60` 會使得最終分組的結果缺少了 `'Over 3 hours'` 與 `'Over 2 hours'`，因為這兩組對應的條件都被條件 `runtime > 60` 先判斷走了。

In [None]:
%%sql
SELECT DISTINCT CASE WHEN runtime > 60 THEN 'Over 1 hours' -- unexpected result
                WHEN runtime > 120 THEN 'Over 2 hours'
                WHEN runtime > 180 THEN 'Over 3 hour'
                WHEN runtime <= 60 THEN 'Below 1 hour' END AS runtime_category
  FROM movies;

runtime_category
Over 1 hours
Below 1 hour


如果不想要特別注意寫作順序，可以把條件設計為互斥，例如還是先分組 `'Over 1 hours'`，但是把條件的上界、下界都交代清楚。

In [None]:
%%sql
SELECT DISTINCT CASE WHEN runtime > 60 AND runtime <= 120 THEN 'Over 1 hours' -- expected result
                WHEN runtime > 120 AND runtime <= 180 THEN 'Over 2 hours'
                WHEN runtime > 180 THEN 'Over 3 hour'
                WHEN runtime <= 60 THEN 'Below 1 hour' END AS runtime_category
  FROM movies;

runtime_category
Over 2 hours
Over 3 hour
Over 1 hours
Below 1 hour


## 以 `CASE WHEN` 衍生計算欄位排序

`CASE WHEN` 除了能夠在 `SELECT` 敘述後使用，亦能夠在 `ORDER BY` 敘述後使用。想要以 `CASE WHEN` 衍生計算欄位排序，一種方式是在 `SELECT` 後建立別名並在 `ORDER BY` 後加上別名。

```sql
SELECT CASE WHEN condition_1 THEN result_1
            WHEN condition_2 THEN result_2
            ...
            ELSE result_n END AS alias
  FROM TABLE
 ORDER BY alias;
```

In [None]:
%%sql
SELECT title,
       runtime,
       CASE WHEN runtime > 180 THEN 'Over 3 hours'
            WHEN runtime > 120 THEN 'Over 2 hours'
            WHEN runtime > 60 THEN 'Over 1 hour'
            ELSE 'Below 1 hour' END AS runtime_category
  FROM movies
 ORDER BY runtime_category
 LIMIT 10;

title,runtime,runtime_category
Sherlock Jr.,45,Below 1 hour
12 Angry Men,96,Over 1 hour
The Silence of the Lambs,118,Over 1 hour
Life Is Beautiful,116,Over 1 hour
Back to the Future,116,Over 1 hour
Psycho,109,Over 1 hour
Léon: The Professional,110,Over 1 hour
The Lion King,88,Over 1 hour
American History X,119,Over 1 hour
The Usual Suspects,106,Over 1 hour


另一種方式是略過 `SELECT` 後建立別名，直接在 `ORDER BY` 後加上 `CASE WHEN` 敘述，要注意這時就得將原本敘述最後的 `AS alias` 省去。

```sql
SELECT columns
  FROM TABLE
 ORDER BY CASE WHEN condition_1 THEN result_1
               WHEN condition_2 THEN result_2
               ...
               ELSE result_n END;
```

In [None]:
%%sql
SELECT title,
       runtime
  FROM movies
 ORDER BY CASE WHEN runtime > 180 THEN 'Over 3 hours'
               WHEN runtime > 120 THEN 'Over 2 hours'
               WHEN runtime > 60 THEN 'Over 1 hour'
               ELSE 'Below 1 hour' END
 LIMIT 10;

title,runtime
Sherlock Jr.,45
12 Angry Men,96
The Silence of the Lambs,118
Life Is Beautiful,116
Back to the Future,116
Psycho,109
Léon: The Professional,110
The Lion King,88
American History X,119
The Usual Suspects,106


## 以 `CASE WHEN` 衍生計算欄位篩選

`CASE WHEN` 除了能夠搭配 `SELECT` 敘述、`ORDER BY` 敘述後使用，亦能夠搭配 `WHERE` 敘述使用。想要以 `CASE WHEN` 衍生計算欄位篩選資料表觀測值，在 `SELECT` 後建立別名並在 `WHERE` 後利用別名搭配關係運算符建立條件。

```sql
SELECT CASE WHEN condition_1 THEN result_1
            WHEN condition_2 THEN result_2
            ...
            ELSE result_n END AS alias
  FROM TABLE
 WHERE conditions;
```

In [None]:
%%sql
SELECT title,
       runtime,
       CASE WHEN runtime > 180 THEN 'Over 3 hours'
            WHEN runtime > 120 THEN 'Over 2 hours'
            WHEN runtime > 60 THEN 'Over 1 hour'
            ELSE 'Below 1 hour' END AS runtime_category
  FROM movies
 WHERE runtime_category = 'Over 3 hours'
 ORDER BY runtime_category, 
          runtime DESC;

title,runtime,runtime_category
Gone with the Wind,238,Over 3 hours
Once Upon a Time in America,229,Over 3 hours
Lawrence of Arabia,218,Over 3 hours
Ben-Hur,212,Over 3 hours
Seven Samurai,207,Over 3 hours
The Godfather Part II,202,Over 3 hours
The Lord of the Rings: The Return of the King,201,Over 3 hours
Schindler's List,195,Over 3 hours
Gandhi,191,Over 3 hours
The Green Mile,189,Over 3 hours


## 重點統整

- 條件邏輯是第三種生成衍生計算欄位的方式，透過條件所得的布林值來決定所指定的資料值為何，在實務中這樣的技巧又被稱為分箱（Binning）、編碼（Encoding）或者分組（Categorizing）。
- 這個章節學起來的 SQL 保留字：
    - `CASE WHEN`
    - `THEN`
    - `ELSE`
    - `END`
- 將截至目前所學的 SQL 保留字集中在一個敘述中，寫作順序必須遵從標準 SQL 的規定。

```sql
SELECT DISTINCT columns AS alias,
       CASE WHEN condition_1 THEN result_1
            WHEN condition_2 THEN result_2
            ...
            ELSE result_n END AS alias
  FROM table
 WHERE conditions
 ORDER BY columns DESC
 LIMIT m;
```

## 練習題 22-24

練習題會涵蓋四個學習資料庫，記得要依據題目的需求，調整編輯器選單的學習資料庫，在自己電腦的 SQLiteStudio 寫出跟預期輸出相同的 SQL 敘述，寫作過程如果卡關了，可以參考附錄二「練習題參考解答」。

### 22. 從 `covid19` 資料庫的 `daily_report` 資料表將「美國」與「非美國」的觀測值用衍生計算欄位區分，美國的觀測值給予 `'Is US'`、非美國的觀測值給予 `'Not US'`，參考下列的預期查詢結果。

預期輸出：(4011, 2) 的查詢結果。

In [None]:
-- 礙於紙本篇幅僅顯示出前五列示意
%%sql


Combined_Key,is_us
"Abbeville, South Carolina, US",Is US
"Abruzzo, Italy",Not US
"Acadia, Louisiana, US",Is US
"Accomack, Virginia, US",Is US
"Acre, Brazil",Not US


### 23. 從 `imdb` 資料庫的 `movies` 資料表將評等超過 8.7（`>8.7`）的電影分類為 `'Awesome'`、將評等超過 8.4（`>8.4`）的電影分類為 `'Terrific'`，再將其餘的電影分類為 `'Great'`，參考下列的預期查詢結果。

預期輸出：(250, 3) 的查詢結果。

In [None]:
-- 礙於紙本篇幅僅顯示出前五列示意
%%sql


title,rating,rating_category
The Shawshank Redemption,9.3,Awesome
The Godfather,9.2,Awesome
The Dark Knight,9.0,Awesome
The Godfather Part II,9.0,Awesome
12 Angry Men,9.0,Awesome


### 24. 從 `twElection2020` 資料庫的 `admin_regions` 資料表將 `county` 分類為 `'六都'`與`'非六都'`，參考下列的預期查詢結果。

註：六都為臺北市、新北市、桃園市、臺中市、臺南市與高雄市。

預期輸出：(22, 2) 的查詢結果。

In [None]:
%%sql


county,county_type
新北市,六都
桃園市,六都
臺中市,六都
臺北市,六都
臺南市,六都
高雄市,六都
南投縣,非六都
嘉義市,非六都
嘉義縣,非六都
基隆市,非六都
