# SQL 入門

> 聯結表格

郭耀仁

In [1]:
# 連結資料庫
import sqlite3
import pandas as pd
from test_queries.test_queries_03 import extract_test_queries as etq

conn_nba = sqlite3.connect('nba.db')
conn_twelection = sqlite3.connect('twelection.db')

## 摘要

- 聯結關聯式資料庫中的表格
- 隨堂練習
- 隨堂練習參考解答

## 聯結關聯式資料庫中的表格

## 什麼是關聯式資料庫

> 依照關聯式模型所建構的多個有相關的表格，關聯式模型指的是每個表格的觀測值層級都是獨立並且獨一，並能夠透過表格聯結將不同表格的觀測值關聯至一個查詢結果。

## 為什麼關聯式模型

> 採用關聯式模型能夠減少資料的重複並且讓維護變得簡單。

## 使用 `JOIN` 與 `ON` 保留字將資料進行關聯

```sql
SELECT *
  FROM left_table JOIN right_table
    ON left_table.primary_key_column = right_table.foreign_key_column
```

## 聯結 `players` 與 `careerSummaries`

In [2]:
sql_query = """
SELECT *
  FROM players
  JOIN careerSummaries
    ON players.personId = careerSummaries.personId;
"""

In [3]:
pd.read_sql(sql_query, conn_nba)

Unnamed: 0,firstName,lastName,temporaryDisplayName,personId,teamId,jersey,isActive,pos,heightFeet,heightInches,...,ftm,fta,pFouls,points,gamesPlayed,gamesStarted,plusMinus,min,dd2,td3
0,Vince,Carter,"Carter, Vince",1713,1610612737,15.0,True,G-F,6,6,...,4852.0,6082.0,3995.0,25728.0,1541.0,982.0,1816.0,46371.0,90.0,5.0
1,Tyson,Chandler,"Chandler, Tyson",2199,1610612745,19.0,True,C,7,0,...,2393.0,3714.0,3268.0,9509.0,1160.0,886.0,325.0,31617.0,292.0,0.0
2,LeBron,James,"James, LeBron",2544,1610612747,23.0,True,F,6,9,...,7379.0,10044.0,2313.0,34087.0,1258.0,1257.0,6887.0,48327.0,485.0,94.0
3,Carmelo,Anthony,"Anthony, Carmelo",2546,1610612757,0.0,True,F,6,8,...,6028.0,7424.0,3204.0,26314.0,1114.0,1106.0,1614.0,39750.0,171.0,2.0
4,Kyle,Korver,"Korver, Kyle",2594,1610612749,26.0,True,G-F,6,7,...,1290.0,1472.0,2512.0,11903.0,1224.0,422.0,2885.0,31056.0,7.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499,Matt,Thomas,"Thomas, Matt",1629744,1610612761,21.0,True,G,6,4,...,7.0,10.0,31.0,150.0,33.0,0.0,22.0,321.0,0.0,0.0
500,Tariq,Owens,"Owens, Tariq",1629745,1610612756,41.0,True,F,6,10,...,2.0,2.0,1.0,4.0,3.0,0.0,-16.0,15.0,0.0,0.0
501,Javonte,Green,"Green, Javonte",1629750,1610612738,43.0,True,G-F,6,4,...,23.0,36.0,37.0,127.0,44.0,1.0,-5.0,414.0,0.0,0.0
502,Juwan,Morgan,"Morgan, Juwan",1629752,1610612762,16.0,True,F,6,7,...,0.0,0.0,7.0,19.0,16.0,0.0,19.0,73.0,0.0,0.0


## 在變數上加上 key 的註記：主鍵（Primary key）

主鍵用來標註一個表格中獨立的觀測值，什麼樣的變數可以被標記為主鍵？

1. 必須獨一
2. 不得有遺漏值

In [4]:
sql_query = """
SELECT name
  FROM PRAGMA_TABLE_INFO('players')
 WHERE pk = 1;
"""
pd.read_sql(sql_query, conn_nba)

Unnamed: 0,name
0,personId


## 在變數上加上 key 的註記：外鍵（Foreign key）

與其他具有相關的表格主鍵相對應的欄位可以被標註為外鍵。

In [5]:
sql_query = """
SELECT *
  FROM PRAGMA_FOREIGN_KEY_LIST('players');
"""
pd.read_sql(sql_query, conn_nba)

Unnamed: 0,id,seq,table,from,to,on_update,on_delete,match
0,0,0,careerSummaries,personId,personId,RESTRICT,RESTRICT,NONE
1,1,0,teams,teamId,teamId,RESTRICT,RESTRICT,NONE


## `JOIN` 是預設的聯結邏輯

- `JOIN` 將左右表格的「交集」觀測值回傳
- `LEFT JOIN` 將左表格「所有」觀測值回傳，對應不到的以遺漏值填補
- `RIGHT JOIN` 將右表格「所有」觀測值回傳，對應不到的以遺漏值填補（SQLite 不支援）
- `FULL OUTER JOIN` 將左右表格的「聯集」觀測值回傳，對應不到的以遺漏值填補（SQLite 不支援）

## 建構一個左表格 `veteran_players`

In [6]:
sql_query = """
SELECT personId,
       temporaryDisplayName
  FROM players
 LIMIT 10;
"""

In [7]:
pd.read_sql(sql_query, conn_nba)

Unnamed: 0,personId,temporaryDisplayName
0,1713,"Carter, Vince"
1,2199,"Chandler, Tyson"
2,2544,"James, LeBron"
3,2546,"Anthony, Carmelo"
4,2594,"Korver, Kyle"
5,2617,"Haslem, Udonis"
6,2730,"Howard, Dwight"
7,2738,"Iguodala, Andre"
8,2747,"Smith, JR"
9,2772,"Ariza, Trevor"


## 建構一個右表格 `top_scorers`

In [8]:
sql_query = """
SELECT personId,
       ppg
  FROM careerSummaries
 ORDER BY ppg DESC
 LIMIT 10;
"""

In [9]:
pd.read_sql(sql_query, conn_nba)

Unnamed: 0,personId,ppg
0,2544,27.1
1,201142,27.0
2,201935,25.1
3,1629029,24.4
4,203954,24.1
5,203076,24.0
6,203081,24.0
7,2546,23.6
8,1629027,23.6
9,1629627,23.6


## 預設的 `JOIN`

In [10]:
sql_query = """
SELECT *
  FROM (SELECT personId,
               temporaryDisplayName
          FROM players
         LIMIT 10) AS veteran_players
  JOIN (SELECT personId,
               ppg
          FROM careerSummaries
         ORDER BY ppg DESC
         LIMIT 10) AS top_scorers
    ON veteran_players.personId = top_scorers.personId;
"""

In [11]:
pd.read_sql(sql_query, conn_nba)

Unnamed: 0,personId,temporaryDisplayName,personId.1,ppg
0,2544,"James, LeBron",2544,27.1
1,2546,"Anthony, Carmelo",2546,23.6


## 改以 `LEFT JOIN` 聯結

In [12]:
sql_query = """
SELECT *
  FROM (SELECT personId,
               temporaryDisplayName
          FROM players
         LIMIT 10) AS veteran_players
  LEFT JOIN (SELECT personId,
                    ppg
               FROM careerSummaries
              ORDER BY ppg DESC
              LIMIT 10) AS top_scorers
    ON veteran_players.personId = top_scorers.personId;
"""

In [13]:
pd.read_sql(sql_query, conn_nba)

Unnamed: 0,personId,temporaryDisplayName,personId.1,ppg
0,1713,"Carter, Vince",,
1,2199,"Chandler, Tyson",,
2,2544,"James, LeBron",2544.0,27.1
3,2546,"Anthony, Carmelo",2546.0,23.6
4,2594,"Korver, Kyle",,
5,2617,"Haslem, Udonis",,
6,2730,"Howard, Dwight",,
7,2738,"Iguodala, Andre",,
8,2747,"Smith, JR",,
9,2772,"Ariza, Trevor",,


## 以 Python pandas 示範 `RIGHT JOIN`

In [14]:
left_df = pd.read_sql("""SELECT personId, temporaryDisplayName FROM players LIMIT 10""", conn_nba)
right_df = pd.read_sql("""SELECT personId, ppg FROM careerSummaries ORDER BY ppg DESC LIMIT 10""", conn_nba)
pd.merge(left_df, right_df, left_on='personId', right_on='personId', how='right')

Unnamed: 0,personId,temporaryDisplayName,ppg
0,2544,"James, LeBron",27.1
1,2546,"Anthony, Carmelo",23.6
2,201142,,27.0
3,201935,,25.1
4,1629029,,24.4
5,203954,,24.1
6,203076,,24.0
7,203081,,24.0
8,1629027,,23.6
9,1629627,,23.6


## 以 Python pandas 示範 `FULL OUTER JOIN`

In [15]:
pd.merge(left_df, right_df, left_on='personId', right_on='personId', how='outer')

Unnamed: 0,personId,temporaryDisplayName,ppg
0,1713,"Carter, Vince",
1,2199,"Chandler, Tyson",
2,2544,"James, LeBron",27.1
3,2546,"Anthony, Carmelo",23.6
4,2594,"Korver, Kyle",
5,2617,"Haslem, Udonis",
6,2730,"Howard, Dwight",
7,2738,"Iguodala, Andre",
8,2747,"Smith, JR",
9,2772,"Ariza, Trevor",


## 使用 `IS NULL` 找出有遺漏的觀測值

In [16]:
sql_query = """
SELECT veteran_players.personId,
       veteran_players.temporaryDisplayName
  FROM (SELECT personId,
               temporaryDisplayName
          FROM players
         LIMIT 10) AS veteran_players
  LEFT JOIN (SELECT personId,
                    ppg
               FROM careerSummaries
              ORDER BY ppg DESC
              LIMIT 10) AS top_scorers
    ON veteran_players.personId = top_scorers.personId
 WHERE top_scorers.ppg IS NULL;
"""

In [17]:
pd.read_sql(sql_query, conn_nba)

Unnamed: 0,personId,temporaryDisplayName
0,1713,"Carter, Vince"
1,2199,"Chandler, Tyson"
2,2594,"Korver, Kyle"
3,2617,"Haslem, Udonis"
4,2730,"Howard, Dwight"
5,2738,"Iguodala, Andre"
6,2747,"Smith, JR"
7,2772,"Ariza, Trevor"


## 關聯式模型定義了表格間的關係

- 一對一
- 一對多
- 多對多

In [18]:
sql_query = """
SELECT *
  FROM players
  JOIN careerSummaries
    ON players.personId = careerSummaries.personId;
"""

In [19]:
pd.read_sql(sql_query, conn_nba)

Unnamed: 0,firstName,lastName,temporaryDisplayName,personId,teamId,jersey,isActive,pos,heightFeet,heightInches,...,ftm,fta,pFouls,points,gamesPlayed,gamesStarted,plusMinus,min,dd2,td3
0,Vince,Carter,"Carter, Vince",1713,1610612737,15.0,True,G-F,6,6,...,4852.0,6082.0,3995.0,25728.0,1541.0,982.0,1816.0,46371.0,90.0,5.0
1,Tyson,Chandler,"Chandler, Tyson",2199,1610612745,19.0,True,C,7,0,...,2393.0,3714.0,3268.0,9509.0,1160.0,886.0,325.0,31617.0,292.0,0.0
2,LeBron,James,"James, LeBron",2544,1610612747,23.0,True,F,6,9,...,7379.0,10044.0,2313.0,34087.0,1258.0,1257.0,6887.0,48327.0,485.0,94.0
3,Carmelo,Anthony,"Anthony, Carmelo",2546,1610612757,0.0,True,F,6,8,...,6028.0,7424.0,3204.0,26314.0,1114.0,1106.0,1614.0,39750.0,171.0,2.0
4,Kyle,Korver,"Korver, Kyle",2594,1610612749,26.0,True,G-F,6,7,...,1290.0,1472.0,2512.0,11903.0,1224.0,422.0,2885.0,31056.0,7.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499,Matt,Thomas,"Thomas, Matt",1629744,1610612761,21.0,True,G,6,4,...,7.0,10.0,31.0,150.0,33.0,0.0,22.0,321.0,0.0,0.0
500,Tariq,Owens,"Owens, Tariq",1629745,1610612756,41.0,True,F,6,10,...,2.0,2.0,1.0,4.0,3.0,0.0,-16.0,15.0,0.0,0.0
501,Javonte,Green,"Green, Javonte",1629750,1610612738,43.0,True,G-F,6,4,...,23.0,36.0,37.0,127.0,44.0,1.0,-5.0,414.0,0.0,0.0
502,Juwan,Morgan,"Morgan, Juwan",1629752,1610612762,16.0,True,F,6,7,...,0.0,0.0,7.0,19.0,16.0,0.0,19.0,73.0,0.0,0.0


In [20]:
sql_query = """
SELECT *
  FROM teams
  JOIN rosters
    ON teams.teamId = rosters.teamId;
"""
pd.read_sql(sql_query, conn_nba)

Unnamed: 0,isNBAFranchise,isAllStar,city,altCityName,fullName,tricode,teamId,nickname,urlName,teamShortName,confName,divName,personId,teamId.1
0,True,False,Atlanta,Atlanta,Atlanta Hawks,ATL,1610612737,Hawks,hawks,Atlanta,East,Southeast,1713,1610612737
1,True,False,Houston,Houston,Houston Rockets,HOU,1610612745,Rockets,rockets,Houston,West,Southwest,2199,1610612745
2,True,False,Los Angeles,Los Angeles Lakers,Los Angeles Lakers,LAL,1610612747,Lakers,lakers,L.A. Lakers,West,Pacific,2544,1610612747
3,True,False,Portland,Portland,Portland Trail Blazers,POR,1610612757,Trail Blazers,blazers,Portland,West,Northwest,2546,1610612757
4,True,False,Milwaukee,Milwaukee,Milwaukee Bucks,MIL,1610612749,Bucks,bucks,Milwaukee,East,Central,2594,1610612749
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499,True,False,Toronto,Toronto,Toronto Raptors,TOR,1610612761,Raptors,raptors,Toronto,East,Atlantic,1629744,1610612761
500,True,False,Phoenix,Phoenix,Phoenix Suns,PHX,1610612756,Suns,suns,Phoenix,West,Pacific,1629745,1610612756
501,True,False,Boston,Boston,Boston Celtics,BOS,1610612738,Celtics,celtics,Boston,East,Atlantic,1629750,1610612738
502,True,False,Utah,Utah,Utah Jazz,UTA,1610612762,Jazz,jazz,Utah,West,Northwest,1629752,1610612762


## 表格聯結就如同水平合併

![Imgur](https://i.imgur.com/hq7fS67.png)

Source: [Pandas User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

## 也能透過 `UNION` 垂直合併表格

![Imgur](https://i.imgur.com/B7xawvp.png)

Source: [Pandas User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

In [21]:
sql_query = """
SELECT firstName,
       lastName,
       'height' AS category,
       heightmeters AS value
  FROM players
 WHERE firstName = 'LeBron'
 UNION
SELECT firstName,
       lastName,
       'weight' AS category,
       weightKilograms AS value
  FROM players
 WHERE firstName = 'LeBron';
"""

In [22]:
pd.read_sql(sql_query, conn_nba)

Unnamed: 0,firstName,lastName,category,value
0,LeBron,James,height,2.06
1,LeBron,James,weight,113.4


## 這是目前涵蓋的查詢保留字

使用 SQL 語法時，保留字順序必須要遵守。

```sql
SELECT DISTINCT CAST(column_name AS data_type) AS alias_name
  FROM table_name
  JOIN table_name
    ON table_name.pk = table_name.fk
 WHERE conditions
 GROUP BY column_name
HAVING conditions
 ORDER BY column_name DESC
 LIMIT n_obs
 UNION;
```

## 隨堂練習

[聯結表格：隨堂練習](https://mybinder.org/v2/gh/datainpoint/classroom-introduction-to-sql/master?filepath=04-exercises.ipynb)

## 隨堂練習：將 `presidential2016` 與 `presidential2020` 三組候選人的得票數以 `UNION` 垂直合併，創建 `year` 變數區分 `number`、`candidates` 與 `total_votes`

In [23]:
expected_output = pd.read_sql(etq('0316'), conn_twelection)

In [24]:
expected_output

Unnamed: 0,year,number,candidates,total_votes
0,2016,1,朱立倫/王如玄,3813365
1,2016,2,蔡英文/陳建仁,6894744
2,2016,3,宋楚瑜/徐欣瑩,1576861
3,2020,1,宋楚瑜/余湘,608590
4,2020,2,韓國瑜/張善政,5522119
5,2020,3,蔡英文/賴清德,8170231


## 隨堂練習：將 `presidential2016` 與 `presidential2020` 三組候選人的得票率以 `UNION` 垂直合併，創建 `year` 變數區分 `number`、`candidates` 與 `votes_percentage`

In [25]:
expected_output = pd.read_sql(etq('0317'), conn_twelection)

In [26]:
expected_output

Unnamed: 0,year,number,candidates,votes_percentage
0,2016,1,朱立倫/王如玄,0.310409
1,2016,2,蔡英文/陳建仁,0.561234
2,2016,3,宋楚瑜/徐欣瑩,0.128357
3,2020,1,宋楚瑜/余湘,0.042556
4,2020,2,韓國瑜/張善政,0.386137
5,2020,3,蔡英文/賴清德,0.571307


## 隨堂練習：查詢 `nba.db` 目前湖人隊（Los Angeles Lakers）的球員陣容生涯場均得分（`ppg`）、場均籃板（`rpg`）與場均助攻（`apg`），選擇 `fullName`、`firstName`、`lastName`、`ppg`、`rpg`、`apg` 並以 `firstName` 遞增排序

In [27]:
expected_output = pd.read_sql(etq('0320'), conn_nba)

In [28]:
expected_output

Unnamed: 0,fullName,firstName,lastName,ppg,rpg,apg
0,Los Angeles Lakers,Alex,Caruso,5.7,2.0,2.1
1,Los Angeles Lakers,Anthony,Davis,24.0,10.4,2.2
2,Los Angeles Lakers,Avery,Bradley,11.8,2.9,1.8
3,Los Angeles Lakers,Danny,Green,8.9,3.5,1.6
4,Los Angeles Lakers,Devontae,Cacok,0.0,0.0,0.0
5,Los Angeles Lakers,Dion,Waiters,13.2,2.7,2.8
6,Los Angeles Lakers,Dwight,Howard,16.8,12.3,1.4
7,Los Angeles Lakers,JR,Smith,12.5,3.2,2.1
8,Los Angeles Lakers,JaVale,McGee,7.9,5.1,0.4
9,Los Angeles Lakers,Jared,Dudley,7.5,3.2,1.6


## 隨堂練習：計算 `presidential2020` 韓國瑜/張善政與蔡英文/賴清德這兩組候選人在臺北市 12 個行政區中各自的得票數，選擇 `town`、`Kuo_Cheng` 與 `Ing_Te` 三個變數

In [29]:
expected_output = pd.read_sql(etq('0318'), conn_twelection)

In [30]:
expected_output

Unnamed: 0,town,Kuo_Cheng,Ing_Te
0,中山區,56491,79022
1,中正區,41461,48183
2,信義區,62353,70285
3,內湖區,74437,94269
4,北投區,59851,90060
5,南港區,30968,40969
6,士林區,65183,104881
7,大同區,24673,50006
8,大安區,85490,88977
9,文山區,82305,78129


## 隨堂練習：計算 `presidential2020` 韓國瑜/張善政與蔡英文/賴清德這兩組候選人在臺北市 12 個行政區中各自的得票數，選擇 `town`、`Kuo_Cheng` 與 `Ing_Te` 三個變數，並找出韓國瑜/張善政在哪些行政區得票數較多

In [31]:
expected_output = pd.read_sql(etq('0319'), conn_twelection)

In [32]:
expected_output

Unnamed: 0,town,Kuo_Cheng,Ing_Te
0,文山區,82305,78129


## 隨堂練習

[聯結表格：隨堂練習參考解答](https://mybinder.org/v2/gh/datainpoint/classroom-introduction-to-sql/master?filepath=04-suggested-answers.ipynb)