# Data Visualization with Modern Data Science

> Getting Started with SQL

Yao-Jen Kuo <yaojenkuo@ntu.edu.tw> from [DATAINPOINT](https://www.datainpoint.com)

In [1]:
%LOAD sqlite3 db=data/taiwan_election_2024.db timeout=2 shared_cache=true

## What is SQL

## The definition

SQL(pronounced ess-que-ell or sequel) is a language specifically designed with **relational databases**. SQL enables people to create databases, add new data to them, maintain the data in them, and retrieve selected parts of the data via **database management system**. Developed in the 1970s at IBM, SQL has grown and advanced over the years to become the industry standard.

## Subcategories of SQL

- DML, short for **Data Manipulation Language**.
- DDL, short for **Data Definition Language**.
- DCL, short for **Data Control Language**.
- TCL, short for **Transaction Control Language**.

## Subcategories of SQL(Cont'd)

- DML: statements begin with `SELECT` `INSERT`, `UPDATE`, `DELETE`...etc.
- DDL: statements begin with `CREATE`, `DROP`, `ALTER`...etc.
- DCL: statements begin with `GRANT`, `REVOKE`...etc.
- TCL: statements begin with `COMMIT`, `ROLLBACK`...etc.

## What will be covered in this course

- DML.
- A bit of DDL.
- ~~DCL.~~
- ~~TCL.~~

## Why SQL

- SQL gives us improved programmatic control over the structure of data, leading to efficiency, speed, and accuracy.
- SQL is also an excellent adjunct to programming languages used in the data sciences, such as R and Python.
- For people with no background in programming languages, SQL often serves as an easy-to-understand introduction into concepts related to data structures and programming logic.

## What is a relational database

> A relational database is a digital database based on the relational model of data.

Source: <https://en.wikipedia.org/wiki/Relational_database>

## What is a database

> A database is like a collection of data items (comic books, product orders and player profile).If we define the term more strictly, a database is a **self-describing** collection of **integrated records**. A record is a representation of some physical or conceptual object.

Source: <https://www.amazon.com/SQL-Dummies-Computer-Tech/dp/1119527074>

## Characteristics of a database

- Integrated records: the observations in a database is accompanied by attributes.
- Self-describing: a database contains metadata.

## What does integrated records mean

In [2]:
SELECT * 
  FROM districts
 LIMIT 5;

id,county,town,polling_place,vote_tallied_at
1,南投縣,中寮鄉,191,2024-01-13 17:36:44
2,南投縣,中寮鄉,192,2024-01-13 17:18:56
3,南投縣,中寮鄉,193,2024-01-13 17:04:33
4,南投縣,中寮鄉,194,2024-01-13 17:44:08
5,南投縣,中寮鄉,195,2024-01-13 17:57:21


## What does self-describing mean

> Metadata is "data that provides information about other data". In other words, it is "data about data".

Source: <https://en.wikipedia.org/wiki/Metadata>

## Seriously, what the f*** is metadata?

![](https://media.giphy.com/media/xT0xeif517lOYwnH2g/giphy.gif)

Source: <https://media.giphy.com/media/xT0xeif517lOYwnH2g/giphy.gif>

## Metadata of `taiwan_election_2024`

In [3]:
SELECT name
  FROM sqlite_master
 WHERE name NOT LIKE 'sqlite_%';

name
districts
villages
parties
election_types
candidates
polling_places
presidents
regional_legislators
aboriginal_legislators
party_legislators


## Metadata of `taiwan_election_2024.districts`

In [4]:
SELECT * 
  FROM PRAGMA_TABLE_INFO('districts');

cid,name,type,notnull,dflt_value,pk
0,id,INTEGER,0,,1
1,county,CHAR(3),0,,0
2,town,VARCHAR(200),0,,0
3,polling_place,INTEGER,0,,0
4,vote_tallied_at,CHAR(19),0,,0


## Change table name to view different table's metadata

In [5]:
SELECT * 
  FROM PRAGMA_TABLE_INFO('parties');

cid,name,type,notnull,dflt_value,pk
0,id,INTEGER,0,,1
1,name,VARCHAR(100),0,,0


## What is a database management system

> The database management system (DBMS) is the software that interacts with end users, applications, and the database itself to capture and analyze the data.

Source: <https://en.wikipedia.org/wiki/Database>

## What is a database management system, the less technical version

A database management system (DBMS) bridges the application program(could be a web application, programming environment, or just a client interface) and the database.

## There are a lot of relational database management systems

- SQL Server by Microsoft.
- MySQL by Oracle.
- DB2 by IBM.
- PostgreSQL.
- **SQLite**.
- ...etc.

## What is SQLite

> SQLite is a relational database management system (RDBMS). In contrast to many other database management systems, SQLite is not a client–server database engine. Rather, it is embedded into the end program.

Source: <https://en.wikipedia.org/wiki/SQLite>

## Why SQLite

- It is open-sourced.
- It is self-contained.
- It is light and portable.
- I PERSONALLY think it is the best way for newbies to learn SQL on local machine.

## Getting SQLiteStudio

## What is SQLiteStudio

> SQLiteStudio is desktop application for browsing and editing SQLite database files. It is aimed for people, who know what SQLite is, or what relational databases are in general.

Source: <https://sqlitestudio.pl/about>

## Download and install SQLiteStudio

- Download links: <https://github.com/pawelsalawa/sqlitestudio/releases>
- Download installers: `.exe` for Windows, `.dmg` for macOS.
- For macOS users: Override system security by System Preferences > Security & Privacy > Open Anyway.

## Downloading an established database [taiwan_election_2024.db](https://taiwan-election-2024.s3.ap-northeast-1.amazonaws.com/taiwan_election_2024.db)

## Now we have a SQLite database client that is able to connect to an established database in our own local machine

- Database > Add a database.
- Browse for existing database on local computer > OK.
- Database > Connect to the database.

## A few "Hello, World!" like SQL statements

In [6]:
SELECT *
  FROM districts
 LIMIT 5;

id,county,town,polling_place,vote_tallied_at
1,南投縣,中寮鄉,191,2024-01-13 17:36:44
2,南投縣,中寮鄉,192,2024-01-13 17:18:56
3,南投縣,中寮鄉,193,2024-01-13 17:04:33
4,南投縣,中寮鄉,194,2024-01-13 17:44:08
5,南投縣,中寮鄉,195,2024-01-13 17:57:21


In [7]:
SELECT *
  FROM parties
 LIMIT 11;

id,name
1,中國國民黨
2,中華婦女黨
3,中華愛國同心黨
4,中華文化共和黨
5,中華統一促進黨
6,中華聯合黨
7,人民最大黨
8,人民民主黨
9,制度救世島
10,勞動黨


## What is a SQL statement

- A SQL statement is a combination of keywords, object names(e.g. databases/tables/columns/functions), constants, and operators.
- SQL keywords are NOT case sensitive.
- Semicolon `;` is the standard way to separate each SQL statement in database systems that allow more than one SQL statement to be executed in the same call.
- Indentations or new lines are OPTIONAL in SQL statements.

## SQL keywords are NOT case sensitive

In [8]:
select *
  from districts
 limit 5;

id,county,town,polling_place,vote_tallied_at
1,南投縣,中寮鄉,191,2024-01-13 17:36:44
2,南投縣,中寮鄉,192,2024-01-13 17:18:56
3,南投縣,中寮鄉,193,2024-01-13 17:04:33
4,南投縣,中寮鄉,194,2024-01-13 17:44:08
5,南投縣,中寮鄉,195,2024-01-13 17:57:21


## Indentations or new lines are OPTIONAL in SQL statements

In [9]:
SELECT * 
FROM districts
LIMIT 3;

id,county,town,polling_place,vote_tallied_at
1,南投縣,中寮鄉,191,2024-01-13 17:36:44
2,南投縣,中寮鄉,192,2024-01-13 17:18:56
3,南投縣,中寮鄉,193,2024-01-13 17:04:33


In [10]:
SELECT * FROM districts LIMIT 3;

id,county,town,polling_place,vote_tallied_at
1,南投縣,中寮鄉,191,2024-01-13 17:36:44
2,南投縣,中寮鄉,192,2024-01-13 17:18:56
3,南投縣,中寮鄉,193,2024-01-13 17:04:33


## Adopt a style guide before writing your own SQL statements

- **[SQL style guide by Simon Holywell](https://www.sqlstyle.guide/)**
- [SQL Style Guide | GitLab](https://about.gitlab.com/handbook/business-technology/data-team/platform/sql-style-guide/)
- [SQL Style Guide - Mozilla Data Documentation](https://docs.telemetry.mozilla.org/concepts/sql_style.html)
- ...etc.

## Format SQL in SQLiteStudio

- Write down your SQL statements.
- Highlight statements and right click for drop-down list -> Format SQL.

## Beginning with `SELECT` and `FROM`

## Basic `SELECT` syntax

A SELECT statement that fetches every row and column in a table.

```sql
SELECT *
  FROM table_name;
```

## Selecting a subset of rows and columns

It is more practical to limit the rows and columns the query retrieves, especially with large tables.

```sql
SELECT column_names
  FROM table_name
 LIMIT m OFFSET m;
```

In [11]:
SELECT id,
       county,
       town
  FROM districts
 LIMIT 5;

id,county,town
1,南投縣,中寮鄉
2,南投縣,中寮鄉
3,南投縣,中寮鄉
4,南投縣,中寮鄉
5,南投縣,中寮鄉


In [12]:
SELECT id,
       county,
       town
  FROM districts
 LIMIT 3 OFFSET 2;

id,county,town
3,南投縣,中寮鄉
4,南投縣,中寮鄉
5,南投縣,中寮鄉


## Using `DISTINCT` to find unique values

- It is common for a column to contain rows with duplicates.
- To understand the range of values in a column, we can use the `DISTINCT` keyword.

```sql
SELECT DISTINCT column_names
  FROM table_name;
```

In [13]:
SELECT DISTINCT county
  FROM districts;

county
南投縣
嘉義市
嘉義縣
基隆市
宜蘭縣
屏東縣
彰化縣
新北市
新竹市
新竹縣


In [14]:
SELECT DISTINCT county,
       town
  FROM districts
 LIMIT 13;

county,town
南投縣,中寮鄉
南投縣,仁愛鄉
南投縣,信義鄉
南投縣,南投市
南投縣,名間鄉
南投縣,國姓鄉
南投縣,埔里鎮
南投縣,水里鄉
南投縣,竹山鎮
南投縣,草屯鎮


## Sorting data with `ORDER BY`

We order the results of a query using the keywords `ORDER BY` followed by the name of the column(s) to sort.

```sql
SELECT column_names
  FROM table_name
 ORDER BY column_names;
```

In [15]:
SELECT village_id,
       candidate_id,
       votes
  FROM presidents
 ORDER BY votes
 LIMIT 5;

village_id,candidate_id,votes
3551,331,1
3770,331,2
2706,331,3
3551,330,3
2706,330,4


## By default, `ORDER BY` sorts values in ascending order

If we want to sort in descending order, add the `DESC` keyword.

```sql
SELECT column_names
  FROM table_name
 ORDER BY column_names DESC;
```

In [16]:
SELECT village_id,
       candidate_id,
       votes
  FROM presidents
 ORDER BY votes DESC
 LIMIT 5;

village_id,candidate_id,votes
4930,329,890
4930,329,868
4534,331,831
965,329,800
74,331,783


## We are not limited to sorting on just one column

In [17]:
SELECT village_id,
       candidate_id,
       votes
  FROM presidents
 ORDER BY village_id,
          candidate_id,
          votes DESC
 LIMIT 10;

village_id,candidate_id,votes
1,329,414
1,329,341
1,330,146
1,330,128
1,331,67
1,331,56
2,329,612
2,330,239
2,331,103
3,329,398


## Filtering rows with `WHERE`

The `WHERE` keyword allows us to find rows that match a specific value, a range of values, or multiple values.

```sql
SELECT column_names
  FROM table_name
 WHERE condition;
```

In [18]:
SELECT DISTINCT county,
       town
  FROM districts
 WHERE county = '臺北市';

county,town
臺北市,中山區
臺北市,中正區
臺北市,信義區
臺北市,內湖區
臺北市,北投區
臺北市,南港區
臺北市,士林區
臺北市,大同區
臺北市,大安區
臺北市,文山區


## The above query uses equals `=` to find rows that exactly match, but we can use other operators

## Relational and logical operators

- `=`: Equal to.
- `!=`: Not equal to.
- `>`, `>=`: Greater than; Greater than or equal to.
- `<`, `<=`: Less than; Less than or equal to.
- `BETWEEN`: Within a range.
- `IN`: Match one of a set of values.
- `LIKE`: Match a pattern.
- `NOT`: Negates a condition.
- `AND`: Intersects conditions.
- `OR`: Union conditions.

## Using the `BETWEEN` operator

In [19]:
SELECT county,
       town,
       polling_place,
       vote_tallied_at
  FROM districts
 WHERE vote_tallied_at BETWEEN '2024-01-13 16:30:00' AND '2024-01-13 16:45:00'
 ORDER BY vote_tallied_at;

county,town,polling_place,vote_tallied_at
宜蘭縣,大同鄉,421,2024-01-13 16:30:49
澎湖縣,湖西鄉,63,2024-01-13 16:39:59
澎湖縣,西嶼鄉,98,2024-01-13 16:40:11
嘉義縣,布袋鎮,159,2024-01-13 16:41:58
澎湖縣,馬公市,53,2024-01-13 16:43:04
澎湖縣,湖西鄉,70,2024-01-13 16:43:09
高雄市,燕巢區,181,2024-01-13 16:43:25
臺南市,龍崎區,1543,2024-01-13 16:44:44


## Using the `IN` operator

In [20]:
SELECT DISTINCT county,
       town
  FROM districts
 WHERE town IN ('大安區', '中正區', '中山區');

county,town
基隆市,中山區
基隆市,中正區
臺中市,大安區
臺北市,中山區
臺北市,中正區
臺北市,大安區


## Using `LIKE` with `WHERE`

Relational operators are fairly straightforward, but `LIKE` deserves additional explanations.

## `LIKE` lets us search for patterns in strings by using two wildcard characters

- Percent sign `%`: A wildcard matching one or more characters.
- Underscore `_`: A wildcard matching just one character.

## Using `LIKE` with `WHERE` and `%`

In [21]:
SELECT DISTINCT county,
       town
  FROM districts
 WHERE town LIKE '中%';

county,town
南投縣,中寮鄉
嘉義縣,中埔鄉
基隆市,中山區
基隆市,中正區
新北市,中和區
桃園市,中壢區
臺中市,中區
臺北市,中山區
臺北市,中正區
臺南市,中西區


## Using `LIKE` with `WHERE` and `_`

In [22]:
SELECT DISTINCT county,
       town
  FROM districts
 WHERE town LIKE '大_區';

county,town
桃園市,大園區
桃園市,大溪區
臺中市,大安區
臺中市,大甲區
臺中市,大肚區
臺中市,大里區
臺中市,大雅區
臺北市,大同區
臺北市,大安區
臺南市,大內區


## Combining conditions with `AND` and `OR`

We can combine conditions using `AND` and `OR`.

## Combining conditions with `AND`

In [23]:
SELECT DISTINCT county,
       town
  FROM districts
 WHERE town LIKE '中%' AND
       town LIKE '%區';

county,town
基隆市,中山區
基隆市,中正區
新北市,中和區
桃園市,中壢區
臺中市,中區
臺北市,中山區
臺北市,中正區
臺南市,中西區


## Combining conditions with `OR`

In [24]:
SELECT DISTINCT county,
       town
  FROM districts
 WHERE town LIKE '中_區' OR
       town LIKE '大_區';

county,town
基隆市,中山區
基隆市,中正區
新北市,中和區
桃園市,中壢區
桃園市,大園區
桃園市,大溪區
臺中市,大安區
臺中市,大甲區
臺中市,大肚區
臺中市,大里區


## Putting what we have so far all together

SQL is about the order of keywords, so follow this convention:

```sql
SELECT column_names
  FROM table_name
 WHERE conditions
 ORDER BY column_names
 LIMIT m OFFSET m;
```