# SQL in Python

Structured Query Language


We'll use SQLite, a database system optimized to be run with minimal resources. Unlike other systems, a SQLite database is stored in a single file.
Can be integrated with python with 'sqlite3', included by default with Python 3.


1. [Environment Setup](#1.-Environment-Setup)
2. [Data Model](#2.-Data-Model)
3. [Queries](#3.-Queries)

    3.1 [First query](#3.1-First-query)
    
    3.2 [SELECT](#3.2-SELECT)
    
    3.3 [LIMIT](#3.3-LIMIT)
    
    3.4 [DISTINCT](#3.4-DISTINCT)
    
    3.5 [WHERE](#3.5-WHERE)
    
    3.6 [ORDER BY](#3.6-ORDER-BY)
    
    3.7 [Aggregations](#3.7-Aggregations)
    
    3.8 [GROUP BY](#3.8-GROUP-BY)
    
    3.9 [JOIN](#3.9-JOIN)
    
4. [Errors](#4.-Errors)

5. [Summary](#5.-Summary)


## 1. Environment Setup
In this notebook we'll use 'Pandas', the most used library for processing tabular data.

In [5]:
import pandas as pd
import sqlite3

First we must establish a connection to the database.

SQLite doesn't have a user manager, is enough with having access to the file.

Normally we would need to establish a connection to the company database, using other libraries like 'pyodbc' in the case of SQLServer or 'cx_Oracle' in the case of Oracle server.

In [6]:
# Connect to the chinook database
connection = sqlite3.connect("chinook.db")

# Get a cursor to create the queries
cursor = connection.cursor()

In [7]:
# Write a function to read the data and convert to a Pandas DataFrame
def sql_query(query):
    cursor.execute(query)
    # Store the data of the query
    ans = cursor.fetchall()
    # Obtain the names of the columns
    names = [description[0] for description in cursor.description]

    return pd.DataFrame(ans,columns=names)

## 2. Data Model
Before starting to interact with a database, we need to know it's structure, and for that we need it's **data model**

![imagen](./img/chinook_data_model.png)

We can see all tables that are inside with the following query

In [8]:
query = "SELECT name FROM sqlite_master WHERE type='table'"
sql_query(query)

Unnamed: 0,name
0,albums
1,sqlite_sequence
2,artists
3,customers
4,employees
5,genres
6,invoices
7,invoice_items
8,media_types
9,playlists


In [9]:
pd.read_sql(query, connection)

Unnamed: 0,name
0,albums
1,sqlite_sequence
2,artists
3,customers
4,employees
5,genres
6,invoices
7,invoice_items
8,media_types
9,playlists


## 3. Queries
Write the *query* in order to select which data to bring from the data base. The queries have this structure:

```SQL
SELECT field1, field2, field3...
FROM a_table
WHERE conditions
```

**Reserved SQL keywords are usually capitalized** in order to differenciate them from the rest.

We wouldn't get an error as **SQL is not case sensitive**.

### 3.1 First query

In [10]:
# Show entire tracks table
query = """
SELECT *
FROM 'tracks'
"""
sql_query(query)

Unnamed: 0,TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
0,1,For Those About To Rock (We Salute You),1,1,1,"Angus Young, Malcolm Young, Brian Johnson",343719,11170334,0.99
1,2,Balls to the Wall,2,2,1,,342562,5510424,0.99
2,3,Fast As a Shark,3,2,1,"F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho...",230619,3990994,0.99
3,4,Restless and Wild,3,2,1,"F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D...",252051,4331779,0.99
4,5,Princess of the Dawn,3,2,1,Deaffy & R.A. Smith-Diesel,375418,6290521,0.99
...,...,...,...,...,...,...,...,...,...
3498,3499,Pini Di Roma (Pinien Von Rom) \ I Pini Della V...,343,2,24,,286741,4718950,0.99
3499,3500,"String Quartet No. 12 in C Minor, D. 703 ""Quar...",344,2,24,Franz Schubert,139200,2283131,0.99
3500,3501,"L'orfeo, Act 3, Sinfonia (Orchestra)",345,2,24,Claudio Monteverdi,66639,1189062,0.99
3501,3502,"Quintet for Horn, Violin, 2 Violas, and Cello ...",346,2,24,Wolfgang Amadeus Mozart,221331,3665114,0.99


### 3.2 SELECT
**The sentence `SELECT` filters columns**. We can even rename columns inside `SELECT`. Two options:
- **SELECT * :** in this way we bring ALL data. It's not recomendable because we may download unnecesary data, which will reduce performance. **Databases are shared resources between more users** and probably the query will be used in a pipeline, so a human may not even see the data at all.

- **SELECT field1, field...:** selects the exact data that we need.

If you want to rename a column it is done with the syntax `field AS new_name`. **If you want to use spaces in the name, you need to use double quotation marks**

SQL is not case sensitive

In [11]:
query = "SELECT name FROM sqlite_master WHERE type='table'"
sql_query(query)

Unnamed: 0,name
0,albums
1,sqlite_sequence
2,artists
3,customers
4,employees
5,genres
6,invoices
7,invoice_items
8,media_types
9,playlists


In [14]:
# From the tracks table, rename "Name" to "SongName" and also show the column "Composer"
query = """
SELECT Name as "SongName", Composer
FROM tracks
"""
sql_query(query)

Unnamed: 0,SongName,Composer
0,For Those About To Rock (We Salute You),"Angus Young, Malcolm Young, Brian Johnson"
1,Balls to the Wall,
2,Fast As a Shark,"F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho..."
3,Restless and Wild,"F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D..."
4,Princess of the Dawn,Deaffy & R.A. Smith-Diesel
...,...,...
3498,Pini Di Roma (Pinien Von Rom) \ I Pini Della V...,
3499,"String Quartet No. 12 in C Minor, D. 703 ""Quar...",Franz Schubert
3500,"L'orfeo, Act 3, Sinfonia (Orchestra)",Claudio Monteverdi
3501,"Quintet for Horn, Violin, 2 Violas, and Cello ...",Wolfgang Amadeus Mozart


### 3.3 LIMIT
It is used to reduce the rows returned. Always at the end. For example `LIMIT 10`

In [15]:
# Limit to 10 rows the previous table
query = """
SELECT Name as 'SongName', Composer
FROM tracks
LIMIT 10
"""
sql_query(query)

Unnamed: 0,SongName,Composer
0,For Those About To Rock (We Salute You),"Angus Young, Malcolm Young, Brian Johnson"
1,Balls to the Wall,
2,Fast As a Shark,"F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho..."
3,Restless and Wild,"F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D..."
4,Princess of the Dawn,Deaffy & R.A. Smith-Diesel
5,Put The Finger On You,"Angus Young, Malcolm Young, Brian Johnson"
6,Let's Get It Up,"Angus Young, Malcolm Young, Brian Johnson"
7,Inject The Venom,"Angus Young, Malcolm Young, Brian Johnson"
8,Snowballed,"Angus Young, Malcolm Young, Brian Johnson"
9,Evil Walks,"Angus Young, Malcolm Young, Brian Johnson"


### 3.4 DISTINCT
It is used to obtain ALL unique rows. Useful to remove dupplicates, or to see all options inside a certain column.

**It may be very slow with tables with thousands or millions of rows.**

In [16]:
# Show all composers from the database
query = """
SELECT DISTINCT Composer
FROM tracks
"""
sql_query(query)

Unnamed: 0,Composer
0,"Angus Young, Malcolm Young, Brian Johnson"
1,
2,"F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho..."
3,"F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D..."
4,Deaffy & R.A. Smith-Diesel
...,...
848,Carl Nielsen
849,Niccolò Paganini
850,Pietro Antonio Locatelli
851,Claudio Monteverdi


### 3.5 WHERE
It is used to filter rows. For example:
* **Single numeric value**
    * UnitPrice = 0.99
    * UnitPrice >= 0.99
    * UnitPrice < 0.99
* **Single string value**: Name = 'Restless and Wild'
* **Different values**: GenreId in (1, 5, 12)
* **String contained in**:
    * strings that start with 'A': Name like 'A%'
    * strings that end in A': Name like '%A'
    * strings that contain 'A' at some point in the middle: Name like '%A%'
* **Distinct of**: UnitPrice <> 0.99

`WHERE`

In [17]:
# Filter the table tracks showing only those with a unitary price above 0.99
query = """
SELECT *
FROM tracks
WHERE (UnitPrice > 0.99) AND (AlbumId = 226)
"""
sql_query(query)

Unnamed: 0,TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
0,2819,Battlestar Galactica: The Story So Far,226,3,18,,2622250,490750393,1.99


`LIKE`

In [18]:
# Filter the table tracks so we can get the songs composed by "Brian Johnson"
query = """
SELECT *
FROM tracks
WHERE Composer LIKE '%Brian Johnson'
"""
sql_query(query)

Unnamed: 0,TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
0,1,For Those About To Rock (We Salute You),1,1,1,"Angus Young, Malcolm Young, Brian Johnson",343719,11170334,0.99
1,6,Put The Finger On You,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",205662,6713451,0.99
2,7,Let's Get It Up,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",233926,7636561,0.99
3,8,Inject The Venom,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",210834,6852860,0.99
4,9,Snowballed,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",203102,6599424,0.99
5,10,Evil Walks,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",263497,8611245,0.99
6,11,C.O.D.,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",199836,6566314,0.99
7,12,Breaking The Rules,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",263288,8596840,0.99
8,13,Night Of The Long Knives,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",205688,6706347,0.99
9,14,Spellbound,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",270863,8817038,0.99


`WHERE` with several arguments

In [22]:
# Filter the table tracks to obtain the songs with a unitary value greater than 0.99 and more than 100_000_000 bytes
# and the genre identifier to be 21, 22 or 23
query = """
SELECT *
FROM Tracks
WHERE (UnitPrice > 0.99)
AND (Bytes > 100000000)
AND (GenreId = 21 OR TrackId = 22 OR TrackID = 23)
"""
sql_query(query)

Unnamed: 0,TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
0,2840,Don't Look Back,228,3,21,,2571154,493628775,1.99
1,2841,One Giant Leap,228,3,21,,2607649,521616246,1.99
2,2842,Collision,228,3,21,,2605480,526182322,1.99
3,2843,Hiros,228,3,21,,2533575,488835454,1.99
4,2844,Better Halves,228,3,21,,2573031,549353481,1.99
...,...,...,...,...,...,...,...,...,...
57,3360,Something Nice Back Home,261,3,21,,2612779,484711353,1.99
58,3361,Cabin Fever,261,3,21,,2612028,477733942,1.99
59,3362,"There's No Place Like Home, Pt. 1",261,3,21,,2609526,522919189,1.99
60,3363,"There's No Place Like Home, Pt. 2",261,3,21,,2497956,523748920,1.99


### 3.6 ORDER BY
We can **order the table with the the field that we want**
By default ORDER BY orders the strings alphabetically and numeric data is ordered from smaller to greater.
To sort the other way around, you need to use `DESC` like `ORDER BY field DESC`

In [23]:
# Order the table tracks by name in descending order
query = """
SELECT *
FROM tracks
ORDER BY Name DESC
"""

sql_query(query)

Unnamed: 0,TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
0,1077,Último Pau-De-Arara,85,1,10,Corumbá/José Gumarães/Venancio,200437,6638563,0.99
1,1073,Óia Eu Aqui De Novo,85,1,10,,219454,7469735,0.99
2,2078,Óculos,169,1,7,,219271,7262419,0.99
3,3496,"Étude 1, In C Major - Preludio (Presto) - Liszt",340,4,24,,51780,2229617,0.99
4,333,É que Nessa Encarnação Eu Nasci Manga,29,1,9,Lucina/Luli,196519,6568081,0.99
...,...,...,...,...,...,...,...,...,...
3498,3254,#9 Dream,255,2,9,,278312,4506425,0.99
3499,109,#1 Zero,11,1,4,"Cornell, Commerford, Morello, Wilk",299102,9731988,0.99
3500,3412,"""Eine Kleine Nachtmusik"" Serenade In G, K. 525...",281,2,24,Wolfgang Amadeus Mozart,348971,5760129,0.99
3501,2918,"""?""",231,3,19,,2782333,528227089,1.99


### 3.7 Aggregations
In some ocasiones we may want to use a statistical value like the maximum of a field, it's standard deviation or simply a count of non null values. So, we may use functions like `MAX`, `COUNT` or `AVG`.
In [this page](https://www.sqlservertutorial.net/sql-server-aggregate-functions/) there is a summary of the most important functions.

In [24]:
# Count the number of songs that start with the letter 'A'
query = """
SELECT COUNT(DISTINCT Name) AS QtTracks
FROM tracks
WHERE Name LIKE 'A%'
"""
sql_query(query)

Unnamed: 0,QtTracks
0,185


In [25]:
# What is the maximum unitary price from the table invoice_items?
query = """
SELECT MAX(UnitPrice)
FROM tracks
"""
sql_query(query)

Unnamed: 0,MAX(UnitPrice)
0,1.99


### 3.8 GROUP BY
Useful sentence for **computing aggregates used in another field**. For example, to compute the mean of the unitary price by genre.


In [30]:
# Obtain the mean of the unitary prices by genre in the table tracks
query = """
SELECT GenreID, AVG(UnitPrice) as TotalPrice
FROM tracks
GROUP BY GenreID
ORDER BY TotalPrice DESC
LIMIT 10
"""

sql_query(query)

Unnamed: 0,GenreId,TotalPrice
0,19,1.99
1,20,1.99
2,18,1.99
3,22,1.99
4,21,1.99
5,1,0.99
6,7,0.99
7,3,0.99
8,4,0.99
9,8,0.99


Or calculate how many songs have written each composer.

In [32]:
query = """
SELECT Composer, COUNT(TrackId) as QtTracks
FROM tracks
-- WHERE Composer <> 'None'
GROUP BY Composer
ORDER BY QtTracks DESC
-- LIMIT 1
"""

sql_query(query)

Unnamed: 0,Composer,QtTracks
0,,978
1,Steve Harris,80
2,U2,44
3,Jagger/Richards,35
4,Billy Corgan,31
...,...,...
848,Aaron Goldberg,1
849,Aaron Copland,1
850,A.Isbell/A.Jones/O.Redding,1
851,A.Bouchard/J.Bouchard/S.Pearlman,1


In [94]:
query = """
SELECT Composer, COUNT(Composer) as QtTracks
FROM tracks
-- WHERE Composer <> 'None'
GROUP BY Composer
ORDER BY QtTracks DESC
-- LIMIT 1
"""

sql_query(query)

Unnamed: 0,Composer,QtTracks
0,Steve Harris,80
1,U2,44
2,Jagger/Richards,35
3,Billy Corgan,31
4,Kurt Cobain,26
...,...,...
848,Aaron Copland,1
849,A.Isbell/A.Jones/O.Redding,1
850,A.Bouchard/J.Bouchard/S.Pearlman,1
851,A. Jamal,1


### 3.9 JOIN
Until now we've seen how to do queries with a single table, but **what if we want to use data from different tables?**
That is done with `JOIN` operations, and for that is needed to use **one or different common values in both tables, called KEYS**.

**When to use them ?** For example, if we have a table with a set of clients and we want to add new fields, we need to get to other tables with the client identifier  and apply a `JOIN`.

Or, for example, we have a table with all the orders with many fields (city, directions, client...) and in another table we have only the numbers of the orders that weren't delivered. If you want to filter the total table of orders by the orders that weren't delivered, you may apply an `INNER JOIN` in order to have the common values, being the identifier the order number.

Different types of JOINs:

![SQL JOINS](./img/joins.jpg)

In [33]:
# Join the table tracks and genres with an INNER JOIN, showing the columns name and composer (from tracks) and name (from genres)
# Filter the rows so that the genre starts with B

query='''
SELECT tra.Name AS TrackName ,
        tra.Composer,
        gen.Name as GenreName
FROM tracks tra
JOIN genres gen
ON tra.GenreId = gen.GenreId
WHERE GenreName LIKE 'B%' -- gen.Name
'''
sql_query(query)

Unnamed: 0,TrackName,Composer,GenreName
0,First Time I Met The Blues,Eurreal Montgomery,Blues
1,Let Me Love You Baby,Willie Dixon,Blues
2,Stone Crazy,Buddy Guy,Blues
3,Pretty Baby,Willie Dixon,Blues
4,When My Left Eye Jumps,Al Perkins/Willie Dixon,Blues
...,...,...,...
91,Berimbau,,Bossa Nova
92,Deixa,,Bossa Nova
93,Pot-Pourri N.º 2,,Bossa Nova
94,Samba Em Prelúdio,,Bossa Nova


In [34]:
# Join the table tracks and invoice_item with a LEFT JOIN, showing the columns trackid, name and composer from the table tracks,
# and invoicelineid, invoiceid from the table invoice_items
query = '''
SELECT tra.trackId, tra.Name, tra.Composer, inv.InvoiceLineId, inv.InvoiceId
FROM tracks tra
LEFT JOIN invoice_items inv
ON tra.trackId = inv.trackId
'''
sql_query(query)

Unnamed: 0,TrackId,Name,Composer,InvoiceLineId,InvoiceId
0,1,For Those About To Rock (We Salute You),"Angus Young, Malcolm Young, Brian Johnson",579.0,108.0
1,2,Balls to the Wall,,1.0,1.0
2,2,Balls to the Wall,,1154.0,214.0
3,3,Fast As a Shark,"F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho...",1728.0,319.0
4,4,Restless and Wild,"F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D...",2.0,1.0
...,...,...,...,...,...
3754,3500,"String Quartet No. 12 in C Minor, D. 703 ""Quar...",Franz Schubert,578.0,108.0
3755,3500,"String Quartet No. 12 in C Minor, D. 703 ""Quar...",Franz Schubert,1727.0,319.0
3756,3501,"L'orfeo, Act 3, Sinfonia (Orchestra)",Claudio Monteverdi,,
3757,3502,"Quintet for Horn, Violin, 2 Violas, and Cello ...",Wolfgang Amadeus Mozart,,


In [35]:
# Join the table tracks and albums with a LEFT JOIN, showing the columns trackid, name and albumid from the table tracks,
# and the column title from the table albums
query = '''
SELECT tra.TrackId, tra.Name, tra.AlbumId, alb.Title
FROM tracks tra
LEFT JOIN albums alb
ON tra.AlbumId = tra.AlbumId
'''
sql_query(query)

Unnamed: 0,TrackId,Name,AlbumId,Title
0,1,For Those About To Rock (We Salute You),1,For Those About To Rock We Salute You
1,1,For Those About To Rock (We Salute You),1,Balls to the Wall
2,1,For Those About To Rock (We Salute You),1,Restless and Wild
3,1,For Those About To Rock (We Salute You),1,Let There Be Rock
4,1,For Those About To Rock (We Salute You),1,Big Ones
...,...,...,...,...
1215536,3503,Koyaanisqatsi,347,Respighi:Pines of Rome
1215537,3503,Koyaanisqatsi,347,Schubert: The Late String Quartets & String Qu...
1215538,3503,Koyaanisqatsi,347,Monteverdi: L'Orfeo
1215539,3503,Koyaanisqatsi,347,Mozart: Chamber Music


<table align="left">
 <tr><td width="80"><img src="./img/error.png" style="width:auto;height:auto"></td>
     <td style="text-align:left">
         <h3>ERRORS</h3>
 </td></tr>
</table>

Errors are all of the same type: `OperationalError`. That shows that the error is generated by SQL, rather than Python interpreter.
It even shows the specific row where is the error located.

In [36]:
query = '''
SEECT * 
FROM tracks
'''

sql_query(query)

OperationalError: near "SEECT": syntax error

In [37]:
query = '''
SELECT * 
FROM tracksssss
'''

sql_query(query)

OperationalError: no such table: tracksssss

In [38]:
query = '''
SELECT field1, field2
FROM tracks
'''

sql_query(query)

OperationalError: no such column: field1

## 5. Summary
**SQL is the standard used to access relational databases**, so almost every company uses this kind of database.
SQL syntaxis is:

```SQL
SELECT field1, field2, field3...
FROM a_table
WHERE conditions
```
With some actions:
1. **Filter columns**: `SELECT`
2. **Rename fields**: `SELECT field as new_name`
3. **Delete dupplicates**: `DISTINCT`
4. **Limit the number of rows**: `LIMIT`
5. **Filter rows**: `WHERE`
6. **Order a table**: `ORDER BY field1, field2, and `DESC`
7. **Add data**: get a a KPI like : `MAX`,`COUNT`,`AVG`...
8. **Add data at group level**: with `GROUP BY`. The aggregation function is computed for each column.
9. **Join data**: `JOIN` with different types: `LEFT`, `RIGHT`, `FULL`, `INNER` and `OUTER`.
10. **Temporal tables**: with `VIEW`
11. **Delete tables or views**: with `DROP`

Also the data may be obtained with sqlite3 **and processed afterwards with pandas.**