# Lecture 0 - Querying


Let's install this Python library to be able to execute SQLite code directly inside Jupyter Notebook cells using SQL syntax

In [None]:
%conda install mamba

In [None]:
%mamba install pandas

In [2]:
# Let's import the required libraries
import sqlite3
import pandas as pd

# Let's connect to the SQLite database used in CS50
conn = sqlite3.connect("longlist.db")


In [16]:
# Let's inspect the schema of the longlist table
pd.read_sql_query(
    "PRAGMA table_info(longlist);",
    conn
)


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,isbn,TEXT,0,,0
1,1,title,TEXT,0,,0
2,2,author,TEXT,0,,0
3,3,translator,TEXT,0,,0
4,4,format,TEXT,0,,0
5,5,pages,INTEGER,0,,0
6,6,publisher,TEXT,0,,0
7,7,published,TEXT,0,,0
8,8,year,INTEGER,0,,0
9,9,votes,INTEGER,0,,0


#### SELECT + DISTINCT

SELECT allows us to select some or all columns and rows from a table inside the database.

It is good practice to use double quotes around table and column names, which are called SQL identifiers. SQL also has strings and we use single quotes around strings to differentiate them from identifiers.

In [33]:
# Selects all columns from "longlist" table
pd.read_sql_query(
"SELECT * FROM longlist;",
conn
)


Unnamed: 0,isbn,title,author,translator,format,pages,publisher,published,year,votes,rating
0,9788439736967,Boulder,Eva Baltasar,Nicole d'Amonville Alegría,paperback,112,Literatura Random House,2022-08-02,2023,2779,3.77
1,9781628971538,Whale,Cheon Myeong-Kwan,Jae Won Chung,paperback,368,Europa Editions,2023-01-19,2023,175,3.97
2,9781642861181,The Gospel According to the New World,Maryse Condé,Richard Philcox,paperback,184,World Editions,2023-03-07,2023,114,3.05
3,9781529414431,Standing Heavy,Gauz,Frank Wynne,paperback,252,MacLehose Press,2022-05-26,2023,322,3.57
4,9781474623025,Time Shelter,Georgi Gospodinov,Angela Rodel,hardcover,304,W&N,2022-04-21,2023,3142,4.05
...,...,...,...,...,...,...,...,...,...,...,...
73,9780525573067,The White Book,Han Kang,Deborah Smith,paperback,161,Portobello Books,2018-04-05,2018,14052,3.83
74,9781788160124,The World Goes On,László Krasznahorkai,"Ottilie Mulzet, George Szirtes, and John Batki",paperback,320,Tuskar Rock,2018-05-31,2018,772,3.77
75,9780857055422,Vernon Subutex 1,Virginie Despentes,Frank Wynne,paperback,352,MacLehose Press,2018-03-22,2018,12250,3.89
76,9781999722784,"Die, My Love",Ariana Harwicz,Sarah Moses and Carolina Orloff,paperback,123,Charco Press,2017-09-14,2018,4567,3.53


In [43]:
#Selects "title" and "author" column from "longlist" table
pd.read_sql_query(
"""
SELECT "title", "author" FROM "longlist";
""",
conn
)

Unnamed: 0,title,author
0,Boulder,Eva Baltasar
1,Whale,Cheon Myeong-Kwan
2,The Gospel According to the New World,Maryse Condé
3,Standing Heavy,Gauz
4,Time Shelter,Georgi Gospodinov
...,...,...
73,The White Book,Han Kang
74,The World Goes On,László Krasznahorkai
75,Vernon Subutex 1,Virginie Despentes
76,"Die, My Love",Ariana Harwicz


DISTINCT removes duplicate rows and returns only unique values

In [120]:
#Selects only DISTINCT years in "year" column from "longlist" table
pd.read_sql_query(
"""
SELECT DISTINCT "year" FROM "longlist";
""",
conn
)

Unnamed: 0,year
0,2023
1,2022
2,2021
3,2020
4,2019
5,2018


#### LIMIT

In [35]:
pd.read_sql_query(
"""
SELECT "title" 
FROM "longlist" 
LIMIT 4;
""",
conn
)

Unnamed: 0,title
0,Boulder
1,Whale
2,The Gospel According to the New World
3,Standing Heavy


#### WHERE - Filtering rows with WHERE

WHERE is used to select rows based on a condition; it will output the rows for which the specified condition is true.

Conditions:

- = : equal to
- <> : not equal to

In [45]:
#Selects all books by Fernanda Melchor
pd.read_sql_query(
"""
SELECT "title","author","year" 
FROM "longlist" 
WHERE "author" = 'Fernanda Melchor';
""",
conn
)

Unnamed: 0,title,author,year
0,Paradais,Fernanda Melchor,2022
1,Hurricane Season,Fernanda Melchor,2020


In [44]:
#Selects all authors nominated in 2023
pd.read_sql_query(
"""
SELECT "author","year" 
FROM "longlist" 
WHERE "year" = 2023;

""",
conn
)

Unnamed: 0,author,year
0,Eva Baltasar,2023
1,Cheon Myeong-Kwan,2023
2,Maryse Condé,2023
3,Gauz,2023
4,Georgi Gospodinov,2023
5,Vigdis Hjorth,2023
6,Andrey Kurkov,2023
7,Laurent Mauvignier,2023
8,Clemens Meyer,2023
9,Perumal Murugan,2023


Note that 'paperback' is in single quotes because it is an SQL string and not an identifier.

In [42]:
pd.read_sql_query(
"""
SELECT "title", "format", "year" 
FROM "longlist" 
WHERE "format" <> 'paperback';

""",
conn
)

Unnamed: 0,title,format,year
0,Time Shelter,hardcover,2023
1,Jimi Hendrix Live in Lviv,hardcover,2023
2,The War of the Poor,hardcover,2021
3,When We Cease to Understand the World,hardcover,2021
4,An Inventory of Losses,hardcover,2021
5,At Night All Blood is Black,hardcover,2021
6,I Live in the Slums,hardcover,2021
7,The Perfect Nine,hardcover,2021
8,The Eighth Life,hardcover,2020
9,Tyll,hardcover,2020


#### WHERE + NOT, AND, OR - Compound conditions

In [71]:
pd.read_sql_query(
"""
SELECT "title", "format", "year" 
FROM "longlist" 
WHERE ("year" = 2022 OR "year" = 2023) AND NOT "format" = 'paperback';
""",
conn
)

Unnamed: 0,title,format,year
0,Time Shelter,hardcover,2023
1,Jimi Hendrix Live in Lviv,hardcover,2023


#### WHERE + Ranges conditions (>,>=,<,<=, BETWEEN __ AND __) - Filtering numbers 

In [61]:
#Find all books between 2022 and 2023 with a range condition
pd.read_sql_query(
"""
SELECT "title", "year" 
FROM "longlist"
WHERE "year" >= 2022 AND "year" <= 2023;
""",
conn
)

Unnamed: 0,title,year
0,Boulder,2023
1,Whale,2023
2,The Gospel According to the New World,2023
3,Standing Heavy,2023
4,Time Shelter,2023
5,Is Mother Dead,2023
6,Jimi Hendrix Live in Lviv,2023
7,The Birthday Party,2023
8,While We Were Dreaming,2023
9,Pyre,2023


In [62]:
#Find all books between 2022 and 2023 with BETWEEN
pd.read_sql_query(
"""
SELECT "title", "year" 
FROM "longlist"
WHERE "year" BETWEEN 2022 AND 2023;
""",
conn
)

Unnamed: 0,title,year
0,Boulder,2023
1,Whale,2023
2,The Gospel According to the New World,2023
3,Standing Heavy,2023
4,Time Shelter,2023
5,Is Mother Dead,2023
6,Jimi Hendrix Live in Lviv,2023
7,The Birthday Party,2023
8,While We Were Dreaming,2023
9,Pyre,2023


In [70]:
#Find all books with a rating above 4.08 and at least 10 000 votes
pd.read_sql_query(
"""
SELECT "title", "year","votes","rating"
FROM "longlist"
WHERE "rating" >= 4.08 AND "votes"> 10000;
""",
conn
)

Unnamed: 0,title,year,votes,rating
0,When We Cease to Understand the World,2021,23251,4.14
1,The Eighth Life,2020,16350,4.52
2,Hurricane Season,2020,22551,4.08
3,The Years,2019,16888,4.18


#### WHERE + IS NULL / IS NOT NULL - Missing data

It is possible that tables may have missing data. NULL is a type used to indicate that certain data does not have a value, or does not exist in the table.

In [47]:
pd.read_sql_query(
"""
SELECT "title", "year" 
FROM "longlist" 
WHERE "translator" IS NULL;
""",
conn
)

Unnamed: 0,title,year
0,The Perfect Nine,2021
1,The Enlightenment of The Greengage Tree,2020


In [49]:
pd.read_sql_query(
"""
SELECT "title", "year" 
FROM "longlist" 
WHERE "translator" IS NOT NULL;
""",
conn
)

Unnamed: 0,title,year
0,Boulder,2023
1,Whale,2023
2,The Gospel According to the New World,2023
3,Standing Heavy,2023
4,Time Shelter,2023
...,...,...
71,The White Book,2018
72,The World Goes On,2018
73,Vernon Subutex 1,2018
74,"Die, My Love",2018


#### WHERE + LIKE - Filtering text

This keyword is used to select data that roughly matches the specified string. 

LIKE is combined with the operators % (matches any characters around a given string) and _ (matches a single character).

In [90]:
#Selects titles that contain “love”
pd.read_sql_query(
"""
SELECT "title"
FROM "longlist"
WHERE "title" LIKE '%love%';
""",
conn
)

Unnamed: 0,title
0,Love in the Big City
1,More Than I Love My Life
2,Love in the New Millennium
3,"Die, My Love"


% matches 0 or more characters, so this query would match book titles that have 0 or more characters before and after “love” — that is, titles that contain “love”.

In [82]:
#Finds all books that begin with "A" (includes "A New","An","After", etc.)
pd.read_sql_query(
"""
SELECT "title"
FROM "longlist"
WHERE "title" LIKE 'A%';
""",
conn
)

Unnamed: 0,title
0,A System So Magnificent It Is Blinding
1,A New Name: Septology VI-VII
2,After the Sun
3,An Inventory of Losses
4,At Night All Blood is Black
5,At Dusk


In [89]:
#Finds all books that begin with "A_" (excludes "A System...", etc.)
pd.read_sql_query(
"""
SELECT "title"
FROM "longlist"
WHERE "title" LIKE 'A_ %';
""",
conn
)

Unnamed: 0,title
0,An Inventory of Losses
1,At Night All Blood is Black
2,At Dusk


In [87]:
#Finds all books that begin with "A_" and have 2 letters (excludes "At D...", etc.)
pd.read_sql_query(
"""
SELECT "title"
FROM "longlist"
WHERE "title" LIKE 'A_';
""",
conn
)

Unnamed: 0,title


Is the comparison of strings case-sensitive in SQL?

- In SQLite, comparison of strings with LIKE is by default case-insensitive, whereas comparison of strings with = is case-sensitive. (Note that, in other DBMS’s, the configuration of your database can change this!)

In [97]:
#Compare the results using LIKE and =
pd.read_sql_query(
"""
SELECT "title"
FROM "longlist"
WHERE "title" LIKE 'pyre';
""",
conn
)

Unnamed: 0,title
0,Pyre


In [98]:
#Compare the results using LIKE and =
pd.read_sql_query(
"""
SELECT "title"
FROM "longlist"
WHERE "title"='pyre';
""",
conn
)

Unnamed: 0,title


#### ORDER BY + DESC, ASC

- ORDER BY chooses ascending order by default for numbers. We can also sort text alphabetically,
- We use of the SQL keyword DESC to specify the descending order and ASC can be used to explicitly specify ascending order.
- We can include a second condition as a tie-break and so on.

In [115]:
# First 4 books ordered alphabetically
pd.read_sql_query(
"""
SELECT "title" 
FROM "longlist" 
ORDER BY "title"
LIMIT 4;
""",
conn
)

Unnamed: 0,title
0,A New Name: Septology VI-VII
1,A System So Magnificent It Is Blinding
2,After the Sun
3,An Inventory of Losses


In [116]:
# TBottom 3 books
pd.read_sql_query(
"""
SELECT "title", "rating", "year" 
FROM "longlist" 
ORDER BY "rating" LIMIT 3;
""",
conn
)

Unnamed: 0,title,rating,year
0,The Gospel According to the New World,3.05,2023
1,The Pine Islands,3.16,2019
2,Love in the New Millennium,3.17,2019


In [117]:
pd.read_sql_query(
"""
SELECT "title", "rating", "votes", "year" 
FROM "longlist" 
WHERE "rating" BETWEEN 4.10 AND 4.14
ORDER BY "rating" DESC, "votes" DESC;
""",
conn
)

Unnamed: 0,title,rating,votes,year
0,When We Cease to Understand the World,4.14,23251,2021
1,Still Born,4.14,7647,2023
2,Elena Knows,4.1,8212,2022
3,The Flying Mountain,4.1,323,2018


In [111]:
pd.read_sql_query(
"""
SELECT "title", "rating", "votes", "year" 
FROM "longlist" 
WHERE "rating" BETWEEN 4.10 AND 4.14
ORDER BY "rating" DESC, "votes";
""",
conn
)

Unnamed: 0,title,rating,votes,year
0,Still Born,4.14,7647,2023
1,When We Cease to Understand the World,4.14,23251,2021
2,The Flying Mountain,4.1,323,2018
3,Elena Knows,4.1,8212,2022


#### AVG, MIN, MAX, and SUM - Aggregate Functions

COUNT, AVG, MIN, MAX, and SUM are called aggregate functions and allow us to perform the corresponding operations over multiple rows of data. 

Each of the following aggregate functions will return ONLY a SINGLE output—the aggregated value.

In [121]:
pd.read_sql_query(
"""
SELECT MIN("rating"),AVG("rating"),MAX("rating"),SUM("votes")
FROM "longlist";
""",
conn
)


Unnamed: 0,"MIN(""rating"")","AVG(""rating"")","MAX(""rating"")","SUM(""votes"")"
0,3.05,3.753718,4.52,604173


Would using MAX with the title column give you the longest book title?

- No, using MAX with the title column would give you the “largest” (or in this case, last) title alphabetically. Similarly, MIN will give the first title alphabetically.

##### ROUND - Rounds the result

To round the average rating to 2 decimal points

In [122]:
pd.read_sql_query(
    """
    SELECT ROUND(AVG("rating"), 2)
    FROM "longlist";
    """,
    conn
)


Unnamed: 0,"ROUND(AVG(""rating""), 2)"
0,3.75


##### AS - Renames column with AS

To rename the column in which the results are displayed

In [123]:
pd.read_sql_query(
    """
    SELECT ROUND(AVG("rating"), 2) AS "Average Rating"
    FROM "longlist";
    """,
    conn
)


Unnamed: 0,Average Rating
0,3.75


#### COUNT (does not count NULL values and may include duplicates) - Aggregate Functions

If we want to count every row in the database, we use the * .

In [124]:
pd.read_sql_query(
    """
    SELECT COUNT(*)
    FROM "longlist";
    """,
    conn
)

Unnamed: 0,COUNT(*)
0,78


We observe that the number of translators is fewer than the number of rows in the database. This is because the COUNT function does not count NULL values.

In [125]:
pd.read_sql_query(
    """
    SELECT COUNT("translator")
    FROM "longlist";
    """,
    conn
)

Unnamed: 0,"COUNT(""translator"")"
0,76


DISTINCT can be used to ensure that only distinct values are counted

In [127]:
pd.read_sql_query(
    """
    SELECT COUNT(DISTINCT "publisher")
    FROM "longlist";
    """,
    conn
)

Unnamed: 0,"COUNT(DISTINCT ""publisher"")"
0,33


In [4]:
conn.close()