<a href="https://colab.research.google.com/github/UniVR-DH/DBMS-course/blob/main/notebooks/lab03-duckdb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SQL exercise with DuckDB in Jupyter Notebooks
In this notebook we use DuckDB as a DBMS, plus we use some plugins to simplify your way to run SQL queries.

## Library Import and Configuration

In [1]:
!pip install --quiet duckdb
!pip install --quiet jupysql
!pip install --quiet duckdb-engine
!pip install --quiet pandas

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/95.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.1/95.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/192.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m192.8/192.8 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.4/54.4 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.9/48.9 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import duckdb
import pandas as pd
# Import jupysql Jupyter extension to create SQL cells
# this avoids the need to run SQL in python
%load_ext sql

**We configure jupysql to return data as a Pandas dataframe and have less verbose output**

In [3]:
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

## Initialize the Database

In [4]:
# Run DuckDB in main memory, remember to export to file
%sql duckdb:///:memory:
# If we want to save the DB to file we can use the following,
# but then we need to remember to download the file
# %sql duckdb:///myfile.db

An entire Jupyter cell can be used as a SQL cell by placing `%%sql` at the start of the cell. Query results will be displayed as a Pandas DF.

In [5]:
%%sql
SELECT 1=2 as test, 'Hello people' as message, 3*12345 as math  ;

Unnamed: 0,test,message,math
0,False,Hello people,37035


**We can use any CSV file**, we can add it to jupyter or download it from the web

In [6]:
!wget https://gist.github.com/mosesvemana/f9868d6d2980b39bf8bf5287a28c7d21/raw/d6ba88f7952370582ecc206d47c4fd0d5448ae20/reviews.csv

--2024-11-08 09:52:55--  https://gist.github.com/mosesvemana/f9868d6d2980b39bf8bf5287a28c7d21/raw/d6ba88f7952370582ecc206d47c4fd0d5448ae20/reviews.csv
Resolving gist.github.com (gist.github.com)... 20.27.177.113
Connecting to gist.github.com (gist.github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://gist.githubusercontent.com/mosesvemana/f9868d6d2980b39bf8bf5287a28c7d21/raw/d6ba88f7952370582ecc206d47c4fd0d5448ae20/reviews.csv [following]
--2024-11-08 09:52:55--  https://gist.githubusercontent.com/mosesvemana/f9868d6d2980b39bf8bf5287a28c7d21/raw/d6ba88f7952370582ecc206d47c4fd0d5448ae20/reviews.csv
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1277892 (1.2M) [text/plain]
Saving to: ‘

In [7]:
%%sql
SELECT * FROM read_csv('reviews.csv') LIMIT 10 ;

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to every...
1,7202016,39087409,2015-07-20,32440555,Frank,Kelly has a great room in a very central locat...
2,7202016,39820030,2015-07-26,37722850,Ian,"Very spacious apartment, and in a great neighb..."
3,7202016,40813543,2015-08-02,33671805,George,Close to Seattle Center and all it has to offe...
4,7202016,41986501,2015-08-10,34959538,Ming,Kelly was a great host and very accommodating ...
5,7202016,43979139,2015-08-23,1154501,Barent,"Kelly was great, place was great, just what I ..."
6,7202016,45265631,2015-09-01,37853266,Kevin,Kelly was great! Very nice and the neighborhoo...
7,7202016,46749120,2015-09-13,24445447,Rick,hola all bnb erz - Just left Seattle where I h...
8,7202016,47783346,2015-09-21,249583,Todd,Kelly's place is conveniently located on a qui...
9,7202016,48388999,2015-09-26,38110731,Tatiana,"The place was really nice, clean, and the most..."


## Move some data inside a table

In [8]:
%%sql
CREATE TABLE reviewer (
    rid BIGINT PRIMARY KEY,
    rname VARCHAR(255)
);

INSERT INTO reviewer (rid, rname)
    SELECT DISTINCT reviewer_id as 'rid', reviewer_name as 'rname'
    FROM read_csv('reviews.csv');

Unnamed: 0,Success


In [9]:
%%sql
SELECT * FROM reviewer ORDER BY rid LIMIT 10;

Unnamed: 0,rid,rname
0,2543,Mike And Fabian
1,9763,Taylor
2,12793,Kelly
3,15174,Scott
4,17196,Kawika
5,19457,Ron
6,26098,Jonathan
7,37709,Seh
8,38157,Annie
9,41555,Craig


In [11]:
%%sql
DROP TABLE IF EXISTS review;
CREATE TABLE review (
    review_id BIGINT PRIMARY KEY,
    apartment_id BIGINT,
    reviewer_id BIGINT,
    date DATE,
    comment TEXT
);

INSERT INTO review (review_id, apartment_id, reviewer_id, date, comment)
    SELECT  id, listing_id, reviewer_id, date, comments
    FROM read_csv('reviews.csv');

Unnamed: 0,Success


In [12]:
%%sql
SELECT COUNT(*) FROM review;

Unnamed: 0,count_star()
0,3044


In [13]:
%%sql
SELECT * FROM review LIMIT 10;

Unnamed: 0,review_id,apartment_id,reviewer_id,date,comment
0,38917982,7202016,28943674,2015-07-19,Cute and cozy place. Perfect location to every...
1,39087409,7202016,32440555,2015-07-20,Kelly has a great room in a very central locat...
2,39820030,7202016,37722850,2015-07-26,"Very spacious apartment, and in a great neighb..."
3,40813543,7202016,33671805,2015-08-02,Close to Seattle Center and all it has to offe...
4,41986501,7202016,34959538,2015-08-10,Kelly was a great host and very accommodating ...
5,43979139,7202016,1154501,2015-08-23,"Kelly was great, place was great, just what I ..."
6,45265631,7202016,37853266,2015-09-01,Kelly was great! Very nice and the neighborhoo...
7,46749120,7202016,24445447,2015-09-13,hola all bnb erz - Just left Seattle where I h...
8,47783346,7202016,249583,2015-09-21,Kelly's place is conveniently located on a qui...
9,48388999,7202016,38110731,2015-09-26,"The place was really nice, clean, and the most..."


In [17]:
%%sql
SELECT date, comment
FROM review
WHERE comment LIKE '%pool%';

Unnamed: 0,date,comment
0,2015-05-26,Kirsten's home is lovely. We had access to a n...
1,2014-05-28,The apartment was very nice and luxurious. It ...
2,2015-05-21,"Jordan & Stay Alfred provided thorough, detail..."
3,2015-11-27,We had a great trip and loved this condo and l...
4,2014-11-24,The room was as described and the view was won...


In [18]:
%%sql
SELECT COUNT(*)
FROM review
WHERE date BETWEEN  '2015-07-01' AND  '2015-07-31';

Unnamed: 0,count_star()
0,231


In SQL we can use some special functions to process the values in the columns,
for example see `strftime` which works with date/time and how to extract values

https://duckdb.org/docs/sql/functions/dateformat.html

In [19]:
%%sql
SELECT strftime('%Y', date) AS review_year, COUNT(*) AS review_count
FROM review
GROUP BY review_year
ORDER BY review_year DESC;

Unnamed: 0,review_year,review_count
0,2016,6
1,2015,1932
2,2014,650
3,2013,234
4,2012,183
5,2011,39


## Exercises

### Q1: Find the number of different apartments with a review

### Q2: Find the date of the first review written

### Q3: Find the number of apartments with more than 1 review

### Q4: Find the names of all reviewers with more than 3 review

__Can you use a nested query?__

### Q5: Find the user that has written the largest number of reviews

### Q6: Find the top-5 apartments with the largest number of reviews


## We can export to file and also load from file
**Note:** DuckDB export 3 files:

1. A file to create the scehma
2. One or more files containing the data
3. A file to load the data

In [20]:
%%sql
EXPORT DATABASE 'reviews_db';

Unnamed: 0,Success


In [21]:
%%sql
DROP TABLE IF EXISTS reviewer;
DROP TABLE IF EXISTS review;
IMPORT DATABASE 'reviews_db';

Unnamed: 0,Success
