DuckDB in JupyterLab - Geonames data
====================================

References:
* DuckDB home page: https://duckdb.org/
  + Documentation: https://duckdb.org/docs/
* DuckDB project on GitHub: https://github.com/duckdb/duckdb
* [GitHub - KS / Cheat Sheets - DuckDB (this project)](https://github.com/data-engineering-helpers/ks-cheat-sheets/tree/main/db/duckdb)
  + [GitHub - KS / Cheat Sheets - DuckDB - Jupyter notebook (this notebook)](https://github.com/data-engineering-helpers/ks-cheat-sheets/blob/main/db/duckdb/ipython-notebooks/duckdb-geonames-basic.ipynb)

In [1]:
import duckdb
import csv

conn = duckdb.connect()
conn = duckdb.connect(database='../db.duckdb', read_only=False)

In [17]:
# Data dir
geoname_base_dir: str = "../data"
geoname_csv_dir: str = f"{geoname_base_dir}/csv"
geoname_pqt_dir: str = f"{geoname_base_dir}/parquet"

In [18]:
from pathlib import Path

def searching_all_files(directory: Path):   
    file_list = [] # A list for storing files existing in directories

    for x in directory.iterdir():
        if x.is_file():

           file_list.append(x)
        else:

           file_list.append(searching_all_files(directory/x))

    return file_list


print(searching_all_files(Path(geoname_base_dir)))

[[PosixPath('../data/../data/csv/.gitkeep'), PosixPath('../data/../data/csv/allCountries.txt'), PosixPath('../data/../data/csv/alternateNames.txt')], [PosixPath('../data/../data/parquet/.gitkeep'), PosixPath('../data/../data/parquet/geonames.parquet'), PosixPath('../data/../data/parquet/alternateNames.parquet'), PosixPath('../data/../data/parquet/allCountries.parquet')]]


In [19]:
# allCountries
geoname_allctry_fn: str = "allCountries"
geoname_allctry_csv: str = f"{geoname_csv_dir}/{geoname_allctry_fn}.txt"
geoname_allctry_pqt: str = f"{geoname_pqt_dir}/{geoname_allctry_fn}.parquet"

# Alternate names
geoname_altname_fn: str = "alternateNames"
geoname_altname_csv: str = f"{geoname_csv_dir}/{geoname_altname_fn}.txt"
geoname_altname_pqt: str = f"{geoname_pqt_dir}/{geoname_altname_fn}.parquet"

def countRows():
    """
    Check that everything goes right
    """
    count_query: str = """
    select count(*)/1e6 as nb from allcountries
    union all
    select count(*)/1e6 as nb from altnames
    union all
    select count(*)/1e6 as nb from geonames
    """

    nb_list = conn.execute(count_query).fetchall()
    return nb_list

def getNCErows():
    """
    Retrieve all the records featuring NCE as the IATA code
    """
    geoame_nce_query: str = "select * from geonames where isoLanguage='iata' and alternateName='NCE'"
    nce_recs = conn.execute(geoame_nce_query).fetchall()
    return nce_recs

def getLILrows():
    """
    Retrieve all the records featuring LIL as the IATA code
    """
    geoame_lil_query: str = "select * from geonames where isoLanguage='iata' and alternateName='LIL'"
    lil_recs = conn.execute(geoame_lil_query).fetchall()
    return lil_recs

In [20]:
nb_list = countRows()
print(f"Number of rows: {nb_list}")

Number of rows: [(12.400722,), (15.893953,), (15.893953,)]


In [21]:
nce_recs = getNCErows()
print("List of records featuring NCE as the IATA code:")
for nce_rec in nce_recs:
    print(nce_rec)

List of records featuring NCE as the IATA code:
(2990440, 'Nice', 'Nice', 'NCE,Nica,Nicaea,Nicc,Nicca,Niccae,Nice,Nicea,Nico,Nikaia,Nis,Nisa,Nissa,Nissa Maritima,Nissa Marìtima,Nitza,Niza,Nizza,Niça,Nìsa,ni si,nis,nisa,niseu,nisu,nitsa,nys,Νίκαια,Ница,Ниццæ,Ницца,Ніца,Ніцца,Նիս,ניס,نيس,نیس,नीस,নিস,ਨੀਸ,நீஸ்,నీస్,นิส,ნიცა,ニース,尼斯,니스', 43.70313, 7.26608, 'P', 'PPLA2', 'FR', None, '93', '06', '062', '06088', 342669, 25, 18, 'Europe/Paris', datetime.date(2023, 2, 13), 'data/csv/allCountries.txt', 6634025, 2990440, 'iata', 'NCE', None, None, None, None, 'data/csv/alternateNames.txt')
(6299418, "Nice Côte d'Azur International Airport", "Nice Cote d'Azur International Airport", "Aehroport Nicca Lazurnyj Bereg,Aeroport de Nice Cote d'Azur,Aéroport de Nice Côte d'Azur,Flughafen Nizza,LFMN,NCE,Nice Airport,Nice Cote d'Azur International Airport,Nice Côte d'Azur International Airport,Nice flygplats,Niza Aeropuerto,frwdgah nys kwt dazwr,koto・dajuru kong gang,mtar nys alryfyra alfrnsy,ni si lan se ha

In [22]:
lil_recs = getLILrows()
print("List of records featuring LIL as the IATA code:")
for lil_rec in lil_recs:
    print(lil_rec)

List of records featuring LIL as the IATA code:
(2998324, 'Lille', 'Lille', "Insula,LIL,Lil,Lil',Lila,Lile,Lilis,Lill,Lill',Lilla,Lille,Lilo,Rijsel,Risel,Rysel,li er,lil,lil.,lila,lili,lly,lyl,riru,Λιλ,Лил,Лилль,Лілль,Ліль,Լիլ,ליל,للی,ليل,لیل,लील,ਲੀਲ,லீல்,ลีล,ლილი,リール,里尔,里爾,릴", 50.63297, 3.05858, 'P', 'PPLA', 'FR', None, '32', '59', '595', '59350', 234475, None, 27, 'Europe/Paris', datetime.date(2022, 8, 14), 'data/csv/allCountries.txt', 7486646, 2998324, 'iata', 'LIL', None, None, None, None, 'data/csv/alternateNames.txt')
(6299448, 'Lille Airport', 'Lille Airport', "Aeroport Lill',Aeroport de Lille-Lesquin,Aeropuerto de Lille-Lesquin,Aéroport de Lille-Lesquin,Flughafen Lille,LFQQ,LIL,Lille - Lesquin Uluslararasi Havalimani,Lille - Lesquin Uluslararası Havalimanı,Lille Airport,Lille-Lesquin Airport,Luchthaven Lille-Lesquin,frwdgah lyl,li er ji chang,mtar lyl,riru kong gang,Аеропорт Лілль,فرودگاه لیل,مطار ليل,リール空港,里爾機場", 50.57037, 3.10643, 'S', 'AIRP', 'FR', None, '32', '59', '595', '

Geonames POR (points of reference):
* Nice airport: https://www.geonames.org/6299418
* Lille airport: https://www.geonames.org/6299448