<a href="https://colab.research.google.com/github/ad17171717/YouTube-Tutorials/blob/main/Google%20Colab%20Tutorials/Google_Colab_%2B_DuckDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **DuckDB**

**DuckDB is an open source, relational, database management system. DuckDB has no external dependencies; the entire source tree for DuckDB is compiled into two files. This structure simplifies deployment and integration in other build processes. Similar to SQLite, DuckDB is self-contained it can be run on Google Colab. Other database management systems like MongoDB and Postgres operate using client server architecture, which requires access to ports. However, Google Colab's environment is managed and restricts access to ports.**

**DuckDB automatically infers data types. If you read in a table of integers, DuckDB will automatically detect the integer type and you do not need to explicitly set the data type.**

**DuckDB was built with data science in mind and can work with both Python and R. For example, the DuckDB Python package can run queries directly on Pandas data without importing or copying any data.**

**DuckDB can work with a variety of file types and databases through extensions and connections including: CSV, JSON, Parquet, PostgreSQL, MySQL and SQLite to name a few.**

**DuckDB can operate in both persistent mode, where the data is saved to disk, and in in-memory mode, where the entire data set is stored in a machine's memory.**

<sup>Source: [Data Sources](https://duckdb.org/docs/data/data_sources) from duckdb.org</sup>

In [1]:
import duckdb
import pandas as pd

## **Downloading a Dataset**

In [2]:
#download data
!wget https://archive.ics.uci.edu/static/public/235/individual+household+electric+power+consumption.zip
!unzip /content/individual+household+electric+power+consumption.zip

--2025-01-03 09:07:19--  https://archive.ics.uci.edu/static/public/235/individual+household+electric+power+consumption.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘individual+household+electric+power+consumption.zip’

individual+househol     [     <=>            ]  19.68M  23.5MB/s    in 0.8s    

2025-01-03 09:07:20 (23.5 MB/s) - ‘individual+household+electric+power+consumption.zip’ saved [20640916]

Archive:  /content/individual+household+electric+power+consumption.zip
  inflating: household_power_consumption.txt  


In [3]:
#check data
!head /content/household_power_consumption.txt

Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000


In [4]:
#save path to file
file_path = '/content/household_power_consumption.txt'

## **Creating an in-memory DuckDB Database**

**DuckDB can operate in in-memory mode. In most clients, this can be activated by passing the special value :memory: as the database file or omitting the database file argument. In in-memory mode, no data is persisted to disk, therefore, all data is lost when the process finishes.**

<sup>Source: [Connect](https://duckdb.org/docs/connect/overview.html) from duckdb.org</sup>

In [5]:
#create connection to an in-memory DuckDB database
conn1 = duckdb.connect(database=':memory:')

#load dataset into DuckDB
conn1.execute(f'''
    CREATE TABLE power_consumption AS
    SELECT * FROM read_csv_auto("{file_path}", delim=';')
''');

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [6]:
#retrieve the first row from the dataset
result = conn1.execute('SELECT * FROM power_consumption LIMIT 1').fetchall()
print(result)

[(datetime.date(2006, 12, 16), datetime.time(17, 24), '4.216', '0.418', '234.840', '18.400', '0.000', '1.000', 17.0)]


## **Creating a Persistent DuckDB Database**

**To create or open a persistent database, set the path of the database file (for example `my_database.duckdb`) when creating the connection. This path can point to an existing database or to a file that does not yet exist and DuckDB will open or create a database at that location as needed. The file may have an arbitrary extension, but .db or .duckdb are two common choices with .ddb also used sometimes.**

In [7]:
#create persistent DuckDB connection
conn2 = duckdb.connect(database='household_power_consumption.duckdb')

#load dataset into DuckDB
conn2.execute(f'''
    CREATE TABLE power_consumption AS
    SELECT * FROM read_csv_auto("{file_path}", delim=';')
''');

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [8]:
#retrieve the first row from the dataset
result = conn2.execute('SELECT * FROM power_consumption LIMIT 1').fetchall()
print(result)

[(datetime.date(2006, 12, 16), datetime.time(17, 24), '4.216', '0.418', '234.840', '18.400', '0.000', '1.000', 17.0)]


In [9]:
#close the connection
conn2.close()

## **`pandas` and DuckDB**

**DuckDB was designed to work with the `pandas` module. A DuckDB database can be read into a `pandas` DataFrame. A `pandas` DataFrame can also write to a DuckDB database.**

### **Querying a `pandas` DataFrame with DuckDB**

**DuckDB allows querying pandas DataFrames directly using SQL without needing to convert them into tables.**

In [10]:
#create a pandas DataFrame
pandas_df = pd.read_csv('household_power_consumption.txt', low_memory=False, delimiter=';',na_values=["?"])
pandas_df.head()

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


In [11]:
result = duckdb.query('SELECT AVG(Global_reactive_power) FROM pandas_df').fetchall()
print(result)

[(0.12371447630389587,)]


### **Reading a DuckDB Database into a `pandas` DataFrame**

In [12]:
#connect to newly created DuckDB database
conn3 = duckdb.connect(database='household_power_consumption.duckdb')

#query database and fetch results into a pandas DataFrame
df_from_duck = conn3.execute("SELECT * FROM power_consumption").df()

In [13]:
#check the columns and first 5 rows of the pandas DataFrame
df_from_duck.head()

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,2006-12-16,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,2006-12-16,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,2006-12-16,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,2006-12-16,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,2006-12-16,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


In [14]:
conn3.close()

### **Writing a `pandas` DataFrame to a DuckDB database**

In [15]:
#create persistent DuckDB connection
conn4 = duckdb.connect(database='household_from_pandas.duckdb')
#write DataFrame to a DuckDB table
conn4.execute('CREATE TABLE IF NOT EXISTS pandas_power_consumption AS SELECT * FROM pandas_df')

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

<duckdb.duckdb.DuckDBPyConnection at 0x7e14f2b58eb0>

In [16]:
result = conn4.execute('SELECT * FROM pandas_power_consumption LIMIT 1').fetchall()
print(result)

[('16/12/2006', '17:24:00', 4.216, 0.418, 234.84, 18.4, 0.0, 1.0, 17.0)]


## **Statistical Queries with DuckDB**

**DuckDB contains statistical functions that can be run on a given dataset.**

In [17]:
#compute the skewness of a given column of data
result = duckdb.query('''
    SELECT skewness(Global_active_power) AS skewness
    FROM pandas_df
''').fetchall()

print(result)

[(1.7862333920876787,)]


In [18]:
#compute correlations between variables
result = duckdb.query('''
    SELECT
        corr(Global_active_power, Voltage) AS power_voltage_corr
    FROM pandas_df
''').fetchall()

print(result)

[(-0.3997616096291052,)]


In [19]:
#compute the mean absolute deviation for a column
result = duckdb.query('''
    SELECT mad(Global_active_power) AS mad
    FROM pandas_df
''').fetchall()

print(result)

[(0.396,)]


# **References and Additional Learning**

## **Data**

- **[Individual Household Electric Power Consumption](https://archive.ics.uci.edu/dataset/235/individual+household+electric+power+consumption) from UC Irvine's Machine Learning Repository**

## **Documentation**

- **[DuckDB Documentation](https://duckdb.org/docs/)**

## **Podcast**

- **[DuckDB and Python: Ducks and Snakes living together](https://www.youtube.com/watch?v=3wGeadcKens) from Talk Python to Me**

# **Connect**
- **Feel free to connect with Adrian on [YouTube](https://www.youtube.com/channel/UCPuDxI3xb_ryUUMfkm0jsRA), [LinkedIn](https://www.linkedin.com/in/adrian-dolinay-frm-96a289106/), [X](https://twitter.com/DolinayG), [GitHub](https://github.com/ad17171717), [Medium](https://adriandolinay.medium.com/) and [Odysee](https://odysee.com/@adriandolinay:0). Happy coding!**

# **Podcast**

- **Check out Adrian's Podcast, The Aspiring STEM Geek on [YouTube](https://www.youtube.com/@AdrianDolinay/podcasts), [Spotify](https://open.spotify.com/show/60dPNJbDPaPw7ru8g5btxV), [Apple Podcasts](https://podcasts.apple.com/us/podcast/the-aspiring-stem-geek/id1765996824), [Audible](https://www.audible.com/podcast/The-Aspiring-STEM-Geek/B0DC73S9SN?eac_link=MCFKvkxuqKYU&ref=web_search_eac_asin_1&eac_selected_type=asin&eac_selected=B0DC73S9SN&qid=IrZ84nGqvz&eac_id=141-8769271-5781515_IrZ84nGqvz&sr=1-1) and [iHeart Radio](https://www.iheart.com/podcast/269-the-aspiring-stem-geek-202676097/)!**