<a href="https://colab.research.google.com/github/ancestor9/2025_Fall_AI-Model-Operations-MLOps/blob/main/week07/Introduction_to_duck_db.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Duck DB**
- DuckDB는 C++로 작성된 인메모리 분석 데이터베이스
- SQL 쿼리와 데이터 집약적 작업을 지원하도록 설계되어 DuckDB가 빠르다.
- DuckDB는 주로 SQL 기능을 제공하는 데 중점을 두고 있으며, Polars는 Pandas와 유사한 DataFrame API를 제공
- DuckDB는 멀티스레드입니다. 쿼리를 실행하기 위해 여러 개의 스레드를 사용하며, 멀티코어 시스템에서 큰 성능 향상

<img src = "https://duckdb.org/images/logo-dl/DuckDB_Logo.png" width =400 height =300>

In [1]:
import duckdb
import pandas as pd
import numpy as np

In [2]:
# Pandas DataFrame 생성
df = pd.DataFrame({
   'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
   'C': np.random.randn(8),
   'D': np.random.randn(8)
})
df

Unnamed: 0,A,B,C,D
0,foo,one,0.360459,-0.408747
1,bar,one,-1.026349,-1.171275
2,foo,two,1.233082,1.308011
3,bar,three,-0.408701,-1.862355
4,foo,two,-0.100458,1.810412
5,bar,two,-0.672313,0.382049
6,foo,one,0.459375,-1.349386
7,foo,three,0.054669,0.141064


In [3]:
# DuckDB를 사용하여 DataFrame에서 SQL 쿼리 실행
result = duckdb.query("SELECT A, AVG(D) FROM df GROUP BY A").to_df()
result

Unnamed: 0,A,avg(D)
0,foo,0.300271
1,bar,-0.883861


In [4]:
# 두 개의 Pandas DataFrame 생성
df1 = pd.DataFrame({
   'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
   'C': np.random.randn(8),
   'D': np.random.randn(8)
})

df2 = pd.DataFrame({
   'A': ['foo', 'bar', 'baz', 'bat'],
   'E': ['apple', 'orange', 'banana', 'grape']
})

In [5]:
# DuckDB를 사용하여 DataFrame에서 SQL 조인 작업 실행
result = duckdb.query("SELECT df1.A, df1.B, df2.E FROM df1 JOIN df2 ON df1.A = df2.A").to_df()
result

Unnamed: 0,A,B,E
0,foo,one,apple
1,bar,one,orange
2,foo,two,apple
3,bar,three,orange
4,foo,two,apple
5,bar,two,orange
6,foo,one,apple
7,foo,three,apple


In [6]:
pd.merge(df1, df2).drop(columns=['C', 'D'])

Unnamed: 0,A,B,E
0,foo,one,apple
1,bar,one,orange
2,foo,two,apple
3,bar,three,orange
4,foo,two,apple
5,bar,two,orange
6,foo,one,apple
7,foo,three,apple


In [7]:
df2

Unnamed: 0,A,E
0,foo,apple
1,bar,orange
2,baz,banana
3,bat,grape


# **[1. Weather API  실습](https://www.weatherapi.com/ )**
### This script demonstrates a straightforward approach to interacting with REST APIs, processing JSON data, and using logging for error handling and status updates. It also shows how to work with databases in Python through the DuckDB library.
### **Call a weather API, retrieve data, and save it to a DuckDB database**
- API key : 55991297847d4e8c92f101705241201

In [8]:
import requests
from pandas import json_normalize
import logging


> **1. call_api(BASE_URL, API_KEY, q)**

>> Purpose:
>>> Makes a request to a weather API to get the current weather forecast for a specified location (q).

>> Parameters:
>>> BASE_URL: The base URL of the weather API.

>>> API_KEY: A key for authenticating requests to the API.

>>> q: The query parameter for the API call, typically the location for which weather data is requested.

>> Process:
>>> Makes a GET request to the weather API using the provided BASE_URL, API_KEY, and query for the location (q). If the request is successful, it converts the JSON response into a pandas DataFrame using json_normalize.
Selects specific columns from this DataFrame that are relevant to the weather data and renames them for clarity. Logs a success message indicating API connectivity was successful. Returns the cleaned and formatted DataFrame.
Error Handling: Catches any request-related exceptions and logs an error message. It returns None if an exception occurs.

### **URL 확인하기**

In [9]:
BASE_URL = "http://api.weatherapi.com/v1/current.json"
API_KEY = "55991297847d4e8c92f101705241201"
q = "London"
url = f"{BASE_URL}?key={API_KEY}&q={q}"
url

'http://api.weatherapi.com/v1/current.json?key=55991297847d4e8c92f101705241201&q=London'

In [10]:
# GET 요청을 보내고 응답 객체를 받습니다.
response = requests.get(url)

# HTTP 응답 코드 확인
print(f"Status Code: {response.status_code}")

# 응답 헤더 확인
print("Headers:")
print(response.headers)

# 응답 본문(내용) 확인
print("Content:")
print(response.text)

Status Code: 200
Headers:
{'Date': 'Tue, 21 Oct 2025 13:47:07 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Server': 'BunnyCDN-FR1-1323', 'CDN-PullZone': '93447', 'CDN-Uid': '8fa3a04a-75d9-4707-8056-b7b33c8ac7fe', 'CDN-RequestCountryCode': 'NL', 'Vary': 'Accept-Encoding', 'Age': '0', 'Cache-Control': 'public, max-age=180', 'Content-Encoding': 'zstd', 'Via': '1.1 varnish (Varnish/7.1)', 'x-weatherapi-qpm-left': '999996', 'x-varnish': '763252850', 'CDN-ProxyVer': '1.39', 'CDN-RequestPullSuccess': 'True', 'CDN-RequestPullCode': '200', 'CDN-CachedAt': '10/21/2025 13:47:07', 'CDN-EdgeStorageId': '1216', 'CDN-RequestId': '2f842bbf0997e2ce4e24ffd551511f5a', 'CDN-Cache': 'MISS', 'CDN-Status': '200', 'CDN-RequestTime': '0'}
Content:
{"location":{"name":"London","region":"City of London, Greater London","country":"United Kingdom","lat":51.5171,"lon":-0.1062,"tz_id":"Europe/London","localtime_epoch":1761054493,"localtime":"2025-10-21 14:48"

- pd.json_normalize() 함수를 사용하여 중첩된 JSON 객체를 포함하는 응답을 평평하게(flatten) 만든 후 DataFrame으로 변환
- response.json() 메소드는 응답 본문을 JSON 객체로 파싱

In [11]:
resp = requests.get(f"{BASE_URL}?key={API_KEY}&q={q}")
json_response = resp.json()
objects = json_normalize(json_response)
objects

Unnamed: 0,location.name,location.region,location.country,location.lat,location.lon,location.tz_id,location.localtime_epoch,location.localtime,current.last_updated_epoch,current.last_updated,...,current.windchill_f,current.heatindex_c,current.heatindex_f,current.dewpoint_c,current.dewpoint_f,current.vis_km,current.vis_miles,current.uv,current.gust_mph,current.gust_kph
0,London,"City of London, Greater London",United Kingdom,51.5171,-0.1062,Europe/London,1761054493,2025-10-21 14:48,1761054300,2025-10-21 14:45,...,60.4,15.8,60.4,9.3,48.7,10.0,6.0,1.5,15.2,24.5


#### **총 33개 컬럼 중에서 필요한 14개 컬럼만 추출(Extract)**

In [12]:
# Set up logging
logging.basicConfig(filename="logs.log", level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

BASE_URL = "http://api.weatherapi.com/v1/current.json"
API_KEY = "55991297847d4e8c92f101705241201"
q = "London"

def call_api(BASE_URL, API_KEY, q):
    """Call a weather API to extract the current forecast"""
    try:
        resp = requests.get(f"{BASE_URL}?key={API_KEY}&q={q}")
        print(f'Http Status : {response.status_code}')
        json_response = resp.json()
        objects = json_normalize(json_response)

        # Extracting only required columns (14개 컬럼)
        objects = objects[["location.name", "location.region", "location.lat", "location.lon", 'current.precip_in',
                           "current.humidity", "current.cloud", "current.feelslike_c", "current.feelslike_f",
                           "current.vis_km", "current.vis_miles", "current.uv", "current.gust_mph", "current.gust_kph"]]

        # Renaming column names
        objects.columns = ["name", "region", "lat", "lon", "precip_in", "humidity", "cloud", "feelslike_c", "feelslike_f",
                           "vis_km", "vis_miles", "uv", "gust_mph", "gust_kph"]
        logging.info('API connectivity check passed.')
        return objects

    except requests.exceptions.RequestException as e:
        logging.error(f'API connectivity check failed: {e}')
        return None


> **2. save_data_to_db(conn, df)**
>> Purpose: Saves the fetched weather data (pandas DataFrame) to a DuckDB database.

>> Parameters:
>>> conn: A connection object to the DuckDB database.

>>> df: The pandas DataFrame containing the weather data to be saved.

>> Process:
>>> Executes an SQL INSERT operation to save the DataFrame data into a table named curr_weather in the DuckDB database. Logs a success message indicating the data was successfully saved. Error Handling: Catches exceptions related to database operations (such as issues with inserting data into DuckDB) and logs an error message.

In [13]:
def save_data_to_db(conn, df):
    """Save the data to a DuckDB database"""
    try:
        conn.execute("INSERT INTO curr_weather SELECT * FROM df")
        logging.info(f'Data saved to DuckDB')
        print(f"Data는 DuckDB로 저장 성공 !!")

    except duckdb.Error as e:
        logging.error(f'Failed to save data to DuckDB: {e}')
        print(f"Data는 DuckDB로 저장 실패 ㅠㅠ")

#### **Duck DB(weather_db.duckdb)를 생성하고 Table(curr_weather Table) 만들기**

In [14]:
# Main execution
if __name__ == '__main__':

    # Creating a database connection and table
    conn = duckdb.connect('weather_db.duckdb')

    sql = '''CREATE OR REPLACE TABLE curr_weather (name string,
                                        region string,
                                        lat string,
                                        lon string,
                                        precip_in numeric,
                                        humidity numeric,
                                        cloud numeric,
                                        feelslike_c numeric,
                                        feelslike_f numeric,
                                        vis_km numeric,
                                        vis_miles numeric,
                                        uv numeric,
                                        gust_mph numeric,
                                        gust_kph numeric)'''
    conn.execute(sql)

    # Call the API to get data
    data = call_api(BASE_URL, API_KEY, q)

    # If data is retrieved, save it to the database
    if data is not None:
        save_data_to_db(conn, data)

    # Close the database connection
    conn.close()

Http Status : 200
Data는 DuckDB로 저장 성공 !!


In [15]:
data

Unnamed: 0,name,region,lat,lon,precip_in,humidity,cloud,feelslike_c,feelslike_f,vis_km,vis_miles,uv,gust_mph,gust_kph
0,London,"City of London, Greater London",51.5171,-0.1062,0.01,77,25,15.3,59.5,10.0,6.0,1.5,15.2,24.5


In [16]:
#  다시 연결 (conn.close() 이후)
conn = duckdb.connect('weather_db.duckdb')
cur = conn.cursor()

In [17]:
# Fetch all rows of query result which returns a list
cur.execute('SELECT * FROM curr_weather;').fetchall()

[('London',
  'City of London, Greater London',
  '51.5171',
  '-0.1062',
  Decimal('0.010'),
  Decimal('77.000'),
  Decimal('25.000'),
  Decimal('15.300'),
  Decimal('59.500'),
  Decimal('10.000'),
  Decimal('6.000'),
  Decimal('1.500'),
  Decimal('15.200'),
  Decimal('24.500'))]

In [18]:
df = conn.execute("SELECT * FROM curr_weather").fetchdf()
df

Unnamed: 0,name,region,lat,lon,precip_in,humidity,cloud,feelslike_c,feelslike_f,vis_km,vis_miles,uv,gust_mph,gust_kph
0,London,"City of London, Greater London",51.5171,-0.1062,0.01,77.0,25.0,15.3,59.5,10.0,6.0,1.5,15.2,24.5


In [19]:
print(len(cur.execute('SELECT * FROM curr_weather;').fetchall()[0]))
print(len(data.columns))

14
14


## END