# Geospatial Data Analysis using DuckDB
### Simplifying Spatial Data Analysis with DuckDB’s Spatial Extensions

In the ever-evolving world of data analysis, geospatial data has emerged as a critical component across various industries, from urban planning and environmental monitoring to logistics and retail. However, analyzing spatial data often comes with its own set of challenges, including complex workflows, specialized tools, and performance bottlenecks.

Enter DuckDB, a lightweight, fast, and embeddable analytical database designed to simplify data analysis tasks. With its spatial extensions, DuckDB is now poised to revolutionize how we handle geospatial data, making it more accessible and efficient for analysts and developers alike.

This article explores how DuckDB’s spatial capabilities can streamline geospatial data analysis, offering a powerful yet user-friendly alternative to traditional GIS tools. Whether you’re a data scientist, GIS professional, or a developer looking to integrate spatial analysis into your applications, DuckDB’s spatial extensions provide a compelling solution to simplify and accelerate your workflows.

### Our Sample Dataset
For this article, I will use the dataset from https://www.kaggle.com/datasets/shengjunlim/singapore-mrt-lrt-stations-with-coordinates?resource=download. This dataset contains a list of MRT and LRT stations in Singapore. The following shows the first five rows of the CSV file (MRT Stations.csv):

This dataset includes the latitude and longitude of each MRT and LRT station, along with a field containing a geometry object. For example, POINT (103.9032524667383 1.319778951553637)represents a geometry object in Well-Known Text (WKT) format, which is commonly used for spatial data.

### Importing the CSV file into DuckDB
The first step to working with this CSV file is to load it into a DuckDB dataset. To do so, you need to install the duckdb package:

!pip install duckdb

Once this is done, you can now create an in-memory copy of a DuckDB database:



In [1]:
import duckdb

conn = duckdb.connect()

Once the database is created, you can load the CSV file into the DuckDB database:

In [None]:
conn.execute('''
    CREATE TABLE MRT_stations
    as
    SELECT
        *
    FROM read_csv_auto('/path to file/MRTStations.csv')
''')

<duckdb.duckdb.DuckDBPyConnection at 0x1104f7bf0>

A table named MRT_stations is created. You can confirm its existence by executing the following SQL statement:

In [5]:
display(conn.execute('SHOW TABLES').df())

Unnamed: 0,name
0,MRT_stations


You will see the following:

0 MRT_stations
Use the following SQL statement to view the content of the MRT_stations table:

In [6]:
df = conn.execute('''
    SELECT
        Latitude as lat,
        Longitude as lng,  
        STN_NAME as station_name,
        geometry
    FROM MRT_stations
    WHERE
        (lat is not null) or
        (lng is not null)
''').df()
df

Unnamed: 0,lat,lng,station_name,geometry
0,1.319779,103.903252,EUNOS MRT STATION,POINT (103.9032524667383 1.319778951553637)
1,1.342353,103.732597,CHINESE GARDEN MRT STATION,POINT (103.7325967380734 1.342352820874744)
2,1.417383,103.832980,KHATIB MRT STATION,POINT (103.8329799077383 1.417383370153547)
3,1.425178,103.762165,KRANJI MRT STATION,POINT (103.7621654109002 1.425177698770448)
4,1.289563,103.816817,REDHILL MRT STATION,POINT (103.816816670149 1.289562726402453)
...,...,...,...,...
166,1.398161,103.818082,SPRINGLEAF MRT STATION,POINT (103.8180818498627 1.398160861025955)
167,1.385062,103.836469,LENTOR MRT STATION,POINT (103.8364694869142 1.385061946926286)
168,1.372087,103.836824,MAYFLOWER MRT STATION,POINT (103.8368239320149 1.372086638674201)
169,1.363308,103.832936,BRIGHT HILL MRT STATION,POINT (103.8329359578363 1.363308098095808)


In the above snippet, we only extracted the four columns in the table — lat, lng, STN_NAME, and geometry as a Pandas DataFrame:

### Displaying the Locations of the MRT and LRT Stations using Folium
As with all geographical data, it is always useful to be able to display them on a map. To do that, you can use Folium, a Python library used for creating interactive leaflet maps. Let’s install Folium now:

### pip install folium

The following code snippet uses the locations data stored in the dataframe to represent each location using a circle marker:

In [7]:
import math
import folium

# display a map with Singapore at the center of the map
mymap = folium.Map(location = [1.3521,103.8198],
                   width = 950,
                   height = 550,
                   zoom_start = 11,
                   tiles = 'openstreetmap')

for lat, lng, station_name in zip(df['lat'], df['lng'], df['station_name']):
    station = folium.CircleMarker(
        location = [lat, lng],  # location of the marker
        radius = 4,             # size of the marker
        color = 'red',          # color of the marker
        fill = True,            # fill the marker with color
        fill_color = 'yellow',  # fill the marker with yellow color
        fill_opacity = 0.5,     # make the marker translucent
        popup = station_name)   # name of the airport
    
    # add the circle marker to the map
    station.add_to(mymap)
mymap

You can zoom in and clicking on a circle will display the name of the MRT or LRT station:

### Installing and Loading the Spatial Extension in DuckDB
To make use of DuckDB to perform geospatial analysis, you can install the spatial extension for DuckDB. Extensions in DuckDB add additional functionality, and the spatial extension provides support for geospatial data types and operations (e.g., points, lines, polygons, and spatial queries).

In [8]:
# Load DuckDB with spatial extension
conn.execute('INSTALL spatial;')
conn.execute('LOAD spatial;')

<duckdb.duckdb.DuckDBPyConnection at 0x1104f7bf0>

The above installs and then load the spatial extension. Once the spatial extension is loaded, you can use geospatial functions and data types in your queries.

### Getting IP Address of a Location
When performing geospatial analytics, it is very common to convert the name of a user-friendly address into latitude and longitude coordinates. This process is known as geocoding. Geocoding is essential for translating human-readable addresses (e.g., “1600 Amphitheatre Parkway, Mountain View, CA”) into geographic coordinates (e.g., latitude 37.4220, longitude -122.0841), which can then be used for mapping, spatial analysis, and visualization.

To do this, you can use the geopy package, a popular Python library used for geocoding and working with geographic data. To use geopy, install it first:

### pip install geopy

The following code snippet uses the geopy package to geocode a place of interest — Sim Lim Square, Singapore, a popular electronics and IT shopping mall located in the Rochor area of Singapore — into its full address, latitude, and longitude:


In [9]:
from geopy.geocoders import Nominatim

# Initialize the geocoder
geolocator = Nominatim(user_agent = "my_geocoding_app")

# Input: Location name or address
location_name = "Sim Lim Square, Singapore"

# Get location information
location = geolocator.geocode(location_name)

# Extract latitude and longitude
if location:
    print(f"Location: {location.address}")
    print(f"Latitude: {location.latitude}")
    print(f"Longitude: {location.longitude}")
else:
    print("Location not found.")

Location: Sim Lim Square, 1, Rochor Canal Road, Selegie, Rochor, Central, Singapore, 188504, Singapore
Latitude: 1.3030332
Longitude: 103.85302554045288
