Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impoved reading and writing from/to PostGIS (SQL in general?) support #595

jorisvandenbossche opened this issue Oct 22, 2017 · 19 comments


Copy link

@jorisvandenbossche jorisvandenbossche commented Oct 22, 2017

Currently we only have a very basic read_postgis function, and we certainly want a write function as well (#189). But, we currently also have some different open (overlapping) PRs and issues related to improving the IO support for PostGIS. Therefore I thought to open a new general issue to get some overview.

Open PRs:

  • #440 PR adding to_postgis
  • #457 PR with both read/write for postgis
  • #546 PR to use geoalchemy in from_postgis
  • one not related to postgis: #101 PR to add support for sqlite

Does somebody have an insight in what the main differences are between the postgis PRs? How to proceed with those?

Some questions related to this that we might need to answer:

  • Do we want to use geoalchemy ( (and thus add it as an optional requirement) What does it bring?
  • Can we actually support more than PostGIS? More general SQL support? (#490) Eg also MySQL has spatial data data types ( But eg geoalchemy does not seem to support that.
  • Naming of the functions (#161): currenlty GeoDataFrame.from_postgis and read_postgis. Depending on the question above, we might want to make it more general (read_sql, to_sql). Personally I would retire the 'from_postgis' for read_postgis (or read_sql) anyhow.

There is some relevant discussion in #161 as well.

Other related issues: #451 on adding SRID support in read_postgis

cc @jdmcbr @dimitri-justeau @showjackyang @adamboche @kuanb @emiliom @perrygeo @carsonfarmer

Copy link

@jdmcbr jdmcbr commented Dec 3, 2017

@jorisvandenbossche Well, my answer is biased, but not being aware of significant advantages geoalchemy2 would provide, I'd be inclined to keep the geopandas writing functionality a little lighter weight. That said, if someone were to point out some real advantages of geoalchemy2, my opinion would likely change.

I don't have any personal experience with spatial MySQL, but if someone has a test database set up they'd be willing to give access to, I'd be happy to try to connect with geopandas and see what breaks.

I'm fine with retiring from_postgis. I remembering finding the pandas choice regarding to and read slightly unintuitive at first, but we might as well be consistent with that standard.

Copy link

@mrocklin mrocklin commented Dec 3, 2017

Copy link

@jdmcbr jdmcbr commented Dec 3, 2017

@mrocklin I would guess mainly that tacking light geometry handling on top of pandas SQL support offered roughly the same capabilities with considerably less work. I've never looked seriously into using GDAL for postGIS access, but I don't see anything at that pops out as motivating a move to rely on GDAL.

Copy link

@mrocklin mrocklin commented Dec 3, 2017

Copy link

@jdmcbr jdmcbr commented Dec 3, 2017

Well, the fiona manual has this to say: "If your data is in a RDBMS like PostGIS, use a Python DB package or ORM like SQLAlchemy or GeoAlchemy. Maybe you’re using GeoDjango already. If so, carry on."

Copy link

@mrocklin mrocklin commented Dec 3, 2017

Copy link

@dimitri-justeau dimitri-justeau commented Dec 3, 2017

Hello everyone, sorry I am answering a bit late to this topic, I have been very busy lately.

Here are the reasons of my choice to rely on geoalchemy2 in the PR I made:

  • Pandas relies on sqlalchemy for its to_sql method. The advantage is that Pandas does not have to worry about building sql statements. It just binds a dataframe's datatypes to sqlalchemy's datatypes. I think that this strategy is a good one since each library deals with what it had been designed for: pandas with dataframes, sqlalchemy with RDBMs. If new backends are supported by sqlalchemy, they are automatically available within Pandas, as well as API changes.

  • I wanted to keep close to this approach, since geoalchemy2 provides an abstraction layer over PostGIS datatypes. If the code I added in #546 can seems a lot, it is actually more or less a replication of what exists in Pandas. I just had to fully rewrite some methods of the subclassed objects to use the appropriate object types.

@jdmcbr Your approach in #440 is in fact much simpler and does not introduce a new dependency, but is also in its current state more limited. In fact, in the PostGIS tables that are created, the geometry type is just geometry, without any SRID information. The SRID could be easily informed relying on EWKT instead of WKT (c.f. #546). The datatypes could also be inferred the same way I did by subclassing the SQLTable to GeoSQLTable.

From this point, I think that the choice to rely on geoalchemy2 or not is mainly a design decision since the same functionalities could be achieved with both approaches. I personally don't have a particular preference, but maybe extending @jdmcbr PR to be more complete would prevent the addition of a new dependency. On the other hand, if geolachemy2 supports more backends in the future, it also might be interesting to stick to it.



Copy link
Member Author

@jorisvandenbossche jorisvandenbossche commented Dec 8, 2017

Would using geoalchemy2 take care of updating the "geometry_columns" table in postgis?

Copy link

@snowman2 snowman2 commented Sep 24, 2019

This looks promising: awburgess/geopandas-postgis#7

Copy link

@HTenkanen HTenkanen commented Oct 21, 2019

Hi all,

I find using the GeoAlchemy sometimes frustratingly slow, especially if there is a need to upload a large number of rows (e.g. millions). For large bulk uploads, I implemented following that might be of use to someone:

import pandas as pd
import geopandas as gpd
from sqlalchemy import create_engine
from geopandas_postgis import PostGIS
from geoalchemy2 import Geometry
from shapely.geometry import MultiLineString, MultiPoint, MultiPolygon
import pandas._libs.lib as lib
import io

def get_geometry_type(gdf):
    """Get basic geometry type of a GeoDataFrame, and information if the gdf contains Geometry Collections."""
    geom_types = list(gdf.geometry.geom_type.unique())
    geom_collection = False
    # Get the basic geometry type 
    basic_types = []
    for gt in geom_types:
        if 'Multi' in gt:
            geom_collection = True
            basic_types.append(gt.replace('Multi', ''))
    geom_types = list(set(basic_types))
    # Check for mixed geometry types
    assert len(geom_types) < 2, "GeoDataFrame contains mixed geometry types, cannot proceed with mixed geometries."
    geom_type = geom_types[0]
    return (geom_type, geom_collection)

def get_srid_from_crs(gdf):
        Get EPSG code from CRS if available. If not, return -1. 
        if is not None:
            # In some cases CRS might be text as well (e.g. proj4)
            # TODO(?): parse epsg from proj4
                srid = int(['init'].replace('epsg:', ''))
                srid = -1
        if srid == -1:
            print("Warning: Could not parse coordinate reference system from GeoDataFrame. Inserting data without defined CRS.")
        return srid
def copy_to_postgis(gdf, engine, table, if_exists='fail',  
                    schema=None, dtype=None, index=False,
    Fast upload of GeoDataFrame into PostGIS database using COPY. 
    gdf : GeoDataFrame
        GeoDataFrame containing the data for upload.
    engine : SQLAclchemy engine.
    if_exists : str
        What to do if table exists already: 'replace' | 'append' | 'fail'.
    schema : db-schema
        Database schema where the data will be uploaded (optional).
    dtype : dict of column name to SQL type, default None
        Optional specifying the datatype for columns. The SQL type should be a 
        SQLAlchemy type, or a string for sqlite3 fallback connection.
    index : bool
        Store DataFrame index to the database as well.
    gdf = gdf.copy()
    geom_name =
    if schema is not None:
        schema_name = schema
        schema_name = 'public'
    # Get srid
    srid = get_srid_from_crs(gdf)
    # Check geometry types
    geometry_type, contains_multi_geoms = get_geometry_type(gdf)
    # Build target geometry type
    if contains_multi_geoms:
        target_geom_type = "Multi{geom_type}".format(geom_type=geometry_type)
        target_geom_type = geometry_type
    # Build dtype with Geometry (srid is updated afterwards)
    if dtype is not None:
        dtype[geom_name] = Geometry(geometry_type=target_geom_type)
        dtype = {geom_name: Geometry(geometry_type=target_geom_type)}
    # Get Pandas SQLTable object (ignore 'geometry')
    # If dtypes is used, update table schema accordingly.
    pandas_sql =
    tbl =, pandas_sql_engine=pandas_sql, 
                             frame=gdf, dtype=dtype, index=index)
    # Check if table exists
    if tbl.exists():
        # If it exists, check if should overwrite    
        if if_exists == 'replace':
        elif if_exists == 'fail':
            raise Exception("Table '{table}' exists in the database.".format(table=table))
        elif if_exists == 'append':
    # Ensure all geometries all Geometry collections, if there was a MultiGeom in the table
    if contains_multi_geoms:
        mask = gdf[geom_name].geom_type==geometry_type
        if geometry_type == 'Point':
            gdf.loc[mask, geom_name] = gdf.loc[mask, geom_name].apply(lambda geom: MultiPoint([geom]))
        elif geometry_type == 'LineString':
            gdf.loc[mask, geom_name] = gdf.loc[mask, geom_name].apply(lambda geom: MultiLineString([geom]))
        elif geometry_type == 'Polygon':
            gdf.loc[mask, geom_name] = gdf.loc[mask, geom_name].apply(lambda geom: MultiPolygon([geom]))
    # Convert geometries to wkt so that it can be pushed with COPY
    gdf[geom_name] = gdf[geom_name].apply(lambda geom: geom.wkt)
    # If there are datetime objects they need to be converted to text
    # TODO
    # Create file buffer
    buffer = io.StringIO()
    gdf.to_csv(buffer, sep='\t', index=index, header=False, quotechar="'")
    # Push to database with COPY
    conn = engine.raw_connection()
    cur = conn.cursor()

        cur.copy_from(buffer, table, sep='\t', null='')
        # Update SRID
        cur.execute("SELECT UpdateGeometrySRID('{schema}', '{table}', '{geometry}', {srid})".format(
                schema=schema_name, table=table, geometry=geom_name, srid=srid))
    except Exception as e:

It's not perfect yet, but does the basic things quite efficiently.

Performance tests

Below, I compare the COPY approach with GeoAlchemy based approach (using geopandas-postgis -package).

In [2]: data = gpd.read_file("")

In [3]: engine = create_engine("postgresql+psycopg2://postgres:postgres@localhost:5432/mydb")

In [4]: len(data)
Out[4]: 13108

In [5]: data.head()
  id      ykr                                           geometry
0  1  5785640  POLYGON ((24.85836190889725 60.39886163559551,...
1  2  5785641  POLYGON ((24.86288218613236 60.39893112037132,...
2  3  5785642  POLYGON ((24.86740246336747 60.39900060514714,...
3  4  5785643  POLYGON ((24.87196934139882 60.39907008992294,...
4  5  5787544  POLYGON ((24.84489427878439 60.39638334525826,...

In [6]: %timeit data.postgis.to_postgis(con=engine, table_name='test', if_exists='replace', geometry='Polygon')
3.59 s ± 90.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit copy_to_postgis(gdf=data, engine=engine, table='test2', if_exists='replace')
1.07 s ± 74.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With COPY you can get ~3.5x improvement in performance with this kind of small dataset. With larger datasets the efficiency is even bigger.


With the current approach, the data is written into a StringIO buffer can cause some issues, if the data contains some special characters (e.g. if uploading natural text such as Tweets with emojis etc).

EDIT1: Updated a check for mixed geometry types, and handling of possible Geometry collections.
EDIT2: Use pandas sql handling to parse table schema etc.
EDIT3: Add performance comparison.

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche commented Oct 24, 2019

@HTenkanen thanks a lot for reviving this!

Yes, the to_sql is known to be slow due to going row by row through Python. I think many people are using the COPY approach with csv to have a faster solution. In pandas we are somewhat hesitant to add functionality specific to a certain database (that might open a can of worms, sqlalchemy gives us this database agnostic approach). But for GeoPandas, given the special status of Postgis, it probably makes sense to actually look into specific approaches that are faster.

Some questions / remarks

  • Pandas has a method argument in to_sql that allows one to use COPY FROM approach as well (see It would be interesting to see if that could be used. Although a potential problem is that there you get an iterator over the data, instead of the actual data (so can't directly use to_csv, but would need to do a timing to see if that gives a slowdown)
  • Can Postgis also read CSV data where the geometries are stored as (hex) WKB instead of WKT? (if so, that should be faster)
  • The get_srid_from_crs will be able to be improved once we use the new pyproj.CRS interface (see #1003)
  • It might be interesting to time the different parts of your function above to see where the remaining bottlenecks are. Is it mainly the actual copy to postgres, or also some of the preprocessing? It might be possible (certainly in light of #1155) to further optimize some of the preprocessing steps.

Copy link

@HTenkanen HTenkanen commented Oct 24, 2019

@jorisvandenbossche Thanks for your comments! I will make some tests based on these ideas, and report back with more detailed profiling of the different parts of the code 👍 I think having the possibility to use SQLAlchemy/GeoAlchemy approach, as well as the faster COPY approach would probably be an optimal situation from the users perspective, but that can of course cause some challenges from the maintenance perspective. Pygeos might indeed provide even better performance in some of the preprocessing steps, I will test!

The new pyproj CRS class btw. also seems superb! Is the plan to introduce this new class into Geopandas soon? It definitely would make life a lot easier. Exciting things happening! :)

Copy link

@Sangarshanan Sangarshanan commented Dec 30, 2019

Hey, I was looking to revive this again and maybe help out/ contribute in any way possible

I worked on the above mentioned points, converting geometries to wkb and used to the to_sql copy method, borrowed from psql_insert_copy and copy_to_postgis defined above

def get_geometry(data):
    if len(set(list(data.geom_type))) == 1:
        dtype = {'geometry' : Geometry(geometry_type= data.geom_type[0].upper())}
        return "Multiple dtypes !!!!"
    return dtype

def insert_to_postgis(data, con, table_name ,schema=None, dtype=None,

    data = data.copy()
    dtype = get_geometry(data)
    # borrowing from HTenkanen 
    srid = get_srid_from_crs(data)
    geom_col = data._geometry_column_name
    # convert geometries to wkb
    data[geom_col] = data[geom_col].apply(lambda x: dumps(x, hex=True))
    dbapi_conn = con.raw_connection()
    cur = dbapi_conn.cursor()

    pandas_sql =
    tbl =, pandas_sql_engine=pandas_sql, 
                             frame=data, dtype=dtype, index=index)

    args = [list(data[i]) for i in data.columns]
    if index:
    # generator of list (borrowed from pandas)
    data_iter = zip(*args)

    # Check if table exists (from HTenkanen)
    if tbl.exists():
        # If it exists, check if should overwrite    
        if if_exists == 'replace':
        elif if_exists == 'fail':
            raise Exception("Table '{table}' exists in the database.".format(table=table_name))
        elif if_exists == 'append':

    # get list of columns using pandas
    keys = tbl.insert_data()[0]
    columns = ', '.join('"{}"'.format(k) for k in list(keys))
    # borrowed from
    s_buf = StringIO()
    writer = csv.writer(s_buf)

    sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
        table_name, columns)
    cur.copy_expert(sql=sql, file=s_buf)
    cur.execute("SELECT UpdateGeometrySRID('{table}', '{geometry}', {srid})".format(
    schema=schema, table=table_name, geometry=geom_col, srid=srid))



In [1]: data = gpd.read_file(

In [2]: %timeit insert_to_postgis(data, table_name='test', con=con, if_exists='replace', index= None)
13.2 s ± 236 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit copy_to_postgis(data, table='test', engine=con, if_exists='replace', index= None)
13.6 s ± 421 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Appreciate feedback / suggestions to start

Copy link

@HTenkanen HTenkanen commented Jan 2, 2020

Hi @Sangarshanan and thanks for reviving this and your contributions! 👍

I now finally had time to get back to this and I went through your edits @Sangarshanan and included them to this Gist:

I left some of the earlier parts as they were, so that the function is able to handle mix between single vs multi-geometries automatically (e.g. mix between Polygon and MultiPolygon).

@jorisvandenbossche: I now also updated the CRS reading using the new pyproj CRS class so it should work now quite nicely with different types of CRS information. In addition, I now tested swapping from shapely.wkb to pygeos.wkb as it also provided some improvements on the performance.

I did some time profiling on the different parts (see GIST) and currently the performance is as follows:

With Pygeos WKB:

In [1]: data = gpd.read_file("")
In [2]: engine = create_engine("postgresql+psycopg2://myuser:mypwd@localhost:5432/mydb")
In [3]: %timeit copy_to_postgis(data, engine, table='ykr_test', if_exists='replace')
717 ms ± 58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And this is how long different parts take:

'get_srid_from_crs'  0.01 seconds
'get_geometry_type'  0.02 seconds
'convert_to_wkb'  0.13 seconds
'write_to_db'  0.55 seconds
'copy_to_postgis'  0.77 seconds   # In total

With Shapely WKB

In [2]: %timeit copy_to_postgis(data, engine, table='ykr_test', if_exists='replace',  schema=None, dtype=None, index=True)
874 ms ± 44.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And this is how long different parts take:

'get_srid_from_crs'  0.01 seconds
'get_geometry_type'  0.02 seconds
'convert_to_wkb'  0.26 seconds
'write_to_db'  0.50 seconds
'copy_to_postgis'  0.86 seconds  # In total

So as we can see, using the Pygeos instead of Shapely is twice as fast, and now most of the time goes to actually writing the data into the database which is normal.

Couple of questions for @jorisvandenbossche :

  • I have understood that the Pygeos things will eventually be integrated into Shapely. But I guess that might still take some time, so should we continue with this Pygeos approach, i.e. converting geometries from shapely to pygeos, and then those to wkb? Or should we stick with normal Shapely wkb-dumps for now, as this would naturally bring a new dependency to Geopandas? What are the current thoughts about integrating Pygeos to Geopandas?

  • I guess the most logical place to add this to_postgis() -functionality would be in the ..geopandas/io/ -file, would you agree?

  • Any recommendations / ideas about how we should test these functionalities? I see that for testing the reading from PostGIS, you have the create_postgis() -function that populates a test_geopandas database. I guess we could take a similar approach here and test that populating the nybb data works with the to_postgis() function?

Copy link

@HTenkanen HTenkanen commented Jan 3, 2020

@jorisvandenbossche @martinfleis @snowman2:

I have now implemented a method GeoDataFrame.to_postgis() for writing data from GeoDataFrame to PostGIS.
You can check it from this fork:

It now supports all the same functionalities as Pandas, i.e. you can also use "replace" and "append" when pushing the GeoDataFrame to PostGIS. When appending, the crs of the GeoDataFrame and the target table needs to match. When writing it uses copy_expert under the hood as suggested by @jorisvandenbossche and kindly written by @Sangarshanan. For checking the CRS from the GeoDataFrame, it now uses the new pyproj.CRS -class, which works quite nicely for extracting the epsg-code (although I needed to lower the min_confidence to 25 when parsing the epsg with CRS.to_epsg() -function.)

All the tests are passing (locally), and I wrote a few basic tests for the write_postgis() functionality into For testing, I used the same nybb dataset that was also used with some other tests.

There are few things that I think would be good to discuss before making a PR from this.

  • This functionality brings a new (optional) dependency to geopandas: GeoAlchemy2. Are we okay with this? This package cannot be found from conda defaults channel, so from the current environments only the conda-forge yml works. Which made me thinking that should the 37-dev.yaml also include conda-forge channel? Or is there a specific reason to include only defaults channel there?
  • I now used pygeos for converting shapely geometries to wkb (which is another new dependency). I implemented it as optional, and if the pygeos cannot be imported OR the GEOS versions between shapely and pygeos do not match, it will use the wkb-dumps from shapely. Do we want to keep this like this? It's a bit hackish as currently the pygeos even needs to be built from master (PYPI v0.5 of pygeos does not yet have the from_shapely -function that @jorisvandenbossche wrote).
    • For installing the pygeos from master in travis, I added a PYGEOS flag to .travis.yml so that pygeos will be installed from pygeos source during 37-latest-conda-forge.yaml environment. This is not needed if/when the new version of pygeos will be released, as this can be added to the yml-file. @jorisvandenbossche Any idea when this might happen?

Any comments and all the feedback is very welcome. 🙂 I try to finish this now as soon as possible.

Copy link

@martinfleis martinfleis commented Jan 3, 2020

@HTenkanen Great job! Few notes from me.

pyproj.CRS will be used as from next release (#1101), so we will be able to clean those conditions then.

I am fine with GeoAlchemy2, it is a pure python package installable from PyPI. Recent GeoPandas in not available on defaults either. @jorisvandenbossche will be able to tell more about the channels support.

The plan was to use pygeos under the hood within geopandas anyway (#1155), but I am not sure what is the current situation after the decision to merge pygeos with shapely. I am not very keen to use the logic you implemented, but once this pygeos/shapely/geopandas relation will be clearer we might come up with a simpler way.

Copy link

@HTenkanen HTenkanen commented Jan 3, 2020

@martinfleis Do you have an idea when the next release is aimed to be rolled out? And has the features for that release already been locked, or could this be able to fit for that release as well?

I agree that the current pygeos implementation is quite a hack and maybe it would be better to wait and go purely with shapely for now until things are settled between pygeos/shapely/geopandas. 👍

Copy link

@martinfleis martinfleis commented Jan 3, 2020

@HTenkanen The date have not been set nor the feature list. This should certainly be part of the next release. I think best would be to open a PR now, we can discuss detail there.

Yes, I would probably go with shapely now, assuming that shapely itself will get pygeos boost at one point, so we don't have to deal with it here.

Re GeoAlchemy, add it to pip install to non conda-forge envs.

Copy link

@HTenkanen HTenkanen commented Jan 3, 2020

@martinfleis Okay, great 👍 I will remove the pygeos parts from the codes and then make a PR from this, so we can continue from there.

Re GeoAlchemy, add it to pip install to non conda-forge envs.

Ah true, I'll do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

8 participants