Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to install Python environment? #12

Closed
Robinlovelace opened this issue Sep 30, 2021 · 11 comments
Closed

How to install Python environment? #12

Robinlovelace opened this issue Sep 30, 2021 · 11 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@Robinlovelace
Copy link
Contributor

Not through pip...

Context: https://github.com/rocker-org/rocker-versioned2/blob/master/scripts/install_python.sh

@Robinlovelace Robinlovelace added help wanted Extra attention is needed question Further information is requested labels Sep 30, 2021
@Robinlovelace
Copy link
Contributor Author

Current set-up is a bit of a MVP, we can do better than this! https://github.com/geocompr/docker/blob/master/python/Dockerfile

@martinfleis
Copy link

I'd say - it depends what are your plans with the Python flavour of the container.

Since you are installing just geopandas within an ubuntu environment, it may be fine using pip as you do right now. You just need to test it after the build to make sure that GDAL, GEOS and PROJ are properly linked.

If you plan to expand the environment then I would strongly recommend adopting conda to manage the environment and install python packages from conda-forge channel. Installation of multiple independent packages depending on GDAL, like fiona used by geopandas, and rasterio often results in dependency conflicts.

@anitagraser
Copy link

Here's my most recent attempt: https://github.com/anitagraser/EDA-protocol-movement-data/blob/main/docker/Dockerfile

Use at own risk ;-)

@cboettig
Copy link

@Robinlovelace awesome stuff here! Would love feedback on ways we can adjust the python install on the rocker images; we've been playing with that a bunch recently, mostly on the ML stack where support for GPU capabilities is the main concern. But it would be nice to get a more general strategy nailed down.

From my experience so far, it's really quite challenging to establish flexible defaults for the python setup that cover all use cases. Like @martinfleis mentions, the flavor of python really depends on the use case -- some python libs need python < 3.7 (looking at you, all that stuff that still depends on tensorflow 1.x), some need latest python, some want to use conda versions, some need packages not on conda. The dominant models seems to be leaving this up to the users, with most installers installing python, virtualenvs (including pipenv, pyenv, poetry) or conda envs somewhere in the users space (i.e. under ~/) rather than follow the old linux model of a multi-user install below ~/ with shared system libraries.

The idea of https://github.com/rocker-org/rocker-versioned2/blob/master/scripts/install_python.sh was to provide a working python environment out of the box visible in either single-user or multi-user deploys of the rocker container, while still making it easy enough for the user to switch into different virtualenvs (i.e. not force RETICULATE_PYTHON), e.g. as outlined here: https://github.com/rocker-org/ml#python-versions-and-virtualenvs. We set WORKON_HOME=/opt/venv so that reticulate would find the default environment (even when user is not the default user), but that may be a poor choice; I think we should probably unset that, create the default environment at the user home level and only for the default user. Anyway, thanks for letting me crash your thread here and feedback/discussion welcome!

@Robinlovelace
Copy link
Contributor Author

Many thanks everyone for the insight, great to be learning about the Python ecosystem and seeing how different open source communities can learn from and support each other (guessing that will be a theme in @anitagraser graser's upcoming #FOSS4G talk ; ). My guess is that Docker provides some of the benefits of virtual environments (self-containerisation, known dependencies, ease of duplication) so that using pip here isn't as bad as using pip on a work desktop. Getting it work quickly, with different upstream versions of OSGeo packages like PROJ and GDAL, and with R, is a priority and my guess is that that would lend itself to pip install. In terms of multi-user sessions I don't think that's something we're looking to support with this project (researchers looking for a reproducible environment and place to test behaviour against different versions of OSGeo and R/Python libraries/packages - but see #11 something we can learn lots from Rocker on) and in the (I guess unusual) cases where people will want to switch to a different environment, could they not just spin-up a different pre-built docker container?

My my question on pip install is: what are the disadvantages of using it in a system like this? Linking to the same (in some cases upstream but always known versions) of OSGeo libraries as the R packages seems like a big advantage for this use case.

In any case I think the most important thing here is whether it works. I will have a play, try installing more awesome Python packages like Momepy / OSMnx, and report back - I may ping you Martin and Anita in the future if that's OK when I inevitably hit a bottleneck, but feeling more confident after seeing general support, and very useful specific comments, from the community!

@Robinlovelace
Copy link
Contributor Author

it may be fine using pip as you do right now. You just need to test it after the build to make sure that GDAL, GEOS and PROJ are properly linked.

That sounds like great advice, I will seek to add some post build tests in there to check they've linked-up correctly.

@martinfleis
Copy link

I will seek to add some post build tests in there to check they've linked-up correctly.

Normally, these issues shows up on import, so import geopandas should be enough.

@Robinlovelace
Copy link
Contributor Author

Minor update: I just double checked if the image contains working geopandas installation linked to the system version of OSGeo libs, seems so from the reprex below.

python3
Python 3.8.2 (default, Jul 16 2020, 14:00:26) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import geopandas as gpd
>>> import movingpandas as mpd

@Robinlovelace
Copy link
Contributor Author

Robinlovelace commented Sep 30, 2021

Another update: it works from the RStudio web-based GUI inside the browser. I still find this amazing!

image

@Robinlovelace
Copy link
Contributor Author

More tests...

library(sf)
f = file.path(system.file("shape/nc.shp", package="sf"))
nc_sf = read_sf(f)
library(reticulate)
system("pip3 install descartes")
gp = import("geopandas")
nc_gp = gp$read_file(f)
class(nc_gp)
plot(nc_gp$AREA, nc_gp$PERIMETER)
gp = import("geopandas", convert = FALSE)
nc_gp = gp$read_file(f)
nc_gp
plt = import("matplotlib.pyplot", convert = FALSE)
nc_gp$plot()
plt$savefig("test.png")

Results in this:

image

So basically works, RStudio has got better at showing plots in recent versions I think, but we're not using recent RStudio versions.

@jorisvandenbossche
Copy link

jorisvandenbossche commented Oct 1, 2021

I will seek to add some post build tests in there to check they've linked-up correctly.

Normally, these issues shows up on import, so import geopandas should be enough.

Note that this won't catch all issues (eg it won't catch if pyproj doesn't find PROJ or finds the wrong PROJ, and also won't necessarily show an import error of fiona, since geopandas now doesn't require it to be present).
The best test is probably to read a small file or so.

My my question on pip install is: what are the disadvantages of using it in a system like this? Linking to the same (in some cases upstream but always known versions) of OSGeo libraries as the R packages seems like a big advantage for this use case.

It's certainly true that in a controlled environment as a docker container, it should be easier to use pip compared to supporting that for users' work desktop machines.
But if you want it to link to the same OSGeo libraries that are already installed (assuming this is about GDAL, GEOS, PROJ etc versions), you will need to install the python package from source (especially fiona, rasterio, shapely and pyproj), so they are built against the system-available libraries. A default pip install might install the pre-built binary wheels which include its own copy of those C libraries (whether this happens or not depends on the base docker image, for alpine linux there are no wheels at the moment, but for debian/ubuntu based images it will use wheels). I think you can force source installs by adding the --no-binary :all: flag to pip install.

And when using wheels, at that point you can get conflicts between system-installed libraries vs wheel-included libraries. So that is still a potential disadvantage of using pip (although in general on linux I think this should also be fine, and prefer the wheel-included library without giving conflicts. But this is for example a typical installation problem on Homebrew mixing GDAL/GEOS from homebrew with a pip installed fiona and shapely).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants