Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python to Codespaces & Test #19

Merged
merged 7 commits into from Oct 26, 2022
Merged

Add Python to Codespaces & Test #19

merged 7 commits into from Oct 26, 2022

Conversation

metaskills
Copy link
Member

Testing to make sure Python works as expected with libc. Seems it does not, this will fail on test/libcrypteia/override-python.sh for now and this PR will sort out why. cc @bekiya

@metaskills
Copy link
Member Author

This Stack Overflow suggests that Python needs some coercion. I cant believe their underlying C does not do this already. I'll have to think a bit because the goal of this project is to ensure that code using the environment does not have to change. https://stackoverflow.com/questions/235435/environment-variables-in-python-on-linux

@metaskills
Copy link
Member Author

Reached out here because seems Python is not using getenv but maybe it is using environ if I can get the hook coded up right to test. https://github.com/customink/crypteia

@mmanciop
Copy link

mmanciop commented Sep 2, 2022

Maybe it uses secure_getenv instead? Manipulating the extern environ object on __start could do the trick but it sounds a bit ugly. Run readelf -Ws on the python binary and let’s see what it says :-)

@hmadison
Copy link

hmadison commented Sep 2, 2022

For anyone else looking at this, its due to the fact that CPython uses extern char **environ; to create a posix.enviorn dictionary.

Since this is being shipped as a layer, I think you have a copule of routes:

  • Make a new envp and override __libc_start_main to set it[1]. This way, all of the downstream function calls should be able to pick up on the changes, and anything that's using the global will continue to mostly work. This method (should) also work for musl in addition to glibc.
  • Mutate posix.environ() (or os.environ()) from a site-packages to be a subclass of dict which calls out to getenv() via ctypes. This way, just Python is affected and anything else that requires __libc_start_main to function as expected can continue to work.

Footnotes:

  1. In glibc __libc_start_main is a bit of assembly. If compatibility with muslc isn't desired __libc_init_first should also work as a wrap point.

@mmanciop
Copy link

mmanciop commented Sep 3, 2022

I have it on good authority (a private project I am working on, which is commercial so I cannot share), that hooking into getenv works for all CPythons I could find on GitHub, both on debian derivatives and on alpine. One can for example intercept the lookup of PYTHONPATH this way.

@hmadison
Copy link

hmadison commented Sep 3, 2022

I took another look at the python bevahior and I think I've nailed down the discrepancy. Any of the PYTHON* environment variables call Py_GETENV which wraps genenv() directly.

This means while python does link to the getenv() function, the only usage of it is getting configuration variables at the interpreter level. Anything arbitrary which was set by the user is coming in via **environ.

LTrace getenv: Python vs Ruby
vagrant@ubuntu-jammy:~$ ltrace -e getenv ruby -e "puts ENV['EXAMPLE']"
libruby-3.0.so.3.0->getenv("RUBY_THREAD_VM_STACK_SIZE")                                                                       = nil
libruby-3.0.so.3.0->getenv("RUBY_THREAD_MACHINE_STACK_SIZE")                                                                  = nil
libruby-3.0.so.3.0->getenv("RUBY_FIBER_VM_STACK_SIZE")                                                                        = nil
libruby-3.0.so.3.0->getenv("RUBY_FIBER_MACHINE_STACK_SIZE")                                                                   = nil
libruby-3.0.so.3.0->getenv("RUBY_SHARED_FIBER_POOL_FREE_STAC"...)                                                             = nil
libruby-3.0.so.3.0->getenv("RUBYOPT")                                                                                         = nil
libruby-3.0.so.3.0->getenv("RUBY_GC_HEAP_FREE_SLOTS")                                                                         = nil
libruby-3.0.so.3.0->getenv("RUBY_GC_HEAP_INIT_SLOTS")                                                                         = nil
libruby-3.0.so.3.0->getenv("RUBY_GC_HEAP_GROWTH_FACTOR")                                                                      = nil
libruby-3.0.so.3.0->getenv("RUBY_GC_HEAP_GROWTH_MAX_SLOTS")                                                                   = nil
libruby-3.0.so.3.0->getenv("RUBY_GC_HEAP_FREE_SLOTS_MIN_RATI"...)                                                             = nil
libruby-3.0.so.3.0->getenv("RUBY_GC_HEAP_FREE_SLOTS_MAX_RATI"...)                                                             = nil
libruby-3.0.so.3.0->getenv("RUBY_GC_HEAP_FREE_SLOTS_GOAL_RAT"...)                                                             = nil
libruby-3.0.so.3.0->getenv("RUBY_GC_HEAP_OLDOBJECT_LIMIT_FAC"...)                                                             = nil
libruby-3.0.so.3.0->getenv("RUBY_GC_MALLOC_LIMIT")                                                                            = nil
libruby-3.0.so.3.0->getenv("RUBY_GC_MALLOC_LIMIT_MAX")                                                                        = nil
libruby-3.0.so.3.0->getenv("RUBY_GC_MALLOC_LIMIT_GROWTH_FACT"...)                                                             = nil
libruby-3.0.so.3.0->getenv("RUBY_GC_OLDMALLOC_LIMIT")                                                                         = nil
libruby-3.0.so.3.0->getenv("RUBY_GC_OLDMALLOC_LIMIT_MAX")                                                                     = nil
libruby-3.0.so.3.0->getenv("RUBY_GC_OLDMALLOC_LIMIT_GROWTH_F"...)                                                             = nil
libruby-3.0.so.3.0->getenv("RUBYLIB")                                                                                         = nil
libruby-3.0.so.3.0->getenv("DEBIAN_RUBY_STANDALONE")                                                                          = nil
libruby-3.0.so.3.0->getenv("DEBIAN_DISABLE_RUBYGEMS_INTEGRAT"...)                                                             = nil
libruby-3.0.so.3.0->getenv("GEM_SKIP")                                                                                        = nil
libruby-3.0.so.3.0->getenv("GEM_REQUIREMENT_DID_YOU_MEAN")                                                                    = nil
libruby-3.0.so.3.0->getenv("GEM_HOME")                                                                                        = nil
libruby-3.0.so.3.0->getenv("GEM_PATH")                                                                                        = nil
libruby-3.0.so.3.0->getenv("HOME")                                                                                            = "/home/vagrant"
libruby-3.0.so.3.0->getenv("XDG_DATA_HOME")                                                                                   = nil
libruby-3.0.so.3.0->getenv("GEM_VENDOR")                                                                                      = nil
libruby-3.0.so.3.0->getenv("GEM_VENDOR")                                                                                      = nil
libruby-3.0.so.3.0->getenv("GEM_SPEC_CACHE")                                                                                  = nil
libruby-3.0.so.3.0->getenv("EXAMPLE")                                                                                         = nil

+++ exited (status 0) +++


vagrant@ubuntu-jammy:~$ ltrace -e getenv python3 -c "import os; print(os.environ.get('EXAMPLE', ''))"

+++ exited (status 0) +++

@mmanciop
Copy link

mmanciop commented Sep 3, 2022

from a site-packages to be a subclass of dict which calls out to getenv() via ctypes

I find that a usercustomize.py + prepending the location of your python library and patching PYTHONPATH with the getenv redirect work reliably for a similar-enough usecase. You could get wrapt in the mix and patch os.environ that way.

@metaskills
Copy link
Member Author

You could get wrapt in the mix and patch os.environ that way.

Does wrapt require a package install be present or could it be vendor'ed into a solution that gets installed next to Crypteia's shared object? The goal is for any code in the container to not take on any burden to change outside adding Crypteia and LD_PRELOAD... or in Python's case... maybe a little more. So curious to see how this would work. Python is the lowest on my programming skill set chain.

@mmanciop
Copy link

mmanciop commented Sep 13, 2022

The wrapt package is in itself independent from LD_PRELOAD. BUT, in the scope of your goal, I think it needs LD_PRELOAD to change the PYTHON_PATH env var, then you can add packages from another folder at your heart's content, including of course wrapt. After all, you need to have a way of adding wrapt as a dependency irrespective of where the particular container keeps its dependencies (I think the right terminology is the "Python site" but I am not 100% sure). If the only thing your planned to support is Lambda + AWS-curated Python runtimes, then we would know where to put wrapt, because /opt/python is the site in those runtimes (if memory serves) :-)

NOTES.md Outdated Show resolved Hide resolved
@metaskills
Copy link
Member Author

metaskills commented Sep 24, 2022

Did some work with @mpeteuil yesterday around the wip-python branch and it is looking really really good. One thing that stumped us was why ctypes was not working as we expected. So I did two things.

First, I added node and python to the base devcontainer. Seems the latest rust devcontainer lost node and was causing some errors on main where I wanted to figure this out. NOTE: This will cause a slight conflict with this branch but easy to resolve. Second, I did a little experiment in the ctypes branch. See linked commit and screenshot below.

So here is a summary of what we co-learned. First, ctypes to me is python's FFI interface. So neat. It did not occur to me that we would be bypassing a core feature of Crypteia's interface which is nix process based. Every process is isolated and if that process needs x-crypteia env vars via getenv it will kick off an isolated SSM request, create a tmp file, load the secrets into memory, and finally deletes the tmp file. Levering a FFI interface to libc within a process does not stack the LD_PRELOAD and hence directly calls getenv and as such does not kick things into motion. Michael had this really neat idea to reach directly to libcrypteia and that works really well. You can see here how I keep it DRY by using the os environ for our LD_PRELOAD.

import ctypes, os
crypteia = os.environ.get('LD_PRELOAD')
getenv = ctypes.cdll.LoadLibrary(crypteia).getenv
getenv.restype = ctypes.c_char_p
getenv(b'SECRET'))

The screenshot and branch further illustrates how our secure tmp file works. We should not need to solve for this since the main python process which is using ctypes would ( I'm guessing in theory ) have the loaded secrets into memory via the python package loaded up front via PYTHONPATH. So seems we have a path ( no pun ) forward now.

Screen Shot 2022-09-24 at 10 13 44 AM

@mmanciop
Copy link

Directly calling into the LD_PRELOAD shared object is a really need idea :-)

@metaskills
Copy link
Member Author

So I got #24 done along with a followup commit (d2927df) to fix how the Docker in Docker tests for ad-hoc tests. There is only one, Amazon Linux. Python will use that technique for Python 2.7. You may not in the commit we are simply doing a setup (build) and high level libcrypteia tests. So the idea is that the DnD test for Python will set the TEST_LANG and invoke Python just as Amazon Linux focuses on Node. Reminder, the DnD test for Python 2.7 will be much simpler than the Amazon Linux since it does not need to build two containers (build & runtime) but just a single one with Python 2.7 in it. We now have language shields too.

@metaskills
Copy link
Member Author

I may rebase this branch to main and get things in order by cherry-picking a commit from the wip-python branch if folks are finding this confusing with so much Docker and cross language testing going on.

@mpeteuil
Copy link
Contributor

mpeteuil commented Oct 3, 2022

Thanks @metaskills for pairing with me a couple times to clarify some things and unblock me when I got hung up on exactly how Crypteia worked. I still have a few open questions and todos.

Todos

  • Clarify in the README that in non-lambda containers you'd need to set the entrypoint to include the Crypteia binary before your application. This is tangentially related to this PR, but not the main concern of it. I could see that happening separately.
  • Use _Environ and the os.environ implementation more generally as a reference for what should be included. May include:
  • Verify Python 2.7+ compatibility
  • Update the amzn and debian build images

Questions

  • How do we get the crypteia python package to be seen as a distribution by pkg_resources.get_distribution("crypteia") in usercustomize.py? Right now that throws DistributionNotFound. This is likely due to only copying ./build/crypteia/ to /opt/crypteia/python/crypteia/, which doesn't include things like egg-info or dist-info. I don't have tons of experience here so will need to read up somewhere like the Setuptools docs.
  • What's the best way to include wrapt for our crypteia package?
    • Do we copy it to /opt/crypteia/python/wrapt? Wouldn't it overwrite anything else using wrapt then since it'd be last in the path? What about using /opt/crypteia/python/crypteia/wrapt? That way it's importable by our crypteia package in /opt/crypteia/python/crypteia but not by other user code (with some caveats)
    • How do we keep it python independent? (the package we're building is specific to 3.9 in some ways when we build it from this development container, and the wrapt c-extension that gets built is also for 3.9)

@mpeteuil
Copy link
Contributor

There are two approaches to including wrapt as a dependency here. We just need to choose one and go with it.

  1. Vendor wrapt inside the crypteia Python package and treat it like a module that's part of the crypteia package. This effectively means crypteia can use it but it's not importable by user code unless they did something like from crypteia import wrapt. In this scenario we shouldn't need to make the dependency conflicts check. You can see that basic approach in 6a41005.
  2. Don't vendor wrapt, but install it alongside the crypteia Python package. In this scenario it is importable by user code and therefore we need to make the dependency check. That's the approach in 95cdbfa.

@metaskills
Copy link
Member Author

I feel option 1 has been the general consensus since we did work in the loader.

@mpeteuil
Copy link
Contributor

mpeteuil commented Oct 25, 2022

Updated based on some offline conversation. If this looks good I'll:

  1. Merge Shrink extensions with scratch, add ltrace for dev, and format #28
  2. Rebase this against main
  3. Squash some commits
  4. Remove notes.md?
  5. Update the changelog
  6. Cut a new release

@mpeteuil mpeteuil force-pushed the Python branch 2 times, most recently from 535d890 to 9634087 Compare October 25, 2022 18:38
README.md Outdated Show resolved Hide resolved
py27/Dockerfile-test Show resolved Hide resolved
py27/Dockerfile-test Show resolved Hide resolved
In order to leverage the Crypteia hook into getenv via LD_PRELOAD, we
need to connect to the OS getenv system call with Python. Python does
not inherently do this via any standard means of getting environment
variables (`os.eviron['VARIABLE']`, `os.environ.get("VARIABLE")`,
`os.getenv("VARIABLE")`, `os.getenvb(b"VARIABLE")`). In order to
actually wire up to the system `getenv` call, we need to use the ctypes
library (or write a C extension, or the like).

Here we leverage ctypes to get a handle to getenv and then use wrapt to
patch os.envion at interpreter boot time via usercustomize.py. The
usercustomize.py file will be imported last in the python boot sequence,
which you can verify yourself by starting python with -v or using
PYTHONVERBOSE (i.e. python -v, PYTHONVERBOSE=1 python). This means that
any use of os.environ after interpreter boot time will be making calls
using the ctypes hook into the system `getenv` call.

We're explicitly trying to support Python 2.7+ at this
time, so the code needs to work in both 2.7 and 3+.

For more information see:
https://github.com/GrahamDumpleton/wrapt/blob/develop/blog/13-ordering-issues-when-monkey-patching-in-python.md
https://docs.python.org/3/library/site.html
https://github.com/GrahamDumpleton/wrapt/blob/develop/src/wrapt/wrappers.py

* Don't check crypteia python for dependency conflicts

The reason we needed to check for dependency conflicts was that if the
host application installed a package that the crypteia Python package
depended on, then we wanted to avoid those collisions. In the case where
we vendor dependencies under the crypteia Python package's umbrella
(essentially treating it like a module within the crypteia package)
there shouldn't be conflicts because those would not be importable by
users just running `import dep` style imports. The only dependency we
have at the moment is wrapt, so we can import it from within crypteia by
using `from crypteia.wrapt import x,y,z`. That allows us to use the
wrapt functionality while not worrying about host applications colliding
with it.
This validates that the crypteia binary can be built in a regular
debian-based docker image. It also runs the tests to verify that Python
2.7 does indeed hook into system getenv when wired up correctly in a
non-lambda context.
@metaskills
Copy link
Member Author

Gonna cut a 1.0 release soon... thank you SO MUCH everyone!

@metaskills metaskills deleted the Python branch October 26, 2022 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants