-
Notifications
You must be signed in to change notification settings - Fork 236
[Work in Progress DO NOT MERGE] Port math library from clojure to python #1893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Now to figure out what is stored where :)
|
This would be amazing! |
We serialize to JSON and, optionally, to temp file. We probably will need to expand to take kw-args into account.
And some unit tests for fun, and even coverage!
The makefile will avoid having to remember the test syntax, making them frictionless to run. The partial runs are a way to keep focusing on just the output of whichever part we want -- although it doesn't save the biggest drag: clojure launch time.
|
We now have serialization of arguments in Clojure, with tests (in Docker). And Serialization+Deserialization of arguments in Python, with tests (locally, not in Docker). Next: wrapping a clojure function with a serializer of input and output, and a python script to replay that call but to the python function, and check the output matches. |
|
I am adding notes to PCALet's start with the PCA, as it's well known and nicely isolated. Let's first:
Checking its clojure calling graph in graph TD
wrapped_pca[wrapped-pca] --> powerit_pca[powerit-pca]
powerit_pca --> power_iteration[power-iteration]
powerit_pca --> rand_starting_vec[rand-starting-vec]
powerit_pca --> factor_matrix[factor-matrix]
power_iteration --> xtxr[xtxr]
power_iteration --> repeatv[repeatv]
factor_matrix --> proj_vec[proj-vec]
xtxr --> repeatv
pca_project[pca-project]
sparsity_aware_project_ptpts[sparsity-aware-project-ptpts] --> sparsity_aware_project_ptpt[sparsity-aware-project-ptpt]
pca_project_cmnts[pca-project-cmnts] --> sparsity_aware_project_ptpts
|
- Add cloverage - Explicitely deactivate conv-man tests, which require a full database, as warned in the comments. - Fix a bug in the new test runner when parsing command line arguments.
I am trying to get full _branch_ coverage of the function `power-iteration`, but failing to. There must be some corner cases I am not surfacing (and Cursor neither). I will leave it aside for now, make a note that it needs fixed, and will go to cover the entirely missing lines in the wrapped-pca function.
|
In spite of lots of PCA tests, I do not get full branch coverage in the key function I do wonder whether we really need full branch coverage, knowing that the PCA method, while currently being an elegantly manually coded power-iteration, might eventually be passed to SKlearn or Lapack. But before that, we will need to check whether the iterative nature of the power-iteration is exploited for incremental updates of the conversation, as I suspect it is. So for now, we will stick to power-iteration. |
See compdemocracy#1894 and `README.pythonport.md` in this commit for explanations of why the code is unreachable.
|
|
|
Just want to give kudos to @metasoarous for his 2014 notes on his hunt for performance back then, living next to the code in https://github.com/compdemocracy/polis/blame/52652e443c52bb554536ef029ce4c951826039c3/math/perf . It's super helpful to see what scenarios you envisaged, what things you were paying attention to, etc! Very useful context, thanks Chris ! |
|
Regarding the purported bug with "including participants that should not be included", this may have actually been intended. If someone joins the conversation before there are 7 comments, we include them in the results as soon as they vote on all available comments. And once someone is in the conversation, I think it would be bad idea to remove them. Now, we can certainly debate the extent to which this behavior is worthwhile; The rationale at the time was that this would help smooth out the experience on new conversations. This is a pretty marginal benefit for most real world usage these days though, and could be looked at as an artifact of early-days startup priorities. I don't know that new implementation needs to necessarily cargo-cult this decision if it's outgrown its utility, but did want to flag the original rationale. If there are other issues though, do please raise a dedicated issue with additional context! One final thought: It sounds like this may not be relevant if the goal of the bindings is just to test/compare behavior until the swap is made. But in case you find it useful, there's a fantastic project called libpython-clj which offers zero-copy bindings between Clojure and Python data structures. |
|
Super helpful context, thanks @metasoarous ! |
3702944 to
3f8f82a
Compare
Note: the timing comparisons at the moment are very unfair to clojure, because in my dev setup clojure runs in docker whereas (for now) python runs on the full host. Thus python has access to more cores if numpy is properly multicore-optimized, and its BLAS calls are also compiled for the actual hardware (Apple Silicon), which is probably not the case in a linux docker image running on my mac I believe. So, fair comparison will be once we have all running in Docker. And ideally on a machine whose silicon matches the docker image. - Improve result comparison with more detailed type checking - Add timing information for Python vs Clojure function execution - Introduce performance summary with speedup calculations - Add more informative discrepancy reporting
|
After discussing with @colinmegill yesterday Updating our plan:Instead of going purely from the leaves of the call-tree and up the call stack, which only allows one big shipping at the end, I will:
This:
|
TEST DOES NOT RUN YET, Type error. That's normal, still working on this.. 1. Porting repness to python by loading group info from JSON blob computed by clojure 2. Testing that the python representative comments are the same as the clojure ones.
The computation in python runs, the test compares it correctly to the values computed by clojure and stored in the DB. However, the results are not yet equal. Still working on this.
Previous implementations were completely missing the moderation. Now added. Let's see if it runs...
Remove some dead code trailing around
Update: after adding filtering of the matrix based on moderation, I'm getting closer. On BG2018, out of 5 representative comments selected by Clojure for each group, the python gets 4 right, and misses. I'll drill down on why that is. Probably still getting one count wrong in the p-test, i.e. in how I account for modded-out comments.
|
Update: due to time-sensitivity of exporting repness via a python microservice, I've opened #1954 . Not a port of the clojure code: uses the analysis notebooks as asked by @colinmegill . |
|
Sorry to side track here, I didn't find other issue related to the discussion of RL topic. I find a paper it mentioned some methods to measures creativity of participants on given task with/without AI assistant, with the P-creativity (psychological creativity) and H-creativity (historical creativity) Thought that can be reward for the RL system Heyyy, just saw this RL paper https://www.nature.com/articles/s41562-023-01686-7 |
|
Tidying up by closing this pull request, as this approach does not provide the rapid shipping and iteration that CompDem prefers. I have, however, transferred the ownership of the repo with this branch to @colinmegill , in case it can ever be helpful in future. |


Context
We (polis core team and advisors) have been discussing for the least few years about whether Clojure was still the optimal language for the math library, given the evolution of the landscape and of polis needs.
This came up again when @metasoarous raised potential performance issues #1579 (comment) .
Porting any codebase, let alone in this case 7000+ lines of scientific Clojure written by a very smart developer (hats off @metasoarous !), with tons of embedded real-world safeguards, and ten years of battle-testing is a Very Big Endeavour, super risky. Most of all, we need to keep all the domain knowledge that is embedded in the current codebase. This is not a rewrite from a scratch, but a port!
So, crazy, but there's a lot to gain (massive ML ecosystem: people, libraries, etc), so we would be remiss not to at least explore how far we can go. Worst that can happen is that this completely fails, we lost time and I've got egg on my face. I'll mitigate the former by still working on the new LLM features with @colinmegill, and for the latter, well, I can live with that :)
So let's go!
Plan
I'll be focusing first on the core functionality: the math. Once that is clear and done, then I will work on the poller and runner. I'll be using
numpyfor all vector operations.Preparation
Core iteration
then, iterating this way:
Expectations
It'll be a real slog at first for step 0 and 1, getting familiar with running the various functions one by one in clojure when needed (i.e. without the realtime poller etc, which I'll keep for the end), but pace should then increase.
Performance should also mechanically improve, as per #1579 and #1062 and #1580 .
Math part will be fun -- although we might start to see some small numerical differences appearing as we go, hopefully keeping them small.