Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The dictionary implementation is painfully slow #19

Closed
Pfiver opened this issue Sep 25, 2014 · 21 comments
Closed

The dictionary implementation is painfully slow #19

Pfiver opened this issue Sep 25, 2014 · 21 comments

Comments

@Pfiver
Copy link

Pfiver commented Sep 25, 2014

The performance is really suboptimal. Looking at the code, it quickly becomes clear why: To find an object, the whole dictionary is searched in a loop (i.e.

for(var i=0;i<self.$keys.length;i++){
). I expected to find some sort of hash map. Don't python objects have hash functions or the like? I don't know, but I guess it would be possible to find something. :-) Anyhow - I converted the hash maps into arrays to solve the performance issues and am very happy with Brython, it is a great project. To who it may concern: Thank you for all the hard work and making it available open source!

@ghost
Copy link

ghost commented Sep 26, 2014

yikes, considering javascript has objects which can be used like associative arrays I'm wondering why it would be like this?

@Pfiver
Copy link
Author

Pfiver commented Sep 26, 2014

This is documented in the source code. Javascript object property keys are limited to strings while Python dictionary keys can be arbitrary objects. But I feel the current big array could at least be subdivided into "buckets" and the Python objects' str() method's output or a hash or fixed-length prefix of that be used as keys to address the buckets which would be stored in JS objects. But I'd say let's look at the Python source a bit more and see how dictionaries are actually implemented there. There is no hash_code() method or the like, as e.g. for Java object afaik. Anyway ... it's al documented pretty well (http://en.wikipedia.org/wiki/Hash_table) but would take some work... :-)

....

Looks like the "hash()" builtin is used in CPython. https://hg.python.org/cpython/file/22a46f05ce23/Objects/dictobject.c . And Jason Sundram knows that http://stackoverflow.com/a/2070320 it is implemented differently (and natively, in C) for different object types. Assuming that finding a good hash function is at the heart of this problem, this might be the reason why it is not solved yet.

@ghost
Copy link

ghost commented Sep 26, 2014

__hash__ isn't even used in Brython though (at least not for its intended purpose!). And yes, you can use objects as keys in javascript

I went through the code and changed it all references from $keys and $values to $data and appropriately changed the looping code. Initial tests are promising although I have a bit of debugging to do still.

@ghost
Copy link

ghost commented Sep 26, 2014

Python's dict source code is very low level and would be slow to mimic with javascript.

I get what you're saying now though, you can use objects and what not but it has the risk of collisions, such as d[4] and d['4'] both pointing to the same thing. I forgot about that aspect of javascript.

The issue with any solution is that you need to be able to extract an object from the key, and any changes to the underlying object need to be reflected in the key when the key is retrieved. In other words, you can't just simply rely on a string representation of an object to use as a key.

@ghost
Copy link

ghost commented Sep 26, 2014

After having some time to sleep on it, I think the best implementation is similar to what we have now, except it emulates the auto-growing hash of CPython. You'd use the hash function to figure out which bucket an item falls into, and then probably transform it from a single item to a list of items if multiple items fall into the same bucket. You'd have an array of keys and an array of items just like now. With CPython, the minimum size is 8 and the array grows by a factor of 2 when the size >= 75% of the array. I didn't get a chance to see how it shrinks, but I'd imagine that it would shrink when you go down a factor of two. I.E. if the array size just grew to 64 (with 24 items), then it would grow again at 48, and shrink at 12

Maybe this was the original plan for dicts and it was scrapped for some reason? Are there any objects that hash doesn't work for that it should?

@PierreQuentel
Copy link
Contributor

You can't use Javascript arrays like Python dicts : the Javascript below produces unexpected results for us Pythonistas

a={'x':1} 
b={'y':2}
t = {}
t[a]='a'
t[b]='b'
console.log(t[a])
for(attr in t){console.log(attr)}

The current implementation in Brython is slow, and it's not even compliant with CPython, because this code should raise an exception :

t = {}
a=[1]
t[a] = 7
print(t)

A fast and compliant implementation should use the hash value of the objects that are hashable in CPython (https://docs.python.org/3/glossary.html#term-hashable). Most built-in types already have a hash method, but not instances of built-in classes

If you can work on this it would be very helpful

@ghost
Copy link

ghost commented Sep 28, 2014

on it.

@ghost
Copy link

ghost commented Sep 28, 2014

So far so good

a_list = []
a_dict = {}
for i in range(1000):
    val = "Hello%i" % i
    a_list.append(val)
    a_dict[val] = 1

import time
start = time.time()
for val in a_list:
    idx = a_list.index(val)
    tmp = a_list[idx]
print("%.02f" % (time.time() - start))

start = time.time()
for val in a_list:
    tmp = a_dict[val]
print("%.02f" % (time.time() - start))

on my new implementation is giving:

2.87
0.03

And yes, you can do this:

a = {}
a['5'] = 10
a[5] = 11
print(a['5'])
print(a[5])
10
11

I implemented traversals as an array scan that basically checks for each value. I believe this is how it works in python as well, as in testing traversal was slower than insertion. I might be able to speed this number up a bit as well.

start = time.time()
for itm in a_list:
    pass
print("%.02f" % (time.time() - start))

start = time.time()
for itm in a_dict.items():
    pass
print("%.02f" % (time.time() - start))
k,v = itm
print("k: %s, v: %s" % (k,v))
0.00
0.19

Still need to get all the built-in methods updated and then update the code that unfortunately relied on the previous implementation of dicts in order to work. Might be a few days but I think having this ready before end of this week is realistic.

@ghost
Copy link

ghost commented Oct 5, 2014

To give an update: I unfortunately haven't found time to finish ironing this out, but it sounds like it's best that I hold off on this anyways until we see the rewrite which changes how namespacing works.

Here's the current state of my changes if anyone's interested:

https://github.com/JamesHutchison/brython/blob/dict_refactor/src/py_dict.js

@ghost
Copy link

ghost commented Oct 18, 2014

Pull request:

#41

@joaoventura
Copy link
Contributor

Hi, just saw this Pull Request now. I am working in another solution for his problem which is much much simpler and I would expected to be much faster.

Instead of managing the dictionary ourselves, we should rely absolutely in javascript dicts which are JIT'ed and are capable of performing 300,000,000 Ops/sec on my computer (Firefox, OSX 10.9). We can use hash() or the hash functions to generate a hash to index an object in a JS dict, and to avoid collisions, each entry on the JS dict would be an array (obj or key, value).

So, it is quite similar to this approach by James Hutchison but we don't need to manage the JS dict ourselves. Brython's dict would be just a small wrapper to a JS dict (and just because of hash translations), and that would give us near Javascript performance, at least theoretically.

@ghost
Copy link

ghost commented Nov 5, 2014

I considered this approach but for reasons I can't remember went with emulating the CPython implementation.

@joaoventura
Copy link
Contributor

I've tested your approach against a basic version of mine (as in this commit: https://github.com/joaoventura/brython/commit/c0a067b10100cceadb174a52621abecb1c06b1d6) using:

a = {1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 0:0}
for _ in range(100000):
    a[8]

and the results were:

Original py_dict: 4.94s

Py_dict "James": 1.40s

Py_dict "JS DIct": 1.15s

The differences between what I have done and your version are quite negligible, but both are 400%-500% faster than the current approach. I still haven't implemented the iterators, which is what the rest of my code really depends on.

@ghost
Copy link

ghost commented Dec 3, 2014

Not really a good test without doing a full implementation that also tests insert, delete, and thrashing performance (a ton of inserts, deletes, and look-ups to a single dict). Could be either implementation's performance difference is a wash when compared against each other using typical data.

@ghost
Copy link

ghost commented Dec 23, 2014

Great work guys! What is the status on this? What can I do to help?

@ghost
Copy link

ghost commented Dec 29, 2014

Getting it merged into the main branch would be perfect :) Sounds like @joaoventura hasn't gotten around to an alternative implementation so the one I have will have to do for now. Thanks for picking that up

@ghost
Copy link

ghost commented Dec 29, 2014

One of the biggest challenges is to merge it into brython-dev/brython. It seems that you forked from PierreQuentel/brython instead which is several hundred commits behind brython-dev/brython. No worries though. I will work with Pierre to see if he can remove (or delete) PierreQuentel/brython.
I'll also work on migrating the changes so that all our tests execute successfully.

Billy

@ghost
Copy link

ghost commented Dec 30, 2014

Really, the only code that matters is py_dict.js. The rest of the changes to other files were to fix code that were dependent on the previous dict implementation. Might be easier to just redo those from scratch at this point, since a lot of them might be non-mergable or non-applicable. Just gotta grep out references to $keys and $values.

@ghost
Copy link

ghost commented Dec 30, 2014

James,

You are mostly correct with your suggestion of finding and modifying the
references to $keys and $values. I have most/all of those changed, but to
pass all our tests, is a little more complicated. For example, since you
use hash to do key lookups (Just to get this out there, I believe your
implementation is the correct way of doing this), we have some data types
(such as complex) that didn't calculate hashes correctly, etc. Its just
about finding these small issues and getting them fixed. I've also made a
few adjustments to your py_dict.js file (but not too many, or maybe they
were issues I introduced while bugging). Right now, I'm down to 4 test
files that do not work. When I started, it was more like 15 or so that had
some type of issue.

I have not run any major tests, but so far, I do not see a speed up. This
could be because, most dicts I'm using for tests do not contain a large
amount of items, so the overhead cost maybe larger than Pierre's original
dict implementation. We hope when this is finished, to have a more
pluggable system so that people can develop a new dict module and just plug
it in and do some performance checks.
.
Billy

On Tue, Dec 30, 2014 at 12:34 PM, JamesHutchison notifications@github.com
wrote:

Really, the only code that matters is py_dict.js. The rest of the changes
to other files were to fix code that were dependent on the previous dict
implementation. Might be easier to just redo those from scratch at this
point, since a lot of them might be non-mergable or non-applicable. Just
gotta grep out references to $keys and $values.


Reply to this email directly or view it on GitHub
#19 (comment).

@ghost
Copy link

ghost commented Dec 31, 2014

Speed-up should be night and day when doing random look-ups over 1k+ items. There might be some minor performance gains by speed-optimizing the iteration process for lesser used functions, but I'd rather tackle that later after the dict implementation is bug free.

@ghost
Copy link

ghost commented Jan 6, 2015

I'm going to close this issue since the new implementation has been put into place. Feel free to reopen this issue (or create a new one), if a related issue arises.

@ghost ghost closed this as completed Jan 6, 2015
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants