Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deco with pandas data structures #18

Closed
wikiped opened this issue May 16, 2016 · 7 comments
Closed

Deco with pandas data structures #18

wikiped opened this issue May 16, 2016 · 7 comments

Comments

@wikiped
Copy link

wikiped commented May 16, 2016

While experimenting with deco and pandas I was hoping that the code below would work.

The intention was to simulate parallel-processing of a dummy pandas.DataFrame, where vectorized implementation is supposedly not possible.

import pandas as pd
n = 1000
df = pd.DataFrame({'str': 'Row'+pd.Series(range(n)).astype(str),
                   'num': pd.np.random.randint(1,10,n)})
df.head()

Produces:

   num   str
0    3  Row0
1    8  Row1
2    1  Row2
3    2  Row3
4    9  Row4

Now trying to join cells in each row with:

from deco import *

@concurrent
def join_str(series, sep=', '):
    return sep.join(map(str, series))

@synchronized
def joiner(df, cols, sep=', '):
    joined = pd.Series(index=df.index)
    for row in df.index:
        joined[row] = join_str(df.loc[row, cols], sep=sep)
    return joined

joiner(df, ['str','num'])

Gives an error Assignment attempted on something that is not index based:

ValueError                                Traceback (most recent call last)
..........

D:\Anaconda\envs\py2k\lib\site-packages\deco\astutil.pyc in subscript_name(node)
     38             return node.id
     39         elif type(node) is ast.Subscript:
---> 40             return SchedulerRewriter.subscript_name(node.value)
     41         raise ValueError("Assignment attempted on something that is not index based")
     42 

D:\Anaconda\envs\py2k\lib\site-packages\deco\astutil.pyc in subscript_name(node)
     39         elif type(node) is ast.Subscript:
     40             return SchedulerRewriter.subscript_name(node.value)
---> 41         raise ValueError("Assignment attempted on something that is not index based")
     42 
     43     def is_concurrent_call(self, node):

ValueError: Assignment attempted on something that is not index based

Which is somewhat strange given that assignment is index based.

Is this because deco doesn't 'understand' this particular data structure (pandas.Series) or is there a problem in the code?

@lelandg
Copy link

lelandg commented May 16, 2016

It looks like you have a type dict, not one that is indexed. If you do a
type(pandas.Series) what does it return?

Leland...
The Glow-In-The-Dark Man
http://lelandgreen.com – my blog
http://theglowinthedarkman.com – custom styles in place... products coming
soon
in 2016!
https://www.etsy.com/shop/TheGlowInTheDarkMan?ref=hdr_shop_menu – Products
live, right now!

On Mon, May 16, 2016 at 6:52 AM, wikiped notifications@github.com wrote:

While experimenting with deco and pandas
http://pandas.pydata.org/pandas-docs/stable/dsintro.html I was hoping
that the code below would work.

The intention was to simulate parallel-processing of a dummy
pandas.DataFrame, where vectorized implementation is supposedly not
possible.

import pandas as pd
n = 1000
df = pd.DataFrame({'str': 'Row'+pd.Series(range(n)).astype(str),
'num': pd.np.random.randint(1,10,n)})
df.head()

Produces:

num str
0 3 Row0
1 8 Row1
2 1 Row2
3 2 Row3
4 9 Row4

Now trying to join cells in each row with:

from deco import *

@Concurrent
def join_str(series, sep=', '):
return sep.join(map(str, series))

@synchronized
def joiner(df, cols, sep=', '):
joined = pd.Series(index=df.index)
for row in df.index:
joined[row] = join_str(df.loc[row, cols], sep=sep)
return joined

joiner(df, ['str','num'])

Gives an error Assignment attempted on something that is not index based:

ValueError Traceback (most recent call last)
..........

D:\Anaconda\envs\py2k\lib\site-packages\deco\astutil.pyc in subscript_name(node)
38 return node.id
39 elif type(node) is ast.Subscript:
---> 40 return SchedulerRewriter.subscript_name(node.value)
41 raise ValueError("Assignment attempted on something that is not index based")
42

D:\Anaconda\envs\py2k\lib\site-packages\deco\astutil.pyc in subscript_name(node)
39 elif type(node) is ast.Subscript:
40 return SchedulerRewriter.subscript_name(node.value)
---> 41 raise ValueError("Assignment attempted on something that is not index based")
42
43 def is_concurrent_call(self, node):

ValueError: Assignment attempted on something that is not index based

Which is somewhat strange given that assignment is index based.

Is this because deco doesn't 'understand' this particular data
structure (pandas.Series) or is there a problem in the code?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#18

@wikiped
Copy link
Author

wikiped commented May 16, 2016

Hm, there might be some confusion with the index term...

type(pandas.Series()) returns pandas.core.series.Series

From the docs I have refferenced:

Internally Series subclasses NDFrame, similarly to the rest of the pandas containers.
Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions.

You access/modify Data in Series through Index:

s = pd.Series(['a','b','c'], index=[0,1,2])
print(s)

0    a
1    b
2    c
dtype: object

print(s[0])
a

s[0:2] = 'd'; print(s)
0    d
1    d
2    c

So in pandas terms above is done through indexing. Perhaps for deco indexing means something else?

@alex-sherman
Copy link
Owner

The exception is a bit of misdirection unfortunately, the problem is not with the assignment I think, but rather the argument df.loc[row, cols]. I call the same get_subscript_name() on the target of the assignment as well as each argument of the function. I'll look into this further today and I think I have a solution which shouldn't be too hard to implement.

@alex-sherman
Copy link
Owner

I've been able to run this example again in the latest version of deco, additionally I added a unit test that should cover this for the future. I'm going to close this for now, thanks again for bringing it up!

@wikiped
Copy link
Author

wikiped commented Jun 11, 2016

@alex-sherman
Thanks for looking into this.

I have tried the same code with the latest build 0.4.1 and get KeyError: 'join_str'

KeyError                                  Traceback (most recent call last)
<ipython-input-5-8fb969f4cccf> in <module>()
----> 1 joined = joiner(df, ['str','num'])
      2 joined.head()

D:\Anaconda\envs\py2k\lib\site-packages\deco\conc.pyc in __call__(self, *args, **kwargs)
     56             exec(out, scope)
     57             self.f = scope[self.orig_f.__name__]
---> 58         return self.f(*args, **kwargs)
     59 
     60 

<string> in joiner(df, cols, sep)

D:\Anaconda\envs\py2k\lib\site-packages\deco\conc.pyc in wait(self)
    127         results = []
    128         while len(self.results) > 0:
--> 129             results.append(self.results.pop().get())
    130         for assign in self.assigns:
    131             assign[0][0][assign[0][1]] = assign[1].get()

D:\Anaconda\envs\py2k\lib\site-packages\deco\conc.pyc in get(self)
    142 
    143     def get(self):
--> 144         result, operations = self.async_result.get()
    145         self.decorator.apply_operations(operations)
    146         return result

D:\Anaconda\envs\py2k\lib\multiprocessing\pool.pyc in get(self, timeout)
    565             return self._value
    566         else:
--> 567             raise self._value
    568 
    569     def _set(self, i, obj):

KeyError: 'join_str'

This happens on python 2.7 and 3.5. Or is this fix not part of 0.4.1 release?

@alex-sherman
Copy link
Owner

Could you provide the exact code your using? I tried the example you have in the first post and it executed without a problem on my machine.

The error looks like it's caused by some inconsistency between module loads or something in the processes spawned by pool. That key error is probably the result of a failed lookup of the concurrent function being referenced. If you're using multiple modules or something this may be the cause, but I would definitely like to fix this either way.

@alex-sherman alex-sherman reopened this Jun 11, 2016
@wikiped
Copy link
Author

wikiped commented Jun 12, 2016

It turns out the reason for the failure was that I was running the code (that I had in my original post) in jupyter notebook and I kind of forgot that mulitprocessing module has to be safeguarded with if __name__ == '__main__' on windows to avoid the exact same error I got in the last run.

If I run the code from shell then everything works fine. So thank you again and sorry for confusion.

@wikiped wikiped closed this as completed Jun 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants