Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random error with postgresql data source #1191

Closed
wavexx opened this issue Jul 31, 2015 · 23 comments
Closed

random error with postgresql data source #1191

wavexx opened this issue Jul 31, 2015 · 23 comments
Assignees
Labels
Milestone

Comments

@wavexx
Copy link

wavexx commented Jul 31, 2015

I'm new to blaze, so pardon my ignorance here. I have no idea if I have to report this to odo/datashape or something else.

I'm using blaze.Data on a postgresql table ("postgresql://"). When I try to get some data off the table with list(head(10)); in 50% of the cases (without any change on the db), I get this error:

  File "/usr/local/lib/python2.7/dist-packages/odo/into.py", line 122, in curried_into
    return into(o, other, **merge(kwargs2, kwargs1))
  File "/usr/local/lib/python2.7/dist-packages/multipledispatch/dispatcher.py", line 164, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/blaze/interactive.py", line 309, in into
    return into(a, result, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/multipledispatch/dispatcher.py", line 164, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/odo/into.py", line 25, in into_type
    return convert(a, b, dshape=dshape, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/odo/core.py", line 30, in __call__
    return _transform(self.graph, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/odo/core.py", line 46, in _transform
    x = f(x, excluded_edges=excluded_edges, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/odo/convert.py", line 21, in dataframe_to_numpy
    dtype = dshape_to_numpy(dshape or discover(df))
  File "/usr/local/lib/python2.7/dist-packages/odo/numpy_dtype.py", line 55, in dshape_to_numpy
    for name, typ in zip(ds.names, ds.types)])
  File "/usr/local/lib/python2.7/dist-packages/odo/numpy_dtype.py", line 26, in unit_to_dtype
    return unit_to_dtype(str(ds).replace('int', 'float').replace('?', ''))
  File "/usr/local/lib/python2.7/dist-packages/odo/numpy_dtype.py", line 22, in unit_to_dtype
    ds = dshape(ds)
  File "/usr/local/lib/python2.7/dist-packages/datashape/util.py", line 49, in dshape
    ds = parser.parse(o, type_symbol_table.sym)
  File "/usr/local/lib/python2.7/dist-packages/datashape/parser.py", line 575, in parse
    dsp.raise_error('Invalid datashape')
  File "/usr/local/lib/python2.7/dist-packages/datashape/parser.py", line 57, in raise_error
    self.ds_str, errmsg)
datashape.error.DataShapeSyntaxError: 

  File <nofile>, line 1
    float16
    ^

DataShapeSyntaxError: Invalid datashape

I actually wonder why this error is not reproducible. Looks like odo is randomly choosing a different conversion/coercion route? In fact, it's so random I cannot even determine whether there's a specific column type that could cause the issue.

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2015

Looks like it hasn't been added to the parser.

In [5]: import datashape as ds

In [6]: ds.__version__
Out[6]: '0.4.6+67.ge3431f9'

In [7]: ds.float16
Out[7]: ctype("float16")

In [8]: ds.dshape('float16')
  File "<nofile>", line 1
    float16
DataShapeSyntaxError: Invalid datashape

I'm on it!

@wavexx
Copy link
Author

wavexx commented Jul 31, 2015

On 31/07/15 16:19, Phillip Cloud wrote:

Looks like it hasn't be added to the parser.

|In [5]: import datashape as ds In [6]: ds.version Out[6]:
'0.4.6+67.ge3431f9' In [7]: ds.float16 Out[7]: ctype("float16") In [8]:
ds.dshape('float16') File "", line 1 float16
DataShapeSyntaxError: Invalid datashape |

I'm on it!

Why does it happen randomly though?

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2015

@wavexx Can you show

Data('postgresql://your uri here').dshape

@cpcloud cpcloud self-assigned this Jul 31, 2015
@cpcloud cpcloud added this to the 0.8.3 milestone Jul 31, 2015
@cpcloud cpcloud added the bug label Jul 31, 2015
@wavexx
Copy link
Author

wavexx commented Jul 31, 2015

On 31/07/15 16:22, Phillip Cloud wrote:

@wavexx https://github.com/wavexx Can you show

Data('postgresql://your uri here').dshape

One example:

shape("""var * {
  aid: ?string[13],
  sid: ?int32,
  date: ?date,
  bdate: ?date,
  gender: ?int16,
  place: ?string[100],
  bplace: ?string[50],
  appointment_type_id: ?int32
  }""")

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2015

what is the exact expression you are trying to run? i don't need to see the database connection string, just the actual line of code that randomly giving the error

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2015

@wavexx when it doesn't fail, does it show you the correct data?

@wavexx
Copy link
Author

wavexx commented Jul 31, 2015

On 31/07/15 16:33, Phillip Cloud wrote:

@wavexx https://github.com/wavexx when it /doesn't/ fail, does it show
you the correct data?

It seems to, yes.

@wavexx
Copy link
Author

wavexx commented Jul 31, 2015

On 31/07/15 16:33, Phillip Cloud wrote:

what is the exact expression you are trying to run? i don't need to see
the database connection string, just the actual line of code that
randomly giving the error

It's a slice on both columns and rows:

list(data[data.fields[a:b]][c:d])

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2015

are a, b, c and d always the same (not equal to each other, but across runs of whatever function these are in)? ie do you get random errors when repeatedly running the exact same expression?

@wavexx
Copy link
Author

wavexx commented Jul 31, 2015

On 31/07/15 16:39, Phillip Cloud wrote:

are |a|, |b|, |c| and |d| always the same (not equal to each other, but
across runs of whatever function these are in)? ie do you get random
errors when repeatedly running the exact same expression?

You can try the full code yourself if you want:

https://github.com/wavexx/gtabview

PYTHONPATH=$PWD ./bin/gtabview postgresql://something/db::table

The slicing occurs in gtabview/models.py:157

For the first query though, it often means always:

list(data[data.fields[0:len(data.fields)][0:min(16384,int(data.nrows)])

So I assume it's constant if int(data.nrows) and len(data.fields) return
the same value. [never checked, but the db is read-only]

Despite being constant, it still fails randomly.

@wavexx
Copy link
Author

wavexx commented Jul 31, 2015

On 31/07/15 16:46, Yuri D'Elia wrote:

On 31/07/15 16:39, Phillip Cloud wrote:

are |a|, |b|, |c| and |d| always the same (not equal to each other, but
across runs of whatever function these are in)? ie do you get random
errors when repeatedly running the exact same expression?

However, do I infer that an empty slice wouldn't be valid?

I do expect data[...][0:0] to return an empty list.

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2015

However, do I infer that an empty slice wouldn't be valid?

Yes this is valid.

@wavexx Can you show:

import blaze, odo, datashape

blaze.__version__
odo.__version__
datashape.__version__

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2015

FWIW, the only "random" element here is that the order of your result set is undefined without an ORDER BY clause. So, an operation like slicing e.g., table[5:15], is implemented with select * from table limit 10 offset 5 and could potentially have a different set of rows. This doesn't strike me as the issue, since the types of the columns will be the same no matter what result set you get back.

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2015

@wavexx What do you get when you run the following code? I'm using IPython, but you can use vanilla Python as well

In [30]: from blaze import Data

In [31]: d = Data('postgresql://localhost::table')

In [32]: d.head()

@wavexx
Copy link
Author

wavexx commented Jul 31, 2015

On 31/07/15 17:17, Phillip Cloud wrote:

@wavexx https://github.com/wavexx What do you get when you run the
following code? I'm using IPython, but you can use vanilla Python as well

In [30]: from blaze import Data, odo, compute
In [31]: d = Data('postgresql://localhost::table')
In [32]: d.head()

If I run it on ipython, right now it works all the time.

If I stick it in a file:

from blaze import Data, odo, compute
d = Data('postgresql://host/db::table')
print(list(d.head()))

and run it with:

python test.py

it fails with float16 not being recognized (same error as reported before).

Now it's interesting, if I import os, sys at the beginning, it works:

import os, sys
from blaze import Data, odo, compute
d = Data('postgresql://host/db::table')
print(list(d.head()))

I did that only because that's what I do in the ipython startup.

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2015

in both cases can you show the versions of blaze, odo and datashape?

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2015

@wavexx also, d.head() and list(d.head()) are doing very different things. It's unrelated to whether it's in a file or not.

In the first case you're seeing something like this

print(repr(odo(compute(d.head()).execute().fetchall(), pd.DataFrame)))

Which works, because sqlalchemy works.

In the second case something like [x for x in odo(d.head(), Iterator)] is called, and the conversion path from sqlalchemy.Selectables to Iterators involves a conversion to a numpy array. This should go directly to an iterator after calling selectable.execute(), so that looks like a bug.

I have no idea why importing os and sys would make this work.

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2015

I have no explanation for the randomness. I'd need you to put a halt into a debugger right before the expression is converted to see what the issue is.

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2015

I'm pretty sure that blaze/datashape#163 will fix all of these errors.

@wavexx
Copy link
Author

wavexx commented Jul 31, 2015

On 31/07/15 17:47, Phillip Cloud wrote:

@wavexx https://github.com/wavexx also, |d.head()| and
|list(d.head())| are doing very different things. It's unrelated to
whether it's in a file or not.

Sure, I just noticed now because of the ipython startup I had.

In the first case you're seeing something like this

print(repr(odo(compute(d.head()).execute().fetchall(), pd.DataFrame)))

Which works, because |sqlalchemy| works.

Which is why I explicitly do list().

In the second case something like |[x for x in odo(d.head(), Iterator)]|
is called, and the conversion path from |sqlalchemy.Selectable|s to
|Iterator|s involves a conversion to a numpy array. This /should/ go
directly to an iterator after calling |selectable.execute()|, so that
looks like a bug.

I have no idea why importing |os| and |sys| would make this work.

The source I'm using is:

import blaze, odo, datashape
from blaze import Data
print(blaze.version)
print(odo.version)
print(datashape.version)
d = Data('postgresql://host/db::table')
print(list(d.head()))

When run with python:

$ python test.py
0.8.2
0.3.3
0.4.6

File , line 1
float16
^
[and the rest of the traceback]

You mentioning iterators made me think, it seems that adding

from future import generators

is sufficient:

$ python test.py
0.8.2
0.3.3
0.4.6
[(......

could that be that you expect some builtins to emit generators, somewhere?

@wavexx
Copy link
Author

wavexx commented Jul 31, 2015

On 31/07/15 17:50, Phillip Cloud wrote:

I have no explanation for the randomness. I'd need you to put a halt
into a debugger right before the expression is converted to see what the
issue is.

I need some guidance.
Where exactly would you inspect for differences?

I need at least some functions names and/or things to look for.

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2015

what are o and other in this line?

  File "/usr/local/lib/python2.7/dist-packages/odo/into.py", line 122, in curried_into
    return into(o, other, **merge(kwargs2, kwargs1))

@cpcloud cpcloud removed this from the 0.8.3 milestone Sep 15, 2015
@cpcloud
Copy link
Member

cpcloud commented Dec 4, 2015

@wavexx closing. pls reopen if this is still an issue

@cpcloud cpcloud closed this as completed Dec 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants