New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support serializing numpy and pandas types #1180
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
elasticsearch/serializer.py
Outdated
elif isinstance(data, np.ndarray): | ||
return data.tolist() | ||
if pd: | ||
if isinstance(data, pd.Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely not the expert here (so I might asking the wrong question here), but I've been recently working a bit with pandas.dtypes
and I'm wondering, how will this serialiser handle the types of the elements inside a pd.Series
or a pd.DataFrame
. Usually these elements are various numerical types, so probably handled by the Numpy converters, but sometimes a user can set these dtypes to pandas specific stuff like category
. For ex. I've recently done something like this in my Jupyter notebook
# finally we will correct the mappings on the remaining columns
mappings = {'carat': 'float64',
'cut': 'category',
'color': 'category',
'depth':'float64',
'table':'float64',
'price': 'float64',
'x':'float64',
'y':'float64',
'z':'float64'}
df_cleaned = df_cleaned.astype(mappings)
How would the serialisation handle category
or some of the other pandas specific dtypes in a Series (here is a list of some more exotic dtypes https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#dtypes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Category is an interesting one, we'd lose the "categorical" aspect of the value if we serialize to a string but maybe that's fine? Maintaining the categorical aspect would require config on the mapping but as long as that's done it'd be a solution, so maybe we do category
-> str
?
This PR attempts to import
numpy
andpandas
and if either library is found adds to the list of types that the defaultJSONSerializer
supports. Numpy adds the integers, floats, boolean,ndarray
, and datetime. Pandas adds support forSeries
,Timestamp
, andNA
->None
. Am I missing any important types that can be safely serialized to JSON?Notably I left out
DataFrame
andnumpy.nan
. NaN is already handled by JSON and doesn't have semantics for Elasticsearch (at least I don't think it does?) andDateFrame
seemed a bit too heavy to support natively? Better for users to callDataFrame.to_json()
themselves?Also wanted to confirm my thinking that it is appropriate to support
Series
andndarray
? Or is that also too presumptive of what a user wants from the library?Closes #1178
Closes elastic/eland#142