Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support serializing numpy and pandas types #1180

Merged
merged 4 commits into from Mar 30, 2020
Merged

Conversation

sethmlarson
Copy link
Contributor

@sethmlarson sethmlarson commented Mar 25, 2020

This PR attempts to import numpy and pandas and if either library is found adds to the list of types that the default JSONSerializer supports. Numpy adds the integers, floats, boolean, ndarray, and datetime. Pandas adds support for Series, Timestamp, and NA -> None. Am I missing any important types that can be safely serialized to JSON?

Notably I left out DataFrame and numpy.nan. NaN is already handled by JSON and doesn't have semantics for Elasticsearch (at least I don't think it does?) and DateFrame seemed a bit too heavy to support natively? Better for users to call DataFrame.to_json() themselves?

Also wanted to confirm my thinking that it is appropriate to support Series and ndarray? Or is that also too presumptive of what a user wants from the library?

Closes #1178
Closes elastic/eland#142

Copy link

@stevedodson stevedodson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

elif isinstance(data, np.ndarray):
return data.tolist()
if pd:
if isinstance(data, pd.Series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely not the expert here (so I might asking the wrong question here), but I've been recently working a bit with pandas.dtypes and I'm wondering, how will this serialiser handle the types of the elements inside a pd.Series or a pd.DataFrame. Usually these elements are various numerical types, so probably handled by the Numpy converters, but sometimes a user can set these dtypes to pandas specific stuff like category. For ex. I've recently done something like this in my Jupyter notebook

# finally we will correct the mappings on the remaining columns 

mappings = {'carat': 'float64',
           'cut': 'category',
           'color': 'category',
           'depth':'float64',
           'table':'float64',
           'price': 'float64',
           'x':'float64',
           'y':'float64',
           'z':'float64'}

df_cleaned = df_cleaned.astype(mappings)

How would the serialisation handle category or some of the other pandas specific dtypes in a Series (here is a list of some more exotic dtypes https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#dtypes)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Category is an interesting one, we'd lose the "categorical" aspect of the value if we serialize to a string but maybe that's fine? Maintaining the categorical aspect would require config on the mapping but as long as that's done it'd be a solution, so maybe we do category -> str?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants