Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation probably needed for sort-values #8

Closed
ezmiller opened this issue Mar 6, 2020 · 4 comments · Fixed by #10
Closed

Implementation probably needed for sort-values #8

ezmiller opened this issue Mar 6, 2020 · 4 comments · Fixed by #10

Comments

@ezmiller
Copy link
Contributor

ezmiller commented Mar 6, 2020

The following piped transformation shows a gap that we might think about filling on this library:

(defn properties-per-host [df]
  (-> df
      (pt/melt {:id-vars :host_id :value-vars [:listing_id]})
      (pt/groupby :host_id)
      (pt/subset-cols :value)
      (pt/n-unique)
      (pt/data-frame)
      (py. sort_values :by :value :ascending false)
      (pt/rename {:columns {:value :num_unique_listings}}))
  ))

In order to sort the n-unique count on the :value column, it was necesssary as things stand to first cast the result as a data frame (it was a series after the group-by and aggregation fn), and then to call the sort_values method on the :pyobject.

It would be nice to set things up such that we don't need to do these extra steps.

@ezmiller
Copy link
Contributor Author

ezmiller commented Mar 6, 2020

@alanmarazzi: I'd be game to open a PR for this. Presuming you are okay with that, is there some context and guidance you'd like to provide?

@alanmarazzi
Copy link
Owner

I'm absolutely ok with that! I think that sort_values might go into the generics ns.

After this the true main step is to decide whether you want to use kw-call or simple-kw-call (here).

The difference between the two is whether you think that having an explicit positional first argument makes sense for the implementation. For example, here you can find rename and to-excel which use the 2 different functions.

The reason why I decided to implement them like that is because to-excel always needs a filename/path, no exceptions, so it makes sense to have a Clojure positional argument dedicated to that, while the same is not true with rename since there are no mandatory arguments and bypassing one you might have to avoid passing another one.

After this, a simple docstring with a couple of examples will be ok, no need to add docs for all the arguments, I have to get rid of those for the other functions as well, since it's impossible to list them all. One cool thing would be to link to the original docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html, but of course even in this way you can't know whether people want the DataFrame or the Series implementation. (Suggestions are well accepted).

The examples can be used for some tests, nothing too extensive, we don't want to test pandas, but the panthera implementation.

After everything is working and test run, the last thing to do is to add the function into the panthera ns

(export-symbols
.

Please let me know about any inquiries you might have, I'd like to turn this into a contributing guide, and you'll see that once you get the hang of it it's very simple to add functionality to this!

@ezmiller
Copy link
Contributor Author

ezmiller commented Mar 9, 2020

@alanmarazzi thanks for this writeup! I'm going to try to turn to this this week, and will certainly come back with any questions I may have. Regarding the docstring, can you give me an example of what you think is a good model for the docstrings going forward? I have to say it has been very convenient to have some of the docstrings that you have written in place. I'm thinking, for example, of the docstrings for group-by or melt. Is your current thinking that those kind of docstrings wouldn't be included?

@alanmarazzi
Copy link
Owner

You're right, ok just to be clear a dumb example of the issue is this: https://pandas.pydata.org/pandas-docs/stable/search.html?q=swaplevel#

Now, the swap-level function will work wether you pass a series, a data-frame or even an index, but we can't know it beforehand what's what.

The problem is that if you look at the docs there are different arguments accepted depending on the type you get.

I'd settle for something like:

Switch levels i and j of a multi index.

Original docs https://pandas.pydata.org/pandas-docs/stable/search.html?q=swaplevel#

Arguments:

  - df-or-srs -> data-frame or series
  - i, j -> int, str: the levels that you want to swap

Examples:

  (swap-level my-df 0 1)
  (swap-level my-srs :a :b)
  (swap-level my-df 0 :b)

So, clarify only args and send people to original docs for attrs (in this case there's none because I decided to not let people easily access copy and inplace).

Anyway I'm really keen to receive suggestions about this, I already know pandas extensively, but anyway I find myself on the docs very often because of its API size and depth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants