Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add series describe function #414

Merged
merged 2 commits into from Aug 14, 2018
Merged

Conversation

alexpantyukhin
Copy link
Contributor

@alexpantyukhin alexpantyukhin commented Aug 11, 2018

@alexpantyukhin
Copy link
Contributor Author

Can't win the message This construct causes code to be less generic than indicated by the type annotations. The type variable 'V has been constrained to be type 'float'. there :) But this is how I see describe for series.

@sebhofer
Copy link

sebhofer commented Aug 11, 2018

Why not just put all the statistics into a frame, with row keys "min", "max",..? This has the additional advantage that you can use the same procedure also for frames, not just for series.

Edit To get rid of the warning, change the 'V in describe's type annotation to float.

@alexpantyukhin
Copy link
Contributor Author

I want to keep 'V because there could be not float . For example later could be add count and uniqueCount It doesn't require float

@sebhofer
Copy link

sebhofer commented Aug 11, 2018

Well at the moment most of the Stats functions are defined only for floats anyway, that's why I opened #410. So at the moment it doesn't really matter if you restrict the type to float or not. But yes, in the long run this should be generic.

@alexpantyukhin
Copy link
Contributor Author

Actually I think that the result of describe should be of the Series type and Frame for Frames. It already has good ways for displaying in interactive mode. And as I know pandas has the same approach.

@sebhofer
Copy link

Ok, sure if you like that better. But why not simply stick to strings as keys?

@tpetricek
Copy link
Member

I think the most appropriate design would be either all dynamic (following Pandas) or all static (using what F# offers). I would be slightly in favour of having a nice F# type as the result, because that works well with other F# code and it can be turned into a data frame easily using Frame.ofRecords. So, my preference would be for returning something like:

type StatisticDescription = 
  { Min : float
    Max : float
    Mean : float
    Std : float }

That said, returning a data frame would also be okay - I can see that if one is working with data frames, then it makes sense to keep the structure. That way, I'd prefer to emulate the pandas patterns and just use string as the key.

The idea of having a DU as a key is a neat trick and I can see how that might be a nice general pattern to use, but if it is a pattern that we are using in just one place in the entire library, then I think it might cause some confusion.

@zyzhu
Copy link
Contributor

zyzhu commented Aug 13, 2018

Pandas supports two kinds of describe
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html

For numerical series, it returns count, mean, std, min, 25%, 50%, 75%, max.

For categorical series, it returns count, unique, top, freq.

Deedle series could have either numerical values or other values such as string or other objects. It would be good to separate describe for two cases.

I would opt for using string as keys as users use describe mostly at exploring stage. It's just used interactively to see a quick output. The result structure is not that important. Keeping key as string would be more flexible while we add more keys in describe output.

@zyzhu
Copy link
Contributor

zyzhu commented Aug 13, 2018

Another benefit of returning data frame instead of FSharp record is to display result in good format in Jupyter notebook. Data frame has a good formatter already. That’s key for Deedle to be adopted by more data scientists.

@alexpantyukhin
Copy link
Contributor Author

Actually I wanted to have some blance between pandas and f# features. That's why the the result of describe is series. But we always know what list of metrics we would have (for now just 4), and it's possible to have just enum StatisticItem for keys.

String as a key also could be possible, but I think that if we could say more strictly what we have in series - is good. Also as @zyzhu mentioned, having Series as result is typically for pandas and in Deedle having result as Series would have a good formating. Need to check how StatisticItem will be serialized to string well.

@zyzhu about two kinds of describe. As I understand it could be one type, but different part filled. For example for the panda use df.describe(include='all').

This arguments made me to have exactly such changes what in this PR. Looking forward for interesting discussion :)

@sebhofer
Copy link

I completely agree with @zyzhu about describe being a tool used for exploration of the data. Why would you use this as part of a larger script instead of just calling the function you really need? That's why I would also go for presenting the result as a frame (or series for that matter) using strings as keys. I don't see why we would benefit from type safety here. That said, I also like @tpetricek's suggestion of returning a record, giving us a compromise between the two approaches.

@alexpantyukhin
Copy link
Contributor Author

Sorry, maybe I don't understand correctly.
Lets keep alltohether.

  • The following suggestion:
    series.describe() -> Series<StatisticItem,'V>
    frame.describe() -> Frame<Series<StatisticItem,'V>>

@tpetricek suggestions

  • with string:
    series.describe() -> Series<string,'V>
    frame.describe() -> Frame<Series<string,'V>>

  • with record:
    type StatisticDescription =
    { Min : float
    Max : float
    Mean : float
    Std : float }

    series.describe() -> StatisticDescription
    frame.describe() -> StatisticDescription list

Table of simple comparison.

Suggestion Simple display Type safety Close to pandas
currentPr + + +/-
with string key + - +
with record need to write self + -

I also kept in mind solutions with string and record when I coding the current approach. But I made a such comaprison and found that the current decision has more advantages.
@sebhofer could you please share some disadvantage of using the StatisticItem for the key. Of course we can have a string as key, but do we really need to do it if we already have all advantages of using string +small benefit of type checking?

@tpetricek
Copy link
Member

@alexpantyukhin Your table looks good to me. I think the only disadvantage of the current PR version is that it introduces a new pattern that we are not using anywhere else in the library. It's not necessarily a bad pattern, but if we only use it in one place, it will appear strange and people will have to figure out what it actually means (and how to work with the DU values). Both records and frame/series with string keys are more straightforward to understand.

@alexpantyukhin
Copy link
Contributor Author

@tpetricek Oh, I see, then the result table would be something like that :)

Suggestion Simple display Type safety Close to pandas Clear for user
currentPr + + +/- -
with string key + - + +
with record need to write self + - +

Then I vote for having string keys, and have it in the docs.

@tpetricek
Copy link
Member

@alexpantyukhin That's roughly what I was thinking! Although there is the usual caveat that "clear for user" is very vaguely defined property, so it really is "tomas thinks it'll be clear for user". But I'm afraid that's all we can get without running a controlled experiment on our users or something like that :-).

@zyzhu
Copy link
Contributor

zyzhu commented Aug 13, 2018

@alexpantyukhin I really like your table analysis. We shall apply similar logic when we add new functionality or refactor current methods.

@zyzhu
Copy link
Contributor

zyzhu commented Aug 14, 2018

@alexpantyukhin I've just merged my pull which make stats functions inline to support inputs such as integer or decimals. Could you follow what I did and make describe function inline and output series/frame with string keys?

@alexpantyukhin
Copy link
Contributor Author

@zyzhu yes, sure! I will.

@alexpantyukhin
Copy link
Contributor Author

@zyzhu Done! rebased and fixed.

@zyzhu
Copy link
Contributor

zyzhu commented Aug 14, 2018

Looks good. I'll see what I can do about min and max to make it not optional and then we can refactor this.

@zyzhu zyzhu merged commit 93c7a98 into fslaborg:master Aug 14, 2018
@alexpantyukhin alexpantyukhin deleted the add_describe branch August 14, 2018 14:22
@alexpantyukhin
Copy link
Contributor Author

thanks! what if to make it nan?

///
/// [category:Series statistics]
static member inline describe (series:Series<'K, 'V>) =
match series with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting a warning here saying that this will make 'V a float:

warning FS0064: This construct causes code to be less generic than indicated by the type annotations. The type variable 'V has been constrained to be type 'float'.

You can fix that by adding box, but then you also need to specify the type argument for the series key in the pattern matching:

match box series with
| :? Series<'K, float> as floatSeries -> 

Also, I think the returned series should probably not be Series<string, float option> but just Series<string, float> because series has missing values built-in. To do that, you can replace Series.ofObservations with Series.ofOptionalObservations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tpetricek I already refactored it in 6ba06fd. There's no longer warning. It will output Series<string, float>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Sorry, I missed that! Sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants