DataFrame.sample #49

thunterdb · 2019-03-29T13:10:52Z

This function is going to require more thoughts than most others because Spark and Pandas have the same function name (sample) to provide slightly different semantics:

Documentation:

Before starting on designing what the expectations should be, here are some constraints:

the existing spark code must still behave similarly
the pandas code may have to call arguments by names to make it compatible. This is usually the standard practice anyway

Some questions which the design doc should explore:

when calling for a number of items to return, should it return a pandas or a spark dataframe. I expect a spark dataframe
should the number of elements returned be exact? I would expect it to be the case since this the full idea of specifying the number of elements
should the elements always be the same? This is very hard to do with the current implementation of sample() in Spark, so this would have to be changed a bit

The text was updated successfully, but these errors were encountered:

AbdealiLoKo · 2019-03-29T13:58:36Z

Somewhat related PR: #48

rxin · 2019-05-14T23:33:37Z

I think we should start with a simple implementation that supports frac first. We can worry about how to do exact or approximate n later. Basically supports the following:

def sample(n, frac, replace):

and throw an exception if n is specified.

Resolves #49

AbdealiLoKo mentioned this issue Mar 29, 2019

Adding a strict setting to force pandas compatibility at the cost of efficiency #47

Closed

garawalid mentioned this issue Apr 21, 2019

Create a design principles doc #119

Closed

rxin added the enhancement New feature or request label May 14, 2019

rxin mentioned this issue May 15, 2019

Implement basic sample function #327

Merged

HyukjinKwon closed this as completed in #327 May 15, 2019

HyukjinKwon pushed a commit that referenced this issue May 15, 2019

Implement basic sample function (#327)

d2231c7

Resolves #49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.sample #49

DataFrame.sample #49

thunterdb commented Mar 29, 2019

AbdealiLoKo commented Mar 29, 2019

rxin commented May 14, 2019

DataFrame.sample #49

DataFrame.sample #49

Comments

thunterdb commented Mar 29, 2019

AbdealiLoKo commented Mar 29, 2019

rxin commented May 14, 2019