Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change rows in dask.dataframe #653

Closed
mrocklin opened this issue Sep 2, 2015 · 6 comments
Closed

Change rows in dask.dataframe #653

mrocklin opened this issue Sep 2, 2015 · 6 comments

Comments

@mrocklin
Copy link
Member

mrocklin commented Sep 2, 2015

This stackoverflow question raises a valid question:

How do we change a few rows in a dask.dataframe?

E.g. how do we change all negative entries to NaN? How do we change a particular column in a particular index range to zero?

The equivalent column operations are handled by assign. Is there an analagous row-wise operation within Pandas that we should copy? The dask.array version of this is possibly something like where.

@jreback
Copy link
Contributor

jreback commented Sep 2, 2015

http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-where-method-and-masking

but note that a .where is equiv to s[mask] (e.g. indexing)

@mrocklin
Copy link
Member Author

mrocklin commented Sep 2, 2015

Cool. Is there a similar function to change values based on loc and columns? For example the SO questioner asks about the following:

df.loc[[2,6], 'a']  = np.pi

@jreback
Copy link
Contributor

jreback commented Sep 2, 2015

this should work directly like the above, is their an issue?

@mrocklin
Copy link
Member Author

mrocklin commented Sep 2, 2015

I'm looking for syntax that doesn't involve mutating the underlying dataframe. I like assign because it accomplishes this use case

df['c'] = df.a + df.b
df = df.assign(c=df.a + df.b)

I think that I'm looking for the same thing that operates on row ranges rather than columns.

@jreback
Copy link
Contributor

jreback commented Sep 2, 2015

on, then .where is your man so to speak. It returns a new frame. (.mask is the inverse)

In [1]: df = DataFrame(np.arange(10).reshape(5,2),columns=list('AB'))

In [2]: df
Out[2]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [3]: df.where(df['A']>6,-df)
Out[3]: 
   A  B
0  0 -1
1 -2 -3
2 -4 -5
3 -6 -7
4  8  9

In [4]: df.mask(df['A']>6,-df)
Out[4]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4 -8 -9

In [5]: df
Out[5]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

@jreback
Copy link
Contributor

jreback commented Sep 2, 2015

The original purpose is actually for masking (but returning a same shaped frame)

In [15]: df.where(df['A']>6)
Out[15]: 
    A   B
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4   8   9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants