Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements DataFrame.persist() with additional tests for DataFrame.cache() #1381

Merged
merged 4 commits into from Mar 31, 2020

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Mar 31, 2020

Resolves #1373

Here, we have a DataFrame named df

>>> import pyspark
>>> df = ks.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
...                   columns=['dogs', 'cats'])
>>> df
   dogs  cats
0   0.2   0.3
1   0.0   0.6
2   0.6   0.0
3   0.2   0.1

Set the StorageLevel to MEMORY_ONLY.

>>> with df.persist(pyspark.StorageLevel.MEMORY_ONLY) as cached_df:
...     print(cached_df.count())
...
dogs    4
cats    4
Name: 0, dtype: int64

Set the StorageLevel to DISK_ONLY.

>>> with df.persist(pyspark.StorageLevel.DISK_ONLY) as cached_df:
...     print(cached_df.count())
...
dogs    4
cats    4
Name: 0, dtype: int64

If a StorageLevel is not given, it uses MEMORY_AND_DISK by default.

>>> with df.persist() as cached_df:
...     print(cached_df.count())
...
dogs    4
cats    4
Name: 0, dtype: int64

@itholic itholic changed the title Implements DataFrame.persist() Implements DataFrame.persist() & Adding test for DataFrame.cache() Mar 31, 2020
@codecov-io
Copy link

codecov-io commented Mar 31, 2020

Codecov Report

Merging #1381 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1381   +/-   ##
=======================================
  Coverage   95.23%   95.24%           
=======================================
  Files          34       34           
  Lines        7792     7799    +7     
=======================================
+ Hits         7421     7428    +7     
  Misses        371      371           
Impacted Files Coverage Δ
databricks/koalas/frame.py 96.80% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5130b99...52b5539. Read the comment docs.

@HyukjinKwon HyukjinKwon changed the title Implements DataFrame.persist() & Adding test for DataFrame.cache() Implements DataFrame.persist() with additional tests for DataFrame.cache() Mar 31, 2020
Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ueshin
Copy link
Collaborator

ueshin commented Mar 31, 2020

Thanks! merging.

@ueshin ueshin merged commit 1e3e093 into databricks:master Mar 31, 2020
HyukjinKwon pushed a commit that referenced this pull request Apr 1, 2020
…che() (#1381)

Resolves #1373 


Here, we have a `DataFrame` named `df`

```python
>>> import pyspark
>>> df = ks.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
...                   columns=['dogs', 'cats'])
>>> df
   dogs  cats
0   0.2   0.3
1   0.0   0.6
2   0.6   0.0
3   0.2   0.1
```

Set the StorageLevel to `MEMORY_ONLY`.

```python
>>> with df.persist(pyspark.StorageLevel.MEMORY_ONLY) as cached_df:
...     print(cached_df.count())
...
dogs    4
cats    4
Name: 0, dtype: int64
```

Set the StorageLevel to `DISK_ONLY`.

```python
>>> with df.persist(pyspark.StorageLevel.DISK_ONLY) as cached_df:
...     print(cached_df.count())
...
dogs    4
cats    4
Name: 0, dtype: int64
```

If a StorageLevel is not given, it uses `MEMORY_AND_DISK` by default.

```python
>>> with df.persist() as cached_df:
...     print(cached_df.count())
...
dogs    4
cats    4
Name: 0, dtype: int64
```
@itholic itholic deleted the f_persist branch April 1, 2020 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

df.cache() question
3 participants