Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11969: [Rust][DataFusion] Improve Examples in documentation #9710

Closed
wants to merge 2 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Mar 15, 2021

Rationale

When people find datafusion on crates.io they want both a textual description of what it does as well as some example code to see if it would help them. For example, look at how tokio does it: https://crates.io/crates/tokio

Changes

  1. Add an example on the main README.md of datafusion (that appears on the crates.io homepage) that shows a prospective user what DataFusion offers, lifted from the existing example from https://docs.rs/datafusion/3.0.0/datafusion
  2. Add formatted output to the rustdoc examples in the main lib.rs page to show better what the output is
  3. Fix some rustdoc warnings I noticed while testing

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@alamb alamb changed the title [Rust][DataFusion] Improve Examples in documentation ARROW-11969: [Rust][DataFusion] Improve Examples in documentation Mar 15, 2021
@github-actions
Copy link

@alamb alamb marked this pull request as ready for review March 15, 2021 21:58

Use the DataFrame API to process data stored in a CSV:

```rust
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking: this might be a little bit fun with API churn, e.g. I believe the input expr ownership work you've recently opened would change these from slices to vecs and we don't have a way to catch that automatically like we do for the in-crate docs (am I right in thinking that cargo test runs all doctests?).

Edit: to be clear, I don't think it's a reason to not do it, just curious if anyone has ideas for how to prevent doc drift :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point @returnString

The way I justified the danger of drift to myself was "the main usecase of this documentation (the overview) is likely to help them answer the question of "should I even bother to try and use this crate". Once they decide to try and actually use the crate they will look at the real docs on docs.rs (from which they can copy/paste).

For the purpose of an example of "what does this library do" I felt even a slightly out of date example might be valuable.

Or maybe I am just trying to pad my github stats ;) But in all seriousness I am not committed to this PR. If it isn't a good idea I can just close it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me; agreed that (personally, at least) I'll give less consideration to projects without simple readme examples.

It balloons the scope of this PR quite a lot so I'm not saying this is a good idea, but I just did a bit of digging and it looks like people have gone through this particular problem before: https://blog.guillaume-gomez.fr/articles/2019-04-13+Keeping+Rust+projects%27+README.md+code+examples+up-to-date

And the end result of that is https://crates.io/crates/doc-comment, which looks like it'll wire up any rust-tagged code blocks in external files as doctests, optionally only for #[cfg(test)].

If it's useful, I could log a followup task to integrate that and take a look at it myself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@returnString https://crates.io/crates/doc-comment looks super awesome -- I think that would be most helpful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.limit(100)?
.collect().await?;

let results: Vec<RecordBatch> = df.collect().await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed when using this as a test for #9749 that we're collecting twice here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol "testing for the win!"

@alamb
Copy link
Contributor Author

alamb commented Mar 18, 2021

I plan to wait for #9749 to be merged and then rebase this / ensure it passes the tests !

alamb pushed a commit that referenced this pull request Mar 19, 2021
…e readme examples remain valid

As discussed [here](#9710 (comment)), we were looking into how we might add code examples to the DataFusion readme whilst keeping them in sync with reality as we go through API revisions etc.

This PR pulls in a new dev dependency, `doc-comment`, which allows for detecting all the `rust`-tagged code blocks in a Markdown file and treating them as doctests, and wires this up for `README.md`.

My only concerns are:
- because the end result is a full-blown doctest, you do need to make sure imports etc are present, which makes the samples more verbose than some people would perhaps like
- again on the verbosity front: we have lots of async code which requires a `#[tokio::main] async fn main() { ... }` wrapper

Neither of these are inherently bad imo, but worth noting upfront.

As an example of a readme sample that passes as a doctest (borrowed from @alamb's latest documentation PR, #9710):

```rust
use datafusion::prelude::*;
use arrow::util::pretty::print_batches;
use arrow::record_batch::RecordBatch;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
  let mut ctx = ExecutionContext::new();
  // create the dataframe
  let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;

  let df = df.filter(col("a").lt_eq(col("b")))?
            .aggregate(&[col("a")], &[min(col("b"))])?
            .limit(100)?;

  let results: Vec<RecordBatch> = df.collect().await?;
  print_batches(&results)?;

  Ok(())
}
```

Closes #9749 from returnString/readme_doctest

Authored-by: Ruan Pearce-Authers <ruanpa@outlook.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb alamb closed this in 2fab404 Mar 23, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…e readme examples remain valid

As discussed [here](apache#9710 (comment)), we were looking into how we might add code examples to the DataFusion readme whilst keeping them in sync with reality as we go through API revisions etc.

This PR pulls in a new dev dependency, `doc-comment`, which allows for detecting all the `rust`-tagged code blocks in a Markdown file and treating them as doctests, and wires this up for `README.md`.

My only concerns are:
- because the end result is a full-blown doctest, you do need to make sure imports etc are present, which makes the samples more verbose than some people would perhaps like
- again on the verbosity front: we have lots of async code which requires a `#[tokio::main] async fn main() { ... }` wrapper

Neither of these are inherently bad imo, but worth noting upfront.

As an example of a readme sample that passes as a doctest (borrowed from @alamb's latest documentation PR, apache#9710):

```rust
use datafusion::prelude::*;
use arrow::util::pretty::print_batches;
use arrow::record_batch::RecordBatch;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
  let mut ctx = ExecutionContext::new();
  // create the dataframe
  let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;

  let df = df.filter(col("a").lt_eq(col("b")))?
            .aggregate(&[col("a")], &[min(col("b"))])?
            .limit(100)?;

  let results: Vec<RecordBatch> = df.collect().await?;
  print_batches(&results)?;

  Ok(())
}
```

Closes apache#9749 from returnString/readme_doctest

Authored-by: Ruan Pearce-Authers <ruanpa@outlook.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
# Rationale

When people find `datafusion` on crates.io they want both a textual description of what it does as well as some example code to see if it would help them. For example, look at how tokio does it: https://crates.io/crates/tokio

# Changes
1.  Add an example on the main README.md of datafusion (that appears on the crates.io homepage) that shows a prospective user what DataFusion offers, lifted from the existing example from https://docs.rs/datafusion/3.0.0/datafusion
2. Add formatted output to the `rustdoc` examples in the main `lib.rs` page to show better what the output is
3. Fix some rustdoc warnings I noticed while testing

Closes apache#9710 from alamb/alamb/better_examples

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…e readme examples remain valid

As discussed [here](apache#9710 (comment)), we were looking into how we might add code examples to the DataFusion readme whilst keeping them in sync with reality as we go through API revisions etc.

This PR pulls in a new dev dependency, `doc-comment`, which allows for detecting all the `rust`-tagged code blocks in a Markdown file and treating them as doctests, and wires this up for `README.md`.

My only concerns are:
- because the end result is a full-blown doctest, you do need to make sure imports etc are present, which makes the samples more verbose than some people would perhaps like
- again on the verbosity front: we have lots of async code which requires a `#[tokio::main] async fn main() { ... }` wrapper

Neither of these are inherently bad imo, but worth noting upfront.

As an example of a readme sample that passes as a doctest (borrowed from @alamb's latest documentation PR, apache#9710):

```rust
use datafusion::prelude::*;
use arrow::util::pretty::print_batches;
use arrow::record_batch::RecordBatch;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
  let mut ctx = ExecutionContext::new();
  // create the dataframe
  let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;

  let df = df.filter(col("a").lt_eq(col("b")))?
            .aggregate(&[col("a")], &[min(col("b"))])?
            .limit(100)?;

  let results: Vec<RecordBatch> = df.collect().await?;
  print_batches(&results)?;

  Ok(())
}
```

Closes apache#9749 from returnString/readme_doctest

Authored-by: Ruan Pearce-Authers <ruanpa@outlook.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
# Rationale

When people find `datafusion` on crates.io they want both a textual description of what it does as well as some example code to see if it would help them. For example, look at how tokio does it: https://crates.io/crates/tokio

# Changes
1.  Add an example on the main README.md of datafusion (that appears on the crates.io homepage) that shows a prospective user what DataFusion offers, lifted from the existing example from https://docs.rs/datafusion/3.0.0/datafusion
2. Add formatted output to the `rustdoc` examples in the main `lib.rs` page to show better what the output is
3. Fix some rustdoc warnings I noticed while testing

Closes apache#9710 from alamb/alamb/better_examples

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants