Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Update single-machine.rst with corrections [skip ci] #4020

Merged
merged 3 commits into from Sep 29, 2018
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
28 changes: 14 additions & 14 deletions docs/source/setup/single-machine.rst
Expand Up @@ -3,24 +3,24 @@ Single-Machine Scheduler

The default Dask scheduler provides parallelism on a single machine by using
either threads or processes. It is the default choice used by Dask because it
requires no setup. You don't need to make any choices or set anything up to
use this scheduler, however you do have a choice between threads and processes:
requires no setup. You don't need to make any choices or set anything up to
use this scheduler. However, you do have a choice between threads and processes:

1. **Threads**: Use multiple threads in the same process. This option is good
for numeric code that releases the GIL_ (like NumPy, Pandas, Scikit-Learn,
Numba, ...) because data is free to share. This is the default scheduler for
``dask.array``, ``dask.dataframe``, and ``dask.delayed``
``dask.array``, ``dask.dataframe``, and ``dask.delayed``.

2. **Processes**: Send data to separate processes for processing. This option
is good when operating on pure Python objects like strings or JSON-like
dictionary data that holds onto the GIL_ but not very good when operating
on numeric data like Pandas dataframes or NumPy arrays. Using processes
avoids GIL issues but can also result in a lot of inter-process
dictionary data that holds onto the GIL_, but not very good when operating
on numeric data like Pandas DataFrames or NumPy arrays. Using processes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on capitalization of DataFrames vs dataframes ? The type name is DataFrame, but I wonder if we should use the uncapitalized version in normal prose. This would be analagous to using "arrays"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using the lower-case format would be the best way to refer to tabular data like Pandas' DataFrames. For instance, what is the correct way to write it? Is it "dataframes" or "data frames"? Surely "data frames" is the correct way to write it, but it does not convey the same meaning as a Pandas or Dasks DataFrame object. (Also, "dataframes" looks like a typo)

Indeed, it would be nice to use dataframes as we use arrays in normal prose, but it will come with a cost of most likely being used randomly along side DataFrames. Now, if we just bite the bullet and adopt DataFrame in camel case, it would avoid inconsistencies or random usage throughout the text and it would always be correct and consistent. Also, "DataFrames" is how it is written exclusively in the pandas docs (no exceptions for what I can see).

If the goal is to avoid inconsistencies of this kind, I think that DataFrames would be the fool proof way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, happy to defer to the convention in the Pandas docs

avoids GIL issues, but can also result in a lot of inter-process
communication, which can be slow. This is the default scheduler for
``dask.bag`` and is sometimes useful with ``dask.dataframe``.
``dask.bag``, and it is sometimes useful with ``dask.dataframe``.

Note that the dask.distributed scheduler is often a better choice when
working with GIL-bound code. See :doc:`Dask.distributed on a single
Note that the ``dask.distributed`` scheduler is often a better choice when
working with GIL-bound code. See :doc:`dask.distributed on a single
machine <single-distributed>`.

3. **Single-threaded**: Execute computations in a single thread. This option
Expand All @@ -35,11 +35,11 @@ use this scheduler, however you do have a choice between threads and processes:
Selecting Threads, Processes, or Single Threaded
------------------------------------------------

Currently these options are available by selecting different ``get`` functions:
Currently, these options are available by selecting different ``get`` functions:

- ``dask.threaded.get``: The threaded scheduler
- ``dask.multiprocessing.get``: The multiprocessing scheduler
- ``dask.local.get_sync``: The single-threaded scheduler
- ``dask.threaded.get``: The threaded scheduler.
- ``dask.multiprocessing.get``: The multiprocessing scheduler.
- ``dask.local.get_sync``: The single-threaded scheduler.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend not to end phrases like these with periods because they are not full sentences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before deciding which one to use at first, I've dug in the documentation a bit to see which way was more used, and I've found several conflicting occurrences of how numbered lists are terminated (or not) in the documentation. For example, here are some samples of lists terminated with dots:

https://dask.pydata.org/en/latest/index.html#dask

https://dask.pydata.org/en/latest/support.html#discussion

https://dask.pydata.org/en/latest/support.html#asking-for-help

https://dask.pydata.org/en/latest/scheduling.html#scheduling

and the opposite as well:

https://dask.pydata.org/en/latest/spec.html#definitions

https://dask.pydata.org/en/latest/configuration.html#configuration

or even mixed use cases:

https://dask.pydata.org/en/latest/dataframe.html#common-uses-and-anti-uses

It was conflicting for me to decide which one to use so I've picked the one which made more sence to me at first.

However, after pondering a bit about this, and taking a look at other documentations of other projects, I came to the conclusion that it is best to avoid terminating sentences altogether and these changes (and others before this one) should be reverted / corrected to remove punctuation at the end of the sentence.


You can specify these functions in any of the following ways:

Expand Down Expand Up @@ -67,6 +67,6 @@ You can specify these functions in any of the following ways:
Use the Distributed Scheduler
-----------------------------

The newer dask.distributed scheduler also works well on a single machine and
The newer ``dask.distributed`` scheduler also works well on a single machine and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just "The newer distributed scheduler"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I find switching to code highlighting like this a bit distracting. I think that we need to find a nice way to talk about the distributed scheduler without using the module name.

Could also be "Dask's newer distributed scheduler also works ..."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way seems to be better than using the the highlighted module name. I'll rectify this change asap

offers more features and diagnostics. See :doc:`this page
<single-distributed>` for more information.