New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: Update single-machine.rst with corrections [skip ci] #4020
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,24 +3,24 @@ Single-Machine Scheduler | |
|
||
The default Dask scheduler provides parallelism on a single machine by using | ||
either threads or processes. It is the default choice used by Dask because it | ||
requires no setup. You don't need to make any choices or set anything up to | ||
use this scheduler, however you do have a choice between threads and processes: | ||
requires no setup. You don't need to make any choices or set anything up to | ||
use this scheduler. However, you do have a choice between threads and processes: | ||
|
||
1. **Threads**: Use multiple threads in the same process. This option is good | ||
for numeric code that releases the GIL_ (like NumPy, Pandas, Scikit-Learn, | ||
Numba, ...) because data is free to share. This is the default scheduler for | ||
``dask.array``, ``dask.dataframe``, and ``dask.delayed`` | ||
``dask.array``, ``dask.dataframe``, and ``dask.delayed``. | ||
|
||
2. **Processes**: Send data to separate processes for processing. This option | ||
is good when operating on pure Python objects like strings or JSON-like | ||
dictionary data that holds onto the GIL_ but not very good when operating | ||
on numeric data like Pandas dataframes or NumPy arrays. Using processes | ||
avoids GIL issues but can also result in a lot of inter-process | ||
dictionary data that holds onto the GIL_, but not very good when operating | ||
on numeric data like Pandas DataFrames or NumPy arrays. Using processes | ||
avoids GIL issues, but can also result in a lot of inter-process | ||
communication, which can be slow. This is the default scheduler for | ||
``dask.bag`` and is sometimes useful with ``dask.dataframe``. | ||
``dask.bag``, and it is sometimes useful with ``dask.dataframe``. | ||
|
||
Note that the dask.distributed scheduler is often a better choice when | ||
working with GIL-bound code. See :doc:`Dask.distributed on a single | ||
Note that the ``dask.distributed`` scheduler is often a better choice when | ||
working with GIL-bound code. See :doc:`dask.distributed on a single | ||
machine <single-distributed>`. | ||
|
||
3. **Single-threaded**: Execute computations in a single thread. This option | ||
|
@@ -35,11 +35,11 @@ use this scheduler, however you do have a choice between threads and processes: | |
Selecting Threads, Processes, or Single Threaded | ||
------------------------------------------------ | ||
|
||
Currently these options are available by selecting different ``get`` functions: | ||
Currently, these options are available by selecting different ``get`` functions: | ||
|
||
- ``dask.threaded.get``: The threaded scheduler | ||
- ``dask.multiprocessing.get``: The multiprocessing scheduler | ||
- ``dask.local.get_sync``: The single-threaded scheduler | ||
- ``dask.threaded.get``: The threaded scheduler. | ||
- ``dask.multiprocessing.get``: The multiprocessing scheduler. | ||
- ``dask.local.get_sync``: The single-threaded scheduler. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tend not to end phrases like these with periods because they are not full sentences. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Before deciding which one to use at first, I've dug in the documentation a bit to see which way was more used, and I've found several conflicting occurrences of how numbered lists are terminated (or not) in the documentation. For example, here are some samples of lists terminated with dots: https://dask.pydata.org/en/latest/index.html#dask https://dask.pydata.org/en/latest/support.html#discussion https://dask.pydata.org/en/latest/support.html#asking-for-help https://dask.pydata.org/en/latest/scheduling.html#scheduling and the opposite as well: https://dask.pydata.org/en/latest/spec.html#definitions https://dask.pydata.org/en/latest/configuration.html#configuration or even mixed use cases: https://dask.pydata.org/en/latest/dataframe.html#common-uses-and-anti-uses It was conflicting for me to decide which one to use so I've picked the one which made more sence to me at first. However, after pondering a bit about this, and taking a look at other documentations of other projects, I came to the conclusion that it is best to avoid terminating sentences altogether and these changes (and others before this one) should be reverted / corrected to remove punctuation at the end of the sentence. |
||
|
||
You can specify these functions in any of the following ways: | ||
|
||
|
@@ -67,6 +67,6 @@ You can specify these functions in any of the following ways: | |
Use the Distributed Scheduler | ||
----------------------------- | ||
|
||
The newer dask.distributed scheduler also works well on a single machine and | ||
The newer ``dask.distributed`` scheduler also works well on a single machine and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe just "The newer distributed scheduler" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that I find switching to code highlighting like this a bit distracting. I think that we need to find a nice way to talk about the distributed scheduler without using the module name. Could also be "Dask's newer distributed scheduler also works ..." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This way seems to be better than using the the highlighted module name. I'll rectify this change asap |
||
offers more features and diagnostics. See :doc:`this page | ||
<single-distributed>` for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on capitalization of DataFrames vs dataframes ? The type name is DataFrame, but I wonder if we should use the uncapitalized version in normal prose. This would be analagous to using "arrays"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think using the lower-case format would be the best way to refer to tabular data like Pandas' DataFrames. For instance, what is the correct way to write it? Is it "dataframes" or "data frames"? Surely "data frames" is the correct way to write it, but it does not convey the same meaning as a Pandas or Dasks DataFrame object. (Also, "dataframes" looks like a typo)
Indeed, it would be nice to use dataframes as we use arrays in normal prose, but it will come with a cost of most likely being used randomly along side DataFrames. Now, if we just bite the bullet and adopt DataFrame in camel case, it would avoid inconsistencies or random usage throughout the text and it would always be correct and consistent. Also, "DataFrames" is how it is written exclusively in the pandas docs (no exceptions for what I can see).
If the goal is to avoid inconsistencies of this kind, I think that DataFrames would be the fool proof way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, happy to defer to the convention in the Pandas docs