-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RUNTIME] Support setting CPU affinity #1403
Conversation
src/runtime/thread_pool.cc
Outdated
std::vector<std::pair <unsigned int, int64_t>> max_freqs; | ||
std::vector<unsigned int> sorted_order; | ||
|
||
for (unsigned int i = 0; i < threads; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The frequency query need to happen in the threading_backend.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restriction: threadpool cannot have IO related stuffs to accommodate SGX
src/runtime/thread_pool.cc
Outdated
std::vector<unsigned int> sorted_order; | ||
|
||
for (unsigned int i = 0; i < threads; i++) { | ||
snprintf(filepath, sizeof(filepath), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use std::ostringstream to construct the filename
src/runtime/thread_pool.cc
Outdated
LOG(WARNING) << "CPU max frequency info empty!"; | ||
max_freqs.push_back(std::make_pair(i, cur_freq)); | ||
} else { | ||
LOG(WARNING) << "failed to read CPU max frequency!"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the file is not available, consider add a default value
src/runtime/thread_pool.cc
Outdated
ThreadPool::preferred_num = preferred_num; | ||
} | ||
|
||
void setNthreadsPref(int nthreads) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: Use CamelCase for functions
src/runtime/thread_pool.cc
Outdated
.set_body([](TVMArgs args, TVMRetValue* rv) { | ||
std::string mode = args[0]; | ||
int nthreads = args[1]; | ||
if (mode == "big") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use integer enums to minimize runtime overhead.
0 -> default, +1, big, -1: little
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how enum types work over the Python ffi; do you mean just use an actual integer here?
src/runtime/thread_pool.cc
Outdated
@@ -242,14 +244,18 @@ class SpscTaskQueue { | |||
class ThreadPool { | |||
public: | |||
ThreadPool(): num_workers_(tvm::runtime::threading::MaxConcurrency()) { | |||
if (nthreads != 0) { | |||
num_workers_ = nthreads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be bad and cannot allow dynamic switching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not necessary, the num_workers are set directly, we just need to have a num_threads_(restricted to be smaller than num_workers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(basically was because I wasn't sure if mutating the threads vector was safe, same as previous case)
*/ | ||
ThreadGroup(int num_workers, | ||
std::function<void(int)> worker_callback, | ||
bool exclude_worker0 = false); | ||
bool exclude_worker0 = false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the SetAffinity is exposed in some way in ThreadingBackend, we do not need to call these arguments here. The affinity setting function can be called in a second round
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, if SetAffinity needs to change (add/remove threads, is that safe?)
src/runtime/thread_pool.cc
Outdated
} | ||
|
||
TVM_REGISTER_GLOBAL("runtime.config_threadpool") | ||
.set_body([](TVMArgs args, TVMRetValue* rv) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two part of things that matter in here. The affinity order configuration need to happen inside the threading backend, the nthread configuration need to happen here.
src/runtime/thread_pool.cc
Outdated
} | ||
|
||
auto max = [] (std::pair<unsigned int, int64_t> a, std::pair<unsigned int, int64_t> b) { | ||
return a.second > b.second; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to have two ways, sort CPU by descending frequency and always use this information
*/ | ||
ThreadGroup(int num_workers, | ||
std::function<void(int)> worker_callback, | ||
bool exclude_worker0 = false); | ||
bool exclude_worker0 = false, | ||
std::vector<unsigned int> *affinity_order = NULL); | ||
~ThreadGroup(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider add
// mode: big, little, default
// nthread can be 0, which takes the maximum number
// returns the real number of threads needed
int ConfigureAffinity(bool exclude_worker0, int mode, int nthread);
src/runtime/thread_pool.cc
Outdated
int mode = args[1]; | ||
int nthreads = args[2]; | ||
std::vector<unsigned int> sorted_order; | ||
unsigned int num_workers_used = threading::configThreadGroup(mode, nthreads, &sorted_order); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be directly merged into ThreadLocal->UpdateWorkerConfig(mode, threads);
Which then calls thread_group_->Configure(exclude_worker0, mode, nthreads);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to pass sorted list back in to the threadpool
src/runtime/threading_backend.cc
Outdated
|
||
// big or LITTLE | ||
if (mode) { | ||
for (unsigned int i = 0; i < threads; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read the result once and store the sorted result inside the thread group
src/runtime/threading_backend.cc
Outdated
@@ -124,6 +147,62 @@ int MaxConcurrency() { | |||
return std::max(max_concurrency, 1); | |||
} | |||
|
|||
|
|||
unsigned int configThreadGroup(int mode, int nthreads, std::vector<unsigned int> *sorted_order) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to return sorted order back, we can just store the order inside the ThreadGroup and use it to support SetAffinity(Configure)
src/runtime/threading_backend.cc
Outdated
@@ -124,6 +147,62 @@ int MaxConcurrency() { | |||
return std::max(max_concurrency, 1); | |||
} | |||
|
|||
|
|||
unsigned int configThreadGroup(int mode, int nthreads, std::vector<unsigned int> *sorted_order) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use CamelCaseFUnctionName, this can be member of ThreadGroup and combined with SetAffinity
src/runtime/thread_pool.cc
Outdated
std::vector<unsigned int> sorted_order; | ||
unsigned int num_workers_used = threading::configThreadGroup(mode, nthreads, &sorted_order); | ||
void UpdateWorkerConfig(int mode, int nthreads) { | ||
unsigned int num_workers_used = threading::ConfigThreadGroup(mode, nthreads, threads_.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, combine threading::ConfigThreadGroup and SetAffinity into
// rename num_workers_used_ -> num_active_workers_;
num_active_workers_ = threads_->Configure(exclude_worker0, mode, nthreads);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the constructor of ThreadGroup to be simple form, and call threads_->Configure(exclude_worker0, mode, num_workers_); in the constructor
src/runtime/threading_backend.cc
Outdated
min_count_ = min_count; | ||
} | ||
|
||
bool AffinityOrderSet() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is no really necessary
src/runtime/threading_backend.cc
Outdated
@@ -147,51 +174,55 @@ int MaxConcurrency() { | |||
return std::max(max_concurrency, 1); | |||
} | |||
|
|||
|
|||
unsigned int configThreadGroup(int mode, int nthreads, std::vector<unsigned int> *sorted_order) { | |||
unsigned int ConfigThreadGroup(int mode, int nthreads, ThreadGroup *thread_group) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this function into ThreadGroup::Impl
src/runtime/threading_backend.cc
Outdated
bool ThreadGroup::AffinityOrderSet() { | ||
return impl_->AffinityOrderSet(); | ||
} | ||
int ThreadGroup::GetPrefCount(bool reverse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this function, merge most logic into ThreadGroup::Impl
src/runtime/threading_backend.cc
Outdated
void ThreadGroup::SetAffinityOrder(std::vector<unsigned int> order, int max_count, int min_count) { | ||
impl_->SetAffinityOrder(order, max_count, min_count); | ||
} | ||
bool ThreadGroup::AffinityOrderSet() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this function, merge most logic into ThreadGroup::Impl
src/runtime/threading_backend.cc
Outdated
void ThreadGroup::SetAffinity(bool exclude_worker0, bool reverse) { | ||
impl_->SetAffinity(exclude_worker0, reverse); | ||
} | ||
void ThreadGroup::SetAffinityOrder(std::vector<unsigned int> order, int max_count, int min_count) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this function, merge most logic into ThreadGroup::Impl
src/runtime/threading_backend.cc
Outdated
void ThreadGroup::SetAffinity(bool exclude_worker0, const std::vector<unsigned int> *order, | ||
bool reverse) { | ||
impl_->SetAffinity(exclude_worker0, order, reverse); | ||
void ThreadGroup::SetAffinity(bool exclude_worker0, bool reverse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this function, merge most logic into ThreadGroup::Impl
@@ -66,7 +66,16 @@ class ThreadGroup::Impl { | |||
#endif | |||
#if defined(__linux__) || defined(__ANDROID__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lazily get the sorted_order here
*/ | ||
void SetAffinity(bool exclude_worker0, bool reverse = false); | ||
|
||
/*! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combine all the functions into one function
int ThreadGroup::Configure(bool exlude_worker0, int mode, int nthreads);
calls into
// put all the logics of the functions into this function
int ThreadGroup::Impl::Configure(bool exlude_worker0, int mode, int nthreads);
src/runtime/thread_pool.cc
Outdated
@@ -297,6 +300,15 @@ class ThreadPool { | |||
return dmlc::ThreadLocalStore<ThreadPool>::Get(); | |||
} | |||
|
|||
void UpdateWorkerConfig(int mode, int nthreads) { | |||
unsigned int num_workers_used = threading::ConfigThreadGroup(mode, nthreads, threads_.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to
num_active_worker_ = threads_->Configure(exclude_worker0_, mode, nthreads);
// bind worker threads to disjoint cores | ||
// if worker 0 is offloaded to master, i.e. exclude_worker0 is true, | ||
// the master thread is bound to core 0. | ||
void SetAffinity(bool exclude_worker0) { | ||
void SetAffinity(bool exclude_worker0, bool reverse = false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge all logic into Configure, and you can directly call Configure in the constructor of ThreadGroup
* | ||
* \return The number of workers to use. | ||
*/ | ||
unsigned int ConfigThreadGroup(int mode, int nthreads, bool exclude_worker0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return int is fine as we use int to store number of active workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it is already member function of ThreadGroup, let us just rename it to Configure
src/runtime/thread_pool.cc
Outdated
@@ -297,6 +300,14 @@ class ThreadPool { | |||
return dmlc::ThreadLocalStore<ThreadPool>::Get(); | |||
} | |||
|
|||
void UpdateWorkerConfig(int mode, int nthreads) { | |||
// this will also reset the affinity of the ThreadGroup | |||
unsigned int num_workers_used = threads_->ConfigThreadGroup(mode, nthreads, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just do the assignment directly
src/runtime/threading_backend.cc
Outdated
@@ -45,11 +46,71 @@ class ThreadGroup::Impl { | |||
} | |||
} | |||
|
|||
unsigned int ConfigThreadGroup(int mode, int nthreads, bool exclude_worker0) { | |||
unsigned int threads = std::thread::hardware_concurrency(); | |||
std::vector<std::pair <unsigned int, int64_t>> max_freqs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64_t>> -> int64_t> >
(add space) compatibility with some of the old compilers.
src/runtime/threading_backend.cc
Outdated
@@ -45,11 +46,71 @@ class ThreadGroup::Impl { | |||
} | |||
} | |||
|
|||
unsigned int ConfigThreadGroup(int mode, int nthreads, bool exclude_worker0) { | |||
unsigned int threads = std::thread::hardware_concurrency(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call MaxConcurrency() because some special rule might apply here
src/runtime/threading_backend.cc
Outdated
ifs.close(); | ||
max_freqs.push_back(std::make_pair(i, cur_freq)); | ||
if (cur_freq < 0) { | ||
LOG(WARNING) << "failed to read CPU max frequency!"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove warning and assume the file does not exist
src/runtime/threading_backend.cc
Outdated
if (nthreads) | ||
num_workers_used = nthreads; | ||
// use default | ||
if (!num_workers_used) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to the previous part
src/runtime/threading_backend.cc
Outdated
@@ -66,7 +127,16 @@ class ThreadGroup::Impl { | |||
#endif | |||
#if defined(__linux__) || defined(__ANDROID__) | |||
for (unsigned i = 0; i < threads_.size(); ++i) { | |||
unsigned core_id = i + exclude_worker0; | |||
unsigned core_id; | |||
if (sorted_order_.size() >= threads_.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CHECK_EQ(sorted_order_.size(), threads_.size());
// always populates sorted_order_, to be the same as threads, to make simplified assumptions
src/runtime/threading_backend.cc
Outdated
} | ||
} else { | ||
LOG(WARNING) << "failed to read CPU max frequency!"; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always continue pushing default value
src/runtime/threading_backend.cc
Outdated
if (mode) { | ||
if (sorted_order_.empty()) { | ||
for (unsigned int i = 0; i < threads; ++i) { | ||
std::ostringstream filepath; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#if defined(_WIN32)
// default value
cur_freq = 0;
#else
the file logic
#endif
src/runtime/threading_backend.cc
Outdated
break; | ||
} | ||
} | ||
auto max = [] (std::pair<unsigned int, int64_t> a, std::pair<unsigned int, int64_t> b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max-> fcmpbyfreq.
When the freq equals, sort by id(make things stable)
src/runtime/threading_backend.cc
Outdated
for (auto it = max_freqs.begin(); it != max_freqs.end(); it++) { | ||
sorted_order_.push_back(it->first); | ||
if (max_freq == it->second) { | ||
max_count_++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider do a check here that if max_count_ + min_count_ < num_cores, shoot a warning saying more than three frequencies are detected.
* | ||
* \return The number of workers to use. | ||
*/ | ||
int Config(int mode, int nthreads, bool exclude_worker0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Config->Confgure
@@ -53,6 +53,13 @@ ThreadGroup::ThreadGroup(int num_workers, | |||
bool exclude_worker0) | |||
: impl_(new ThreadGroup::Impl(num_workers, worker_callback, exclude_worker0)) {} | |||
void ThreadGroup::Join() {} | |||
unsigned int ThreadGroup::ConfigThreadGroup(int mode, int nthreads, bool exclude_worker0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configure?
src/runtime/thread_pool.cc
Outdated
@@ -297,6 +301,13 @@ class ThreadPool { | |||
return dmlc::ThreadLocalStore<ThreadPool>::Get(); | |||
} | |||
|
|||
void UpdateWorkerConfig(int mode, int nthreads) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configure
src/runtime/threading_backend.cc
Outdated
if (mode) { | ||
if (sorted_order_.empty()) { | ||
for (unsigned int i = 0; i < threads; ++i) { | ||
int64_t cur_freq = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default to 0?
src/runtime/threading_backend.cc
Outdated
num_workers_used = threading::MaxConcurrency(); | ||
} | ||
// if a specific number was given, use that | ||
if (nthreads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enclose with {}
src/runtime/threading_backend.cc
Outdated
filepath << "/sys/devices/system/cpu/cpu" << i << "/cpufreq/cpuinfo_max_freq"; | ||
std::ifstream ifs(filepath.str()); | ||
if (!ifs.fail()) { | ||
ifs >> cur_freq; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!(ifs >> curr_freq)) {
curr_freq = -1;
}
src/runtime/threading_backend.cc
Outdated
@@ -7,6 +7,7 @@ | |||
#include <dmlc/logging.h> | |||
#include <thread> | |||
#include <algorithm> | |||
#include <fstream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only include fstream in linux and android
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some final comments, we are almost good to merge
@@ -53,6 +53,13 @@ ThreadGroup::ThreadGroup(int num_workers, | |||
bool exclude_worker0) | |||
: impl_(new ThreadGroup::Impl(num_workers, worker_callback, exclude_worker0)) {} | |||
void ThreadGroup::Join() {} | |||
unsigned int ThreadGroup::Configure(int mode, int nthreads, bool exclude_worker0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line between functions
@@ -53,6 +53,13 @@ ThreadGroup::ThreadGroup(int num_workers, | |||
bool exclude_worker0) | |||
: impl_(new ThreadGroup::Impl(num_workers, worker_callback, exclude_worker0)) {} | |||
void ThreadGroup::Join() {} | |||
unsigned int ThreadGroup::Configure(int mode, int nthreads, bool exclude_worker0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can just make return type as int
src/runtime/threading_backend.cc
Outdated
std::vector<std::pair <unsigned int, int64_t> > max_freqs; | ||
|
||
// big or LITTLE | ||
if (mode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if mode guard can be ignored
src/runtime/threading_backend.cc
Outdated
@@ -66,7 +138,16 @@ class ThreadGroup::Impl { | |||
#endif | |||
#if defined(__linux__) || defined(__ANDROID__) | |||
for (unsigned i = 0; i < threads_.size(); ++i) { | |||
unsigned core_id = i + exclude_worker0; | |||
unsigned core_id; | |||
if (sorted_order_.size() >= threads_.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again this seems to be not necessary. Can we simply do CHECK_EQ(sorted_order_.size(), threads_.size()); and always produce sorted_ order before call SetAffinity? Possibly do sorted order initialization in the construction phase.
This will likely simplify this function's logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, moved the initialization to the constuctor. To preserve the old TVM_BIND_THREADS
behavior, the branch in the ThreadGroup
constructor is moved to ThreadPool
in case Configure->SetAffinity
is called so it can get the correct number of workers for the new default (prefer big).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one final comment
src/runtime/threading_backend.cc
Outdated
max_freqs.push_back(std::make_pair(i, cur_freq)); | ||
} | ||
|
||
auto fcmpbyfreq = [] (std::pair<unsigned int, int64_t> a, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer const reference: const std::pair<unsigned int, int64_t>&
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, more things to be fixed
src/runtime/threading_backend.cc
Outdated
@@ -92,8 +116,57 @@ class ThreadGroup::Impl { | |||
#endif | |||
} | |||
|
|||
void InitSortedOrder() { | |||
unsigned int threads = threading::MaxConcurrency(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some followups: we might need to remodel MaxConcurrency a bit:
Note that the number of workers can indeed be different from the number of cores. The freq-query will always need to query all the CPU info(using hardware concurrency without considering env variable)
src/runtime/threading_backend.cc
Outdated
@@ -26,16 +30,7 @@ class ThreadGroup::Impl { | |||
for (int i = exclude_worker0; i < num_workers_; ++i) { | |||
threads_.emplace_back([worker_callback, i] { worker_callback(i); }); | |||
} | |||
const char *val = getenv("TVM_BIND_THREADS"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TVM_BIND_THREADS is an env variable asking whether we should set affinity(or let threads run as they are). We should take this into consideration.
When TVM_BIND_THREADS is defined and it is 0, we should not do SetAffinity at all, and configure have no effect other than return the corresponding value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what should be done here, as currently this branch is moved to ThreadPool; the behavior should be close to what it was before (now Configure is not called so SetAffinity is not called either, and the default num_workers_ is used).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently CHECK_EQ
will break in the case of x86 with hyperthreading when MaxConcurrency reports #logicalcores/2 and std::hardware_concurrency gives #logicalcores.
@@ -53,6 +53,13 @@ ThreadGroup::ThreadGroup(int num_workers, | |||
bool exclude_worker0) | |||
: impl_(new ThreadGroup::Impl(num_workers, worker_callback, exclude_worker0)) {} | |||
void ThreadGroup::Join() {} | |||
int ThreadGroup::Configure(int mode, int nthreads, bool exclude_worker0) { | |||
unsigned int max_conc = (unsigned int) MaxConcurrency(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can just use int here
src/runtime/threading_backend.cc
Outdated
max_freqs.push_back(std::make_pair(i, cur_freq)); | ||
} | ||
|
||
auto fcmpbyfreq = [] (std::pair<unsigned int, int64_t> &a, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do const reference instead of reference in here.
src/runtime/threading_backend.cc
Outdated
@@ -65,8 +87,18 @@ class ThreadGroup::Impl { | |||
#endif | |||
#endif | |||
#if defined(__linux__) || defined(__ANDROID__) | |||
CHECK_LE(threads_.size() + exclude_worker0, sorted_order_.size()); | |||
if (sorted_order_.size() != threads_.size()) { | |||
LOG(WARNING) << "setting affinity with subset of threads!"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this warning seems can be a bit frequent, instead of this, do
Do CHECK_GE(sorted_order_.size(), threads_.size());
// if a specific number was given, use that | ||
if (nthreads) { | ||
num_workers_used = nthreads; | ||
} | ||
const char *val = getenv("TVM_BIND_THREADS"); | ||
if (val == nullptr || atoi(val) == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Skip if sorted_order.size() is bigger than threads_.size()
if (sorted_order_.size() > threads_.size() || not_bind) skip set affinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: SGX, this change won't affect threading since the trusted threading backend is basically a no-op that shells out to the (herein improved) untrusted backend
@@ -53,6 +53,13 @@ ThreadGroup::ThreadGroup(int num_workers, | |||
bool exclude_worker0) | |||
: impl_(new ThreadGroup::Impl(num_workers, worker_callback, exclude_worker0)) {} | |||
void ThreadGroup::Join() {} | |||
int ThreadGroup::Configure(int mode, int nthreads, bool exclude_worker0) { | |||
int max_conc = MaxConcurrency(); | |||
if (!nthreads || ntheads > MaxConcurrency()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!nthreads || ntheads > max_conc) {
src/runtime/threading_backend.cc
Outdated
return a.first < b.first; | ||
} else { | ||
return a.second > b.second; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return a.second == b.second ? a.first < b.first : a.second > b.second
for brevity
src/runtime/threading_backend.cc
Outdated
} | ||
ifs.close(); | ||
} | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this empty #else
necessary?
/*! | ||
* \brief configure the CPU id affinity | ||
* | ||
* \param mode The preferred CPU type (1 = big, -1 = little). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use an enum for these AFFINITY_MODE_(BIG|LITTLE)
to take advantage of static checking
src/runtime/threading_backend.cc
Outdated
} | ||
|
||
int Configure(int mode, int nthreads, bool exclude_worker0) { | ||
int num_workers_used = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set int num_workers_used = nthreads
and remove the later if
block
@eqy please rebase against the master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the contribution.
How do I configure affinity in python script? I have rk3399 board. |
@kaishijeng it is better to ask a question in the https://discuss.tvm.ai/ |
} | ||
|
||
private: | ||
// bind worker threads to disjoint cores | ||
// if worker 0 is offloaded to master, i.e. exclude_worker0 is true, | ||
// the master thread is bound to core 0. | ||
void SetAffinity(bool exclude_worker0) { | ||
void SetAffinity(bool exclude_worker0, bool reverse = false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently one common use mode of tvm is to use all the available cores on a system, as defined in MaxConcurrency. On asymmetric or heterogeneous multicores (e.g., big.LITTLE), however, we may want to specify a subset of CPU cores (with a preference for the CPU core type).
This PR introduces a new global function to change the runtime configuration
runtime.config_threadpool
which allows the user to specify a CPU type as well as the number of threads to use.arguments:
mode
:{big, little, default}
nthreads
:{0, 1, 2, ...}
Mode will select the preferred CPU type (based on CPU clock rate). "big" will prefer higher clocked cores first, while "little" will prefer lower clocked cores first. Choosing "default" will not change the CPU affinity preference order from the system's CPU id ordering.
Nthreads selects the number of CPUs to use, picking CPUs to use in the order specified by
mode
.If a value of 0 is picked, all of the cores of the preferred
mode
will be used.Example:
System has 4x Cortex A-53 (LITTLE) + 2x Cortex A-72 (big).
cpuid 0: LITTLE
cpuid 1: LITTLE
cpuid 2: LITTLE
cpuid 3: LITTLE
cpuid 4: big
cpuid 5: big
config_threadpool("big", 1)
-> changes affinity order to[4,5,1,2,3]
, will use CPU 4config_threadpool("big", 2)
-> changes affinity order to[4,5,1,2,3]
, will use CPUs 4, 5config_threadpool("big", 3)
-> changes affinity order to[4,5,1,2,3]
, will use CPUs 4,5,0config_threadpool("big", 0)
-> changes affinity order to[4,5,1,2,3]
, will use CPUs 4,5config_threadpool("little", 0)
->leaves affinity order as[0,1,2,3,4,5]
, will use CPUs 0,1,2,3config_threadpool("little", 5)
->leaves affinity order as[0,1,2,3,4,5]
, will use CPUs 0,1,2,3,4config_threadpool("default", 0)
->leaves affinity order as[0,1,2,3,4,5]
, will use CPUs 0,1,2,3,4,5Note that if
config_threadpool
is not called, the default behavior remains the same as in current tvm.On a few toy experiments, we can observe that restricting the runtime to 2x A-72 cores can actually outperform 4x A-53 PLUS 2x A-72 cores due to the current static work partitioning scheme.
misc:
I do not know the current standard for docstrings of
TVM_REGISTER_GLOBAL
functions, let me know where that should go.Currently this scheme will fix CPU affinity once the runtime spawns the threads for execution. We can modify the scheme to support switching affinity dynamically within the same runtime instance, but this requires considering how we want to handle migrating existing threads, as the existing threading backend implementation only sets thread affinity at thread creation time.