Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RUNTIME] Support setting CPU affinity #1403

Merged
merged 34 commits into from
Jul 12, 2018
Merged

[RUNTIME] Support setting CPU affinity #1403

merged 34 commits into from
Jul 12, 2018

Conversation

eqy
Copy link
Contributor

@eqy eqy commented Jul 8, 2018

Currently one common use mode of tvm is to use all the available cores on a system, as defined in MaxConcurrency. On asymmetric or heterogeneous multicores (e.g., big.LITTLE), however, we may want to specify a subset of CPU cores (with a preference for the CPU core type).

This PR introduces a new global function to change the runtime configuration runtime.config_threadpool which allows the user to specify a CPU type as well as the number of threads to use.
arguments:
mode : {big, little, default}
nthreads: {0, 1, 2, ...}
Mode will select the preferred CPU type (based on CPU clock rate). "big" will prefer higher clocked cores first, while "little" will prefer lower clocked cores first. Choosing "default" will not change the CPU affinity preference order from the system's CPU id ordering.
Nthreads selects the number of CPUs to use, picking CPUs to use in the order specified by mode.
If a value of 0 is picked, all of the cores of the preferred mode will be used.

Example:
System has 4x Cortex A-53 (LITTLE) + 2x Cortex A-72 (big).
cpuid 0: LITTLE
cpuid 1: LITTLE
cpuid 2: LITTLE
cpuid 3: LITTLE
cpuid 4: big
cpuid 5: big

config_threadpool("big", 1) -> changes affinity order to [4,5,1,2,3], will use CPU 4
config_threadpool("big", 2) -> changes affinity order to [4,5,1,2,3], will use CPUs 4, 5
config_threadpool("big", 3) -> changes affinity order to [4,5,1,2,3], will use CPUs 4,5,0
config_threadpool("big", 0) -> changes affinity order to [4,5,1,2,3], will use CPUs 4,5

config_threadpool("little", 0) ->leaves affinity order as [0,1,2,3,4,5], will use CPUs 0,1,2,3
config_threadpool("little", 5) ->leaves affinity order as [0,1,2,3,4,5], will use CPUs 0,1,2,3,4

config_threadpool("default", 0) ->leaves affinity order as [0,1,2,3,4,5], will use CPUs 0,1,2,3,4,5

Note that if config_threadpool is not called, the default behavior remains the same as in current tvm.

On a few toy experiments, we can observe that restricting the runtime to 2x A-72 cores can actually outperform 4x A-53 PLUS 2x A-72 cores due to the current static work partitioning scheme.

misc:
I do not know the current standard for docstrings of TVM_REGISTER_GLOBAL functions, let me know where that should go.

Currently this scheme will fix CPU affinity once the runtime spawns the threads for execution. We can modify the scheme to support switching affinity dynamically within the same runtime instance, but this requires considering how we want to handle migrating existing threads, as the existing threading backend implementation only sets thread affinity at thread creation time.

std::vector<std::pair <unsigned int, int64_t>> max_freqs;
std::vector<unsigned int> sorted_order;

for (unsigned int i = 0; i < threads; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frequency query need to happen in the threading_backend.cc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restriction: threadpool cannot have IO related stuffs to accommodate SGX

std::vector<unsigned int> sorted_order;

for (unsigned int i = 0; i < threads; i++) {
snprintf(filepath, sizeof(filepath),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use std::ostringstream to construct the filename

LOG(WARNING) << "CPU max frequency info empty!";
max_freqs.push_back(std::make_pair(i, cur_freq));
} else {
LOG(WARNING) << "failed to read CPU max frequency!";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the file is not available, consider add a default value

ThreadPool::preferred_num = preferred_num;
}

void setNthreadsPref(int nthreads) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: Use CamelCase for functions

.set_body([](TVMArgs args, TVMRetValue* rv) {
std::string mode = args[0];
int nthreads = args[1];
if (mode == "big") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use integer enums to minimize runtime overhead.
0 -> default, +1, big, -1: little

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how enum types work over the Python ffi; do you mean just use an actual integer here?

@@ -242,14 +244,18 @@ class SpscTaskQueue {
class ThreadPool {
public:
ThreadPool(): num_workers_(tvm::runtime::threading::MaxConcurrency()) {
if (nthreads != 0) {
num_workers_ = nthreads;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be bad and cannot allow dynamic switching

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not necessary, the num_workers are set directly, we just need to have a num_threads_(restricted to be smaller than num_workers)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(basically was because I wasn't sure if mutating the threads vector was safe, same as previous case)

*/
ThreadGroup(int num_workers,
std::function<void(int)> worker_callback,
bool exclude_worker0 = false);
bool exclude_worker0 = false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the SetAffinity is exposed in some way in ThreadingBackend, we do not need to call these arguments here. The affinity setting function can be called in a second round

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, if SetAffinity needs to change (add/remove threads, is that safe?)

}

TVM_REGISTER_GLOBAL("runtime.config_threadpool")
.set_body([](TVMArgs args, TVMRetValue* rv) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two part of things that matter in here. The affinity order configuration need to happen inside the threading backend, the nthread configuration need to happen here.

}

auto max = [] (std::pair<unsigned int, int64_t> a, std::pair<unsigned int, int64_t> b) {
return a.second > b.second;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to have two ways, sort CPU by descending frequency and always use this information

*/
ThreadGroup(int num_workers,
std::function<void(int)> worker_callback,
bool exclude_worker0 = false);
bool exclude_worker0 = false,
std::vector<unsigned int> *affinity_order = NULL);
~ThreadGroup();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider add

// mode: big, little, default
// nthread can be 0, which takes the maximum number
// returns the real number of threads needed
int ConfigureAffinity(bool exclude_worker0, int mode, int nthread);

@tqchen tqchen added status: review in progress status: need update need update based on feedbacks labels Jul 9, 2018
int mode = args[1];
int nthreads = args[2];
std::vector<unsigned int> sorted_order;
unsigned int num_workers_used = threading::configThreadGroup(mode, nthreads, &sorted_order);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be directly merged into ThreadLocal->UpdateWorkerConfig(mode, threads);
Which then calls thread_group_->Configure(exclude_worker0, mode, nthreads);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to pass sorted list back in to the threadpool


// big or LITTLE
if (mode) {
for (unsigned int i = 0; i < threads; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read the result once and store the sorted result inside the thread group

@@ -124,6 +147,62 @@ int MaxConcurrency() {
return std::max(max_concurrency, 1);
}


unsigned int configThreadGroup(int mode, int nthreads, std::vector<unsigned int> *sorted_order) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to return sorted order back, we can just store the order inside the ThreadGroup and use it to support SetAffinity(Configure)

@@ -124,6 +147,62 @@ int MaxConcurrency() {
return std::max(max_concurrency, 1);
}


unsigned int configThreadGroup(int mode, int nthreads, std::vector<unsigned int> *sorted_order) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use CamelCaseFUnctionName, this can be member of ThreadGroup and combined with SetAffinity

std::vector<unsigned int> sorted_order;
unsigned int num_workers_used = threading::configThreadGroup(mode, nthreads, &sorted_order);
void UpdateWorkerConfig(int mode, int nthreads) {
unsigned int num_workers_used = threading::ConfigThreadGroup(mode, nthreads, threads_.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, combine threading::ConfigThreadGroup and SetAffinity into

// rename num_workers_used_ -> num_active_workers_;
num_active_workers_ = threads_->Configure(exclude_worker0, mode, nthreads);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the constructor of ThreadGroup to be simple form, and call threads_->Configure(exclude_worker0, mode, num_workers_); in the constructor

min_count_ = min_count;
}

bool AffinityOrderSet() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is no really necessary

@@ -147,51 +174,55 @@ int MaxConcurrency() {
return std::max(max_concurrency, 1);
}


unsigned int configThreadGroup(int mode, int nthreads, std::vector<unsigned int> *sorted_order) {
unsigned int ConfigThreadGroup(int mode, int nthreads, ThreadGroup *thread_group) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this function into ThreadGroup::Impl

bool ThreadGroup::AffinityOrderSet() {
return impl_->AffinityOrderSet();
}
int ThreadGroup::GetPrefCount(bool reverse) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this function, merge most logic into ThreadGroup::Impl

void ThreadGroup::SetAffinityOrder(std::vector<unsigned int> order, int max_count, int min_count) {
impl_->SetAffinityOrder(order, max_count, min_count);
}
bool ThreadGroup::AffinityOrderSet() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this function, merge most logic into ThreadGroup::Impl

void ThreadGroup::SetAffinity(bool exclude_worker0, bool reverse) {
impl_->SetAffinity(exclude_worker0, reverse);
}
void ThreadGroup::SetAffinityOrder(std::vector<unsigned int> order, int max_count, int min_count) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this function, merge most logic into ThreadGroup::Impl

void ThreadGroup::SetAffinity(bool exclude_worker0, const std::vector<unsigned int> *order,
bool reverse) {
impl_->SetAffinity(exclude_worker0, order, reverse);
void ThreadGroup::SetAffinity(bool exclude_worker0, bool reverse) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this function, merge most logic into ThreadGroup::Impl

@@ -66,7 +66,16 @@ class ThreadGroup::Impl {
#endif
#if defined(__linux__) || defined(__ANDROID__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazily get the sorted_order here

*/
void SetAffinity(bool exclude_worker0, bool reverse = false);

/*!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine all the functions into one function

int ThreadGroup::Configure(bool exlude_worker0, int mode, int nthreads);
calls into
// put all the logics of the functions into this function
int ThreadGroup::Impl::Configure(bool exlude_worker0, int mode, int nthreads);

@@ -297,6 +300,15 @@ class ThreadPool {
return dmlc::ThreadLocalStore<ThreadPool>::Get();
}

void UpdateWorkerConfig(int mode, int nthreads) {
unsigned int num_workers_used = threading::ConfigThreadGroup(mode, nthreads, threads_.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to

num_active_worker_  = threads_->Configure(exclude_worker0_, mode, nthreads);

// bind worker threads to disjoint cores
// if worker 0 is offloaded to master, i.e. exclude_worker0 is true,
// the master thread is bound to core 0.
void SetAffinity(bool exclude_worker0) {
void SetAffinity(bool exclude_worker0, bool reverse = false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge all logic into Configure, and you can directly call Configure in the constructor of ThreadGroup

*
* \return The number of workers to use.
*/
unsigned int ConfigThreadGroup(int mode, int nthreads, bool exclude_worker0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return int is fine as we use int to store number of active workers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is already member function of ThreadGroup, let us just rename it to Configure

@@ -297,6 +300,14 @@ class ThreadPool {
return dmlc::ThreadLocalStore<ThreadPool>::Get();
}

void UpdateWorkerConfig(int mode, int nthreads) {
// this will also reset the affinity of the ThreadGroup
unsigned int num_workers_used = threads_->ConfigThreadGroup(mode, nthreads,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do the assignment directly

@@ -45,11 +46,71 @@ class ThreadGroup::Impl {
}
}

unsigned int ConfigThreadGroup(int mode, int nthreads, bool exclude_worker0) {
unsigned int threads = std::thread::hardware_concurrency();
std::vector<std::pair <unsigned int, int64_t>> max_freqs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64_t>> -> int64_t> > (add space) compatibility with some of the old compilers.

@@ -45,11 +46,71 @@ class ThreadGroup::Impl {
}
}

unsigned int ConfigThreadGroup(int mode, int nthreads, bool exclude_worker0) {
unsigned int threads = std::thread::hardware_concurrency();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call MaxConcurrency() because some special rule might apply here

ifs.close();
max_freqs.push_back(std::make_pair(i, cur_freq));
if (cur_freq < 0) {
LOG(WARNING) << "failed to read CPU max frequency!";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove warning and assume the file does not exist

if (nthreads)
num_workers_used = nthreads;
// use default
if (!num_workers_used)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to the previous part

@@ -66,7 +127,16 @@ class ThreadGroup::Impl {
#endif
#if defined(__linux__) || defined(__ANDROID__)
for (unsigned i = 0; i < threads_.size(); ++i) {
unsigned core_id = i + exclude_worker0;
unsigned core_id;
if (sorted_order_.size() >= threads_.size()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK_EQ(sorted_order_.size(), threads_.size());

// always populates sorted_order_, to be the same as threads, to make simplified assumptions

}
} else {
LOG(WARNING) << "failed to read CPU max frequency!";
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always continue pushing default value

if (mode) {
if (sorted_order_.empty()) {
for (unsigned int i = 0; i < threads; ++i) {
std::ostringstream filepath;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#if defined(_WIN32)
// default value
cur_freq = 0;
#else
the file logic
#endif

break;
}
}
auto max = [] (std::pair<unsigned int, int64_t> a, std::pair<unsigned int, int64_t> b) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max-> fcmpbyfreq.

When the freq equals, sort by id(make things stable)

for (auto it = max_freqs.begin(); it != max_freqs.end(); it++) {
sorted_order_.push_back(it->first);
if (max_freq == it->second) {
max_count_++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider do a check here that if max_count_ + min_count_ < num_cores, shoot a warning saying more than three frequencies are detected.

*
* \return The number of workers to use.
*/
int Config(int mode, int nthreads, bool exclude_worker0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config->Confgure

@@ -53,6 +53,13 @@ ThreadGroup::ThreadGroup(int num_workers,
bool exclude_worker0)
: impl_(new ThreadGroup::Impl(num_workers, worker_callback, exclude_worker0)) {}
void ThreadGroup::Join() {}
unsigned int ThreadGroup::ConfigThreadGroup(int mode, int nthreads, bool exclude_worker0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configure?

@@ -297,6 +301,13 @@ class ThreadPool {
return dmlc::ThreadLocalStore<ThreadPool>::Get();
}

void UpdateWorkerConfig(int mode, int nthreads) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configure

if (mode) {
if (sorted_order_.empty()) {
for (unsigned int i = 0; i < threads; ++i) {
int64_t cur_freq = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default to 0?

num_workers_used = threading::MaxConcurrency();
}
// if a specific number was given, use that
if (nthreads)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enclose with {}

filepath << "/sys/devices/system/cpu/cpu" << i << "/cpufreq/cpuinfo_max_freq";
std::ifstream ifs(filepath.str());
if (!ifs.fail()) {
ifs >> cur_freq;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!(ifs >> curr_freq)) {
curr_freq = -1;
}

@@ -7,6 +7,7 @@
#include <dmlc/logging.h>
#include <thread>
#include <algorithm>
#include <fstream>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only include fstream in linux and android

Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some final comments, we are almost good to merge

@@ -53,6 +53,13 @@ ThreadGroup::ThreadGroup(int num_workers,
bool exclude_worker0)
: impl_(new ThreadGroup::Impl(num_workers, worker_callback, exclude_worker0)) {}
void ThreadGroup::Join() {}
unsigned int ThreadGroup::Configure(int mode, int nthreads, bool exclude_worker0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line between functions

@@ -53,6 +53,13 @@ ThreadGroup::ThreadGroup(int num_workers,
bool exclude_worker0)
: impl_(new ThreadGroup::Impl(num_workers, worker_callback, exclude_worker0)) {}
void ThreadGroup::Join() {}
unsigned int ThreadGroup::Configure(int mode, int nthreads, bool exclude_worker0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just make return type as int

std::vector<std::pair <unsigned int, int64_t> > max_freqs;

// big or LITTLE
if (mode) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if mode guard can be ignored

@@ -66,7 +138,16 @@ class ThreadGroup::Impl {
#endif
#if defined(__linux__) || defined(__ANDROID__)
for (unsigned i = 0; i < threads_.size(); ++i) {
unsigned core_id = i + exclude_worker0;
unsigned core_id;
if (sorted_order_.size() >= threads_.size()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this seems to be not necessary. Can we simply do CHECK_EQ(sorted_order_.size(), threads_.size()); and always produce sorted_ order before call SetAffinity? Possibly do sorted order initialization in the construction phase.

This will likely simplify this function's logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, moved the initialization to the constuctor. To preserve the old TVM_BIND_THREADS behavior, the branch in the ThreadGroup constructor is moved to ThreadPool in case Configure->SetAffinity is called so it can get the correct number of workers for the new default (prefer big).

Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one final comment

max_freqs.push_back(std::make_pair(i, cur_freq));
}

auto fcmpbyfreq = [] (std::pair<unsigned int, int64_t> a,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer const reference: const std::pair<unsigned int, int64_t>&

Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, more things to be fixed

@@ -92,8 +116,57 @@ class ThreadGroup::Impl {
#endif
}

void InitSortedOrder() {
unsigned int threads = threading::MaxConcurrency();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some followups: we might need to remodel MaxConcurrency a bit:

Note that the number of workers can indeed be different from the number of cores. The freq-query will always need to query all the CPU info(using hardware concurrency without considering env variable)

@@ -26,16 +30,7 @@ class ThreadGroup::Impl {
for (int i = exclude_worker0; i < num_workers_; ++i) {
threads_.emplace_back([worker_callback, i] { worker_callback(i); });
}
const char *val = getenv("TVM_BIND_THREADS");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TVM_BIND_THREADS is an env variable asking whether we should set affinity(or let threads run as they are). We should take this into consideration.

When TVM_BIND_THREADS is defined and it is 0, we should not do SetAffinity at all, and configure have no effect other than return the corresponding value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what should be done here, as currently this branch is moved to ThreadPool; the behavior should be close to what it was before (now Configure is not called so SetAffinity is not called either, and the default num_workers_ is used).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently CHECK_EQ will break in the case of x86 with hyperthreading when MaxConcurrency reports #logicalcores/2 and std::hardware_concurrency gives #logicalcores.

@@ -53,6 +53,13 @@ ThreadGroup::ThreadGroup(int num_workers,
bool exclude_worker0)
: impl_(new ThreadGroup::Impl(num_workers, worker_callback, exclude_worker0)) {}
void ThreadGroup::Join() {}
int ThreadGroup::Configure(int mode, int nthreads, bool exclude_worker0) {
unsigned int max_conc = (unsigned int) MaxConcurrency();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just use int here

max_freqs.push_back(std::make_pair(i, cur_freq));
}

auto fcmpbyfreq = [] (std::pair<unsigned int, int64_t> &a,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do const reference instead of reference in here.

@@ -65,8 +87,18 @@ class ThreadGroup::Impl {
#endif
#endif
#if defined(__linux__) || defined(__ANDROID__)
CHECK_LE(threads_.size() + exclude_worker0, sorted_order_.size());
if (sorted_order_.size() != threads_.size()) {
LOG(WARNING) << "setting affinity with subset of threads!";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this warning seems can be a bit frequent, instead of this, do

Do CHECK_GE(sorted_order_.size(), threads_.size());

// if a specific number was given, use that
if (nthreads) {
num_workers_used = nthreads;
}
const char *val = getenv("TVM_BIND_THREADS");
if (val == nullptr || atoi(val) == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Skip if sorted_order.size() is bigger than threads_.size()
if (sorted_order_.size() > threads_.size() || not_bind) skip set affinity.

@tqchen
Copy link
Member

tqchen commented Jul 12, 2018

cc @nhynes @yidawang

Copy link
Member

@nhynes nhynes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: SGX, this change won't affect threading since the trusted threading backend is basically a no-op that shells out to the (herein improved) untrusted backend

@@ -53,6 +53,13 @@ ThreadGroup::ThreadGroup(int num_workers,
bool exclude_worker0)
: impl_(new ThreadGroup::Impl(num_workers, worker_callback, exclude_worker0)) {}
void ThreadGroup::Join() {}
int ThreadGroup::Configure(int mode, int nthreads, bool exclude_worker0) {
int max_conc = MaxConcurrency();
if (!nthreads || ntheads > MaxConcurrency()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!nthreads || ntheads > max_conc) {

return a.first < b.first;
} else {
return a.second > b.second;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return a.second == b.second ? a.first < b.first : a.second > b.second for brevity

}
ifs.close();
}
#else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this empty #else necessary?

/*!
* \brief configure the CPU id affinity
*
* \param mode The preferred CPU type (1 = big, -1 = little).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use an enum for these AFFINITY_MODE_(BIG|LITTLE) to take advantage of static checking

}

int Configure(int mode, int nthreads, bool exclude_worker0) {
int num_workers_used = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set int num_workers_used = nthreads and remove the later if block

@tqchen
Copy link
Member

tqchen commented Jul 12, 2018

Copy link
Contributor

@yidawang yidawang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the contribution.

@tqchen tqchen merged commit 95a97f3 into apache:master Jul 12, 2018
@kaishijeng
Copy link

@eqy

How do I configure affinity in python script? I have rk3399 board.
Thanks

@tqchen
Copy link
Member

tqchen commented Jul 13, 2018

@kaishijeng it is better to ask a question in the https://discuss.tvm.ai/

}

private:
// bind worker threads to disjoint cores
// if worker 0 is offloaded to master, i.e. exclude_worker0 is true,
// the master thread is bound to core 0.
void SetAffinity(bool exclude_worker0) {
void SetAffinity(bool exclude_worker0, bool reverse = false) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add reverse? @eqy @sanallen

tqchen pushed a commit to tqchen/tvm that referenced this pull request Aug 4, 2018
sergei-mironov pushed a commit to sergei-mironov/tvm that referenced this pull request Aug 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants