New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prange generates unnecessary python interactions when indexing a 2d view #2987
Comments
Creation and (last) deletion of a slice require the GIL (to prevent the underlying memoryview from being deleted before the slice). If not done internally, this happens in the |
Thanks for your feedback. I'm a bit confused about memory views needing the GIL, because changing |
@robertwb to clarify my comment above: When using def sum_2d_a(float[:, :] view_2d):
cdef:
float out = 0
int row_idx = 0
for row_idx in range(view_2d.shape[0], nogil=True):
out += sum_1d_a(view_2d[row_idx, :]) # no interactions here Using def sum_2d_a(float[:, :] view_2d):
cdef:
float out = 0
int row_idx = 0
for row_idx in prange(view_2d.shape[0], nogil=True):
out += sum_1d_a(view_2d[row_idx, :]) # interactions here As you can see in the first example, the GIL is released when the slice is created. The only difference between the two examples is the use of Again I'm happy to assist any way I can Thanks |
We should make sure this raises an error. @NicolasHug, taking a slice of a memoryview needs to create a new reference to the buffer owner object. Refcounting requires the GIL. So there needs to be at least some interaction with the Python runtime. Maybe you can get away with calling the function once per (larger) slice and not once for each row? E.g. use |
Thanks for the feedback @scoder. My bad, what I meant to write was this, and I still don't see any Python interaction when creating the slices: with nogil:
for row_idx in range(view_2d.shape[0]):
out += sum_1d_a(view_2d[row_idx, :]) # no interactions
This would work in this simple example but my actual use-case is more tricky than that and this workaround could not apply. I do need to have the prange over the rows (sometimes columns) of a big array with a custom dtype. here is where, in case you're interested |
Ok, I finally looked at the code that your initial example generates and it's essentially this: Py_BEGIN_ALLOW_THREADS
…
#ifdef _OPENMP
#pragma omp for firstprivate(__pyx_v_row_idx) lastprivate(__pyx_v_row_idx)
#endif /* _OPENMP */
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
if (__pyx_parallel_why < 2)
{
__pyx_v_row_idx = (int)(0 + 1 * __pyx_t_2);
/* "test.pyx":13
* out += sum_1d_a(view_2d[row_idx, :]) # <<<<<<<<<<<<<<
*/
__pyx_t_4.data = __pyx_v_view_2d.data;
__pyx_t_4.memview = __pyx_v_view_2d.memview;
__PYX_INC_MEMVIEW(&__pyx_t_4, 0);
{
Py_ssize_t __pyx_tmp_idx = __pyx_v_row_idx;
Py_ssize_t __pyx_tmp_shape = __pyx_v_view_2d.shape[0];
Py_ssize_t __pyx_tmp_stride = __pyx_v_view_2d.strides[0];
if (__pyx_tmp_idx < 0)
__pyx_tmp_idx += __pyx_tmp_shape;
if ((0)) __PYX_ERR(0, 13, __pyx_L8_error)
__pyx_t_4.data += __pyx_tmp_idx * __pyx_tmp_stride;
}
__pyx_t_4.shape[0] = __pyx_v_view_2d.shape[1];
__pyx_t_4.strides[0] = __pyx_v_view_2d.strides[1];
__pyx_t_4.suboffsets[0] = -1;
__pyx_v_out = (__pyx_v_out + __pyx_f_4test_sum_1d_a(__pyx_t_4));
__PYX_XDEC_MEMVIEW(&__pyx_t_4, 0);
__pyx_t_4.memview = NULL; __pyx_t_4.data = NULL;
…
#ifdef _OPENMP
#pragma omp flush(__pyx_parallel_why)
#endif /* _OPENMP */
}
}
#ifdef _OPENMP
Py_END_ALLOW_THREADS Bad formatting aside, that looks fairly reasonable to me. I don't see any excessive "Python interaction" here. There is a bit of unused error handling code in there which I stripped above and which could probably be avoided, but the C compiler will be happy to discard it. |
The documentation claims differently https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#memoryviews-and-the-gil? (And based on the generated code the documentation looks like it's right) |
Thanks for taking the time @scoder I might be wrong in my diagnosis about Python interaction being the cause of my problem, but it seems that there is still something fishy going on. Maybe it's time to explain more in details what motivated me to open this issue: In scikit-learn, we have some benchmark code that runs in about 45 seconds when using the "full" indexing on 2d views, i.e. always using For info, here's the diff between the two branches (the slow branch is in green): Diffdiff --git a/sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx b/sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx
index d346aabdac..78f7b3906b 100644
--- a/sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx
+++ b/sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx
@@ -30,12 +30,12 @@ def _predict_from_numeric_data(
int i
for i in prange(numeric_data.shape[0], schedule='static', nogil=True):
- out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
+ out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
cdef inline Y_DTYPE_C _predict_one_from_numeric_data(
node_struct [:] nodes,
- const X_DTYPE_C [:, :] numeric_data,
+ const X_DTYPE_C [:] numeric_data,
const int row) nogil:
# Need to pass the whole array and the row index, else prange won't work.
# See issue Cython #2798
@@ -47,13 +47,13 @@ cdef inline Y_DTYPE_C _predict_one_from_numeric_data(
if node.is_leaf:
return node.value
- if isnan(numeric_data[row, node.feature_idx]):
+ if isnan(numeric_data[node.feature_idx]):
if node.missing_go_to_left:
node = nodes[node.left]
else:
node = nodes[node.right]
else:
- if numeric_data[row, node.feature_idx] <= node.threshold:
+ if numeric_data[node.feature_idx] <= node.threshold:
node = nodes[node.left]
else:
node = nodes[node.right]
@@ -69,13 +69,13 @@ def _predict_from_binned_data(
int i
for i in prange(binned_data.shape[0], schedule='static', nogil=True):
- out[i] = _predict_one_from_binned_data(nodes, binned_data, i,
+ out[i] = _predict_one_from_binned_data(nodes, binned_data[i], i,
missing_values_bin_idx)
cdef inline Y_DTYPE_C _predict_one_from_binned_data(
node_struct [:] nodes,
- const X_BINNED_DTYPE_C [:, :] binned_data,
+ const X_BINNED_DTYPE_C [:] binned_data,
const int row,
const unsigned char missing_values_bin_idx) nogil:
# Need to pass the whole array and the row index, else prange won't work.
@@ -87,13 +87,13 @@ cdef inline Y_DTYPE_C _predict_one_from_binned_data(
while True:
if node.is_leaf:
return node.value
- if binned_data[row, node.feature_idx] == missing_values_bin_idx:
+ if binned_data[ node.feature_idx] == missing_values_bin_idx:
if node.missing_go_to_left:
node = nodes[node.left]
else:
node = nodes[node.right]
else:
- if binned_data[row, node.feature_idx] <= node.bin_threshold:
+ if binned_data[ node.feature_idx] <= node.bin_threshold:
node = nodes[node.left]
else:
node = nodes[node.right]
diff --git a/sklearn/ensemble/_hist_gradient_boosting/histogram.pyx b/sklearn/ensemble/_hist_gradient_boosting/histogram.pyx
index 8bd7c4ee8b..578b48995f 100644
--- a/sklearn/ensemble/_hist_gradient_boosting/histogram.pyx
+++ b/sklearn/ensemble/_hist_gradient_boosting/histogram.pyx
@@ -155,7 +155,7 @@ cdef class HistogramBuilder:
for feature_idx in prange(n_features, schedule='static'):
# Compute histogram of each feature
self._compute_histogram_brute_single_feature(
- feature_idx, sample_indices, histograms)
+ feature_idx, sample_indices, histograms[feature_idx])
return histograms
@@ -163,7 +163,7 @@ cdef class HistogramBuilder:
HistogramBuilder self,
const int feature_idx,
const unsigned int [::1] sample_indices, # IN
- hist_struct [:, ::1] histograms) nogil: # OUT
+ hist_struct [::1] histograms) nogil: # OUT
"""Compute the histogram for a given feature."""
cdef:
@@ -180,20 +180,20 @@ cdef class HistogramBuilder:
if root_node:
if hessians_are_constant:
- _build_histogram_root_no_hessian(feature_idx, X_binned,
+ _build_histogram_root_no_hessian(X_binned,
ordered_gradients,
histograms)
else:
- _build_histogram_root(feature_idx, X_binned,
+ _build_histogram_root(X_binned,
ordered_gradients, ordered_hessians,
histograms)
else:
if hessians_are_constant:
- _build_histogram_no_hessian(feature_idx,
+ _build_histogram_no_hessian(
sample_indices, X_binned,
ordered_gradients, histograms)
else:
- _build_histogram(feature_idx, sample_indices,
+ _build_histogram(sample_indices,
X_binned, ordered_gradients,
ordered_hessians, histograms)
@@ -234,11 +234,11 @@ cdef class HistogramBuilder:
for feature_idx in prange(n_features, schedule='static', nogil=True):
# Compute histogram of each feature
- _subtract_histograms(feature_idx,
+ _subtract_histograms(
self.n_bins,
- parent_histograms,
- sibling_histograms,
- histograms)
+ parent_histograms[feature_idx],
+ sibling_histograms[feature_idx],
+ histograms[feature_idx])
return histograms
@@ -267,36 +267,34 @@ cpdef void _build_histogram_naive(
cpdef void _subtract_histograms(
- const int feature_idx,
unsigned int n_bins,
- hist_struct [:, ::1] hist_a, # IN
- hist_struct [:, ::1] hist_b, # IN
- hist_struct [:, ::1] out) nogil: # OUT
+ hist_struct [::1] hist_a, # IN
+ hist_struct [::1] hist_b, # IN
+ hist_struct [::1] out) nogil: # OUT
"""compute (hist_a - hist_b) in out"""
cdef:
unsigned int i = 0
for i in range(n_bins):
- out[feature_idx, i].sum_gradients = (
- hist_a[feature_idx, i].sum_gradients -
- hist_b[feature_idx, i].sum_gradients
+ out[i].sum_gradients = (
+ hist_a[i].sum_gradients -
+ hist_b[i].sum_gradients
)
- out[feature_idx, i].sum_hessians = (
- hist_a[feature_idx, i].sum_hessians -
- hist_b[feature_idx, i].sum_hessians
+ out[i].sum_hessians = (
+ hist_a[i].sum_hessians -
+ hist_b[i].sum_hessians
)
- out[feature_idx, i].count = (
- hist_a[feature_idx, i].count -
- hist_b[feature_idx, i].count
+ out[i].count = (
+ hist_a[i].count -
+ hist_b[i].count
)
cpdef void _build_histogram(
- const int feature_idx,
const unsigned int [::1] sample_indices, # IN
const X_BINNED_DTYPE_C [::1] binned_feature, # IN
const G_H_DTYPE_C [::1] ordered_gradients, # IN
const G_H_DTYPE_C [::1] ordered_hessians, # IN
- hist_struct [:, ::1] out) nogil: # OUT
+ hist_struct [::1] out) nogil: # OUT
"""Return histogram for a given feature."""
cdef:
unsigned int i = 0
@@ -315,34 +313,33 @@ cpdef void _build_histogram(
bin_2 = binned_feature[sample_indices[i + 2]]
bin_3 = binned_feature[sample_indices[i + 3]]
- out[feature_idx, bin_0].sum_gradients += ordered_gradients[i]
- out[feature_idx, bin_1].sum_gradients += ordered_gradients[i + 1]
- out[feature_idx, bin_2].sum_gradients += ordered_gradients[i + 2]
- out[feature_idx, bin_3].sum_gradients += ordered_gradients[i + 3]
+ out[bin_0].sum_gradients += ordered_gradients[i]
+ out[bin_1].sum_gradients += ordered_gradients[i + 1]
+ out[bin_2].sum_gradients += ordered_gradients[i + 2]
+ out[bin_3].sum_gradients += ordered_gradients[i + 3]
- out[feature_idx, bin_0].sum_hessians += ordered_hessians[i]
- out[feature_idx, bin_1].sum_hessians += ordered_hessians[i + 1]
- out[feature_idx, bin_2].sum_hessians += ordered_hessians[i + 2]
- out[feature_idx, bin_3].sum_hessians += ordered_hessians[i + 3]
+ out[bin_0].sum_hessians += ordered_hessians[i]
+ out[bin_1].sum_hessians += ordered_hessians[i + 1]
+ out[bin_2].sum_hessians += ordered_hessians[i + 2]
+ out[bin_3].sum_hessians += ordered_hessians[i + 3]
- out[feature_idx, bin_0].count += 1
- out[feature_idx, bin_1].count += 1
- out[feature_idx, bin_2].count += 1
- out[feature_idx, bin_3].count += 1
+ out[bin_0].count += 1
+ out[bin_1].count += 1
+ out[bin_2].count += 1
+ out[bin_3].count += 1
for i in range(unrolled_upper, n_node_samples):
bin_idx = binned_feature[sample_indices[i]]
- out[feature_idx, bin_idx].sum_gradients += ordered_gradients[i]
- out[feature_idx, bin_idx].sum_hessians += ordered_hessians[i]
- out[feature_idx, bin_idx].count += 1
+ out[bin_idx].sum_gradients += ordered_gradients[i]
+ out[bin_idx].sum_hessians += ordered_hessians[i]
+ out[bin_idx].count += 1
cpdef void _build_histogram_no_hessian(
- const int feature_idx,
const unsigned int [::1] sample_indices, # IN
const X_BINNED_DTYPE_C [::1] binned_feature, # IN
const G_H_DTYPE_C [::1] ordered_gradients, # IN
- hist_struct [:, ::1] out) nogil: # OUT
+ hist_struct [::1] out) nogil: # OUT
"""Return histogram for a given feature, not updating hessians.
Used when the hessians of the loss are constant (typically LS loss).
@@ -365,28 +362,27 @@ cpdef void _build_histogram_no_hessian(
bin_2 = binned_feature[sample_indices[i + 2]]
bin_3 = binned_feature[sample_indices[i + 3]]
- out[feature_idx, bin_0].sum_gradients += ordered_gradients[i]
- out[feature_idx, bin_1].sum_gradients += ordered_gradients[i + 1]
- out[feature_idx, bin_2].sum_gradients += ordered_gradients[i + 2]
- out[feature_idx, bin_3].sum_gradients += ordered_gradients[i + 3]
+ out[bin_0].sum_gradients += ordered_gradients[i]
+ out[bin_1].sum_gradients += ordered_gradients[i + 1]
+ out[bin_2].sum_gradients += ordered_gradients[i + 2]
+ out[bin_3].sum_gradients += ordered_gradients[i + 3]
- out[feature_idx, bin_0].count += 1
- out[feature_idx, bin_1].count += 1
- out[feature_idx, bin_2].count += 1
- out[feature_idx, bin_3].count += 1
+ out[bin_0].count += 1
+ out[bin_1].count += 1
+ out[bin_2].count += 1
+ out[bin_3].count += 1
for i in range(unrolled_upper, n_node_samples):
bin_idx = binned_feature[sample_indices[i]]
- out[feature_idx, bin_idx].sum_gradients += ordered_gradients[i]
- out[feature_idx, bin_idx].count += 1
+ out[bin_idx].sum_gradients += ordered_gradients[i]
+ out[bin_idx].count += 1
cpdef void _build_histogram_root(
- const int feature_idx,
const X_BINNED_DTYPE_C [::1] binned_feature, # IN
const G_H_DTYPE_C [::1] all_gradients, # IN
const G_H_DTYPE_C [::1] all_hessians, # IN
- hist_struct [:, ::1] out) nogil: # OUT
+ hist_struct [::1] out) nogil: # OUT
"""Compute histogram of the root node.
Unlike other nodes, the root node has to find the split among *all* the
@@ -412,33 +408,32 @@ cpdef void _build_histogram_root(
bin_2 = binned_feature[i + 2]
bin_3 = binned_feature[i + 3]
- out[feature_idx, bin_0].sum_gradients += all_gradients[i]
- out[feature_idx, bin_1].sum_gradients += all_gradients[i + 1]
- out[feature_idx, bin_2].sum_gradients += all_gradients[i + 2]
- out[feature_idx, bin_3].sum_gradients += all_gradients[i + 3]
+ out[bin_0].sum_gradients += all_gradients[i]
+ out[bin_1].sum_gradients += all_gradients[i + 1]
+ out[bin_2].sum_gradients += all_gradients[i + 2]
+ out[bin_3].sum_gradients += all_gradients[i + 3]
- out[feature_idx, bin_0].sum_hessians += all_hessians[i]
- out[feature_idx, bin_1].sum_hessians += all_hessians[i + 1]
- out[feature_idx, bin_2].sum_hessians += all_hessians[i + 2]
- out[feature_idx, bin_3].sum_hessians += all_hessians[i + 3]
+ out[bin_0].sum_hessians += all_hessians[i]
+ out[bin_1].sum_hessians += all_hessians[i + 1]
+ out[bin_2].sum_hessians += all_hessians[i + 2]
+ out[bin_3].sum_hessians += all_hessians[i + 3]
- out[feature_idx, bin_0].count += 1
- out[feature_idx, bin_1].count += 1
- out[feature_idx, bin_2].count += 1
- out[feature_idx, bin_3].count += 1
+ out[bin_0].count += 1
+ out[bin_1].count += 1
+ out[bin_2].count += 1
+ out[bin_3].count += 1
for i in range(unrolled_upper, n_samples):
bin_idx = binned_feature[i]
- out[feature_idx, bin_idx].sum_gradients += all_gradients[i]
- out[feature_idx, bin_idx].sum_hessians += all_hessians[i]
- out[feature_idx, bin_idx].count += 1
+ out[bin_idx].sum_gradients += all_gradients[i]
+ out[bin_idx].sum_hessians += all_hessians[i]
+ out[bin_idx].count += 1
cpdef void _build_histogram_root_no_hessian(
- const int feature_idx,
const X_BINNED_DTYPE_C [::1] binned_feature, # IN
const G_H_DTYPE_C [::1] all_gradients, # IN
- hist_struct [:, ::1] out) nogil: # OUT
+ hist_struct [::1] out) nogil: # OUT
"""Compute histogram of the root node, not updating hessians.
Used when the hessians of the loss are constant (typically LS loss).
@@ -461,17 +456,17 @@ cpdef void _build_histogram_root_no_hessian(
bin_2 = binned_feature[i + 2]
bin_3 = binned_feature[i + 3]
- out[feature_idx, bin_0].sum_gradients += all_gradients[i]
- out[feature_idx, bin_1].sum_gradients += all_gradients[i + 1]
- out[feature_idx, bin_2].sum_gradients += all_gradients[i + 2]
- out[feature_idx, bin_3].sum_gradients += all_gradients[i + 3]
+ out[bin_0].sum_gradients += all_gradients[i]
+ out[bin_1].sum_gradients += all_gradients[i + 1]
+ out[bin_2].sum_gradients += all_gradients[i + 2]
+ out[bin_3].sum_gradients += all_gradients[i + 3]
- out[feature_idx, bin_0].count += 1
- out[feature_idx, bin_1].count += 1
- out[feature_idx, bin_2].count += 1
- out[feature_idx, bin_3].count += 1
+ out[bin_0].count += 1
+ out[bin_1].count += 1
+ out[bin_2].count += 1
+ out[bin_3].count += 1
for i in range(unrolled_upper, n_samples):
bin_idx = binned_feature[i]
- out[feature_idx, bin_idx].sum_gradients += all_gradients[i]
- out[feature_idx, bin_idx].count += 1
+ out[bin_idx].sum_gradients += all_gradients[i]
+ out[bin_idx].count += 1
diff --git a/sklearn/ensemble/_hist_gradient_boosting/splitting.pyx b/sklearn/ensemble/_hist_gradient_boosting/splitting.pyx
index 984cc6767f..cb98d1e069 100644
--- a/sklearn/ensemble/_hist_gradient_boosting/splitting.pyx
+++ b/sklearn/ensemble/_hist_gradient_boosting/splitting.pyx
@@ -437,7 +437,7 @@ cdef class Splitter:
self._find_best_bin_to_split_left_to_right(
feature_idx, has_missing_values[feature_idx],
- histograms, n_samples, sum_gradients, sum_hessians,
+ histograms[feature_idx], n_samples, sum_gradients, sum_hessians,
value, monotonic_cst[feature_idx],
lower_bound, upper_bound, &split_infos[feature_idx])
@@ -446,7 +446,7 @@ cdef class Splitter:
# sending the nans to the left child would lead to a higher
# gain
self._find_best_bin_to_split_right_to_left(
- feature_idx, histograms, n_samples,
+ feature_idx, histograms[feature_idx], n_samples,
sum_gradients, sum_hessians,
value, monotonic_cst[feature_idx],
lower_bound, upper_bound, &split_infos[feature_idx])
@@ -491,7 +491,7 @@ cdef class Splitter:
Splitter self,
unsigned int feature_idx,
unsigned char has_missing_values,
- const hist_struct [:, ::1] histograms, # IN
+ const hist_struct [::1] histograms, # IN
unsigned int n_samples,
Y_DTYPE_C sum_gradients,
Y_DTYPE_C sum_hessians,
@@ -540,17 +540,17 @@ cdef class Splitter:
loss_current_node = _loss_from_value(value, sum_gradients)
for bin_idx in range(end):
- n_samples_left += histograms[feature_idx, bin_idx].count
+ n_samples_left += histograms[bin_idx].count
n_samples_right = n_samples_ - n_samples_left
if self.hessians_are_constant:
- sum_hessian_left += histograms[feature_idx, bin_idx].count
+ sum_hessian_left += histograms[bin_idx].count
else:
sum_hessian_left += \
- histograms[feature_idx, bin_idx].sum_hessians
+ histograms[bin_idx].sum_hessians
sum_hessian_right = sum_hessians - sum_hessian_left
- sum_gradient_left += histograms[feature_idx, bin_idx].sum_gradients
+ sum_gradient_left += histograms[bin_idx].sum_gradients
sum_gradient_right = sum_gradients - sum_gradient_left
if n_samples_left < self.min_samples_leaf:
@@ -605,7 +605,7 @@ cdef class Splitter:
cdef void _find_best_bin_to_split_right_to_left(
self,
unsigned int feature_idx,
- const hist_struct [:, ::1] histograms, # IN
+ const hist_struct [::1] histograms, # IN
unsigned int n_samples,
Y_DTYPE_C sum_gradients,
Y_DTYPE_C sum_hessians,
@@ -653,18 +653,18 @@ cdef class Splitter:
loss_current_node = _loss_from_value(value, sum_gradients)
for bin_idx in range(start, -1, -1):
- n_samples_right += histograms[feature_idx, bin_idx + 1].count
+ n_samples_right += histograms[bin_idx + 1].count
n_samples_left = n_samples_ - n_samples_right
if self.hessians_are_constant:
- sum_hessian_right += histograms[feature_idx, bin_idx + 1].count
+ sum_hessian_right += histograms[bin_idx + 1].count
else:
sum_hessian_right += \
- histograms[feature_idx, bin_idx + 1].sum_hessians
+ histograms[bin_idx + 1].sum_hessians
sum_hessian_left = sum_hessians - sum_hessian_right
sum_gradient_right += \
- histograms[feature_idx, bin_idx + 1].sum_gradients
+ histograms[bin_idx + 1].sum_gradients
sum_gradient_left = sum_gradients - sum_gradient_right
if n_samples_right < self.min_samples_leaf:
Now, the reason why I think that slow-down comes from the interactions, is because as far as I can tell, the generated C-code looks just fine. For example, __pyx_t_28 = __pyx_v_bin_0;
(*((struct __pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_hist_struct *)
( /* dim=0 */ ((char *)
(((struct __pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_hist_struct *)
__pyx_v_out.data)
+ __pyx_t_28)) ))).count += 1 while __pyx_t_36 = __pyx_v_feature_idx;
__pyx_t_37 = __pyx_v_bin_0;
(*((struct __pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_hist_struct *)
( /* dim=1 */ ((char *)
(((struct __pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_hist_struct *)
( /* dim=0 */ (__pyx_v_out.data + __pyx_t_36 * __pyx_v_out.strides[0]) ))
+ __pyx_t_37)) ))).count += 1; ( As far as I can tell the first snippet uses less operations than the second one, as expected. And yet, its branch is much slower. So that's why I'm thinking the slow down might come from the generated interactions. Thanks again for your time. |
@NicolasHug any chance you could run your code through a C level profiler like callgrind or Linux |
I was using 4 threads, but I observe the same slowdown when setting OMP_NUM_THREADS to 1.
Of course I used callgrind on a smaller benchmark (setting OMP_NUM_THREADS to 1 and enabling cython profiling, though I'm not sure this has any effect), and here are the logs: callgrind.out.slow_one.txt The "slow" one is the one using 1d views. Loading these in 1c1
< static pyobject *__pyx_pf_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_from_numeric_data(cython_unused pyobject *__pyx_self, __pyx_memviewslice __pyx_v_nodes, __pyx_memviewslice __pyx_v_numeric_data, __pyx_memviewslice __pyx_v_out) {
---
> static PyObject *__pyx_pf_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_from_numeric_data(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_nodes, __Pyx_memviewslice __pyx_v_numeric_data, __Pyx_memviewslice __pyx_v_out) {
3a4
> __Pyx_TraceDeclarations
8c9,11
< Py_ssize_t __pyx_t_4;
---
> __Pyx_memviewslice __pyx_t_4 = { 0, 0, { 0 }, { 0 }, { 0 } };
> Py_ssize_t __pyx_t_5;
> __Pyx_TraceFrameInit(__pyx_codeobj_)
9a13
> __Pyx_TraceCall("_predict_from_numeric_data", __pyx_f[0], 25, 0, __PYX_ERR(0, 25, __pyx_L1_error));
11c15
< /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":32
---
> /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":33
15c19
< * out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
---
> * out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
25c29
< if (unlikely(!__pyx_v_numeric_data.memview)) { __Pyx_RaiseUnboundMemoryviewSliceNogil("numeric_data"); __PYX_ERR(0, 32, __pyx_L4_error) }
---
> if (unlikely(!__pyx_v_numeric_data.memview)) { __Pyx_RaiseUnboundMemoryviewSliceNogil("numeric_data"); __PYX_ERR(0, 33, __pyx_L4_error) }
28a33,37
> int __pyx_parallel_temp0 = ((int)0xbad0bad0);
> const char *__pyx_parallel_filename = NULL; int __pyx_parallel_lineno = 0, __pyx_parallel_clineno = 0;
> PyObject *__pyx_parallel_exc_type = NULL, *__pyx_parallel_exc_value = NULL, *__pyx_parallel_exc_tb = NULL;
> int __pyx_parallel_why;
> __pyx_parallel_why = 0;
39c48
< #pragma omp parallel private(__pyx_t_4)
---
> #pragma omp parallel private(__pyx_t_5) firstprivate(__pyx_t_4) private(__pyx_filename, __pyx_lineno, __pyx_clineno) shared(__pyx_parallel_why, __pyx_parallel_exc_type, __pyx_parallel_exc_value, __pyx_parallel_exc_tb)
42a52,57
> #ifdef WITH_THREAD
> PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
> #endif
> Py_BEGIN_ALLOW_THREADS
> #endif /* _OPENMP */
> #ifdef _OPENMP
45a61
> if (__pyx_parallel_why < 2)
49c65
< /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":33
---
> /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":34
52c68
< * out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i) # <<<<<<<<<<<<<<
---
> * out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i) # <<<<<<<<<<<<<<
56,57c72,121
< __pyx_t_4 = __pyx_v_i;
< *((__pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_Y_DTYPE_C *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_4 * __pyx_v_out.strides[0]) )) = __pyx_f_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_one_from_numeric_data(__pyx_v_nodes, __pyx_v_numeric_data, __pyx_v_i);
---
> __pyx_t_4.data = __pyx_v_numeric_data.data;
> __pyx_t_4.memview = __pyx_v_numeric_data.memview;
> __PYX_INC_MEMVIEW(&__pyx_t_4, 0);
> {
> Py_ssize_t __pyx_tmp_idx = __pyx_v_i;
> Py_ssize_t __pyx_tmp_stride = __pyx_v_numeric_data.strides[0];
> if ((0)) __PYX_ERR(0, 34, __pyx_L8_error)
> __pyx_t_4.data += __pyx_tmp_idx * __pyx_tmp_stride;
> }
>
> __pyx_t_4.shape[0] = __pyx_v_numeric_data.shape[1];
> __pyx_t_4.strides[0] = __pyx_v_numeric_data.strides[1];
> __pyx_t_4.suboffsets[0] = -1;
>
> __pyx_t_5 = __pyx_v_i;
> *((__pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_Y_DTYPE_C *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_5 * __pyx_v_out.strides[0]) )) = __pyx_f_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_one_from_numeric_data(__pyx_v_nodes, __pyx_t_4, __pyx_v_i);
> __PYX_XDEC_MEMVIEW(&__pyx_t_4, 0);
> __pyx_t_4.memview = NULL;
> __pyx_t_4.data = NULL;
> goto __pyx_L11;
> __pyx_L8_error:;
> {
> #ifdef WITH_THREAD
> PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
> #endif
> #ifdef _OPENMP
> #pragma omp flush(__pyx_parallel_exc_type)
> #endif /* _OPENMP */
> if (!__pyx_parallel_exc_type) {
> __Pyx_ErrFetchWithState(&__pyx_parallel_exc_type, &__pyx_parallel_exc_value, &__pyx_parallel_exc_tb);
> __pyx_parallel_filename = __pyx_filename; __pyx_parallel_lineno = __pyx_lineno; __pyx_parallel_clineno = __pyx_clineno;
> __Pyx_GOTREF(__pyx_parallel_exc_type);
> }
> #ifdef WITH_THREAD
> __Pyx_PyGILState_Release(__pyx_gilstate_save);
> #endif
> }
> __pyx_parallel_why = 4;
> goto __pyx_L10;
> __pyx_L10:;
> #ifdef _OPENMP
> #pragma omp critical(__pyx_parallel_lastprivates0)
> #endif /* _OPENMP */
> {
> __pyx_parallel_temp0 = __pyx_v_i;
> }
> __pyx_L11:;
> #ifdef _OPENMP
> #pragma omp flush(__pyx_parallel_why)
> #endif /* _OPENMP */
59a124,159
> #ifdef _OPENMP
> Py_END_ALLOW_THREADS
> #else
> {
> #ifdef WITH_THREAD
> PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
> #endif
> #endif /* _OPENMP */
> /* Clean up any temporaries */
> __PYX_XDEC_MEMVIEW(&__pyx_t_4, 0);
> #ifdef WITH_THREAD
> __Pyx_PyGILState_Release(__pyx_gilstate_save);
> #endif
> #ifndef _OPENMP
> }
> #endif /* _OPENMP */
> }
> }
> if (__pyx_parallel_exc_type) {
> /* This may have been overridden by a continue, break or return in another thread. Prefer the error. */
> __pyx_parallel_why = 4;
> }
> if (__pyx_parallel_why) {
> __pyx_v_i = __pyx_parallel_temp0;
> switch (__pyx_parallel_why) {
> case 4:
> {
> #ifdef WITH_THREAD
> PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
> #endif
> __Pyx_GIVEREF(__pyx_parallel_exc_type);
> __Pyx_ErrRestoreWithState(__pyx_parallel_exc_type, __pyx_parallel_exc_value, __pyx_parallel_exc_tb);
> __pyx_filename = __pyx_parallel_filename; __pyx_lineno = __pyx_parallel_lineno; __pyx_clineno = __pyx_parallel_clineno;
> #ifdef WITH_THREAD
> __Pyx_PyGILState_Release(__pyx_gilstate_save);
> #endif
60a161,162
> goto __pyx_L4_error;
> }
71c173
< /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":32
---
> /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":33
75c177
< * out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
---
> * out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
97c199
< /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":24
---
> /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":25
108a211
> __PYX_XDEC_MEMVIEW(&__pyx_t_4, 1);
115a219
> __Pyx_TraceReturn(__pyx_r, 0);
Fast codestatic PyObject *__pyx_pf_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_from_numeric_data(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_nodes, __Pyx_memviewslice __pyx_v_numeric_data, __Pyx_memviewslice __pyx_v_out) {
int __pyx_v_i;
PyObject *__pyx_r = NULL;
__Pyx_RefNannyDeclarations
Py_ssize_t __pyx_t_1;
Py_ssize_t __pyx_t_2;
Py_ssize_t __pyx_t_3;
Py_ssize_t __pyx_t_4;
__Pyx_RefNannySetupContext("_predict_from_numeric_data", 0);
/* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":32
* int i
*
* for i in prange(numeric_data.shape[0], schedule='static', nogil=True): # <<<<<<<<<<<<<<
* out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
*
*/
{
#ifdef WITH_THREAD
PyThreadState *_save;
Py_UNBLOCK_THREADS
__Pyx_FastGIL_Remember();
#endif
/*try:*/ {
if (unlikely(!__pyx_v_numeric_data.memview)) { __Pyx_RaiseUnboundMemoryviewSliceNogil("numeric_data"); __PYX_ERR(0, 32, __pyx_L4_error) }
__pyx_t_1 = (__pyx_v_numeric_data.shape[0]);
if ((1 == 0)) abort();
{
#if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
#undef likely
#undef unlikely
#define likely(x) (x)
#define unlikely(x) (x)
#endif
__pyx_t_3 = (__pyx_t_1 - 0 + 1 - 1/abs(1)) / 1;
if (__pyx_t_3 > 0)
{
#ifdef _OPENMP
#pragma omp parallel private(__pyx_t_4)
#endif /* _OPENMP */
{
#ifdef _OPENMP
#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) schedule(static)
#endif /* _OPENMP */
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
{
__pyx_v_i = (int)(0 + 1 * __pyx_t_2);
/* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":33
*
* for i in prange(numeric_data.shape[0], schedule='static', nogil=True):
* out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i) # <<<<<<<<<<<<<<
*
*
*/
__pyx_t_4 = __pyx_v_i;
*((__pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_Y_DTYPE_C *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_4 * __pyx_v_out.strides[0]) )) = __pyx_f_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_one_from_numeric_data(__pyx_v_nodes, __pyx_v_numeric_data, __pyx_v_i);
}
}
}
}
}
#if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
#undef likely
#undef unlikely
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
#endif
}
/* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":32
* int i
*
* for i in prange(numeric_data.shape[0], schedule='static', nogil=True): # <<<<<<<<<<<<<<
* out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
*
*/
/*finally:*/ {
/*normal exit:*/{
#ifdef WITH_THREAD
__Pyx_FastGIL_Forget();
Py_BLOCK_THREADS
#endif
goto __pyx_L5;
}
__pyx_L4_error: {
#ifdef WITH_THREAD
__Pyx_FastGIL_Forget();
Py_BLOCK_THREADS
#endif
goto __pyx_L1_error;
}
__pyx_L5:;
}
}
/* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":24
*
*
* def _predict_from_numeric_data( # <<<<<<<<<<<<<<
* node_struct [:] nodes,
* const X_DTYPE_C [:, :] numeric_data,
*/
/* function exit code */
__pyx_r = Py_None; __Pyx_INCREF(Py_None);
goto __pyx_L0;
__pyx_L1_error:;
__Pyx_AddTraceback("sklearn.ensemble._hist_gradient_boosting._predictor._predict_from_numeric_data", __pyx_clineno, __pyx_lineno, __pyx_filename);
__pyx_r = NULL;
__pyx_L0:;
__PYX_XDEC_MEMVIEW(&__pyx_v_nodes, 1);
__PYX_XDEC_MEMVIEW(&__pyx_v_numeric_data, 1);
__PYX_XDEC_MEMVIEW(&__pyx_v_out, 1);
__Pyx_XGIVEREF(__pyx_r);
__Pyx_RefNannyFinishContext();
return __pyx_r;
} Slow codestatic PyObject *__pyx_pf_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_from_numeric_data(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_nodes, __Pyx_memviewslice __pyx_v_numeric_data, __Pyx_memviewslice __pyx_v_out) {
int __pyx_v_i;
PyObject *__pyx_r = NULL;
__Pyx_TraceDeclarations
__Pyx_RefNannyDeclarations
Py_ssize_t __pyx_t_1;
Py_ssize_t __pyx_t_2;
Py_ssize_t __pyx_t_3;
__Pyx_memviewslice __pyx_t_4 = { 0, 0, { 0 }, { 0 }, { 0 } };
Py_ssize_t __pyx_t_5;
__Pyx_TraceFrameInit(__pyx_codeobj_)
__Pyx_RefNannySetupContext("_predict_from_numeric_data", 0);
__Pyx_TraceCall("_predict_from_numeric_data", __pyx_f[0], 25, 0, __PYX_ERR(0, 25, __pyx_L1_error));
/* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":33
* int i
*
* for i in prange(numeric_data.shape[0], schedule='static', nogil=True): # <<<<<<<<<<<<<<
* out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
*
*/
{
#ifdef WITH_THREAD
PyThreadState *_save;
Py_UNBLOCK_THREADS
__Pyx_FastGIL_Remember();
#endif
/*try:*/ {
if (unlikely(!__pyx_v_numeric_data.memview)) { __Pyx_RaiseUnboundMemoryviewSliceNogil("numeric_data"); __PYX_ERR(0, 33, __pyx_L4_error) }
__pyx_t_1 = (__pyx_v_numeric_data.shape[0]);
if ((1 == 0)) abort();
{
int __pyx_parallel_temp0 = ((int)0xbad0bad0);
const char *__pyx_parallel_filename = NULL; int __pyx_parallel_lineno = 0, __pyx_parallel_clineno = 0;
PyObject *__pyx_parallel_exc_type = NULL, *__pyx_parallel_exc_value = NULL, *__pyx_parallel_exc_tb = NULL;
int __pyx_parallel_why;
__pyx_parallel_why = 0;
#if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
#undef likely
#undef unlikely
#define likely(x) (x)
#define unlikely(x) (x)
#endif
__pyx_t_3 = (__pyx_t_1 - 0 + 1 - 1/abs(1)) / 1;
if (__pyx_t_3 > 0)
{
#ifdef _OPENMP
#pragma omp parallel private(__pyx_t_5) firstprivate(__pyx_t_4) private(__pyx_filename, __pyx_lineno, __pyx_clineno) shared(__pyx_parallel_why, __pyx_parallel_exc_type, __pyx_parallel_exc_value, __pyx_parallel_exc_tb)
#endif /* _OPENMP */
{
#ifdef _OPENMP
#ifdef WITH_THREAD
PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
#endif
Py_BEGIN_ALLOW_THREADS
#endif /* _OPENMP */
#ifdef _OPENMP
#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) schedule(static)
#endif /* _OPENMP */
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
if (__pyx_parallel_why < 2)
{
__pyx_v_i = (int)(0 + 1 * __pyx_t_2);
/* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":34
*
* for i in prange(numeric_data.shape[0], schedule='static', nogil=True):
* out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i) # <<<<<<<<<<<<<<
*
*
*/
__pyx_t_4.data = __pyx_v_numeric_data.data;
__pyx_t_4.memview = __pyx_v_numeric_data.memview;
__PYX_INC_MEMVIEW(&__pyx_t_4, 0);
{
Py_ssize_t __pyx_tmp_idx = __pyx_v_i;
Py_ssize_t __pyx_tmp_stride = __pyx_v_numeric_data.strides[0];
if ((0)) __PYX_ERR(0, 34, __pyx_L8_error)
__pyx_t_4.data += __pyx_tmp_idx * __pyx_tmp_stride;
}
__pyx_t_4.shape[0] = __pyx_v_numeric_data.shape[1];
__pyx_t_4.strides[0] = __pyx_v_numeric_data.strides[1];
__pyx_t_4.suboffsets[0] = -1;
__pyx_t_5 = __pyx_v_i;
*((__pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_Y_DTYPE_C *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_5 * __pyx_v_out.strides[0]) )) = __pyx_f_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_one_from_numeric_data(__pyx_v_nodes, __pyx_t_4, __pyx_v_i);
__PYX_XDEC_MEMVIEW(&__pyx_t_4, 0);
__pyx_t_4.memview = NULL;
__pyx_t_4.data = NULL;
goto __pyx_L11;
__pyx_L8_error:;
{
#ifdef WITH_THREAD
PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
#endif
#ifdef _OPENMP
#pragma omp flush(__pyx_parallel_exc_type)
#endif /* _OPENMP */
if (!__pyx_parallel_exc_type) {
__Pyx_ErrFetchWithState(&__pyx_parallel_exc_type, &__pyx_parallel_exc_value, &__pyx_parallel_exc_tb);
__pyx_parallel_filename = __pyx_filename; __pyx_parallel_lineno = __pyx_lineno; __pyx_parallel_clineno = __pyx_clineno;
__Pyx_GOTREF(__pyx_parallel_exc_type);
}
#ifdef WITH_THREAD
__Pyx_PyGILState_Release(__pyx_gilstate_save);
#endif
}
__pyx_parallel_why = 4;
goto __pyx_L10;
__pyx_L10:;
#ifdef _OPENMP
#pragma omp critical(__pyx_parallel_lastprivates0)
#endif /* _OPENMP */
{
__pyx_parallel_temp0 = __pyx_v_i;
}
__pyx_L11:;
#ifdef _OPENMP
#pragma omp flush(__pyx_parallel_why)
#endif /* _OPENMP */
}
}
#ifdef _OPENMP
Py_END_ALLOW_THREADS
#else
{
#ifdef WITH_THREAD
PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
#endif
#endif /* _OPENMP */
/* Clean up any temporaries */
__PYX_XDEC_MEMVIEW(&__pyx_t_4, 0);
#ifdef WITH_THREAD
__Pyx_PyGILState_Release(__pyx_gilstate_save);
#endif
#ifndef _OPENMP
}
#endif /* _OPENMP */
}
}
if (__pyx_parallel_exc_type) {
/* This may have been overridden by a continue, break or return in another thread. Prefer the error. */
__pyx_parallel_why = 4;
}
if (__pyx_parallel_why) {
__pyx_v_i = __pyx_parallel_temp0;
switch (__pyx_parallel_why) {
case 4:
{
#ifdef WITH_THREAD
PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
#endif
__Pyx_GIVEREF(__pyx_parallel_exc_type);
__Pyx_ErrRestoreWithState(__pyx_parallel_exc_type, __pyx_parallel_exc_value, __pyx_parallel_exc_tb);
__pyx_filename = __pyx_parallel_filename; __pyx_lineno = __pyx_parallel_lineno; __pyx_clineno = __pyx_parallel_clineno;
#ifdef WITH_THREAD
__Pyx_PyGILState_Release(__pyx_gilstate_save);
#endif
}
goto __pyx_L4_error;
}
}
}
#if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
#undef likely
#undef unlikely
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
#endif
}
/* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":33
* int i
*
* for i in prange(numeric_data.shape[0], schedule='static', nogil=True): # <<<<<<<<<<<<<<
* out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
*
*/
/*finally:*/ {
/*normal exit:*/{
#ifdef WITH_THREAD
__Pyx_FastGIL_Forget();
Py_BLOCK_THREADS
#endif
goto __pyx_L5;
}
__pyx_L4_error: {
#ifdef WITH_THREAD
__Pyx_FastGIL_Forget();
Py_BLOCK_THREADS
#endif
goto __pyx_L1_error;
}
__pyx_L5:;
}
}
/* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":25
*
*
* def _predict_from_numeric_data( # <<<<<<<<<<<<<<
* node_struct [:] nodes,
* const X_DTYPE_C [:, :] numeric_data,
*/
/* function exit code */
__pyx_r = Py_None; __Pyx_INCREF(Py_None);
goto __pyx_L0;
__pyx_L1_error:;
__PYX_XDEC_MEMVIEW(&__pyx_t_4, 1);
__Pyx_AddTraceback("sklearn.ensemble._hist_gradient_boosting._predictor._predict_from_numeric_data", __pyx_clineno, __pyx_lineno, __pyx_filename);
__pyx_r = NULL;
__pyx_L0:;
__PYX_XDEC_MEMVIEW(&__pyx_v_nodes, 1);
__PYX_XDEC_MEMVIEW(&__pyx_v_numeric_data, 1);
__PYX_XDEC_MEMVIEW(&__pyx_v_out, 1);
__Pyx_XGIVEREF(__pyx_r);
__Pyx_TraceReturn(__pyx_r, 0);
__Pyx_RefNannyFinishContext();
return __pyx_r;
} Just for completeness the Cython diff is just: @ -30,12 +31,12 @@ def _predict_from_numeric_data(
int i
for i in prange(numeric_data.shape[0], schedule='static', nogil=True):
- out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
+ out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
On top of the big chunk of additional code, I'm wondering if the slowdown could come from the addition of the private / lastprivate openmp variables, in particular |
Well, yes. It can make your profile unusable. :) I can't say if it was really compiled in on your side (requires a C macro setting), but CPython profiling and C level profiling do very different things and the CPython profiling (which adds dedicated event callbacks to your code) just creates a lot of overhead that you don't want to see in your C profile.
Could be. I pushed a fix that discards the unused error handling code in 794d21d. That should at least make the diff a bit clearer. Could you maybe retry with the latest master? And with Cython profiling disabled? :) |
Thanks! This did make the benchmark with 1d views faster, so this is definitely an improvement. However it's still not quite as fast as the benchmark with 2d views, and I still observe about 4 Billion additional instructions. But I'm pretty sure now that these 4 billion come from these 2 instructions: __PYX_INC_MEMVIEW(&__pyx_t_4, 0);
...
__PYX_XDEC_MEMVIEW(&__pyx_t_4, 0); where Indeed, when I comment these out, both benchmarks run just as fast, and I don't observe these 4 billion instructions :) Here's the diff just for reference: 8c8,9
< Py_ssize_t __pyx_t_4;
---
> __Pyx_memviewslice __pyx_t_4 = { 0, 0, { 0 }, { 0 }, { 0 } };
> Py_ssize_t __pyx_t_5;
18c19
< * out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
---
> * out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
41c42
< #pragma omp parallel private(__pyx_t_4)
---
> #pragma omp parallel private(__pyx_t_5) firstprivate(__pyx_t_4)
54c55
< * out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i) # <<<<<<<<<<<<<<
---
> * out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i) # <<<<<<<<<<<<<<
58,59c59,75
< __pyx_t_4 = __pyx_v_i;
< *((__pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_Y_DTYPE_C *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_4 * __pyx_v_out.strides[0]) )) = __pyx_f_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_one_from_numeric_data(__pyx_v_nodes, __pyx_v_numeric_data, __pyx_v_i);
---
> __pyx_t_4.data = __pyx_v_numeric_data.data;
> __pyx_t_4.memview = __pyx_v_numeric_data.memview;
> __PYX_INC_MEMVIEW(&__pyx_t_4, 0);
> {
> Py_ssize_t __pyx_tmp_idx = __pyx_v_i;
> Py_ssize_t __pyx_tmp_stride = __pyx_v_numeric_data.strides[0];
> __pyx_t_4.data += __pyx_tmp_idx * __pyx_tmp_stride;
> }
>
> __pyx_t_4.shape[0] = __pyx_v_numeric_data.shape[1];
> __pyx_t_4.strides[0] = __pyx_v_numeric_data.strides[1];
> __pyx_t_4.suboffsets[0] = -1;
>
> __pyx_t_5 = __pyx_v_i;
> *((__pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_Y_DTYPE_C *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_5 * __pyx_v_out.strides[0]) )) = __pyx_f_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_one_from_numeric_data(__pyx_v_nodes, __pyx_t_4, __pyx_v_i);
> __PYX_XDEC_MEMVIEW(&__pyx_t_4, 0);
> __pyx_t_4.memview = NULL; __pyx_t_4.data = NULL;
77c93
< * out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
---
> * out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
110a127
> __PYX_XDEC_MEMVIEW(&__pyx_t_4, 1); As a side note, I do observe these 2 instructions being generated when doing just with nogil:
for row_idx in range(view_2d.shape[0]):
out += sum_1d_a(view_2d[row_idx, :]) so I understand @robertwb's #2987 (comment) better now. |
So I guess my new question is this: would you consider introducing a new option that would bypass the If yes, I would be happy to give this a try! |
BTW here are the callgrind dumps of the benchmarks using Cython's master in case you're interested (without CPython profiling): callgrind.out.2d_views.txt |
Without making any comment about whether it would be useful: this would essentially be a borrowed memoryview slice. It'd have to be unstorable (e.g. could be added to a class, set as a global variable, or put in a closure). I think the main challenge would be making it easily accessible while keeping the limitations understood. |
Funny idea: could you create and keep one additional slice before entering the loop, i.e. outside of the prange loop? Any slice would do. Just to be sure that we're not hitting the case of the first slice all the time inside of the loop for some unexpected reason. Once you've tried that (to be sure that you get comparable timings), please update to the latest master and give that another try. I sprinkled some branch hints in various exception handling places to keep the C compiler from accidentally guessing wrongly about the fast path through the slicing code. Not sure if it makes any difference, but let's see it from actual numbers. |
I declared a Also I made a few additional experiments:
|
That's expected since you wrote that you are only using one thread. prange/OpenMP introduces additional overhead for managing the workload of that one thread that a simple loop does not have.
Can't say if that's a valid comparison. It would mean that the refcount doesn't get increased correctly and I would actually expect that to lead to a crash due to a premature buffer release. Not sure why it still works for you. |
I tried your example, extending it to calculate an actual sum in the nogil functions (so that the C compiler can't just discard the call). Here are my timings:
I would say, this is totally in the same ballpark. I also tried it with a larger array of 25000x5000 and still got very close timings for both, within a couple of percent of each other with 4 threads. Here's the code I used (please tell me if I did anything wrong):#distutils: extra_compile_args=-fopenmp
#distutils: extra_link_args=-fopenmp
#cython: boundscheck=False
#cython: language_level=3
from cython.parallel import prange
def sum_2d_a(float[:, :] view_2d):
cdef:
float out = 0
int row_idx = 0
for row_idx in prange(view_2d.shape[0], nogil=True):
out += sum_1d_a(view_2d[row_idx, :])
return out
cdef float sum_1d_a(float[:] view_1d) nogil:
cdef Py_ssize_t i
cdef float s = 0
for i in range(view_1d.shape[0]):
s += view_1d[i]
return s
def sum_2d_b(float[:, :] view_2d):
cdef:
float out = 0
int row_idx = 0
for row_idx in prange(view_2d.shape[0], nogil=True):
out += sum_1d_b(view_2d, row_idx)
return out
cdef float sum_1d_b(float[:, :] view_2d, int row_idx) nogil:
cdef Py_ssize_t i
cdef float s = 0
for i in range(view_2d.shape[1]):
s += view_2d[row_idx, i]
return s |
I also let callgrind count the instructions and they are not even 1% apart for both versions, with a slight advantage for the 2D non-slicing version. I tested all of this with the latest master branch. |
I also pushed the changes from this ticket over to the 0.29.x branch so that they get included in 0.29.18. Seems worth it. |
Thanks! I can reproduce your results, i.e. I observe no difference when the array is of dimension 5000 x 5000. However, I do observe some significant difference when ncols << nrows: with I tried with OMP_NUM_THREADS set to 1 or 4, and with I do not observe any slowdown when nrows >= ncols though. |
Doctor, doctor! It hurts when I do this. – Then don't do it! :) |
Seriously, that's pretty much the worst case for slicing. It creates a huge number of very small slices. You also wouldn't create a ton of threads to let each of them sum a hand full of integers, because there is overhead involved in setting them up and managing them. Similarly, there is overhead involved in safely passing a view on a memory managed buffer around. I'm well aware that Cython lets you shoot yourself in the foot in many cases if you really, really ask for it, but I'd be happier if the answer to the problem at hand wasn't "let's add yet another directive". |
Sure, just to clarify, I only came up with this contrived "sum" example because I was trying to find a minimal reproducing example. Our actual use in scikit-learn is more reality grounded, and I can go into the details if you want but basically, calling a Cythonized function on each row of a big matrix where Would you recommend using pointers for this instead? We're trying to avoid pointers as much as possible since ideally, we don't want to have to worry about strides and C/Fortran alignment. I took a deeper dive into the implementation of |
I can well imagine that that's a use case for something like scikit-learn, thanks for the insights. Atomic ints are not costly per se, they are often implemented in hardware and thus fairly cheap in comparison to many other forms of locking mechanisms. But they do represent a form of locking, and that's still not entirely for free, and (as with all such mechanisms) gets worse under congestion. With very small work packets like in your case, they're probably getting close to saturation. I'm wondering, #2227 has been pending for a while now. Would it be a good idea to tweak the semantics of for-loop iteration by axis to return borrowed slices, with their lifetime bound to the loop iteration? I think that would also help your use case. |
I'm closing this ticket since several things have been improved. I think the future work on this is out of its scope now. |
Indexing a 2d view in a
prange
will generate some unnecessary python interactions related to the GIL.Here is a reproducing example. LMK if I can help with anything.
version : 0.29.6
The text was updated successfully, but these errors were encountered: