Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prange generates unnecessary python interactions when indexing a 2d view #2987

Closed
NicolasHug opened this issue Jun 7, 2019 · 29 comments
Closed

Comments

@NicolasHug
Copy link
Contributor

NicolasHug commented Jun 7, 2019

Indexing a 2d view in a prange will generate some unnecessary python interactions related to the GIL.

Here is a reproducing example. LMK if I can help with anything.

#cython: boundscheck=False
#cython: language_level=3
from cython.parallel import prange

# Indexing a 2d view in a prange loop: generates lots of python interactions.
def sum_2d_a(float[:, :] view_2d):

    cdef:
        float out = 0
        int row_idx = 0

    for row_idx in prange(view_2d.shape[0], nogil=True):
        out += sum_1d_a(view_2d[row_idx, :])
        # lots of python interactions at the end of the loop

    return out

cdef float sum_1d_a(float[:] view_1d) nogil:
    return 3.4  # irrelevant code


# workaround: pass the whole 2d view and the row index. No interactions in this case.
def sum_2d_b(float[:, :] view_2d):

    cdef:
        float out = 0
        int row_idx = 0

    for row_idx in prange(view_2d.shape[0], nogil=True):
        out += sum_1d_b(view_2d, row_idx)
        # no python interaction

    return out

cdef float sum_1d_b(float[:, :] view_2d, int row_idx) nogil:
    return 3.4  # irrelevant code

version : 0.29.6

@robertwb
Copy link
Contributor

robertwb commented Jun 8, 2019

Creation and (last) deletion of a slice require the GIL (to prevent the underlying memoryview from being deleted before the slice). If not done internally, this happens in the __PYX_{INC,DEC}_MEMVIEW utilities. Even if we had stronger analysis, we'd have to know that sum_1d_b doesn't cache it away somewhere, so it's hard to see how we'd do better here.

@NicolasHug
Copy link
Contributor Author

Thanks for your feedback.

I'm a bit confused about memory views needing the GIL, because changing prange to range will remove the python interactions. Even in a with nogil: context manager.

@NicolasHug
Copy link
Contributor Author

NicolasHug commented May 14, 2020

@robertwb to clarify my comment above:

When using range, no GIL-related interaction:

def sum_2d_a(float[:, :] view_2d):

    cdef:
        float out = 0
        int row_idx = 0

    for row_idx in range(view_2d.shape[0], nogil=True):
        out += sum_1d_a(view_2d[row_idx, :])  # no interactions here

Using prange, lots of GIL-related interactions:

def sum_2d_a(float[:, :] view_2d):

    cdef:
        float out = 0
        int row_idx = 0

    for row_idx in prange(view_2d.shape[0], nogil=True):
        out += sum_1d_a(view_2d[row_idx, :])  # interactions here

As you can see in the first example, the GIL is released when the slice is created.

The only difference between the two examples is the use of range vs prange.

Again I'm happy to assist any way I can

Thanks

@scoder
Copy link
Contributor

scoder commented May 14, 2020

for row_idx in range(view_2d.shape[0], nogil=True):

We should make sure this raises an error. range() doesn't take keyword arguments (definitely not one called nogil). In any case, this does not release the GIL, and thus the loop does not need to re-acquire it.

@NicolasHug, taking a slice of a memoryview needs to create a new reference to the buffer owner object. Refcounting requires the GIL. So there needs to be at least some interaction with the Python runtime. Maybe you can get away with calling the function once per (larger) slice and not once for each row? E.g. use prange(N) for N threads and then slice the data into N slices yourself, passing down a 2D view.

@NicolasHug
Copy link
Contributor Author

Thanks for the feedback @scoder.

My bad, what I meant to write was this, and I still don't see any Python interaction when creating the slices:

    with nogil:
        for row_idx in range(view_2d.shape[0]):
            out += sum_1d_a(view_2d[row_idx, :])  # no interactions

Maybe you can get away with calling the function once per (larger) slice and not once for each row? E.g. use prange(N) for N threads and then slice the data into N slices yourself, passing down a 2D view.

This would work in this simple example but my actual use-case is more tricky than that and this workaround could not apply. I do need to have the prange over the rows (sometimes columns) of a big array with a custom dtype. here is where, in case you're interested

@scoder
Copy link
Contributor

scoder commented May 14, 2020

Ok, I finally looked at the code that your initial example generates and it's essentially this:

                    Py_BEGIN_ALLOW_THREADS#ifdef _OPENMP
                    #pragma omp for firstprivate(__pyx_v_row_idx) lastprivate(__pyx_v_row_idx)
                    #endif /* _OPENMP */
                    for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
                        if (__pyx_parallel_why < 2)
                        {
                            __pyx_v_row_idx = (int)(0 + 1 * __pyx_t_2);

                            /* "test.pyx":13
 *         out += sum_1d_a(view_2d[row_idx, :])             # <<<<<<<<<<<<<<
 */
                            __pyx_t_4.data = __pyx_v_view_2d.data;
                            __pyx_t_4.memview = __pyx_v_view_2d.memview;
                            __PYX_INC_MEMVIEW(&__pyx_t_4, 0);
                            {
    Py_ssize_t __pyx_tmp_idx = __pyx_v_row_idx;
        Py_ssize_t __pyx_tmp_shape = __pyx_v_view_2d.shape[0];
    Py_ssize_t __pyx_tmp_stride = __pyx_v_view_2d.strides[0];
        if (__pyx_tmp_idx < 0)
            __pyx_tmp_idx += __pyx_tmp_shape;
        if ((0)) __PYX_ERR(0, 13, __pyx_L8_error)
        __pyx_t_4.data += __pyx_tmp_idx * __pyx_tmp_stride;
}

__pyx_t_4.shape[0] = __pyx_v_view_2d.shape[1];
__pyx_t_4.strides[0] = __pyx_v_view_2d.strides[1];
    __pyx_t_4.suboffsets[0] = -1;

__pyx_v_out = (__pyx_v_out + __pyx_f_4test_sum_1d_a(__pyx_t_4));
                            __PYX_XDEC_MEMVIEW(&__pyx_t_4, 0);
                            __pyx_t_4.memview = NULL; __pyx_t_4.data = NULL;
…
                            #ifdef _OPENMP
                            #pragma omp flush(__pyx_parallel_why)
                            #endif /* _OPENMP */
                        }
                    }
                    #ifdef _OPENMP
                    Py_END_ALLOW_THREADS

Bad formatting aside, that looks fairly reasonable to me. I don't see any excessive "Python interaction" here. There is a bit of unused error handling code in there which I stripped above and which could probably be avoided, but the C compiler will be happy to discard it.

@da-woods
Copy link
Contributor

da-woods commented May 14, 2020

taking a slice of a memoryview needs to create a new reference to the buffer owner object. Refcounting requires the GIL.

The documentation claims differently https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#memoryviews-and-the-gil?

(And based on the generated code the documentation looks like it's right)

@NicolasHug
Copy link
Contributor Author

Thanks for taking the time @scoder

I might be wrong in my diagnosis about Python interaction being the cause of my problem, but it seems that there is still something fishy going on. Maybe it's time to explain more in details what motivated me to open this issue:

In scikit-learn, we have some benchmark code that runs in about 45 seconds when using the "full" indexing on 2d views, i.e. always using some_2d_view[row, col].
However when we simplify the code and pass 1d views (some_2d_view[row, :]) that benchmarks takes up to about 100 seconds.

For info, here's the diff between the two branches (the slow branch is in green):

Diff
diff --git a/sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx b/sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx
index d346aabdac..78f7b3906b 100644
--- a/sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx
+++ b/sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx
@@ -30,12 +30,12 @@ def _predict_from_numeric_data(
         int i
 
     for i in prange(numeric_data.shape[0], schedule='static', nogil=True):
-        out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
+        out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
 
 
 cdef inline Y_DTYPE_C _predict_one_from_numeric_data(
         node_struct [:] nodes,
-        const X_DTYPE_C [:, :] numeric_data,
+        const X_DTYPE_C [:] numeric_data,
         const int row) nogil:
     # Need to pass the whole array and the row index, else prange won't work.
     # See issue Cython #2798
@@ -47,13 +47,13 @@ cdef inline Y_DTYPE_C _predict_one_from_numeric_data(
         if node.is_leaf:
             return node.value
 
-        if isnan(numeric_data[row, node.feature_idx]):
+        if isnan(numeric_data[node.feature_idx]):
             if node.missing_go_to_left:
                 node = nodes[node.left]
             else:
                 node = nodes[node.right]
         else:
-            if numeric_data[row, node.feature_idx] <= node.threshold:
+            if numeric_data[node.feature_idx] <= node.threshold:
                 node = nodes[node.left]
             else:
                 node = nodes[node.right]
@@ -69,13 +69,13 @@ def _predict_from_binned_data(
         int i
 
     for i in prange(binned_data.shape[0], schedule='static', nogil=True):
-        out[i] = _predict_one_from_binned_data(nodes, binned_data, i,
+        out[i] = _predict_one_from_binned_data(nodes, binned_data[i], i,
                                                missing_values_bin_idx)
 
 
 cdef inline Y_DTYPE_C _predict_one_from_binned_data(
         node_struct [:] nodes,
-        const X_BINNED_DTYPE_C [:, :] binned_data,
+        const X_BINNED_DTYPE_C [:] binned_data,
         const int row,
         const unsigned char missing_values_bin_idx) nogil:
     # Need to pass the whole array and the row index, else prange won't work.
@@ -87,13 +87,13 @@ cdef inline Y_DTYPE_C _predict_one_from_binned_data(
     while True:
         if node.is_leaf:
             return node.value
-        if binned_data[row, node.feature_idx] ==  missing_values_bin_idx:
+        if binned_data[ node.feature_idx] ==  missing_values_bin_idx:
             if node.missing_go_to_left:
                 node = nodes[node.left]
             else:
                 node = nodes[node.right]
         else:
-            if binned_data[row, node.feature_idx] <= node.bin_threshold:
+            if binned_data[ node.feature_idx] <= node.bin_threshold:
                 node = nodes[node.left]
             else:
                 node = nodes[node.right]
diff --git a/sklearn/ensemble/_hist_gradient_boosting/histogram.pyx b/sklearn/ensemble/_hist_gradient_boosting/histogram.pyx
index 8bd7c4ee8b..578b48995f 100644
--- a/sklearn/ensemble/_hist_gradient_boosting/histogram.pyx
+++ b/sklearn/ensemble/_hist_gradient_boosting/histogram.pyx
@@ -155,7 +155,7 @@ cdef class HistogramBuilder:
             for feature_idx in prange(n_features, schedule='static'):
                 # Compute histogram of each feature
                 self._compute_histogram_brute_single_feature(
-                    feature_idx, sample_indices, histograms)
+                    feature_idx, sample_indices, histograms[feature_idx])
 
         return histograms
 
@@ -163,7 +163,7 @@ cdef class HistogramBuilder:
             HistogramBuilder self,
             const int feature_idx,
             const unsigned int [::1] sample_indices,  # IN
-            hist_struct [:, ::1] histograms) nogil:  # OUT
+            hist_struct [::1] histograms) nogil:  # OUT
         """Compute the histogram for a given feature."""
 
         cdef:
@@ -180,20 +180,20 @@ cdef class HistogramBuilder:
 
         if root_node:
             if hessians_are_constant:
-                _build_histogram_root_no_hessian(feature_idx, X_binned,
+                _build_histogram_root_no_hessian(X_binned,
                                                  ordered_gradients,
                                                  histograms)
             else:
-                _build_histogram_root(feature_idx, X_binned,
+                _build_histogram_root(X_binned,
                                       ordered_gradients, ordered_hessians,
                                       histograms)
         else:
             if hessians_are_constant:
-                _build_histogram_no_hessian(feature_idx,
+                _build_histogram_no_hessian(
                                             sample_indices, X_binned,
                                             ordered_gradients, histograms)
             else:
-                _build_histogram(feature_idx, sample_indices,
+                _build_histogram(sample_indices,
                                  X_binned, ordered_gradients,
                                  ordered_hessians, histograms)
 
@@ -234,11 +234,11 @@ cdef class HistogramBuilder:
 
         for feature_idx in prange(n_features, schedule='static', nogil=True):
             # Compute histogram of each feature
-            _subtract_histograms(feature_idx,
+            _subtract_histograms(
                                  self.n_bins,
-                                 parent_histograms,
-                                 sibling_histograms,
-                                 histograms)
+                                 parent_histograms[feature_idx],
+                                 sibling_histograms[feature_idx],
+                                 histograms[feature_idx])
         return histograms
 
 
@@ -267,36 +267,34 @@ cpdef void _build_histogram_naive(
 
 
 cpdef void _subtract_histograms(
-        const int feature_idx,
         unsigned int n_bins,
-        hist_struct [:, ::1] hist_a,  # IN
-        hist_struct [:, ::1] hist_b,  # IN
-        hist_struct [:, ::1] out) nogil:  # OUT
+        hist_struct [::1] hist_a,  # IN
+        hist_struct [::1] hist_b,  # IN
+        hist_struct [::1] out) nogil:  # OUT
     """compute (hist_a - hist_b) in out"""
     cdef:
         unsigned int i = 0
     for i in range(n_bins):
-        out[feature_idx, i].sum_gradients = (
-            hist_a[feature_idx, i].sum_gradients -
-            hist_b[feature_idx, i].sum_gradients
+        out[i].sum_gradients = (
+            hist_a[i].sum_gradients -
+            hist_b[i].sum_gradients
         )
-        out[feature_idx, i].sum_hessians = (
-            hist_a[feature_idx, i].sum_hessians -
-            hist_b[feature_idx, i].sum_hessians
+        out[i].sum_hessians = (
+            hist_a[i].sum_hessians -
+            hist_b[i].sum_hessians
         )
-        out[feature_idx, i].count = (
-            hist_a[feature_idx, i].count -
-            hist_b[feature_idx, i].count
+        out[i].count = (
+            hist_a[i].count -
+            hist_b[i].count
         )
 
 
 cpdef void _build_histogram(
-        const int feature_idx,
         const unsigned int [::1] sample_indices,  # IN
         const X_BINNED_DTYPE_C [::1] binned_feature,  # IN
         const G_H_DTYPE_C [::1] ordered_gradients,  # IN
         const G_H_DTYPE_C [::1] ordered_hessians,  # IN
-        hist_struct [:, ::1] out) nogil:  # OUT
+        hist_struct [::1] out) nogil:  # OUT
     """Return histogram for a given feature."""
     cdef:
         unsigned int i = 0
@@ -315,34 +313,33 @@ cpdef void _build_histogram(
         bin_2 = binned_feature[sample_indices[i + 2]]
         bin_3 = binned_feature[sample_indices[i + 3]]
 
-        out[feature_idx, bin_0].sum_gradients += ordered_gradients[i]
-        out[feature_idx, bin_1].sum_gradients += ordered_gradients[i + 1]
-        out[feature_idx, bin_2].sum_gradients += ordered_gradients[i + 2]
-        out[feature_idx, bin_3].sum_gradients += ordered_gradients[i + 3]
+        out[bin_0].sum_gradients += ordered_gradients[i]
+        out[bin_1].sum_gradients += ordered_gradients[i + 1]
+        out[bin_2].sum_gradients += ordered_gradients[i + 2]
+        out[bin_3].sum_gradients += ordered_gradients[i + 3]
 
-        out[feature_idx, bin_0].sum_hessians += ordered_hessians[i]
-        out[feature_idx, bin_1].sum_hessians += ordered_hessians[i + 1]
-        out[feature_idx, bin_2].sum_hessians += ordered_hessians[i + 2]
-        out[feature_idx, bin_3].sum_hessians += ordered_hessians[i + 3]
+        out[bin_0].sum_hessians += ordered_hessians[i]
+        out[bin_1].sum_hessians += ordered_hessians[i + 1]
+        out[bin_2].sum_hessians += ordered_hessians[i + 2]
+        out[bin_3].sum_hessians += ordered_hessians[i + 3]
 
-        out[feature_idx, bin_0].count += 1
-        out[feature_idx, bin_1].count += 1
-        out[feature_idx, bin_2].count += 1
-        out[feature_idx, bin_3].count += 1
+        out[bin_0].count += 1
+        out[bin_1].count += 1
+        out[bin_2].count += 1
+        out[bin_3].count += 1
 
     for i in range(unrolled_upper, n_node_samples):
         bin_idx = binned_feature[sample_indices[i]]
-        out[feature_idx, bin_idx].sum_gradients += ordered_gradients[i]
-        out[feature_idx, bin_idx].sum_hessians += ordered_hessians[i]
-        out[feature_idx, bin_idx].count += 1
+        out[bin_idx].sum_gradients += ordered_gradients[i]
+        out[bin_idx].sum_hessians += ordered_hessians[i]
+        out[bin_idx].count += 1
 
 
 cpdef void _build_histogram_no_hessian(
-        const int feature_idx,
         const unsigned int [::1] sample_indices,  # IN
         const X_BINNED_DTYPE_C [::1] binned_feature,  # IN
         const G_H_DTYPE_C [::1] ordered_gradients,  # IN
-        hist_struct [:, ::1] out) nogil:  # OUT
+        hist_struct [::1] out) nogil:  # OUT
     """Return histogram for a given feature, not updating hessians.
 
     Used when the hessians of the loss are constant (typically LS loss).
@@ -365,28 +362,27 @@ cpdef void _build_histogram_no_hessian(
         bin_2 = binned_feature[sample_indices[i + 2]]
         bin_3 = binned_feature[sample_indices[i + 3]]
 
-        out[feature_idx, bin_0].sum_gradients += ordered_gradients[i]
-        out[feature_idx, bin_1].sum_gradients += ordered_gradients[i + 1]
-        out[feature_idx, bin_2].sum_gradients += ordered_gradients[i + 2]
-        out[feature_idx, bin_3].sum_gradients += ordered_gradients[i + 3]
+        out[bin_0].sum_gradients += ordered_gradients[i]
+        out[bin_1].sum_gradients += ordered_gradients[i + 1]
+        out[bin_2].sum_gradients += ordered_gradients[i + 2]
+        out[bin_3].sum_gradients += ordered_gradients[i + 3]
 
-        out[feature_idx, bin_0].count += 1
-        out[feature_idx, bin_1].count += 1
-        out[feature_idx, bin_2].count += 1
-        out[feature_idx, bin_3].count += 1
+        out[bin_0].count += 1
+        out[bin_1].count += 1
+        out[bin_2].count += 1
+        out[bin_3].count += 1
 
     for i in range(unrolled_upper, n_node_samples):
         bin_idx = binned_feature[sample_indices[i]]
-        out[feature_idx, bin_idx].sum_gradients += ordered_gradients[i]
-        out[feature_idx, bin_idx].count += 1
+        out[bin_idx].sum_gradients += ordered_gradients[i]
+        out[bin_idx].count += 1
 
 
 cpdef void _build_histogram_root(
-        const int feature_idx,
         const X_BINNED_DTYPE_C [::1] binned_feature,  # IN
         const G_H_DTYPE_C [::1] all_gradients,  # IN
         const G_H_DTYPE_C [::1] all_hessians,  # IN
-        hist_struct [:, ::1] out) nogil:  # OUT
+        hist_struct [::1] out) nogil:  # OUT
     """Compute histogram of the root node.
 
     Unlike other nodes, the root node has to find the split among *all* the
@@ -412,33 +408,32 @@ cpdef void _build_histogram_root(
         bin_2 = binned_feature[i + 2]
         bin_3 = binned_feature[i + 3]
 
-        out[feature_idx, bin_0].sum_gradients += all_gradients[i]
-        out[feature_idx, bin_1].sum_gradients += all_gradients[i + 1]
-        out[feature_idx, bin_2].sum_gradients += all_gradients[i + 2]
-        out[feature_idx, bin_3].sum_gradients += all_gradients[i + 3]
+        out[bin_0].sum_gradients += all_gradients[i]
+        out[bin_1].sum_gradients += all_gradients[i + 1]
+        out[bin_2].sum_gradients += all_gradients[i + 2]
+        out[bin_3].sum_gradients += all_gradients[i + 3]
 
-        out[feature_idx, bin_0].sum_hessians += all_hessians[i]
-        out[feature_idx, bin_1].sum_hessians += all_hessians[i + 1]
-        out[feature_idx, bin_2].sum_hessians += all_hessians[i + 2]
-        out[feature_idx, bin_3].sum_hessians += all_hessians[i + 3]
+        out[bin_0].sum_hessians += all_hessians[i]
+        out[bin_1].sum_hessians += all_hessians[i + 1]
+        out[bin_2].sum_hessians += all_hessians[i + 2]
+        out[bin_3].sum_hessians += all_hessians[i + 3]
 
-        out[feature_idx, bin_0].count += 1
-        out[feature_idx, bin_1].count += 1
-        out[feature_idx, bin_2].count += 1
-        out[feature_idx, bin_3].count += 1
+        out[bin_0].count += 1
+        out[bin_1].count += 1
+        out[bin_2].count += 1
+        out[bin_3].count += 1
 
     for i in range(unrolled_upper, n_samples):
         bin_idx = binned_feature[i]
-        out[feature_idx, bin_idx].sum_gradients += all_gradients[i]
-        out[feature_idx, bin_idx].sum_hessians += all_hessians[i]
-        out[feature_idx, bin_idx].count += 1
+        out[bin_idx].sum_gradients += all_gradients[i]
+        out[bin_idx].sum_hessians += all_hessians[i]
+        out[bin_idx].count += 1
 
 
 cpdef void _build_histogram_root_no_hessian(
-        const int feature_idx,
         const X_BINNED_DTYPE_C [::1] binned_feature,  # IN
         const G_H_DTYPE_C [::1] all_gradients,  # IN
-        hist_struct [:, ::1] out) nogil:  # OUT
+        hist_struct [::1] out) nogil:  # OUT
     """Compute histogram of the root node, not updating hessians.
 
     Used when the hessians of the loss are constant (typically LS loss).
@@ -461,17 +456,17 @@ cpdef void _build_histogram_root_no_hessian(
         bin_2 = binned_feature[i + 2]
         bin_3 = binned_feature[i + 3]
 
-        out[feature_idx, bin_0].sum_gradients += all_gradients[i]
-        out[feature_idx, bin_1].sum_gradients += all_gradients[i + 1]
-        out[feature_idx, bin_2].sum_gradients += all_gradients[i + 2]
-        out[feature_idx, bin_3].sum_gradients += all_gradients[i + 3]
+        out[bin_0].sum_gradients += all_gradients[i]
+        out[bin_1].sum_gradients += all_gradients[i + 1]
+        out[bin_2].sum_gradients += all_gradients[i + 2]
+        out[bin_3].sum_gradients += all_gradients[i + 3]
 
-        out[feature_idx, bin_0].count += 1
-        out[feature_idx, bin_1].count += 1
-        out[feature_idx, bin_2].count += 1
-        out[feature_idx, bin_3].count += 1
+        out[bin_0].count += 1
+        out[bin_1].count += 1
+        out[bin_2].count += 1
+        out[bin_3].count += 1
 
     for i in range(unrolled_upper, n_samples):
         bin_idx = binned_feature[i]
-        out[feature_idx, bin_idx].sum_gradients += all_gradients[i]
-        out[feature_idx, bin_idx].count += 1
+        out[bin_idx].sum_gradients += all_gradients[i]
+        out[bin_idx].count += 1
diff --git a/sklearn/ensemble/_hist_gradient_boosting/splitting.pyx b/sklearn/ensemble/_hist_gradient_boosting/splitting.pyx
index 984cc6767f..cb98d1e069 100644
--- a/sklearn/ensemble/_hist_gradient_boosting/splitting.pyx
+++ b/sklearn/ensemble/_hist_gradient_boosting/splitting.pyx
@@ -437,7 +437,7 @@ cdef class Splitter:
 
                 self._find_best_bin_to_split_left_to_right(
                     feature_idx, has_missing_values[feature_idx],
-                    histograms, n_samples, sum_gradients, sum_hessians,
+                    histograms[feature_idx], n_samples, sum_gradients, sum_hessians,
                     value, monotonic_cst[feature_idx],
                     lower_bound, upper_bound, &split_infos[feature_idx])
 
@@ -446,7 +446,7 @@ cdef class Splitter:
                     # sending the nans to the left child would lead to a higher
                     # gain
                     self._find_best_bin_to_split_right_to_left(
-                        feature_idx, histograms, n_samples,
+                        feature_idx, histograms[feature_idx], n_samples,
                         sum_gradients, sum_hessians,
                         value, monotonic_cst[feature_idx],
                         lower_bound, upper_bound, &split_infos[feature_idx])
@@ -491,7 +491,7 @@ cdef class Splitter:
             Splitter self,
             unsigned int feature_idx,
             unsigned char has_missing_values,
-            const hist_struct [:, ::1] histograms,  # IN
+            const hist_struct [::1] histograms,  # IN
             unsigned int n_samples,
             Y_DTYPE_C sum_gradients,
             Y_DTYPE_C sum_hessians,
@@ -540,17 +540,17 @@ cdef class Splitter:
         loss_current_node = _loss_from_value(value, sum_gradients)
 
         for bin_idx in range(end):
-            n_samples_left += histograms[feature_idx, bin_idx].count
+            n_samples_left += histograms[bin_idx].count
             n_samples_right = n_samples_ - n_samples_left
 
             if self.hessians_are_constant:
-                sum_hessian_left += histograms[feature_idx, bin_idx].count
+                sum_hessian_left += histograms[bin_idx].count
             else:
                 sum_hessian_left += \
-                    histograms[feature_idx, bin_idx].sum_hessians
+                    histograms[bin_idx].sum_hessians
             sum_hessian_right = sum_hessians - sum_hessian_left
 
-            sum_gradient_left += histograms[feature_idx, bin_idx].sum_gradients
+            sum_gradient_left += histograms[bin_idx].sum_gradients
             sum_gradient_right = sum_gradients - sum_gradient_left
 
             if n_samples_left < self.min_samples_leaf:
@@ -605,7 +605,7 @@ cdef class Splitter:
     cdef void _find_best_bin_to_split_right_to_left(
             self,
             unsigned int feature_idx,
-            const hist_struct [:, ::1] histograms,  # IN
+            const hist_struct [::1] histograms,  # IN
             unsigned int n_samples,
             Y_DTYPE_C sum_gradients,
             Y_DTYPE_C sum_hessians,
@@ -653,18 +653,18 @@ cdef class Splitter:
         loss_current_node = _loss_from_value(value, sum_gradients)
 
         for bin_idx in range(start, -1, -1):
-            n_samples_right += histograms[feature_idx, bin_idx + 1].count
+            n_samples_right += histograms[bin_idx + 1].count
             n_samples_left = n_samples_ - n_samples_right
 
             if self.hessians_are_constant:
-                sum_hessian_right += histograms[feature_idx, bin_idx + 1].count
+                sum_hessian_right += histograms[bin_idx + 1].count
             else:
                 sum_hessian_right += \
-                    histograms[feature_idx, bin_idx + 1].sum_hessians
+                    histograms[bin_idx + 1].sum_hessians
             sum_hessian_left = sum_hessians - sum_hessian_right
 
             sum_gradient_right += \
-                histograms[feature_idx, bin_idx + 1].sum_gradients
+                histograms[bin_idx + 1].sum_gradients
             sum_gradient_left = sum_gradients - sum_gradient_right
 
             if n_samples_right < self.min_samples_leaf:

Now, the reason why I think that slow-down comes from the interactions, is because as far as I can tell, the generated C-code looks just fine. For example, out[bin_0].count += 1 generates:

__pyx_t_28 = __pyx_v_bin_0;
(*((struct __pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_hist_struct *)
   ( /* dim=0 */ ((char *)
   (((struct __pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_hist_struct *)
    __pyx_v_out.data)
    + __pyx_t_28)) ))).count += 1

while out[feature_idx, bin_0].count += 1 generates

__pyx_t_36 = __pyx_v_feature_idx;
__pyx_t_37 = __pyx_v_bin_0;
(*((struct __pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_hist_struct *)
  ( /* dim=1 */ ((char *)
  (((struct __pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_hist_struct *) 
  ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_36 * __pyx_v_out.strides[0]) ))
  + __pyx_t_37)) ))).count += 1;

(out is either declared as hist_struct [::1] out or as hist_struct [:, ::1] out)

As far as I can tell the first snippet uses less operations than the second one, as expected. And yet, its branch is much slower. So that's why I'm thinking the slow down might come from the generated interactions.

Thanks again for your time.

@scoder
Copy link
Contributor

scoder commented May 14, 2020

@NicolasHug any chance you could run your code through a C level profiler like callgrind or Linux perf to see where the time is spent? How many threads are you using for your timings?

@NicolasHug
Copy link
Contributor Author

NicolasHug commented May 14, 2020

How many threads are you using for your timings?

I was using 4 threads, but I observe the same slowdown when setting OMP_NUM_THREADS to 1.

any chance you could run your code through a C level profiler

Of course

I used callgrind on a smaller benchmark (setting OMP_NUM_THREADS to 1 and enabling cython profiling, though I'm not sure this has any effect), and here are the logs:

callgrind.out.slow_one.txt
callgrind.out.fast_one.txt

The "slow" one is the one using 1d views.

Loading these in kcachegrind we can observe that the slow one runs about 5 billions more instructions. Most of them are in a function called _predict_from_numeric_data. I'll paste below the code of that function for the "fast" and "slow" branches. Also, as far as I understand, those 5B additional instructions come from the function itself, not from its callees (could be wrong on this one). So the source of the slowdown should be somewhere in this diff (also provided the code for each function if you need) :

1c1
< static pyobject *__pyx_pf_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_from_numeric_data(cython_unused pyobject *__pyx_self, __pyx_memviewslice __pyx_v_nodes, __pyx_memviewslice __pyx_v_numeric_data, __pyx_memviewslice __pyx_v_out) {
---
> static PyObject *__pyx_pf_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_from_numeric_data(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_nodes, __Pyx_memviewslice __pyx_v_numeric_data, __Pyx_memviewslice __pyx_v_out) {
3a4
>   __Pyx_TraceDeclarations
8c9,11
<   Py_ssize_t __pyx_t_4;
---
>   __Pyx_memviewslice __pyx_t_4 = { 0, 0, { 0 }, { 0 }, { 0 } };
>   Py_ssize_t __pyx_t_5;
>   __Pyx_TraceFrameInit(__pyx_codeobj_)
9a13
>   __Pyx_TraceCall("_predict_from_numeric_data", __pyx_f[0], 25, 0, __PYX_ERR(0, 25, __pyx_L1_error));
11c15
<   /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":32
---
>   /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":33
15c19
<  *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
---
>  *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
25c29
<         if (unlikely(!__pyx_v_numeric_data.memview)) { __Pyx_RaiseUnboundMemoryviewSliceNogil("numeric_data"); __PYX_ERR(0, 32, __pyx_L4_error) }
---
>         if (unlikely(!__pyx_v_numeric_data.memview)) { __Pyx_RaiseUnboundMemoryviewSliceNogil("numeric_data"); __PYX_ERR(0, 33, __pyx_L4_error) }
28a33,37
>             int __pyx_parallel_temp0 = ((int)0xbad0bad0);
>             const char *__pyx_parallel_filename = NULL; int __pyx_parallel_lineno = 0, __pyx_parallel_clineno = 0;
>             PyObject *__pyx_parallel_exc_type = NULL, *__pyx_parallel_exc_value = NULL, *__pyx_parallel_exc_tb = NULL;
>             int __pyx_parallel_why;
>             __pyx_parallel_why = 0;
39c48
<                 #pragma omp parallel private(__pyx_t_4)
---
>                 #pragma omp parallel private(__pyx_t_5) firstprivate(__pyx_t_4) private(__pyx_filename, __pyx_lineno, __pyx_clineno) shared(__pyx_parallel_why, __pyx_parallel_exc_type, __pyx_parallel_exc_value, __pyx_parallel_exc_tb)
42a52,57
>                     #ifdef WITH_THREAD
>                     PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
>                     #endif
>                     Py_BEGIN_ALLOW_THREADS
>                     #endif /* _OPENMP */
>                     #ifdef _OPENMP
45a61
>                         if (__pyx_parallel_why < 2)
49c65
<                             /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":33
---
>                             /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":34
52c68
<  *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)             # <<<<<<<<<<<<<<
---
>  *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)             # <<<<<<<<<<<<<<
56,57c72,121
<                             __pyx_t_4 = __pyx_v_i;
<                             *((__pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_Y_DTYPE_C *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_4 * __pyx_v_out.strides[0]) )) = __pyx_f_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_one_from_numeric_data(__pyx_v_nodes, __pyx_v_numeric_data, __pyx_v_i);
---
>                             __pyx_t_4.data = __pyx_v_numeric_data.data;
>                             __pyx_t_4.memview = __pyx_v_numeric_data.memview;
>                             __PYX_INC_MEMVIEW(&__pyx_t_4, 0);
>                             {
>     Py_ssize_t __pyx_tmp_idx = __pyx_v_i;
>     Py_ssize_t __pyx_tmp_stride = __pyx_v_numeric_data.strides[0];
>         if ((0)) __PYX_ERR(0, 34, __pyx_L8_error)
>         __pyx_t_4.data += __pyx_tmp_idx * __pyx_tmp_stride;
> }
> 
> __pyx_t_4.shape[0] = __pyx_v_numeric_data.shape[1];
> __pyx_t_4.strides[0] = __pyx_v_numeric_data.strides[1];
>     __pyx_t_4.suboffsets[0] = -1;
> 
> __pyx_t_5 = __pyx_v_i;
>                             *((__pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_Y_DTYPE_C *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_5 * __pyx_v_out.strides[0]) )) = __pyx_f_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_one_from_numeric_data(__pyx_v_nodes, __pyx_t_4, __pyx_v_i);
>                             __PYX_XDEC_MEMVIEW(&__pyx_t_4, 0);
>                             __pyx_t_4.memview = NULL;
>                             __pyx_t_4.data = NULL;
>                             goto __pyx_L11;
>                             __pyx_L8_error:;
>                             {
>                                 #ifdef WITH_THREAD
>                                 PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
>                                 #endif
>                                 #ifdef _OPENMP
>                                 #pragma omp flush(__pyx_parallel_exc_type)
>                                 #endif /* _OPENMP */
>                                 if (!__pyx_parallel_exc_type) {
>                                   __Pyx_ErrFetchWithState(&__pyx_parallel_exc_type, &__pyx_parallel_exc_value, &__pyx_parallel_exc_tb);
>                                   __pyx_parallel_filename = __pyx_filename; __pyx_parallel_lineno = __pyx_lineno; __pyx_parallel_clineno = __pyx_clineno;
>                                   __Pyx_GOTREF(__pyx_parallel_exc_type);
>                                 }
>                                 #ifdef WITH_THREAD
>                                 __Pyx_PyGILState_Release(__pyx_gilstate_save);
>                                 #endif
>                             }
>                             __pyx_parallel_why = 4;
>                             goto __pyx_L10;
>                             __pyx_L10:;
>                             #ifdef _OPENMP
>                             #pragma omp critical(__pyx_parallel_lastprivates0)
>                             #endif /* _OPENMP */
>                             {
>                                 __pyx_parallel_temp0 = __pyx_v_i;
>                             }
>                             __pyx_L11:;
>                             #ifdef _OPENMP
>                             #pragma omp flush(__pyx_parallel_why)
>                             #endif /* _OPENMP */
59a124,159
>                     #ifdef _OPENMP
>                     Py_END_ALLOW_THREADS
>                     #else
> {
> #ifdef WITH_THREAD
>                     PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
>                     #endif
>                     #endif /* _OPENMP */
>                     /* Clean up any temporaries */
>                     __PYX_XDEC_MEMVIEW(&__pyx_t_4, 0);
>                     #ifdef WITH_THREAD
>                     __Pyx_PyGILState_Release(__pyx_gilstate_save);
>                     #endif
>                     #ifndef _OPENMP
> }
> #endif /* _OPENMP */
>                 }
>             }
>             if (__pyx_parallel_exc_type) {
>               /* This may have been overridden by a continue, break or return in another thread. Prefer the error. */
>               __pyx_parallel_why = 4;
>             }
>             if (__pyx_parallel_why) {
>               __pyx_v_i = __pyx_parallel_temp0;
>               switch (__pyx_parallel_why) {
>                     case 4:
>                 {
>                     #ifdef WITH_THREAD
>                     PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
>                     #endif
>                     __Pyx_GIVEREF(__pyx_parallel_exc_type);
>                     __Pyx_ErrRestoreWithState(__pyx_parallel_exc_type, __pyx_parallel_exc_value, __pyx_parallel_exc_tb);
>                     __pyx_filename = __pyx_parallel_filename; __pyx_lineno = __pyx_parallel_lineno; __pyx_clineno = __pyx_parallel_clineno;
>                     #ifdef WITH_THREAD
>                     __Pyx_PyGILState_Release(__pyx_gilstate_save);
>                     #endif
60a161,162
>                 goto __pyx_L4_error;
>               }
71c173
<       /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":32
---
>       /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":33
75c177
<  *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
---
>  *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
97c199
<   /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":24
---
>   /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":25
108a211
>   __PYX_XDEC_MEMVIEW(&__pyx_t_4, 1);
115a219
>   __Pyx_TraceReturn(__pyx_r, 0);
Fast code
static PyObject *__pyx_pf_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_from_numeric_data(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_nodes, __Pyx_memviewslice __pyx_v_numeric_data, __Pyx_memviewslice __pyx_v_out) {
  int __pyx_v_i;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  Py_ssize_t __pyx_t_1;
  Py_ssize_t __pyx_t_2;
  Py_ssize_t __pyx_t_3;
  Py_ssize_t __pyx_t_4;
  __Pyx_RefNannySetupContext("_predict_from_numeric_data", 0);

  /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":32
 *         int i
 * 
 *     for i in prange(numeric_data.shape[0], schedule='static', nogil=True):             # <<<<<<<<<<<<<<
 *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
 * 
 */
  {
      #ifdef WITH_THREAD
      PyThreadState *_save;
      Py_UNBLOCK_THREADS
      __Pyx_FastGIL_Remember();
      #endif
      /*try:*/ {
        if (unlikely(!__pyx_v_numeric_data.memview)) { __Pyx_RaiseUnboundMemoryviewSliceNogil("numeric_data"); __PYX_ERR(0, 32, __pyx_L4_error) }
        __pyx_t_1 = (__pyx_v_numeric_data.shape[0]);
        if ((1 == 0)) abort();
        {
            #if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
                #undef likely
                #undef unlikely
                #define likely(x)   (x)
                #define unlikely(x) (x)
            #endif
            __pyx_t_3 = (__pyx_t_1 - 0 + 1 - 1/abs(1)) / 1;
            if (__pyx_t_3 > 0)
            {
                #ifdef _OPENMP
                #pragma omp parallel private(__pyx_t_4)
                #endif /* _OPENMP */
                {
                    #ifdef _OPENMP
                    #pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) schedule(static)
                    #endif /* _OPENMP */
                    for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
                        {
                            __pyx_v_i = (int)(0 + 1 * __pyx_t_2);

                            /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":33
 * 
 *     for i in prange(numeric_data.shape[0], schedule='static', nogil=True):
 *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)             # <<<<<<<<<<<<<<
 * 
 * 
 */
                            __pyx_t_4 = __pyx_v_i;
                            *((__pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_Y_DTYPE_C *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_4 * __pyx_v_out.strides[0]) )) = __pyx_f_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_one_from_numeric_data(__pyx_v_nodes, __pyx_v_numeric_data, __pyx_v_i);
                        }
                    }
                }
            }
        }
        #if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
            #undef likely
            #undef unlikely
            #define likely(x)   __builtin_expect(!!(x), 1)
            #define unlikely(x) __builtin_expect(!!(x), 0)
        #endif
      }

      /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":32
 *         int i
 * 
 *     for i in prange(numeric_data.shape[0], schedule='static', nogil=True):             # <<<<<<<<<<<<<<
 *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
 * 
 */
      /*finally:*/ {
        /*normal exit:*/{
          #ifdef WITH_THREAD
          __Pyx_FastGIL_Forget();
          Py_BLOCK_THREADS
          #endif
          goto __pyx_L5;
        }
        __pyx_L4_error: {
          #ifdef WITH_THREAD
          __Pyx_FastGIL_Forget();
          Py_BLOCK_THREADS
          #endif
          goto __pyx_L1_error;
        }
        __pyx_L5:;
      }
  }

  /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":24
 * 
 * 
 * def _predict_from_numeric_data(             # <<<<<<<<<<<<<<
 *         node_struct [:] nodes,
 *         const X_DTYPE_C [:, :] numeric_data,
 */

  /* function exit code */
  __pyx_r = Py_None; __Pyx_INCREF(Py_None);
  goto __pyx_L0;
  __pyx_L1_error:;
  __Pyx_AddTraceback("sklearn.ensemble._hist_gradient_boosting._predictor._predict_from_numeric_data", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __PYX_XDEC_MEMVIEW(&__pyx_v_nodes, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_numeric_data, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_out, 1);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
Slow code
static PyObject *__pyx_pf_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_from_numeric_data(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_nodes, __Pyx_memviewslice __pyx_v_numeric_data, __Pyx_memviewslice __pyx_v_out) {
  int __pyx_v_i;
  PyObject *__pyx_r = NULL;
  __Pyx_TraceDeclarations
  __Pyx_RefNannyDeclarations
  Py_ssize_t __pyx_t_1;
  Py_ssize_t __pyx_t_2;
  Py_ssize_t __pyx_t_3;
  __Pyx_memviewslice __pyx_t_4 = { 0, 0, { 0 }, { 0 }, { 0 } };
  Py_ssize_t __pyx_t_5;
  __Pyx_TraceFrameInit(__pyx_codeobj_)
  __Pyx_RefNannySetupContext("_predict_from_numeric_data", 0);
  __Pyx_TraceCall("_predict_from_numeric_data", __pyx_f[0], 25, 0, __PYX_ERR(0, 25, __pyx_L1_error));

  /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":33
 *         int i
 * 
 *     for i in prange(numeric_data.shape[0], schedule='static', nogil=True):             # <<<<<<<<<<<<<<
 *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
 * 
 */
  {
      #ifdef WITH_THREAD
      PyThreadState *_save;
      Py_UNBLOCK_THREADS
      __Pyx_FastGIL_Remember();
      #endif
      /*try:*/ {
        if (unlikely(!__pyx_v_numeric_data.memview)) { __Pyx_RaiseUnboundMemoryviewSliceNogil("numeric_data"); __PYX_ERR(0, 33, __pyx_L4_error) }
        __pyx_t_1 = (__pyx_v_numeric_data.shape[0]);
        if ((1 == 0)) abort();
        {
            int __pyx_parallel_temp0 = ((int)0xbad0bad0);
            const char *__pyx_parallel_filename = NULL; int __pyx_parallel_lineno = 0, __pyx_parallel_clineno = 0;
            PyObject *__pyx_parallel_exc_type = NULL, *__pyx_parallel_exc_value = NULL, *__pyx_parallel_exc_tb = NULL;
            int __pyx_parallel_why;
            __pyx_parallel_why = 0;
            #if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
                #undef likely
                #undef unlikely
                #define likely(x)   (x)
                #define unlikely(x) (x)
            #endif
            __pyx_t_3 = (__pyx_t_1 - 0 + 1 - 1/abs(1)) / 1;
            if (__pyx_t_3 > 0)
            {
                #ifdef _OPENMP
                #pragma omp parallel private(__pyx_t_5) firstprivate(__pyx_t_4) private(__pyx_filename, __pyx_lineno, __pyx_clineno) shared(__pyx_parallel_why, __pyx_parallel_exc_type, __pyx_parallel_exc_value, __pyx_parallel_exc_tb)
                #endif /* _OPENMP */
                {
                    #ifdef _OPENMP
                    #ifdef WITH_THREAD
                    PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
                    #endif
                    Py_BEGIN_ALLOW_THREADS
                    #endif /* _OPENMP */
                    #ifdef _OPENMP
                    #pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) schedule(static)
                    #endif /* _OPENMP */
                    for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
                        if (__pyx_parallel_why < 2)
                        {
                            __pyx_v_i = (int)(0 + 1 * __pyx_t_2);

                            /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":34
 * 
 *     for i in prange(numeric_data.shape[0], schedule='static', nogil=True):
 *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)             # <<<<<<<<<<<<<<
 * 
 * 
 */
                            __pyx_t_4.data = __pyx_v_numeric_data.data;
                            __pyx_t_4.memview = __pyx_v_numeric_data.memview;
                            __PYX_INC_MEMVIEW(&__pyx_t_4, 0);
                            {
    Py_ssize_t __pyx_tmp_idx = __pyx_v_i;
    Py_ssize_t __pyx_tmp_stride = __pyx_v_numeric_data.strides[0];
        if ((0)) __PYX_ERR(0, 34, __pyx_L8_error)
        __pyx_t_4.data += __pyx_tmp_idx * __pyx_tmp_stride;
}

__pyx_t_4.shape[0] = __pyx_v_numeric_data.shape[1];
__pyx_t_4.strides[0] = __pyx_v_numeric_data.strides[1];
    __pyx_t_4.suboffsets[0] = -1;

__pyx_t_5 = __pyx_v_i;
                            *((__pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_Y_DTYPE_C *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_5 * __pyx_v_out.strides[0]) )) = __pyx_f_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_one_from_numeric_data(__pyx_v_nodes, __pyx_t_4, __pyx_v_i);
                            __PYX_XDEC_MEMVIEW(&__pyx_t_4, 0);
                            __pyx_t_4.memview = NULL;
                            __pyx_t_4.data = NULL;
                            goto __pyx_L11;
                            __pyx_L8_error:;
                            {
                                #ifdef WITH_THREAD
                                PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
                                #endif
                                #ifdef _OPENMP
                                #pragma omp flush(__pyx_parallel_exc_type)
                                #endif /* _OPENMP */
                                if (!__pyx_parallel_exc_type) {
                                  __Pyx_ErrFetchWithState(&__pyx_parallel_exc_type, &__pyx_parallel_exc_value, &__pyx_parallel_exc_tb);
                                  __pyx_parallel_filename = __pyx_filename; __pyx_parallel_lineno = __pyx_lineno; __pyx_parallel_clineno = __pyx_clineno;
                                  __Pyx_GOTREF(__pyx_parallel_exc_type);
                                }
                                #ifdef WITH_THREAD
                                __Pyx_PyGILState_Release(__pyx_gilstate_save);
                                #endif
                            }
                            __pyx_parallel_why = 4;
                            goto __pyx_L10;
                            __pyx_L10:;
                            #ifdef _OPENMP
                            #pragma omp critical(__pyx_parallel_lastprivates0)
                            #endif /* _OPENMP */
                            {
                                __pyx_parallel_temp0 = __pyx_v_i;
                            }
                            __pyx_L11:;
                            #ifdef _OPENMP
                            #pragma omp flush(__pyx_parallel_why)
                            #endif /* _OPENMP */
                        }
                    }
                    #ifdef _OPENMP
                    Py_END_ALLOW_THREADS
                    #else
{
#ifdef WITH_THREAD
                    PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
                    #endif
                    #endif /* _OPENMP */
                    /* Clean up any temporaries */
                    __PYX_XDEC_MEMVIEW(&__pyx_t_4, 0);
                    #ifdef WITH_THREAD
                    __Pyx_PyGILState_Release(__pyx_gilstate_save);
                    #endif
                    #ifndef _OPENMP
}
#endif /* _OPENMP */
                }
            }
            if (__pyx_parallel_exc_type) {
              /* This may have been overridden by a continue, break or return in another thread. Prefer the error. */
              __pyx_parallel_why = 4;
            }
            if (__pyx_parallel_why) {
              __pyx_v_i = __pyx_parallel_temp0;
              switch (__pyx_parallel_why) {
                    case 4:
                {
                    #ifdef WITH_THREAD
                    PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
                    #endif
                    __Pyx_GIVEREF(__pyx_parallel_exc_type);
                    __Pyx_ErrRestoreWithState(__pyx_parallel_exc_type, __pyx_parallel_exc_value, __pyx_parallel_exc_tb);
                    __pyx_filename = __pyx_parallel_filename; __pyx_lineno = __pyx_parallel_lineno; __pyx_clineno = __pyx_parallel_clineno;
                    #ifdef WITH_THREAD
                    __Pyx_PyGILState_Release(__pyx_gilstate_save);
                    #endif
                }
                goto __pyx_L4_error;
              }
            }
        }
        #if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
            #undef likely
            #undef unlikely
            #define likely(x)   __builtin_expect(!!(x), 1)
            #define unlikely(x) __builtin_expect(!!(x), 0)
        #endif
      }

      /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":33
 *         int i
 * 
 *     for i in prange(numeric_data.shape[0], schedule='static', nogil=True):             # <<<<<<<<<<<<<<
 *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
 * 
 */
      /*finally:*/ {
        /*normal exit:*/{
          #ifdef WITH_THREAD
          __Pyx_FastGIL_Forget();
          Py_BLOCK_THREADS
          #endif
          goto __pyx_L5;
        }
        __pyx_L4_error: {
          #ifdef WITH_THREAD
          __Pyx_FastGIL_Forget();
          Py_BLOCK_THREADS
          #endif
          goto __pyx_L1_error;
        }
        __pyx_L5:;
      }
  }

  /* "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx":25
 * 
 * 
 * def _predict_from_numeric_data(             # <<<<<<<<<<<<<<
 *         node_struct [:] nodes,
 *         const X_DTYPE_C [:, :] numeric_data,
 */

  /* function exit code */
  __pyx_r = Py_None; __Pyx_INCREF(Py_None);
  goto __pyx_L0;
  __pyx_L1_error:;
  __PYX_XDEC_MEMVIEW(&__pyx_t_4, 1);
  __Pyx_AddTraceback("sklearn.ensemble._hist_gradient_boosting._predictor._predict_from_numeric_data", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __PYX_XDEC_MEMVIEW(&__pyx_v_nodes, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_numeric_data, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_out, 1);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_TraceReturn(__pyx_r, 0);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

Just for completeness the Cython diff is just:

@ -30,12 +31,12 @@ def _predict_from_numeric_data(
         int i
 
     for i in prange(numeric_data.shape[0], schedule='static', nogil=True):
-        out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
+        out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
 

On top of the big chunk of additional code, I'm wondering if the slowdown could come from the addition of the private / lastprivate openmp variables, in particular __pyx_filename, __pyx_lineno, etc.

@scoder
Copy link
Contributor

scoder commented May 15, 2020

enabling cython profiling, though I'm not sure this has any effect

Well, yes. It can make your profile unusable. :) I can't say if it was really compiled in on your side (requires a C macro setting), but CPython profiling and C level profiling do very different things and the CPython profiling (which adds dedicated event callbacks to your code) just creates a lot of overhead that you don't want to see in your C profile.

I'm wondering if the slowdown could come from the addition of the private / lastprivate openmp variables

Could be. I pushed a fix that discards the unused error handling code in 794d21d. That should at least make the diff a bit clearer. Could you maybe retry with the latest master? And with Cython profiling disabled? :)

@NicolasHug
Copy link
Contributor Author

I pushed a fix that discards the unused error handling code in 794d21d.

Thanks! This did make the benchmark with 1d views faster, so this is definitely an improvement.

However it's still not quite as fast as the benchmark with 2d views, and I still observe about 4 Billion additional instructions. But I'm pretty sure now that these 4 billion come from these 2 instructions:

__PYX_INC_MEMVIEW(&__pyx_t_4, 0);
...
__PYX_XDEC_MEMVIEW(&__pyx_t_4, 0);

where __pyx_t_4 is a __Pyx_memviewslice.

Indeed, when I comment these out, both benchmarks run just as fast, and I don't observe these 4 billion instructions :)

Here's the diff just for reference:

8c8,9
<   Py_ssize_t __pyx_t_4;
---
>   __Pyx_memviewslice __pyx_t_4 = { 0, 0, { 0 }, { 0 }, { 0 } };
>   Py_ssize_t __pyx_t_5;
18c19
<  *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
---
>  *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
41c42
<                 #pragma omp parallel private(__pyx_t_4)
---
>                 #pragma omp parallel private(__pyx_t_5) firstprivate(__pyx_t_4)
54c55
<  *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)             # <<<<<<<<<<<<<<
---
>  *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)             # <<<<<<<<<<<<<<
58,59c59,75
<                             __pyx_t_4 = __pyx_v_i;
<                             *((__pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_Y_DTYPE_C *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_4 * __pyx_v_out.strides[0]) )) = __pyx_f_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_one_from_numeric_data(__pyx_v_nodes, __pyx_v_numeric_data, __pyx_v_i);
---
>                             __pyx_t_4.data = __pyx_v_numeric_data.data;
>                             __pyx_t_4.memview = __pyx_v_numeric_data.memview;
>                             __PYX_INC_MEMVIEW(&__pyx_t_4, 0);
>                             {
>     Py_ssize_t __pyx_tmp_idx = __pyx_v_i;
>     Py_ssize_t __pyx_tmp_stride = __pyx_v_numeric_data.strides[0];
>         __pyx_t_4.data += __pyx_tmp_idx * __pyx_tmp_stride;
> }
> 
> __pyx_t_4.shape[0] = __pyx_v_numeric_data.shape[1];
> __pyx_t_4.strides[0] = __pyx_v_numeric_data.strides[1];
>     __pyx_t_4.suboffsets[0] = -1;
> 
> __pyx_t_5 = __pyx_v_i;
>                             *((__pyx_t_7sklearn_8ensemble_23_hist_gradient_boosting_6common_Y_DTYPE_C *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_5 * __pyx_v_out.strides[0]) )) = __pyx_f_7sklearn_8ensemble_23_hist_gradient_boosting_10_predictor__predict_one_from_numeric_data(__pyx_v_nodes, __pyx_t_4, __pyx_v_i);
>                             __PYX_XDEC_MEMVIEW(&__pyx_t_4, 0);
>                             __pyx_t_4.memview = NULL; __pyx_t_4.data = NULL;
77c93
<  *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
---
>  *         out[i] = _predict_one_from_numeric_data(nodes, numeric_data[i], i)
110a127
>   __PYX_XDEC_MEMVIEW(&__pyx_t_4, 1);

As a side note, I do observe these 2 instructions being generated when doing just

    with nogil:
        for row_idx in range(view_2d.shape[0]):
            out += sum_1d_a(view_2d[row_idx, :])

so I understand @robertwb's #2987 (comment) better now.

@NicolasHug
Copy link
Contributor Author

NicolasHug commented May 15, 2020

So I guess my new question is this: would you consider introducing a new option that would bypass the __PYX_X{INC,DEC}_MEMVIEW instructions when creating view slices? That would definitely help us in sklearn where such slices are created for each sample, and when we have millions of samples, these calls add up to a significant overhead.

If yes, I would be happy to give this a try!

@NicolasHug
Copy link
Contributor Author

BTW here are the callgrind dumps of the benchmarks using Cython's master in case you're interested (without CPython profiling):

callgrind.out.2d_views.txt
callgrind.out.1d_views.txt <- 4 billion more instructions
callgrind.out.1d_views_no_INC_DEC.txt <- this one is just as fast as the 2d views one

@da-woods
Copy link
Contributor

would you consider introducing a new option that would bypass the __PYX_X{INC,DEC}_MEMVIEW instructions when creating view slices?

Without making any comment about whether it would be useful: this would essentially be a borrowed memoryview slice. It'd have to be unstorable (e.g. could be added to a class, set as a global variable, or put in a closure). I think the main challenge would be making it easily accessible while keeping the limitations understood.

@scoder
Copy link
Contributor

scoder commented May 15, 2020

__PYX_X{INC,DEC}_MEMVIEW mostly use atomic ints for the reference counting, and have to resort to Python reference counting only for the first and last slice taken from a memoryview.

Funny idea: could you create and keep one additional slice before entering the loop, i.e. outside of the prange loop? Any slice would do. Just to be sure that we're not hitting the case of the first slice all the time inside of the loop for some unexpected reason.

Once you've tried that (to be sure that you get comparable timings), please update to the latest master and give that another try. I sprinkled some branch hints in various exception handling places to keep the C compiler from accidentally guessing wrongly about the fast path through the slicing code. Not sure if it makes any difference, but let's see it from actual numbers.

@NicolasHug
Copy link
Contributor Author

NicolasHug commented May 15, 2020

could you create and keep one additional slice before entering the loop, i.e. outside of the prange loop?

I declared a cdef const X_DTYPE_C [:] dummy = numeric_data[0] before the prange in both Cython 0.29.17 and master: it doesn't seem to have any effect.

Also I made a few additional experiments:

  • commenting out only __PYX_INC_MEMVIEW (EDIT: I meant __PYX_DEC_MEMVIEW) leads to about half of the speed up as when commenting both __PYX_X{INC,DEC}_MEMVIEW. So it seems to confirm the slowdown comes from these 2.
  • When I use range instead of prange:
    • I still observe the same kind of behaviour, i.e. commenting out __PYX_X{INC,DEC}_MEMVIEW leads to faster code. And this is true regardless of whether range is in a with nogil context manager.
    • If I don't comment out anything, the code with range is slightly faster than the code with prange. This seems to indicate that the problem is not directly related to the use of prange, even though it does seem to make it worse.

@scoder
Copy link
Contributor

scoder commented May 16, 2020

the code with range is slightly faster than the code with prange

That's expected since you wrote that you are only using one thread. prange/OpenMP introduces additional overhead for managing the workload of that one thread that a simple loop does not have.

commenting out only __PYX_INC_MEMVIEW

Can't say if that's a valid comparison. It would mean that the refcount doesn't get increased correctly and I would actually expect that to lead to a crash due to a premature buffer release. Not sure why it still works for you.

@scoder
Copy link
Contributor

scoder commented May 16, 2020

I tried your example, extending it to calculate an actual sum in the nogil functions (so that the C compiler can't just discard the call). Here are my timings:

$ python3.8 -m timeit -s "import numpy; from test import sum_2d_a, sum_2d_b; a = numpy.ones((5000, 5000), dtype=numpy.float32); print(sum_2d_a(a), sum_2d_b(a))" "sum_2d_a(a)"
25000000.0 25000000.0
20 loops, best of 5: 12.6 msec per loop

$ python3.8 -m timeit -s "import numpy; from test import sum_2d_a, sum_2d_b; a = numpy.ones((5000, 5000), dtype=numpy.float32); print(sum_2d_a(a), sum_2d_b(a))" "sum_2d_b(a)"
25000000.0 25000000.0
20 loops, best of 5: 12.4 msec per loop

I would say, this is totally in the same ballpark. I also tried it with a larger array of 25000x5000 and still got very close timings for both, within a couple of percent of each other with 4 threads.

Here's the code I used (please tell me if I did anything wrong):
#distutils: extra_compile_args=-fopenmp
#distutils: extra_link_args=-fopenmp

#cython: boundscheck=False
#cython: language_level=3


from cython.parallel import prange


def sum_2d_a(float[:, :] view_2d):
    cdef:
        float out = 0
        int row_idx = 0

    for row_idx in prange(view_2d.shape[0], nogil=True):
        out += sum_1d_a(view_2d[row_idx, :])

    return out

cdef float sum_1d_a(float[:] view_1d) nogil:
    cdef Py_ssize_t i
    cdef float s = 0
    for i in range(view_1d.shape[0]):
        s += view_1d[i]
    return s


def sum_2d_b(float[:, :] view_2d):
    cdef:
        float out = 0
        int row_idx = 0

    for row_idx in prange(view_2d.shape[0], nogil=True):
        out += sum_1d_b(view_2d, row_idx)

    return out

cdef float sum_1d_b(float[:, :] view_2d, int row_idx) nogil:
    cdef Py_ssize_t i
    cdef float s = 0
    for i in range(view_2d.shape[1]):
        s += view_2d[row_idx, i]
    return s

@scoder
Copy link
Contributor

scoder commented May 16, 2020

I also let callgrind count the instructions and they are not even 1% apart for both versions, with a slight advantage for the 2D non-slicing version. I tested all of this with the latest master branch.

@scoder
Copy link
Contributor

scoder commented May 16, 2020

I also compared my master version (fb17dc8) to ebbc5f9 now, before I added the branch prediction hints. They make the slicing version (a) about 9% faster for me, using gcc 7.5.0.

@scoder
Copy link
Contributor

scoder commented May 16, 2020

I also pushed the changes from this ticket over to the 0.29.x branch so that they get included in 0.29.18. Seems worth it.

@NicolasHug
Copy link
Contributor Author

I tried your example

Thanks!

I can reproduce your results, i.e. I observe no difference when the array is of dimension 5000 x 5000. However, I do observe some significant difference when ncols << nrows: with numpy.ones((5_000_000, 20), dtype=numpy.float32), sum_2d_a runs in about 120 msec while sum_2d_b takes about 70 msec.

I tried with OMP_NUM_THREADS set to 1 or 4, and with range instead of prange, without the GIL, and they all lead to the same observation that sum_2d_b is faster.

I do not observe any slowdown when nrows >= ncols though.

@scoder
Copy link
Contributor

scoder commented May 16, 2020

I do observe some significant difference when ncols << nrows

Doctor, doctor! It hurts when I do this. – Then don't do it! :)

@scoder
Copy link
Contributor

scoder commented May 16, 2020

Seriously, that's pretty much the worst case for slicing. It creates a huge number of very small slices. You also wouldn't create a ton of threads to let each of them sum a hand full of integers, because there is overhead involved in setting them up and managing them. Similarly, there is overhead involved in safely passing a view on a memory managed buffer around. I'm well aware that Cython lets you shoot yourself in the foot in many cases if you really, really ask for it, but I'd be happier if the answer to the problem at hand wasn't "let's add yet another directive".

@NicolasHug
Copy link
Contributor Author

You also wouldn't create a ton of threads to let each of them sum a hand full of integers

Sure, just to clarify, I only came up with this contrived "sum" example because I was trying to find a minimal reproducing example. Our actual use in scikit-learn is more reality grounded, and I can go into the details if you want but basically, calling a Cythonized function on each row of a big matrix where nrows >> ncols is something we want to do a lot ;). And now that we support OpenMP, we might want to do that even more.

Would you recommend using pointers for this instead? We're trying to avoid pointers as much as possible since ideally, we don't want to have to worry about strides and C/Fortran alignment.

I took a deeper dive into the implementation of __PYX_INC/DEC_MEMVIEW. I can see that most of what is done there is read or set the acquisition_count field of the original 2d memview, which is an atomic int. When I comment out __PYX_INC/DEC_MEMVIEW, sum_2d_a and sum_2d_b are equivalent in terms of timing. And that's a bit confusing to me, because that means that the operations in __PYX_INC/DEC_MEMVIEW are much more expensive than creating the slice and setting its fields (data, memview, shape, stride...). Would you have any explanation of why that might be the case? Are atomic ints that expensive?

@scoder
Copy link
Contributor

scoder commented May 17, 2020

I can well imagine that that's a use case for something like scikit-learn, thanks for the insights.

Atomic ints are not costly per se, they are often implemented in hardware and thus fairly cheap in comparison to many other forms of locking mechanisms. But they do represent a form of locking, and that's still not entirely for free, and (as with all such mechanisms) gets worse under congestion. With very small work packets like in your case, they're probably getting close to saturation.

I'm wondering, #2227 has been pending for a while now. Would it be a good idea to tweak the semantics of for-loop iteration by axis to return borrowed slices, with their lifetime bound to the loop iteration? I think that would also help your use case.

@scoder
Copy link
Contributor

scoder commented May 19, 2020

I'm closing this ticket since several things have been improved. I think the future work on this is out of its scope now.

@scoder scoder closed this as completed May 19, 2020
@NicolasHug
Copy link
Contributor Author

yes, I'll keep an eye on #2227 and the PR

Thanks a lot for your time and help @scoder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants