Skip to content

Commit

Permalink
AirAllocateStackByGraphColoring should use the optimized interference…
Browse files Browse the repository at this point in the history
… graphs from AirAllocateRegistersByGraphColoring

https://bugs.webkit.org/show_bug.cgi?id=226258

Reviewed by Phil Pizlo.

Source/JavaScriptCore:

The main change in this patch is that AirAllocateStackByGraphColoring is now using the optimized datastructures in wtf/InterferenceGraph.h.
This required templating most of it over the interference graph used (Small/Large/Huge), but I tried keeping some common parts out of the templated class to minimize the impact on compile times and binary size.

A consequence of that change is that coalescableMoves and remappedStackSlots now store indices instead of direct pointers to StackSlots, resulting in a memory reduction of about 3x as well.

Another consequence is that I had to slightly alter the way that coalescing works: instead of carefully removing the interference edges of the killed slot, we simply use mayClear() which is not guaranteed to remove anything.
I believe that this is sound, because every subsequent access to m_interference checks whether a slot has been coalesced first, so dropping these edges is purely a memory saving, but has no logical effect.

The new code was tested in a few ways:
- running on JetStream2 with asan
- running on JetStream2 with TEST_OPTIMIZED_INTERFERENCE_GRAPH
- running on JetStream2 and logging the frame sizes at the end of this phase, and comparing to the results of doing the same on ToT (same average frame size)

The two functions where this code had the largest memory footprint in JetStream2 were both in tsf-wasm.
One has 751 stack slots, and had an interference graph of 2.1MB and a coalescableMoves vector of 440kB
The other has 673 stack slots, and had an interference graph of 1.9MB and a coalescableMoves vector of 421kB.
With this patch, they respectively use 79kB+146kB and 67kB+140kB
The effect on the rest of JetStream2 is less likely to matter as few functions used more than a few dozens of kB in this phase, but in percentages are just as huge.

More importantly (and the reason I wrote this patch in the first place), I checked mruby-wasm.aotoki.dev which with a few other pages forced us to lower Options::maximumTmpsForGraphColoring because of jetsams.
It has two massive functions that reach this phase if I increase Options::maximumTmpsForGraphColoring:
- about 6k stack slots -> 215MB + 6MB (interference graph + coalescableMoves)
- about 9k stack slots -> 395MB + 4MB
After this patch, they respectively use 4.5MB+2MB and 9MB+1.5MB, or roughly a 40x improvement.
Combined with the recent improvements to the register allocator, I hope to be able to increase Options::maximumTmpsForGraphColoring soon (in a different patch for easier bisection if either cause a perf regression).
This would be helpful, since its lowering cratered our performance on some other wasm application by 8x.

In terms of compile times, this patch lowered the time spent in AllocateStackByGraphColoring over the course of a run of JetStream2 from roughly 350ms to roughly 270ms.
This is almost certainly negligible, but at least it guarantees that it did not regress.

* b3/air/AirAllocateRegistersByGraphColoring.cpp:
* b3/air/AirAllocateStackByGraphColoring.cpp:
(JSC::B3::Air::allocateStackByGraphColoring):

Source/WTF:

I moved the interference graphs datastructures from AirAllocateRegistersByGraphColoring to their own wtf/InterferenceGraph.h file.
There are three of them:
- SmallInterferenceGraph, best for n < 400
- LargeInterferenceGraph, for n < 2**16
- HugeInterferenceGraph, for n up to 2**32
I also added "Iterable" versions of them, that have an operator[] method whose result you can iterate on to get all the indices which interfere with a given index.
SmallIterableInterferenceGraph is the same as the non-iterable version, but the Large and Huge versions are a bit slower than their counterparts and use 2x memory.

All of these were tested by running JetStream2 with the TEST_OPTIMIZED_INTERFERENCE_GRAPH set to 1.
This flag makes the optimized datastructures run in parallel with a reference implementation, and their results are checked for equality on every method call.
There is one small difference allowed: iteration is not guaranteed to go through elements in the same order.

I also added a clear() method to LikelyDenseUnsignedIntegerSet, and added the NotNull flag to its various uses of placement new.

* WTF.xcodeproj/project.pbxproj:
* wtf/CMakeLists.txt:
* wtf/HashSet.h:
(WTF::W>::memoryUse const):
* wtf/InterferenceGraph.h: Added.
(WTF::InterferenceBitVector::contains):
(WTF::InterferenceBitVector::addAndReturnIsNewEntry):
(WTF::InterferenceBitVector::add):
(WTF::InterferenceBitVector::clear):
(WTF::InterferenceBitVector::mayClear):
(WTF::InterferenceBitVector::setMaxIndex):
(WTF::InterferenceBitVector::forEach):
(WTF::InterferenceBitVector::size const):
(WTF::InterferenceBitVector::memoryUse const):
(WTF::InterferenceBitVector::dumpMemoryUseInKB const):
(WTF::InterferenceBitVector::Iterable::iterator::operator++):
(WTF::InterferenceBitVector::Iterable::iterator::operator* const):
(WTF::InterferenceBitVector::Iterable::iterator::operator== const):
(WTF::InterferenceBitVector::Iterable::iterator::operator!= const):
(WTF::InterferenceBitVector::Iterable::begin const):
(WTF::InterferenceBitVector::Iterable::end const):
(WTF::InterferenceBitVector::operator[] const):
(WTF::InterferenceBitVector::index const):
(WTF::InterferenceVector::contains):
(WTF::InterferenceVector::addAndReturnIsNewEntry):
(WTF::InterferenceVector::add):
(WTF::InterferenceVector::clear):
(WTF::InterferenceVector::mayClear):
(WTF::InterferenceVector::setMaxIndex):
(WTF::InterferenceVector::forEach):
(WTF::InterferenceVector::size const):
(WTF::InterferenceVector::memoryUse const):
(WTF::InterferenceVector::dumpMemoryUseInKB const):
(WTF::InterferenceVector::Iterable::begin const):
(WTF::InterferenceVector::Iterable::end const):
(WTF::InterferenceVector::operator[] const):
(WTF::UndirectedEdgesDuplicatingAdapter::contains):
(WTF::UndirectedEdgesDuplicatingAdapter::addAndReturnIsNewEntry):
(WTF::UndirectedEdgesDuplicatingAdapter::add):
(WTF::UndirectedEdgesDuplicatingAdapter::clear):
(WTF::UndirectedEdgesDuplicatingAdapter::mayClear):
(WTF::UndirectedEdgesDuplicatingAdapter::setMaxIndex):
(WTF::UndirectedEdgesDuplicatingAdapter::forEach):
(WTF::UndirectedEdgesDuplicatingAdapter::size const):
(WTF::UndirectedEdgesDuplicatingAdapter::memoryUse const):
(WTF::UndirectedEdgesDuplicatingAdapter::dumpMemoryUseInKB const):
(WTF::UndirectedEdgesDuplicatingAdapter::operator[] const):
(WTF::UndirectedEdgesDedupAdapter::contains):
(WTF::UndirectedEdgesDedupAdapter::addAndReturnIsNewEntry):
(WTF::UndirectedEdgesDedupAdapter::add):
(WTF::UndirectedEdgesDedupAdapter::clear):
(WTF::UndirectedEdgesDedupAdapter::mayClear):
(WTF::UndirectedEdgesDedupAdapter::setMaxIndex):
(WTF::UndirectedEdgesDedupAdapter::forEach):
(WTF::UndirectedEdgesDedupAdapter::size const):
(WTF::UndirectedEdgesDedupAdapter::memoryUse const):
(WTF::UndirectedEdgesDedupAdapter::dumpMemoryUseInKB const):
(WTF::InterferenceHashSet::contains):
(WTF::InterferenceHashSet::addAndReturnIsNewEntry):
(WTF::InterferenceHashSet::add):
(WTF::InterferenceHashSet::clear):
(WTF::InterferenceHashSet::setMaxIndex):
(WTF::InterferenceHashSet::forEach):
(WTF::InterferenceHashSet::size const):
(WTF::InterferenceHashSet::memoryUse const):
(WTF::InterferenceHashSet::dumpMemoryUseInKB const):
(WTF::InstrumentedInterferenceGraph::contains):
(WTF::InstrumentedInterferenceGraph::addAndReturnIsNewEntry):
(WTF::InstrumentedInterferenceGraph::add):
(WTF::InstrumentedInterferenceGraph::clear):
(WTF::InstrumentedInterferenceGraph::mayClear):
(WTF::InstrumentedInterferenceGraph::setMaxIndex):
(WTF::InstrumentedInterferenceGraph::forEach):
(WTF::InstrumentedInterferenceGraph::size const):
(WTF::InstrumentedInterferenceGraph::memoryUse const):
(WTF::InstrumentedInterferenceGraph::dumpMemoryUseInKB const):
(WTF::InstrumentedIterableInterferenceGraph::Iterable::Iterable):
(WTF::InstrumentedIterableInterferenceGraph::Iterable::begin const):
(WTF::InstrumentedIterableInterferenceGraph::Iterable::end const):
(WTF::InstrumentedIterableInterferenceGraph::operator[] const):
* wtf/LikelyDenseUnsignedIntegerSet.h:
(WTF::LikelyDenseUnsignedIntegerSet::LikelyDenseUnsignedIntegerSet):
(WTF::LikelyDenseUnsignedIntegerSet::clear):
(WTF::LikelyDenseUnsignedIntegerSet::add):
(WTF::LikelyDenseUnsignedIntegerSet::estimateHashSetSize):
(WTF::LikelyDenseUnsignedIntegerSet::transitionToHashSet):
(WTF::LikelyDenseUnsignedIntegerSet::transitionToBitVector):


Canonical link: https://commits.webkit.org/238229@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@278186 268f45cc-cd09-0410-ab3c-d52691b4dbfc
  • Loading branch information
Robin Morisset committed May 28, 2021
1 parent 695e088 commit bc14815
Show file tree
Hide file tree
Showing 9 changed files with 1,009 additions and 457 deletions.
41 changes: 41 additions & 0 deletions Source/JavaScriptCore/ChangeLog
@@ -1,3 +1,44 @@
2021-05-27 Robin Morisset <rmorisset@apple.com>

AirAllocateStackByGraphColoring should use the optimized interference graphs from AirAllocateRegistersByGraphColoring
https://bugs.webkit.org/show_bug.cgi?id=226258

Reviewed by Phil Pizlo.

The main change in this patch is that AirAllocateStackByGraphColoring is now using the optimized datastructures in wtf/InterferenceGraph.h.
This required templating most of it over the interference graph used (Small/Large/Huge), but I tried keeping some common parts out of the templated class to minimize the impact on compile times and binary size.

A consequence of that change is that coalescableMoves and remappedStackSlots now store indices instead of direct pointers to StackSlots, resulting in a memory reduction of about 3x as well.

Another consequence is that I had to slightly alter the way that coalescing works: instead of carefully removing the interference edges of the killed slot, we simply use mayClear() which is not guaranteed to remove anything.
I believe that this is sound, because every subsequent access to m_interference checks whether a slot has been coalesced first, so dropping these edges is purely a memory saving, but has no logical effect.

The new code was tested in a few ways:
- running on JetStream2 with asan
- running on JetStream2 with TEST_OPTIMIZED_INTERFERENCE_GRAPH
- running on JetStream2 and logging the frame sizes at the end of this phase, and comparing to the results of doing the same on ToT (same average frame size)

The two functions where this code had the largest memory footprint in JetStream2 were both in tsf-wasm.
One has 751 stack slots, and had an interference graph of 2.1MB and a coalescableMoves vector of 440kB
The other has 673 stack slots, and had an interference graph of 1.9MB and a coalescableMoves vector of 421kB.
With this patch, they respectively use 79kB+146kB and 67kB+140kB
The effect on the rest of JetStream2 is less likely to matter as few functions used more than a few dozens of kB in this phase, but in percentages are just as huge.

More importantly (and the reason I wrote this patch in the first place), I checked mruby-wasm.aotoki.dev which with a few other pages forced us to lower Options::maximumTmpsForGraphColoring because of jetsams.
It has two massive functions that reach this phase if I increase Options::maximumTmpsForGraphColoring:
- about 6k stack slots -> 215MB + 6MB (interference graph + coalescableMoves)
- about 9k stack slots -> 395MB + 4MB
After this patch, they respectively use 4.5MB+2MB and 9MB+1.5MB, or roughly a 40x improvement.
Combined with the recent improvements to the register allocator, I hope to be able to increase Options::maximumTmpsForGraphColoring soon (in a different patch for easier bisection if either cause a perf regression).
This would be helpful, since its lowering cratered our performance on some other wasm application by 8x.

In terms of compile times, this patch lowered the time spent in AllocateStackByGraphColoring over the course of a run of JetStream2 from roughly 350ms to roughly 270ms.
This is almost certainly negligible, but at least it guarantees that it did not regress.

* b3/air/AirAllocateRegistersByGraphColoring.cpp:
* b3/air/AirAllocateStackByGraphColoring.cpp:
(JSC::B3::Air::allocateStackByGraphColoring):

2021-05-27 Darin Adler <darin@apple.com>

Next step toward using std::optional directly instead of through WTF::Optional typedef
Expand Down
240 changes: 3 additions & 237 deletions Source/JavaScriptCore/b3/air/AirAllocateRegistersByGraphColoring.cpp
Expand Up @@ -39,7 +39,7 @@
#include "AirUseCounts.h"
#include <wtf/HashSet.h>
#include <wtf/HashTraits.h>
#include <wtf/LikelyDenseUnsignedIntegerSet.h>
#include <wtf/InterferenceGraph.h>
#include <wtf/SmallSet.h>
#include <wtf/Vector.h>

Expand All @@ -52,238 +52,6 @@ static constexpr bool traceDebug = false;
static constexpr bool reportStats = false;
static constexpr bool reportInterferenceGraphMemoryUse = false;

template <typename IndexType>
class InterferenceHashSet {
// Interference edges are not directed. An edge between any two Tmps is represented
// by the concatenated values (through TmpMapper) of the smallest Tmp followed by the bigger Tmp.
using InterferenceEdge = std::pair<IndexType, IndexType>;
public:
bool contains(IndexType u, IndexType v)
{
if (v < u)
std::swap(u, v);
return m_set.contains({ u, v });
}

bool addAndReturnIsNewEntry(IndexType u, IndexType v)
{
if (v < u)
std::swap(u, v);
return m_set.add({ u, v }).isNewEntry;
}

void clear()
{
m_set.clear();
}

void setMaxTmpIndex(unsigned n)
{
ASSERT_UNUSED(n, n < std::numeric_limits<IndexType>::max());
}

template <typename Functor>
void forEach(const Functor& functor)
{
for (auto edge : m_set)
functor(edge);
}

unsigned size() const { return m_set.size(); }

unsigned memoryUse() const { return m_set.capacity() * sizeof(InterferenceEdge); }

void dumpMemoryUseInKB() const { dataLog(memoryUse()/1024); }

private:
HashSet<InterferenceEdge> m_set;
};

template <typename T, typename IndexType>
class InterferenceVector {
public:
bool contains(IndexType u, IndexType v)
{
if (v < u)
std::swap(u, v);
return m_vector[u].contains(v);
}

bool addAndReturnIsNewEntry(IndexType u, IndexType v)
{
if (v < u)
std::swap(u, v);
bool isNewEntry = m_vector[u].add(v).isNewEntry;
m_size += isNewEntry;
return isNewEntry;
}

void clear()
{
m_vector.clear();
m_size = 0;
}

void setMaxTmpIndex(unsigned n)
{
ASSERT(n < std::numeric_limits<IndexType>::max());
m_vector.resize(n);
}

template <typename Functor>
void forEach(const Functor& functor)
{
for (unsigned i = 0; i < m_vector.size() ; ++i) {
for (IndexType j : m_vector[i])
functor({ i, j });
}
}

unsigned size() const { return m_size; }

unsigned memoryUse() const
{
unsigned memory = 0;
for (const T& set : m_vector)
memory += set.memoryUse();
return memory;
}

void dumpMemoryUseInKB() const { dataLog(memoryUse()/1024); }

private:
Vector<T> m_vector;
unsigned m_size {0};
};

template <typename IndexType>
class InterferenceBitVector {
public:
bool contains(IndexType u, IndexType v)
{
return m_bitVector.quickGet(index(u, v));
}

bool addAndReturnIsNewEntry(IndexType u, IndexType v)
{
// Note: we could do only one quickSet, at the cost of one extra branch in every contains/addAndReturnIsNewEntry (to enforce u < v)
bool alreadyIn = m_bitVector.quickSet(index(u, v));
bool alreadyIn2 = m_bitVector.quickSet(index(v, u));
ASSERT_UNUSED(alreadyIn2, alreadyIn == alreadyIn2);
m_size += !alreadyIn;
return !alreadyIn;
}

void clear()
{
m_bitVector.clearAll();
m_size = 0;
}

void setMaxTmpIndex(unsigned n)
{
ASSERT(n < std::numeric_limits<IndexType>::max());
m_numTmps = n;
m_bitVector.ensureSize(n*n);
}

template <typename Functor>
void forEach(const Functor& functor)
{
for (IndexType i = 0; i < m_numTmps; ++i) {
for (IndexType j = i + 1; j < m_numTmps; ++j) {
if (m_bitVector.quickGet(index(i, j)))
functor({ i, j });
}
}
}

unsigned size() const { return m_size; }

unsigned memoryUse() const { return (m_bitVector.size() + 7) >> 3; }

void dumpMemoryUseInKB() const { dataLog(memoryUse()/1024); }

private:
unsigned index(IndexType i, IndexType j)
{
return i * m_numTmps + j;
}

BitVector m_bitVector;
unsigned m_size {0};
IndexType m_numTmps;
};

#define TEST_OPTIMIZED_INTERFERENCE_GRAPH 0
#if TEST_OPTIMIZED_INTERFERENCE_GRAPH
template <typename T, typename U, typename IndexType>
class InstrumentedInterferenceGraph {
public:
bool contains(IndexType u, IndexType v)
{
bool containsInA = m_setA.contains(u, v);
bool containsInB = m_setB.contains(u, v);
RELEASE_ASSERT(containsInA == containsInB);
return containsInA;
}

bool addAndReturnIsNewEntry(IndexType u, IndexType v)
{
bool isNewEntryA = m_setA.addAndReturnIsNewEntry(u, v);
bool isNewEntryB = m_setB.addAndReturnIsNewEntry(u, v);
RELEASE_ASSERT(isNewEntryA == isNewEntryB);
return isNewEntryA;
}

void clear()
{
m_setA.clear();
m_setB.clear();
}

void setMaxTmpIndex(unsigned n)
{
m_setA.setMaxTmpIndex(n);
m_setB.setMaxTmpIndex(n);
}

template <typename Functor>
void forEach(const Functor& functor)
{
m_setA.forEach(functor);
}

unsigned size() const
{
unsigned sizeA = m_setA.size();
unsigned sizeB = m_setB.size();
RELEASE_ASSERT(sizeA == sizeB);
return sizeA;
}

void dumpMemoryUseInKB() const
{
dataLog(m_setA.memoryUse()/1024, " -> ", m_setB.memoryUse()/1024);
}

private:
T m_setA;
U m_setB;
};

using SmallInterferenceGraph = InterferenceBitVector<uint16_t>;
using LargeInterferenceGraph = InstrumentedInterferenceGraph<InterferenceHashSet<uint16_t>, InterferenceVector<LikelyDenseUnsignedIntegerSet<uint16_t>, uint16_t>, uint16_t>;
using HugeInterferenceGraph = InstrumentedInterferenceGraph<InterferenceHashSet<uint32_t>, InterferenceVector<LikelyDenseUnsignedIntegerSet<uint32_t>, uint32_t>, uint16_t>;

#else // TEST_OPTIMIZED_INTERFERENCE_GRAPH

using SmallInterferenceGraph = InterferenceBitVector<uint16_t>;
using LargeInterferenceGraph = InterferenceVector<LikelyDenseUnsignedIntegerSet<uint16_t>, uint16_t>;
using HugeInterferenceGraph = InterferenceVector<LikelyDenseUnsignedIntegerSet<uint32_t>, uint32_t>;

#endif

// The AbstractColoringAllocator defines all the code that is independant
// from the bank or register and can be shared when allocating registers.
template<typename IndexType, typename InterferenceSet, Bank bank>
Expand Down Expand Up @@ -1589,7 +1357,7 @@ class ColoringAllocator : public AllocatorType<IndexType, InterferenceSet, bank>
}
}

m_interferenceEdges.setMaxTmpIndex(AbsoluteTmpMapper<bank>::absoluteIndex(m_code.numTmps(bank)));
m_interferenceEdges.setMaxIndex(AbsoluteTmpMapper<bank>::absoluteIndex(m_code.numTmps(bank)));

initializePrecoloredTmp();
build();
Expand Down Expand Up @@ -2042,9 +1810,7 @@ class GraphColoringRegisterAllocation {
return false;
};

// The bit vector approach uses n*n bits, or 20kB for n = 400.
constexpr unsigned maxSizeForSmallInterferenceGraph = 400;
if (m_code.numTmps(bank) < maxSizeForSmallInterferenceGraph) {
if (m_code.numTmps(bank) < WTF::maxSizeForSmallInterferenceGraph) {
if (useIRC()) {
ColoringAllocator<uint16_t, bank, IRC, SmallInterferenceGraph> allocator(m_code, m_tmpWidth, m_useCounts, unspillableTmps);
done = doAllocation(allocator);
Expand Down

0 comments on commit bc14815

Please sign in to comment.