Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC] LLVM vectorization improvements #6533

Merged
merged 3 commits into from
Jul 6, 2017
Merged

[GSoC] LLVM vectorization improvements #6533

merged 3 commits into from
Jul 6, 2017

Conversation

coodie
Copy link
Contributor

@coodie coodie commented Jun 23, 2017

This branch introduces improvements to LLVM vectorization.

Improvements relate to adding "llvm.mem.parallel_loop_access" metadata to instructions inside parallel loops. Specifically we add the metadata to loads and stores in loops marked as order independent (vectorizeOnly for example). There are two ways of adding metadata to loops:

  1. Metadata refers to the innermost loop instruction is in.
  2. Metadata refers to the stack of nested loops that instruction is in.

Implementation covers only 1).

@@ -0,0 +1 @@
CHPL_LLVM!=llvm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file should end in a newline

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or something... never sure what the red characters in GitHub here at the end mean.

for i in vectorizeOnly(0..511) {
A[A[i]] = B[i];
A[i] = B[i+1];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this case vectorize without addLoopVectorizationHint? With or without the run-time check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway I'd consider passing -vectorize-scev-check-threshold=0 -runtime-memory-check-threshold=0 LLVM options to disable the runtime check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for testing purposes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding loop vectorization (llvm.loop.vectorize.enable) is not necessary to get anything vectorized, it only makes sense when we want to disable vectorization of particular loop. What is important is adding llvm.mem.parallel_loop_access, because it greatly improves vectorization (we can remove runtime checks entirely and stuff gets vectorized anyway). I couldn't figure out working case by myself so I took this from llvm's test suite, llvm.mem.parallel_loop_access is necessary to get this vectorized, because it's very tricky and not obvious that there is no loop carried dependency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, cool

@mppf
Copy link
Member

mppf commented Jun 23, 2017

I'd like to see this in two PRs:

  1. PR adding the testing, but as .futures if they don't vectorize right without the hints
  2. PR adding the hints and moving those .futures to regular tests

@coodie
Copy link
Contributor Author

coodie commented Jun 27, 2017

@mppf
I've split this PR into testing and implementation, and synchronized both of branches with master.

@@ -54,6 +54,8 @@

#include "llvmDebug.h"

#include "llvm/Support/TargetRegistry.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must be missing something - why is this #include now necessary? You didn't change anything else in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh well, apparently I forgot to remove this include, it's not necessary for anything to work.

@mppf
Copy link
Member

mppf commented Jun 27, 2017

  • test this with --vectorize

#ifdef HAVE_LLVM
namespace
{
llvm::MDNode* addLoopVectorizationHint(llvm::Instruction* instruction)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect this function to be static rather than in an anonymous namespace. Arguably a style issue, but making it static will fit in with the rest of the compiler. Vs when defining a type, you can't use static.

if(fNoVectorize == false && isOrderIndependent())
{
llvm::MDNode* llvmLoopMetadata = addLoopVectorizationHint(endLoopBranch);
for(auto it = blockStmtBody->begin(); it != blockStmtBody->end(); it++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer ++it to it++ for this kind of loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note with LLVM 5.0 (and not before) the recommendation changes to

for (Instruction &I : BB)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code might only add parallel loop access metadata to simple loops.
What if we have

var sum = 0;
for i in vectorizeOnly(1..100) with (+ reduce sum) {
  if i > 10 then
     sum += i;
  else
     sum += 1;
}

? Won't this loop over blockStmtBody instructions not visit those instructions "inside" the conditional?

@@ -78,6 +91,8 @@ struct GenInfo {
llvm::IRBuilder<> *builder;
LayeredValueTable *lvt;

std::stack<LoopData> loopStack;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this stack only ever contains elements with parallel=true and since you're not extending other loop generation (while loops, other loops, ...), I think this stack should just store llvm::MDNode* for parallel loops and be called something like parallelLoopStack.
Do you envision some case in which this stack will need to contain non-parallel loops?

@@ -389,6 +389,13 @@ llvm::StoreInst* codegenStoreLLVM(llvm::Value* val,
llvm::MDNode* tbaa = NULL;
if( USE_TBAA && valType ) tbaa = valType->symbol->llvmTbaaNode;
if( tbaa ) ret->setMetadata(llvm::LLVMContext::MD_tbaa, tbaa);

if(!info->loopStack.empty()) {
const auto &loopData = info->loopStack.top();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note in a comment here and in the PR message that you're only adding parallel_loop_access hints for the innermost loops.

@mppf
Copy link
Member

mppf commented Jul 6, 2017

I'd like to see a future PR that adds the parallel loop access to function calls; see https://github.com/chapel-lang/chapel/blob/master/compiler/codegen/expr.cpp#L2222

@mppf
Copy link
Member

mppf commented Jul 6, 2017

I tested this, with this patch:

diff --git a/compiler/codegen/CForLoop.cpp b/compiler/codegen/CForLoop.cpp
index bcf60f8..9f9dc84 100644
--- a/compiler/codegen/CForLoop.cpp
+++ b/compiler/codegen/CForLoop.cpp
@@ -40,7 +40,7 @@ static llvm::MDNode* generateLoopMetadata(bool vectori
ze)
   auto tmpNode        = llvm::MDNode::getTemporary(ctx, llvm::None);
   args.push_back(tmpNode.get());
 
-  if(vectorize)
+  if(vectorize && 0 /* off for testing */)
   {
     llvm::Metadata *loopVectorizeEnable[] = { llvm::MDString::get(ctx, "llvm.lo
op.vectorize.enable"),
                                               llvm::ConstantAsMetadata::get(llvm::ConstantInt::get(llvm::Type::getInt1Ty(ctx), true))};

with --llvm --vectorize --fast

and I saw 1 failure:

  compflags/coodie/llvmPrintIr/llvmPrintIrFull.chpl

Please have a look at disabling the vectorize.enable by default (I think it's fine to leave in the code, but make it off by default somehow) and at getting llvmPrintIrFull to work in that configuration:

start_test -compopts --llvm -compopts --vectorize -compopts --fast compflags/coodie/llvmPrintIr/

One way to do that would be to make llvmPrintIrFull.skipif containing:

COMPOPTS <= --fast

Once you address these issues I think this is ready to merge. Thanks.

@mppf
Copy link
Member

mppf commented Jul 6, 2017

Passed full local testing.

@mppf mppf merged commit 76d4354 into chapel-lang:master Jul 6, 2017
mppf added a commit that referenced this pull request Jul 10, 2017
Add vectorization improvement tests

[PR by @coodie - thanks! Reviewed/tested by @mppf]

This PR adds vectorization test testing changes related to #6533.

Tests are done using FileCheck. It's best if this PR is merged after FileCheck is integrated into chapel's test suite because current solution adds PREDIFF script which runs FileCheck, and for every test empty .good file has to be added.
@mppf mppf mentioned this pull request Jul 10, 2017
7 tasks
@coodie coodie changed the title LLVM vectorization improvements [GSoC] LLVM vectorization improvements Aug 28, 2017
@mppf mppf mentioned this pull request Sep 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants