Skip to content

Conversation

jf205
Copy link
Contributor

@jf205 jf205 commented Aug 18, 2021

This PR closes https://github.com/github/docs-content/issues/3585 by adding an article written by @aschackmull to the public CodeQL docs. I've made a few small changes to the text, but it's still pretty technical. I don't think that's a problem given the audience we are aiming this article at.

cc @RasmusWL

Copy link
Contributor

@intrigus-lgtm intrigus-lgtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this documentation is very helpful and I now finally know why I should not use any() for debugging 🎉

Checking sources and sinks
--------------------------

As a first step, make sure that the source and sink definitions contain what you expect. If one of these is empty then there can never be any data flow. The easiest way to verify this is using quick evaluation in CodeQL for VS Code: Select the text ``node instanceof MySource``, right-click, and choose "CodeQL: Quick Evaluation". This will evaluate the highlighted text, which in this case means the set of sources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between selecting the text node instanceof MySource and selecting the text isSource?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really.

If you do a quick evaluation the isSource member-predicate, your output will contains 2 columns: this (which is the string that identifies the configuration), and node which has the relevant nodes that are sources. So you will end up with a bit more noise in the output 🤷

@RasmusWL
Copy link
Member

RasmusWL commented Aug 18, 2021

A few points in regards to this that we should probably cover in the tutorial as well:

Since this article was written initially, hasPartialFlowRev has been added for debugging flow starting at a sink.

In regards to that, @aschackmull came with the following:

PSA re. using the partial flow features of the shared data flow library: It looks like the (somewhat recent) addition of being able to do reverse partial flow has the side effect that both forward partial flow and reverse partial flow is calculated whenever you request one of the two (they share the type PartialPathNode). So your best bet in terms of getting good performance is to set whichever one of isSource and isSink that's currently irrelevant to none().

@jf205 jf205 requested review from hvitved and aschackmull and removed request for aschackmull August 19, 2021 08:00
@@ -25,3 +26,4 @@ CodeQL queries are used in code scanning analyses to find problems in source cod
- :doc:`About data flow analysis <about-data-flow-analysis>`: Data flow analysis is used to compute the possible values that a variable can hold at various points in a program, determining how those values propagate through the program and where they are used.
- :doc:`Creating path queries <creating-path-queries>`: You can create path queries to visualize the flow of information through a codebase.
- :doc:`Troubleshooting query performance <troubleshooting-query-performance>`: Improve the performance of your CodeQL queries by following a few simple guidelines.
- :doc:`Debugging data-flow queries using partial flow <debugging-data-flow-queries-using-partial-flow>`: If a data-flow query unexpectedly produces no results, you can use partial flow to debug the problem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps If a data-flow does not produce expected results instead of If a data-flow query unexpectedly produces no results, as it may also be relevant to debug if a query has results. (Same change in the other file).

where config.hasPath(source, sink)
select sink, "Sink is reached from $@.", source.getNode(), "here"

If a data-flow query that you have written does not produce any results when you expect it to, there may be a problem with your query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not produce any results when you expect it to -> does not produce expected results


A naive next step could be to try changing the sink definition to ``any()``. This would mean that we would get a lot of flow to all the places that are reachable from the sources. While this approach makes sense and can work, it can be hard to get an overview of the results as they can be quite plentiful, and it does come with a couple of additional caveats: Performance might degrade to uselessness and we might not even see all the partial flow paths. The latter point is somewhat subtle and deserves elaboration. Since the data-flow library tries very hard to prune impossible paths and since field stores and reads must be evenly matched along a path, then we will never see paths going through a store that fail to reach a corresponding read. This can make it hard to see where flow actually stops.

Because of this, a ``Configuration`` comes with a mechanism for exploring partial flow that tries to deal with these caveats. This is the ``Configuration.hasPartialFlow`` predicate:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add (there is also a `hasPartialFlowRev` predicate for exploring flow backwards from sinks).

@@ -56,9 +56,9 @@ If there are still no results and performance did not degrade to complete useles
Partial flow
------------

A naive next step could be to try changing the sink definition to ``any()``. This would mean that we would get a lot of flow to all the places that are reachable from the sources. While this approach makes sense and can work, it can be hard to get an overview of the results as they can be quite plentiful, and it does come with a couple of additional caveats: Performance might degrade to uselessness and we might not even see all the partial flow paths. The latter point is somewhat subtle and deserves elaboration. Since the data-flow library tries very hard to prune impossible paths and since field stores and reads must be evenly matched along a path, then we will never see paths going through a store that fail to reach a corresponding read. This can make it hard to see where flow actually stops.
A naive next step could be to try changing the sink definition to ``any()``. This would mean that we would get a lot of flow to all the places that are reachable from the sources. While this approach makes sense and can work in some cases, you might find that it produces so many results that it's very hard to explore the findings, which can also dramtatically affect query performance. More importnatly, you might not even see all the partial flow paths. This is because the data-flow library tries very hard to prune impossible paths and, since field stores and reads must be evenly matched along a path, we will never see paths going through a store that fail to reach a corresponding read. This can make it hard to see where flow actually stops.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "importnatly"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also typo: "dramtatically"

hvitved
hvitved previously approved these changes Aug 24, 2021
@jf205
Copy link
Contributor Author

jf205 commented Aug 24, 2021

Thanks for the reviews so far! I'm going to work on making the text a little easier to follow before asking for a review from the docs team.

aschackmull
aschackmull previously approved these changes Aug 31, 2021
@jf205 jf205 dismissed stale reviews from aschackmull and hvitved via 81a9ce2 September 2, 2021 15:40
@jf205
Copy link
Contributor Author

jf205 commented Sep 2, 2021

I've had a go at further polishing the text following the technical reviews. I think this is now ready for a review by the docs team. Note that this article is aimed at fairly advanced query writers, so i haven't simplified the text too much.

@jf205 jf205 added the ready-for-doc-review This PR requires and is ready for review from the GitHub docs team. label Sep 2, 2021
@jf205 jf205 closed this Sep 2, 2021
@jf205 jf205 reopened this Sep 2, 2021
Checking sources and sinks
--------------------------

Initially, you should make sure that the ``source`` and ``sink`` definitions contain what you expect. If either the ``source`` or ``sink`` is empty then there can never be any data flow. The easiest way to check this is using quick evaluation in CodeQL for VS Code. Select the text ``node instanceof MySource``, right-click, and choose "CodeQL: Quick Evaluation". This will evaluate the highlighted text, which in this case means the set of sources. For more information, see :ref:`Analyzing your projects <running-a-specific-part-of-a-query-or-library>` in the CodeQL for VS Code help.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The words source and sink are not referring directly to code, so I don't think they should be type-set specially.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed throughout, and I think it should be reverted throughout.

)
}

If you are focusing on a single ``source`` then the ``src`` column is meaningless. You may of course also add other columns of interest based on ``n``, but including the enclosing callable and the distance to the source at the very least is generally recommended, as they can be useful columns to sort on to better inspect the results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was rewritten from "superfluous" to "meaningless", which I don't think makes sense. It should be reverted to "superfluous".

Copy link
Contributor

@mattpollard mattpollard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋🏻 @jf205, thanks for your patience here! Checked out the text and it looks good to me 🚢

@jf205 jf205 merged commit c86311e into main Sep 13, 2021
@jf205 jf205 deleted the dataflow-tutorial branch September 13, 2021 09:25
@shati-patel shati-patel removed the ready-for-doc-review This PR requires and is ready for review from the GitHub docs team. label Sep 13, 2021
RasmusWL added a commit that referenced this pull request Feb 25, 2022
The things that I mentioned in #6502 (comment) that never got into the document 😳
hmac pushed a commit that referenced this pull request Mar 20, 2022
The things that I mentioned in #6502 (comment) that never got into the document 😳
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants