# Review #3 #4

Open
opened this Issue Dec 7, 2018 · 8 comments

Projects
None yet
3 participants
Collaborator

### distillpub-reviewers commented Dec 7, 2018

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to Austin Huang for taking the time to review this article.

## On Outstanding Communication:

For such a topic that is in such a great need of accessible explanations, I felt that the article’s organization and writing could have been much improved.

More consideration needs to be put into the writing how topics are sequenced.

For example, under “Marginalization and Conditioning”, the high level intuition came after a detailed technical exposition such that, if the reader was able to follow the technical exposition, they wouldn’t need the figures. On the other hand, if the reader did not have the intuition of what marginalization and conditioning illustrated in the figure already in their minds, they would have struggled to follow the last 3-4 paragraphs.

Other issues follow this pattern of being sequenced poorly, starting with the details and ending with the big picture. The section on multivariate Gaussian had two paragraphs discussing positive semidefiniteness, diagonal/off-diagonal elements of the covariance, only to be followed by actually defining the covariance matrix, ending with a visual of what a multivariate Gaussian looks like.

The section on the posterior distribution (arguably the crux of the model) does not read well. It is difficult to follow and is in need of a rewrite.

See section-by-section comments at the end of the review for a detailed discussion of specific communication issues.

## On Scientific Correctness & Integrity:

Given the article does not make scientific claims and primarily focuses on the communicating the basic mathematical structure of a particular area of machine learning, most of these don’t directly apply.

This could be seen as one weakness of the article - that it doesn’t take enough of a stance (a good contrast is the T-sne article, another communication of ML methodology which does take positions on their use rather than focusing strictly on the mathematical underpinnings). This article doesn’t comment or help the reader reason about correct or incorrect application of GPs, it only describes the mathematical machinery behind them.

On the topic of limitations, it would be nice if the article wrote about pitfalls / limitations of Gaussian Process models.

# Introduction

"Even if you have spent some time reading about machine learning, chances are that you have never heard of Gaussian processes."

This assumption will often be incorrect (in some circles, Gaussian processes are well known), but more importantly, misses an opportunity to provide a stronger motivational hook to the article beyond "it's something you have haven't heard of / rehearsing the basics is good".

"Their most obvious area of application are regression problems, for example in robotics, but they can also be extended to classification and clustering tasks [2, 3] . Just a quick recap: the goal of regression is to find a function that describes a given set of data points as closely as possible. "

The phrasing here feels more convoluted than it needs to be. Why bring the overloaded regression terminology followed by an awkward definitional statement after its initial use. Could just skip to using the definition in the first sentence and avoid the "just a quick recap" sentence altogether.

# Multivariate Gaussian Distributions

"In particular, we are interested in the multivariate case of this distribution, where each random variable is distributed normally and their joint distribution is also Gaussian."

This section misses an opportunity to introduce a figure that illuminates the gap between the intuition of a Gaussian most people have with the way they're used in GPs.

One of the big stumbling blocks with GPs is a visual one - when students are introduced to Gaussians, they are used to looking at the distribution with each dimension mapped to a visual axis (like the two figures in this section). In GPs, the dimensions of a multivariate Gaussian are visually represented as vertical axes (in effect, sharing a single axis).

This section shows the former, the next section shows the latter, but we’re missing a visual that bridges and links the relationship between the two visual representations. There is an opportunity to provide an illustration that bridges their familiar representation of multivariate gaussian with one dimension per axis (limited between 1D to 3D examples), with the visual representation common to GPs (one dimension per horizontal slice) by showing the correspondence side-by-side in a single interactive visual.

"In general, the multivariate Gaussian distribution is defined by a mean vector \muμ and a covariance matrix \SigmaΣ",

What purpose does the "In general" clause serve?

"symetric" misspelled - should be "symmetric"

“The standard deviations for each random variable are on the diagonal of the covariance matrix … ”

I believe this is an error - the diagonal of the covariance matrix corresponds to the variance for each random variable, not the standard deviation.

The placement of the controls for the bivariate normal figures are counter-intuitive.

The placement bottom handle pointing downwards implies that:

1. the effect of the control is on the vertical proportions of the Gaussian (since the control goes up and down) and
2. the effect is somehow opposite the control which points upwards.

Neither of these intuitions turns out to be true of the control's effect on the distribution.

To add more confusion, the most positive correlation is obtained when the slider is at its most negative position and the most negative correlation is obtained when the slider is at its most positive position.

"Gaussian distributions are widely used to model the real world: either as a surrogate when the original distributions are unknown, or in the context of the central limit theorem."

This sentence does not make much sense, unpacking the two clauses:

"Gaussian distributions are widely used to model the real world ... when the original distributions are unknown."

This notion of “assume Gaussian by default” reflects a common misuse of Gaussian distributions and is blamed for many modeling failures. The phrasing here implies that use of Gaussians in conditions where the original distributions are unknown is an acceptable common practice, which a poor modeling practice.

"Gaussian distributions are widely used to model the real world ... in the context of the central limit theorem."

I can guess what this means (that Gaussian distributions are justified if the conditions of the central limit theorem are held) but the phrasing reads very awkwardly.

"Gaussian distributions have the nice algebraic property of being closed under conditioning and marginalization. This means that the resulting distributions from these operations are also Gaussian, which makes many problems in statistics and machine learning tractable."

Use of “This” is ambiguous (Does it refer to being closed, conditioning, marginalization? etc) Could replace "This" with "Being closed under conditioning and marginalization" to reduce ambiguity.

"Marginalization and conditioning both work on subsets of the original distribution and we will use the following notation"

This section needs more elaboration as there are two key stumbling blocks for those coming from other backgrounds.

1. The way in which X and Y are used in this context as subsets of a multivariate model as opposed to predictor/dependent variables is going to be confusing for a large fraction of people and it's worth being precise and having a visual aid to cross that gap.

2. X and Y represent sets of random variables. This is stated in the text, but would be much better if somehow illustrated in a way such that X and Y are shown to capture multiple dimensions, perhaps representing the first equation under "Marginalization and Conditioning" in a more visual way. The breakdown of the \mu and \Sigma into subsets would be better illustrated with a figure.

“The interpretation of this equation is straight forward”

I don’t think it’s helpful to qualify the interpretation as "straightforward" or “not straightforward”, just explain the interpretation without the qualification.

"Now that we have worked through the necessary equations, we will think about how we can understand the two operations visually."

I think it would be beneficial to show the visual first and then present marginalization and conditioning equations as elaborations on the visual.

# Gaussian Processes

"First, we will move from the continuous view to the discrete representation of a function: "

This transition sentence is confusing considering continuous views of functions hasn't been discussed. Up to now the main topic has just been the multivariate gaussian distribution, not continuous functions.

## Kernels

There's a bug in the figure here the right margin annotation doesn't get updated until after the slider is released. It should get updated as soon as a slider is touched.

"The stationary nature of the RBF kernel can be observed in the banding around the diagonal of its covariance matrix"

This relationship could be made more intuitive visually. Specifically, what exactly is meant by banding around the diagonal and why does it confer stationarity?

## Prior Distribution

The "click to start" functionality is awkward, why does the user need to start/pause at all? Why not just have the visual continually running by default?

## Posterior Distribution

This section probably needs the most work writing-wise.

"The intuition behind this step is that the training points constrain the set of functions to those that pass through the training points."

This can be a harmful intuition since noise parameters can enable the solution set of functions to also not pass through the training points.

"Analogous to the prior distribution, we could obtain a prediction by sampling from this distribution. But, since sampling involves randomness, we would have no guarantee that the result is a good fit to the data. "

I think I understand the idea that the desire is to obtain a closed form representation, but this is a very awkward choice of phrase. Lots of times random processes are used to find fits to data (e.g. MCMC).

“In contrast to the prior distribution, we set the mean to \mu=0μ=0 , so in this context it is not really of importance.”

Very confusing what concept "it" is referring to here.

Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: Explanation of existing results

How significant are these contributions? 3/5
Outstanding Communication Score
Article Structure 3/5
Writing Style 2/5
Diagram & Interface Style 4/5
Impact of diagrams / interfaces / tools for thought? 4/5
Scientific Correctness & Integrity Score
Are claims in the article well supported? 5/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 2/5
How easy would it be to replicate (or falsify) the results? 3/5
Does the article cite relevant work? 3/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 3/5

Collaborator

### grtlr commented Jan 30, 2019

 Thank you very much for taking the time to review our article! We really appreciate the detailed comments and feedback! We have just started to incorporate all the changes. I will reference the corresponding PRs and commits in this issue.
Collaborator

## Prior Distribution

The "click to start" functionality is awkward, why does the user need to start/pause at all? Why not just have the visual continually running by default?

All the figures in this articles are implemented using JavaScript. We have experienced minor stutter in some of the figures. We hope that the click to start makes the article run smoother on older machines.

Closed

### grtlr added a commit that referenced this issue Jan 31, 2019

 Expand paragraph on noisy inputs (#7) 
We provide more details on the error model for noisy inputs, as a valuable extension to Gaussian processes. (#1, #4)
 c02f66d 
Collaborator

### grtlr commented Feb 6, 2019

 The placement of the controls for the bivariate normal figures are counter-intuitive. The placement bottom handle pointing downwards implies that: the effect of the control is on the vertical proportions of the Gaussian (since the control goes up and down) and the effect is somehow opposite the control which points upwards. Neither of these intuitions turns out to be true of the control's effect on the distribution. To add more confusion, the most positive correlation is obtained when the slider is at its most negative position and the most negative correlation is obtained when the slider is at its most positive position. We very much agree with these points. In #13 we completly overhaul the interaction for the bivariate figures. Now the MVN can be controlled using its eigenvectors which, in our opinion, feels more natural. Additionally, we use SVD instead of Cholesky decomposition, which yields much nicer ellipses.
Collaborator

# Multivariate Gaussian Distributions

"In particular, we are interested in the multivariate case of this distribution, where each random variable is distributed normally and their joint distribution is also Gaussian."

This section misses an opportunity to introduce a figure that illuminates the gap between the intuition of a Gaussian most people have with the way they're used in GPs.

One of the big stumbling blocks with GPs is a visual one - when students are introduced to Gaussians, they are used to looking at the distribution with each dimension mapped to a visual axis (like the two figures in this section). In GPs, the dimensions of a multivariate Gaussian are visually represented as vertical axes (in effect, sharing a single axis).

This section shows the former, the next section shows the latter, but we’re missing a visual that bridges and links the relationship between the two visual representations. There is an opportunity to provide an illustration that bridges their familiar representation of multivariate gaussian with one dimension per axis (limited between 1D to 3D examples), with the visual representation common to GPs (one dimension per horizontal slice) by showing the correspondence side-by-side in a single interactive visual.

This is a great idea! PR #15 adds such an interactive figure that should make this connection much clearer. It allows the user to explore the two different views (sample from distribution and corresponding horizontal slice; for 2 dimensions).

Collaborator

# Introduction

"Even if you have spent some time reading about machine learning, chances are that you have never heard of Gaussian processes."

This assumption will often be incorrect (in some circles, Gaussian processes are well known), but more importantly, misses an opportunity to provide a stronger motivational hook to the article beyond "it's something you have haven't heard of / rehearsing the basics is good".

"Their most obvious area of application are regression problems, for example in robotics, but they can also be extended to classification and clustering tasks [2, 3] . Just a quick recap: the goal of regression is to find a function that describes a given set of data points as closely as possible. "

The phrasing here feels more convoluted than it needs to be. Why bring the overloaded regression terminology followed by an awkward definitional statement after its initial use. Could just skip to using the definition in the first sentence and avoid the "just a quick recap" sentence altogether.

[...]

"In general, the multivariate Gaussian distribution is defined by a mean vector \muμ and a covariance matrix \SigmaΣ",

What purpose does the "In general" clause serve?

"symetric" misspelled - should be "symmetric"

“The standard deviations for each random variable are on the diagonal of the covariance matrix … ”

I believe this is an error - the diagonal of the covariance matrix corresponds to the variance for each random variable, not the standard deviation.

[...]

"Gaussian distributions are widely used to model the real world: either as a surrogate when the original distributions are unknown, or in the context of the central limit theorem."

This sentence does not make much sense, unpacking the two clauses:

"Gaussian distributions are widely used to model the real world ... when the original distributions are unknown."

This notion of “assume Gaussian by default” reflects a common misuse of Gaussian distributions and is blamed for many modeling failures. The phrasing here implies that use of Gaussians in conditions where the original distributions are unknown is an acceptable common practice, which a poor modeling practice.

"Gaussian distributions are widely used to model the real world ... in the context of the central limit theorem."

I can guess what this means (that Gaussian distributions are justified if the conditions of the central limit theorem are held) but the phrasing reads very awkwardly.

"Gaussian distributions have the nice algebraic property of being closed under conditioning and marginalization. This means that the resulting distributions from these operations are also Gaussian, which makes many problems in statistics and machine learning tractable."

Use of “This” is ambiguous (Does it refer to being closed, conditioning, marginalization? etc) Could replace "This" with "Being closed under conditioning and marginalization" to reduce ambiguity.

PR #10 fixes the corresponding passages in the text.

Collaborator

### RKehlbeck commented Feb 7, 2019 • edited

 There's a bug in the figure here the right margin annotation doesn't get updated until after the slider is released. It should get updated as soon as a slider is touched. PR #12 fixes the sliders.
Collaborator

## Posterior Distribution

This section probably needs the most work writing-wise.

"The intuition behind this step is that the training points constrain the set of functions to those that pass through the training points."

This can be a harmful intuition since noise parameters can enable the solution set of functions to also not pass through the training points.

PR #18 gives additional information about the noise model so that the function does not have to pass directly through the observations.

Collaborator

### grtlr commented Feb 7, 2019

 "Analogous to the prior distribution, we could obtain a prediction by sampling from this distribution. But, since sampling involves randomness, we would have no guarantee that the result is a good fit to the data. " I think I understand the idea that the desire is to obtain a closed form representation, but this is a very awkward choice of phrase. Lots of times random processes are used to find fits to data (e.g. MCMC). “In contrast to the prior distribution, we set the mean to \mu=0μ=0 , so in this context it is not really of importance.” Very confusing what concept "it" is referring to here. We have fixed this minor comments in PR #19.