# SciNet: Promise and Pitfalls

### Abstract

In [ITEN 2020], the authors present a very good interpolation technique, that can be used to identify some key variables that affect systems, as well as predict their behavior. Properly contextualized, with an understanding of the limitations of all neural methods (and this one specifically), this can be a powerful tool for an experimentalist or theorist. I hope to showcase this methods potential applications, pitfalls, and some mitigating techniques here, using several toy examples and a full-fledged analysis of the driven damped oscillator.

## 1 Introduction

The study of Neural Networks has numerous applications in the world today, both in practice, such as this translation software [ANY TRANSLATOR], but also as a poweful tool for researchers in many different fields [DEEPLABCUT]. Despite this potential, however, many researchers in this field have a bad habit of promising even more and then not delivering: [iNFLUENCER] promised [NO RADIOLOGISTS QUOTE] by [YEAR] for instance, back in [YEAR]. Furthermore, many truly fascinating advances, such as Google Talks to Books, capable of answering many questions if their answer is directly visible in a text it has been trained on, was incorrectly advertised as "reading books." Gary Marcus' [Reeboting AI] has a very good section on pages [PAGES IN GM] blowing that notion out of the water. [REREAD THAT SECTION IN MARCUS, AND SUMMARIZE BETTER].

The paper [ITEN 2020] falls into that last category, introducing an idea with potential, and then grossly exagerating that potential. The goal of this review is to showcase both the utility of this network, and some of its failure modes, using the example system of the driven oscillator. In section $2$, we will review the principles of neural networks and of SciNet in particular; in section $3$, we will discuss the problems and advantages of using neural networks for extrapolation using a very simple model system; in section $4$, we will discuss SciNet's performance on the driven, damped oscillator; and we will discuss these systems, general tips, and directions for future research in the conclusion.

### What SciNet Is and Isn't

The article introduces a very powerful, general method for interpolating smooth, differentiable functions. Furthermore, using a clever reinterpretation of an old idea in ML - a Variational Autoencoder [VAE SOURCE] - the authors even expose some of the internal workings of their neural architecture, partially reducing the "black box" nature these models are known for. I belive this has a lot of potential in physics and other sciences; given some time, methods based on this framework might even become as useful and ubiquitous as local polynomial and Fourier approximations.

However, the authors are **certainly not** "Discovering Physical Concepts with Neural Networks," as the title claims. SciNet cannot be used outside of a larger, theoretical framework, using symbolic mathematics to describe physical laws. Nor is their agent "unbiased by prior knowledge," as the authors claim in the introduction; in fact, I will show how SciNet has specific, quantifiable biases towards smooth functions that are counterproductive in some cases. The problem with these descriptions is not just that they are clickbait; it's that they are very likely to grossly mislead and misdirect new researchers, especially those who are more familiar with physics than they are with neural networks. If you don't understand the many, significant limitations of this approach, and naively throw difficult experimental data at it, you will only become incredibly frustrated, and possibly forego this exciting field entirely.

### How Should SciNet Be Used? 

SciNet can be used in place of other interpolation techniques, such as polynomial interpolation - that is, on functions that are continuous, smooth without large jumps, on a restricted domain, and with a restricted range, much like the $arcsin$, shown with its polynomial approximations in figure 1. Note that the $arcsin$ nonlinearity appears in the transformation from geocentric to heliocentric coordinates, essential for the Solar System problem.

[ARCSIN DIAGRAM]

Unlike polynomials, it is particularly good with functionas that consume input many variables, but whose behavior is likely controlled by just a few values that can be derived from those functions. The behavior of the latent space might isolate those variables, or it might not, but there's certainly a chance that the behavior of the latent space can guide the scientist. SciNet can be used to make predictions in an area well covered by the training data, and possibly used in lieau of experiments (more in the Conclusion), but it should \emph{never} be used for extrapolation far from that domain, even if the underlying function is very simple, as we will see in section $3$.

There are many possible applications of this network, which I will suggest in the conclusion. There are also many cases in which we can use preprocessing in order to make difficult physical systems more tractable for SciNet, in case the researcher knows some aspects of the system he is studying; we'll discuss some of these in section $4$, and also in the conclusion.

## 2 Neural Nets and VAEs

Let's briefly review the underlying methods, before analyzing their potential.

### Neural Net Review

[Basics, 1 layer example, graphs of sigmoid / ReLU / ELU]

### VAE and $\beta$VAE

[VAE ONE PARAGRAPH]

Formula, including KL divergence.

$\beta$-VAE and what it does.

### Which hidden parameters are used?

[ONE paragraph, summarizing metehod and pointing to conclusion]

### SciNet

Put it all together, show two graphs.

## 3 Extrapolation

One of the most impressive features of the laws of physics is how general they are, and how well they extrapolate far outside the original domain whence they were derived. Newton's Laws - even though we now know they are an imperfect approximation - extrapolate so well that the modern space industry still relies on them. The laws of physics as we understand them today apply so incredibly well to almost all the phenomena with which we can directly interact that experiments showcasing the limitations of modern physics require national or international megaprojects. In fact, it could be said that this capacity for extrapolation is *the* feature that distinguishes Physics from the other scientific disciplines; one would imagine that anything aiming to replace Physical laws would have the same feature.

However, neural networks have difficulty identifying even basic patterns and using them outside the training domain; overfitting is the rule. To see this, we will see how neural nets perform on the simple function

$$ y = x + 10 $$

We will use a one-dimensional $x$ and $y$, with $x ~ N(0, 1)$, and then evaluate performance for values of $x$ far from the origin.

### Theoretical Discussion

Instead of testing the full SciNet here, I will focus on simple systems with just one hidden layer, with $8$ neurons in the hidden layer. This is not because SciNet wouldn't suffer from the same problems - I will try to show this by also including the results with $128$ hidden neurons - but because one layer is a simpler system, and we have the abilty to fully write out all parameters and the understand all functions that can be described with such a system.

Specifically, for a system with $1$ dimensional input $x$, one dimensional output $y$, first layer weights and biases $w_{1, i} \text{ and } b_{1, i} 1 \le i \le 8$, second layer bias $b_2$, second layer weights $w_{2, i}$, and $ReLU$ nonlinearities, we can write

$$ y = b_2 + \Sigma_{i = 1}^8 w_{2, i} ReLU [ b_{1, i} + w_{1, i} x ]$$

In effect, we can fit any piecewise linear function with $8$ or fewer inflection points. With $ELU$ nonlinearities, of course, we replace $ReLU$ with $ELU$ in the equiation, and the class of functions we can fit is slightly different, but the principle is similar.

Now, while an $ELU$ network can only approximate the function $f(x) = x + 10$ (at least, for $x$ drawn from the entire real line), clever choices for the weights and biases can express $f$ exactly in a $ReLU$ network. For example,

$$ x + 10 \equiv 10 + ReLU(x) - ReLU(-x) $$

However, there are also numerous ways for the functions to match only on a limited domain. For example, 

$$ x + 10 = ReLU(x + 10) $$

as long as $x \ge -10$ (which covers the entire training domain we are likely to encounter), but these functions quickly diverge for $x$ below this cutoff. 
Furthermore, there is nothing in the design of neural networks to encourage one solution over the other; the expectation is for the set of training points to cover all cases we are likely to see in the wild.

It's worth noting that this stands in sharp contrast to the tools typically used in physics, such as polynomial fits. Indeed, if the ground truth function $f$ is a polynomial of degree $k$ (with no noise), we have $k+1$ or more training data points, and we use a least squares fit to find a $k$-degree polynomial $g$, its easy to guarantee that $f(x) \equiv g(x)$ on the entire real number line.

### Results

Now that we understand why we have no reason for a neural network to extrapolate beyond the training domain, the results in Fig \ref{linfailure} are easier to understand:

[graph of failure, x in training range and outside it. FIG 1 LAYER.]

Adding more neurons in the hidden layer, or hidden layers, increases the flexibility of our model - we'll be able to fit piecewise linear functions with many more inflection points - but it does nothing to guarantee extrapolation.

[FIG WITH 128 NEURONS]

The same performance hurdles will hold true for more sophisticated systems, such as SciNet. There is nothing to guarantee that a system with good results on both the train and test set will still perform well outside of this set. Something as simple as increasing the amplitude of an oscillator by several factors of magnitude will break the predictive potential of the system.

We will see SciNet fail at even small extrapolations in Section $4$.

### Within the Training Distribution

With such a strong bias away from functions that extrapolate, why use neural networks at all? Why not simply resort to more traditional techniques, such as polynomial fits?

Like every fitting modeling tool, neural nets have their uses and drawbacks. One of their strengths is their flexibility, and their ability to fit any function on a limited domain (this is known as the parametric limit, and it assumes a hidden layer with arbitrarily many inputs). 

Specifically, because all of the nonlinearities commonly used are nearly linear or nearly constant on large segments of their domain, neural nets perform very good fits to functions that hvae different behavior in different parts of the relevant domain.

Perhaps the best function to showcase here is the step function. Assume that $y = 10$ when $x > 0$ and $y = 0$ otherwise. With training data drawn from the unit normal, the graphs of trained ReLU and ELU neural nets (8 hidden layers, trained for 50,000 epochs) are in figure \ref{stepsuccess}.

[Graph with ReLU and ELU approximation, as well as error.]

Away from the discontinuity, we see that the ReLU net matches the ground truth almost perfectly. The ELU net has small ringing effects, since it is limited to functions with a continuous first derivative, but these errors are also very small relative to the discontinuity.

However, if we take a polynomial of the same degree as the number of free parameters in these neural nets ($25$) and fit it to $204800$ $x$-$y$ pairs (the $xs$ were drawn from the unit normal), we get figure \ref{polystepfailure}. Changing the degree or the number of $x-y$ pairs within reasonable limits doesn't change the behavior much.

If the ground truth is a polynomial, a polynomial fit will work very well, but if the underlying function is something different and you need to fit a patchwork of local behaviors, neural nets will do a far, far better job. If you have fewer ideas about the underlying system, fitting a neural net can direct you on the right path. SciNet can even identify some useful hidden variables which might make further investigation easier. However, if you want general rules that extrapolate well, further investigation is absolutely necessary.

## 4 Damped Driven Oscillator

[Describe setup and problem]

### boring_patch

### center_very_broad

### peak-focused failures

### Mitigating methods - logspace

[Discussion of success]

[Discussion of biases and sharp peaks]

### Fogetting small effects

[Discussion of that one-dimensional logspace case.]

## 5 Conclusion

[WRITE ME AT THE END!!!!]

[Include all off the stuff discussed, especially applications / replacing experiments]