Skip to content

2020.12.21 Spring based regression with different spring strengths during training vs. prediction inference

fcrimins edited this page Jan 7, 2021 · 19 revisions

Smart Regression or One-of-Many-Possibilities Regression

  • 2021-01-07

    • The problem here is akin to 2D space with multiple lines traveling through it and for each point, finding the closest line.
      • So say you have a hockey-stick-shaped cloud of points, you might use 2 lines to "fit" that cloud that intersect at the origin.
      • Or if you have an hour-glass-shaped cloud, then two lines might intersect in the middle.
      • Or a bar/stripe-shaped cloud, then two lines might be parallel to the axis of the bar, each accounting for half of its points.
      • There's probably a linear programming dual expression of this problem.
    • A problem arises at "inference" time when you need to predict a y value for a given x when you don't know which line to choose.
      • Actually, maybe not: the entire nD space can be split into regions that are closest to each line (or plane) of fit... hmm... the problem is still that all n dimensions need to be known to know which line is closer, n-1 independent variable dimensions is insufficient.
    • The (orthogonal) distance between a point (m, n) and a line Ax + By + C = 0 in 2D space is: d = |Am + Bn + C| / sqrt(A^2 + B^2)
      • Except this idea is to choose n-1/x_i 1-independent-variable (each with their own intercept) planes in nD space.
      • Which, perhaps, is analogous to choosing 2 y-intercepts in 2D to represent a cloud of points, and then choosing the closest (L-infinity norm) as the distance (summed across all of the points) to minimize. Again, you can't do "inference" here though because both lines are horizontal.
    • So then given a set of underspecified models (e.g. with only 1 or 2 independent variables each), the characteristic/utility function is computed as the sum of L-infinity norms (distance to closest model) to each point in the in-sample/training set.
      • For inference perhaps we can define a probability distribution across n-1/x_i dimensions so that given a new x_hat (vector) we can determine which model applies with highest probability and use that model for infering a value of y for that point.
      • One way to define these probability distributions could be with k-nearest-neighbors: for a new x_hat find the k nearest x_hat_j that are closest and use the underspecified model that is closest for most of those x_hat_j.
    • Let's get back to the original idea here: to allow different of the n-1 independent x variables to "take over" in different regions of the nD space to explain what's going on with y.
      • For example, maybe dividend-yield doesn't matter when there's high cash-flow, but when there's low cash-flow it predicts future returns.
      • In other words, there are 2 x variables (dividend-yield and cash-flow) and when cash-flow is high (in that region of x_i/n-1-dimensional space), it's the driver and elsewhere, dividend-yield is the driver. We used to call this modulation of one variable on another.
  • 2020-12-28

    • Rather than allowing for a "curvy" model, as ML would provide, allow for a "wider" model where the center of the model is not a line but rather a region.
    • The effect of this should be that the center of the model is only used to predict the belly of the distribution point while the extremes of the distribution are predicted by whichever part of the wider model is closest (i.e. look for a "reason" for an outlier)
    • Effectively (also) this would be like using all of the datapoints for training (maybe discounting the outliers or using Ridge/Lasso) but then using different datapoint-specific (orthogonal distance to "wider" region) models for each datapoint during prediction/inference.
    • What is a simple mathematical formulation of such a model? Something akin to Ridge/Lasso/Normalization?
    • Consider this approach in contrast to a discounting approach such as dividing by localized variance.
  • 2020-12-21

    • Even massively overfit ML models still don't allow for coding individual neurons to handle individual cases/datapoints. Every independent variable (or tranformation/combination thereof has the same parameters, same activations). The springs are all the same strength.
    • Perhaps what's missing is the allowance for individual parameters to be different strengths based on the dependent variable (i.e. "special case regression"). For example if the dependent variable (DV) is an outlier, then search the independent variable (IV) space for a value that could explain that outlier, even if not fully.
    • Maybe soft labels (or here get at this same idea somehow, in other words by adjusting the outlier DVs so the they are "softer," more flexible, looser springs when it comes to their effect on the whole model, but tighter springs when it comes to making a prediction of their own E[DV]
    • Kws: smart regression
    • http://feedproxy.google.com/~r/marginalrevolution/feed/~3/JlM_lLt_1wk/my-conversation-with-john-o-brennan.html
    • search the IV space for a value that could explain that outlier

      • Call this "one of many possibilities" regression (or "infinity norm regression" though that appears to already be taken to mean something else); one independent variable being allowed to explain the dependent variable in the absence of the others and for a specific datapoint. Allow specific datapoints, especially if they are outliers, to be explained by individual independent variables.
  • 2020-12-21

  • 2020-12-15

    • What are the most important statistical ideas of the past 50 years?
      • "We argue that the most important statistical ideas of the past half century are: counterfactual causal inference, bootstrapping and simulation-based inference, overparameterized models and regularization, multilevel models, generic computation algorithms, adaptive decision analysis, robust inference, and exploratory data analysis."
    • Is there another form of regression where the strength of each spring is proportional to the inverse level of outlier'ness of each dependent variable value. That's what should matter when trying to "explain" an outlier. You might not be able to explain it fully, but if there's even something a little out of the ordinary going on (in the independent variables) then that should be sufficient explanation.
    • Maybe it's similar to going from linear regression with its vertical "springs" to PCA with its orthogonal "springs" to something even moreso with beyond-orthogonal "springs?"
  • 2020-12-15 Least squares as springs

    • "mediately convinced that this intuition pays off, consider the issue that motivated me to think of this in the first place: influential outliers, points with high “leverage” in the statistical s"
    • "sical sense rather than those formal definitions. See the point in the bottom left of the plot above? Since it is pulling closer to the end of the line it has more leverage in the actual physical sense. This is the explanation I prefer to give students"
    • "which in this case is just the potential energy. The potential energy stored in a spring that is stretched a certain distance is the integral of the force over that distance, and since the force scales with the distance this means the energy scales with the squared distance. Hence, the equilibrium of this physical system mi"
    • "in equilibrium). Principal components analysis Although PCA is often considered a more advanced topic than (simple) regression, its justification in our physical analogy is actually simpler. All we need to do is drop the vertical rule that was required for regression. In this case, the springs are allowed to rotate their angle of departure from the points, and their position of attachment to the line (or hyperplane) can slide to accommodate this change in angle. This results in an equilibrium where the springs"
    • "not just distance in the yy-coordinate alone). (This is also called total least squares or a special case of Deming regression.) Model"
  • 2020-11-22

    • When buying a signal (dependent variable)... and neutralizing it wrt some other signal (independent variable, e.g. a Barra factor) via linear regression, you end up buying the extremes of the first signal (DV) but also selling the extremes of the latter (IV), which is suboptimal because you have no opinion of the latter.
    • So is there some sort of point-by-point regression technique? This is similar to the "reasons for movements" theory.
  • 2020-11-20

  • 2020-11-10

    • "Truth is in the extremes. There is no noise in the far tails." - Nassim Taleb
    • What is the implication here for quant development? It's that P(info) is really high in the tails (of a forecast distribution)! So the tails should not be traded. What is the profitability of positions that are the result of large forecasts? What is the profitability of large positions vs. small positions?
  • 2020-10-22

    • From an analogy mentioned by Jacob Kline that quant investing pictures of companies that have been taken with a camera. You aren't working with real cats, you're working with pictures of cats. And you're trying to predict the picture, not the cat.
    • Try to predict the picture, the image. The image is the return and the fundamentals and all the other "obvious," typical financial datapoints. The typical data are dependent variables, not independent. They are part of the image, part of the story, along with the future return.
    • And that is the analogy to the camera. Predict the picture. Pictures of cats.
    • Look, prices move. To say that you know how much a price should move based on a particular change in information is one thing. But to relate a particular price move, regardless of size, to a particular change in information is something much less (look for any "reason" or "reason for movements"). Who are we to say that we can guess by how much a price should move according to particular change in information?
  • 2020-10-07

    • Always start with returns and ignore prices
    • Start with returns and look for things that cause them ("reasons"), using nonlinear methods. Reasons for returns are not linear when put through the lens of humanity/groupthink/emotion, which modulate reasons. This is like p(info) but for all ratios including price, not just SRs.
    • All we know is that there are times of stability, of a particular framing/rationalization/regime. And nonlinear shifts between those regimes, either specific instrument shifts or cross sectional regime shifts. Either can be used to explain returns, just don't trust the price either after or before because those prices can lead to prolonged periods of constant forecasts (if price used in a ratio) and large static positions as a result.
    • Never use prices as numerator or denominator of a forecast because it implies an implicit dependency on return or price change.
    • And use volume and return to fit (discount/neutralize) NLP sentiment.
  • Inference vs. Prediction

    • "Inference: Use the model to learn about the data generation process."
    • "Prediction: Use the model to predict the outcomes for new data points."
    • "Model interpretability is a necessity for inference"
Clone this wiki locally