Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About theoretical background #4

Closed
franchesoni opened this issue Nov 13, 2018 · 10 comments
Closed

About theoretical background #4

franchesoni opened this issue Nov 13, 2018 · 10 comments

Comments

@franchesoni
Copy link

I don't quite understand how does the code calculate the coefficients. In particular, how is log likelihood related to the eigenvalues that form the cost function.

I'd be glad if you could provide a reference or explanation

Thanks

@FlorianWilhelm
Copy link
Collaborator

Hi, you could take a look into the references inside https://cran.r-project.org/web/packages/dse/vignettes/Guide.pdf

@franchesoni
Copy link
Author

@FlorianWilhelm
I've red the guide you sent, and also the references there contained. There's some part of the S code that is similar to your code, but I could not find (maybe I'm not a good searcher) a theoretical explanation of the log likelihood calculation.

Probably there is some optimization/statistics theory I don't know, but Gilbert's publications don't seem to have an answer. A cost function build upon eigenvalues of a matrix of correlations (between variables with no lag) puzzles me.

Thanks for previous response, hope you can help me

@FlorianWilhelm
Copy link
Collaborator

@franchesoni, note that the product of all eigenvalues of a matrix equals the determinant of a matrix. In the pdf of a multivariate normal distribution you need to calculate the determinant of the covariance matrix. Got it?

@franchesoni
Copy link
Author

franchesoni commented Nov 15, 2018

I got that. This is what I understood:

I should assume that the residuals are independent and identically distributed according to a multivariate normal distribution with zero mean.
Then I can write down the negative log likelihood and identify the terms 'like1' and 'const' of your code.

Is the other term that I don't understand yet. Somehow (lack of latex here), res_i^T cov^{-1} res_i is the same that len(s).

Could you clarify that last step?
thanks!

@FlorianWilhelm
Copy link
Collaborator

FlorianWilhelm commented Nov 15, 2018

It's actually only a bit of linear algebra. Just keep in mind that cov = res^t * res and do some transformations and you arrive at the number of non-zero eigenvalues which is len(s). Got it?

@franchesoni
Copy link
Author

Thanks for the response!

Unfortunately, it does not seem immediate at all to me (and I've tried!). I would appreciate it if you could write that bit of linear algebra I'm lacking of. Sorry for the nitpicking, it's the final step!

@FlorianWilhelm
Copy link
Collaborator

Maybe I underestimated this... it was quite some time that I programmed it and I also peeked a bit into the implementation of DSE. I would say that in general res_i^T cov^{-1} res_i does not equal len(s) as you can be easily seen by constructing a counter example. So maybe it's just an approximation if you sum over all.

Also, let's assume we only have a univariate time series, thus cov^{-1} is just a scalar that only equals 1 = len(s) if that's the std of res. In this case the sum over all res_i^T cov^{-1} res_i could be approximated with sampleT * len(s) due to the definition of cov. In the general case where cov != 1 for p =1 or if we have eigenvalues not equal 1 for p > 1 this is not the case. At least that are my findings after putting one hour of thought into it.

Please tell me a bit more about your motivations in knowing this? Is this some student work or why are you so eager to know that technical detail? Have you also tried to contact the original author of DSE? In general PyDSE is more a prototype and not actively developed anymore...

@franchesoni
Copy link
Author

@FlorianWilhelm , thanks for taking the time. I appreciate it.

My main motivation, aside from learning ambitions, comes from my job. I work in the Solar Energy Lab in forecasting through temporal series analysis, and I'm exploring the performance of ARMA models (now I'm comparing static digital filtering with statsmodels' Kalman filtering approach).

As I don't have anyone double checking my work, I can't copy or implement things without making sure that they are correct (not that I doubt you, it's just that I can't leave giant bugs or even think about publishing unfunded research). Nevertheless, your code is great, just what I need.

As I found analytically, when p = 1 and the res is non zero, the equality holds (squared L2 norms go away leaving only sampleT). As you said, the general case is not that obvious (at least to me). I will continue asking elsewhere as you suggest, with the confidence on the correctness of the method (I've made a few tests).

Thanks for answering all of it! Cheers

@FlorianWilhelm
Copy link
Collaborator

FlorianWilhelm commented Nov 20, 2018

How about that @franchesoni? Best you copy it in some LaTeX aware editor.

We have for $r\in\mathcal{R}^{n\times p}$ that $\frac{r^tr}{n} = \mathrm{cov}\in\mathcal{R}^{p\times p}$ and now we want to calculate $\sum_{i=1}^{sampleT}r_i\mathrm{var}^{-1}r_i^t$ where $r_i\in\mathcal{R}^{1\times p}$ denotes the $i$-th row of $r$.

Putting everything together, we have that $\sum_{i=1}^{sampleT}r_i\mathrm{var}^{-1}r_i^t = \sum_{i,j=1}^{sampleT,p}r\mathrm{var}^{-1}\circ r$ where $\circ$ is the Hadamard product. Since $\sum_{i,j}(A\circ B){i,j}=\mathrm{trace}\left(B^{\mathrm {T} }A\right)$ we have $\sum{i,j=1}^{sampleT,p}r\mathrm{var}^{-1}\circ r =\mathrm{trace}\left(r^tr\mathrm{var}^{-1}\right)=\mathrm{trace}\left(n\mathrm{var}\cdot \mathrm{var}^{-1}\right)=n\cdot p$. For the degenerative case we have for p the number of non-zero eigenvalues.

@franchesoni
Copy link
Author

Great @FlorianWilhelm ! It was just some linear algebra indeed.

About the degenerate case, I asked in cross-validated and there is an intuitive explanation that relates with PCA. Basically, when some eigenvalue of the covariance is negligible, we can use the principal component representation of the data, and arrive at the case you just nicely covered.

I've not done the math yet, but the intuition is there. Thanks for the very helpful responses!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants