Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in agilent .uv parser #28

Closed
ethanbass opened this issue Mar 27, 2022 · 9 comments
Closed

bug in agilent .uv parser #28

ethanbass opened this issue Mar 27, 2022 · 9 comments

Comments

@ethanbass
Copy link
Contributor

ethanbass commented Mar 27, 2022

I was investigating the UV parser more and I think there are still some problems. For example, I was trying to import a UV file from my lab and it looks pretty good for about the first 15 minutes, but then the baseline starts going all over the place. Any idea what might be going on? I'm attaching a picture of the entab imported file in black and the CSV I exported from chemstation in blue.
image

The example file that ships with entab doesn't look too good either:
image

Below is the code to reproduce what I did in R. You can find the file I tried to convert and the CSV version here https://cornell.box.com/v/example-DAD-files .
Thanks!
Ethan

library(entab)
files[1]
path <- "~/Library/CloudStorage/Box-Box/kessler-data/lactuca/botrytis_experiment/data/lettuce_roots/ETHAN_01_19_21 2021-01-20 00-27-52/679.D/dad1.uv"
r <- as.data.frame(Reader(path))
ch.entab <- data.frame(tidyr::pivot_wider(r, id_cols = "time",
                        names_from = "wavelength", values_from = "intensity"))

ch.csv <- read.csv("~/Library/CloudStorage/Box-Box/kessler-data/lactuca/botrytis_experiment/data/lettuce_roots/export3D/EXPORT3D_ETHAN_01_19_21 2021-01-20 00-27-52/679.CSV",
                   row.names = 1, header=TRUE,
                   fileEncoding="utf-16",check.names = FALSE)
par(mfrow=c(1,1))
matplot(ch.entab$time, ch.entab[,"X280"], type="l",ylim=c(-100,800))
matplot(ch.entab$time, ch.csv[,"280.00000"],type="l",add=T,lty=2,col="blue")
abline(v="15.00",col="red",lty=3)

example_file <- as.data.frame(Reader("~/entab/entab/tests/data/carotenoid_extract.d/dad1.uv"))
library("tidyverse")
df %>% filter(wavelength=="200")
df <- data.frame(tidyr::pivot_wider(example_file, id_cols = "time", names_from = "wavelength", values_from = "intensity"))
matplot(df$time, df$X280, type="l")
@ethanbass
Copy link
Contributor Author

I tried the aston parser. it works beautifully!
image

@bovee
Copy link
Owner

bovee commented Mar 30, 2022

I'm not sure if this was the issue (I haven't checked the graphs yet), but there's definitely a bug where it was pulling an unsigned int instead of a signed one (fixed in 7b751f5). I vaguely remember a bug like this happening in Aston too a long time ago too so it's possible there's still something else.

@ethanbass
Copy link
Contributor Author

ethanbass commented Mar 30, 2022

Thanks for looking into this. Your example file now seems to be reading correctly, but my file 679.D still has the crazy shifting baseline in both versions (CLI and entab-R). Also, in the R version something there seems to be a newly introduced bug where there are some values making it into the wavelength column that appear to be retention times (but this doesn't happen in the CLI version).

Also, I don't have benchmarks, but it seems like something you did slowed down the R version considerably. I'm not sure if this could be related to the retention times appearing with the wavelengths. The slowdown only seems to affect the chemstation UV parser. The masshunter parser, for example, is working beautifully from what I can tell.

@bovee
Copy link
Owner

bovee commented Mar 31, 2022

I think the R slowness/bad data is unrelated to the UV parsing, but might be from 622c036 ? It's extremely weird.

Thank you for the UV data BTW! I took a quick look and I think there are still two things going on:

  1. The values between Aston and Entab start the same, but go off track after the first record so there's a parsing bug around file lengths I'll try to track down.

  2. Both of their values are (very slightly) different from the CSV. I think there's a multiplier or offset in the header that they need to be corrected by?

@bovee
Copy link
Owner

bovee commented Mar 31, 2022

I refactored the UV parser a bit in 14059d2 and I think both of these issues should be fixed (and there should be metadata available on these files now).

I'm still not sure what's happening with the R bindings, but I can futz with it. You might also try deleting the current ones before reinstalling?

@ethanbass
Copy link
Contributor Author

ethanbass commented Mar 31, 2022

awesome this is great!!! I tried removing the R package before reinstalling as you suggested and it seems to have helped dramatically with the speed. This also seems to have improved the issue I mentioned with retention times appearing in the wavelengths column (about 9/10 times). The weird part though (!?) is that this behavior is still happening about one tenth of the time, as in, if I repeatedly run the Reader on the same file. 🧐 (This seems to be independent of the file used). Also i'm pretty confident that the speed issue is somewhat related to this behavior. It runs much slower on the runs where it ends up producing the wrong values

@ethanbass
Copy link
Contributor Author

Also re: metadata I'm not quite sure what kind of metadata there should be or how to access it?

@bovee
Copy link
Owner

bovee commented Apr 1, 2022

I opened a new bug (#29) for the retention time crossover issue to track that on its own since it's weird and I don't fully understand it.

Some of the file parsers read additional metadata out (e.g. sample name, operator name, etc) if the file contains it and I've figured out the format; you can access it with the -m flag on the CLI or in R with Reader(path)$metadata().

@bovee bovee closed this as completed Apr 1, 2022
@ethanbass
Copy link
Contributor Author

sounds good!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants