Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for SPC files from Shimadzu instruments #102

Open
ximeg opened this issue Feb 7, 2020 · 11 comments
Open

Support for SPC files from Shimadzu instruments #102

ximeg opened this issue Feb 7, 2020 · 11 comments
Assignees
Labels
Topic: file IO Input and output (read/write) related functions Type: enhancement 🎈 asking for a new feature.

Comments

@ximeg
Copy link
Collaborator

ximeg commented Feb 7, 2020

read.spc does not work with spc files from Shimadzu spectrometers, because they use a proprietary binary format. The displayed error message is confusing.

  1. I suggest to detect this file format (first four bytes are D0 CF 11 E0) and display an error message that says 'Support for Shimadzu SPC file format (OLE CF) is not yet implemented'

  2. After that we can try to implement an import filter for these files. There is an experimental support for Shimadzu SPC format in the spc Python module, which we can look at. There is also an online converter for Shimadzu SPC files – I emailed the author regarding availability of its source code.

Attached there are four SPC files from our Shimadzu UV-2600 spectrometer:
Shimadzu_UV-2600.zip

@ximeg ximeg added the Type: enhancement 🎈 asking for a new feature. label Feb 7, 2020
@ximeg ximeg self-assigned this Feb 7, 2020
@uri-t
Copy link

uri-t commented Feb 10, 2020

I wrote the spc converter @ximeg mentioned above. I've put up the source code here. In this repo, getSpectrum.py, consts.py, and sample_client.py are working code for the online converter (kind of a mess, not very well documented). There are also some ruby files, which are code I've written more recently to get back up to speed on the file format.

The basic idea of the format is that it breaks the file in 512-byte sectors. These are organized into streams with allocation tables. One of these streams is filled with directory entries, which give organization to the information in the file. These directories form a tree structure and some contain pointers out to data they contain. To extract the spectra, we basically find the directory containing the data (e.g. X. Data 1) and follow the pointer to the data. This diagram illustrates the general structure, with the streams, allocation tables, and directories.

I still haven't quite remembered how I got to the data itself from the directories, but I'll update when I do. For now, here's some more documentation on the file format in general. I'd recommend the first one in particular.

@ximeg
Copy link
Collaborator Author

ximeg commented Feb 10, 2020

Hi @uri-t ,
many thanks for this information! This helped me to understand how the file is structured and how your script works.

OLE CF seems to be quite a complex file format, ideally we would like to find a generic open-source OLE reader for R and adapt it to import Shimadzu SPC. This would save us a ton of effort.
I also stumbled upon a coolection of python tools to work with OLE files oletools that allows us to dissect the spc file and look inside.

Here is an example output of running oledir:
$ oledir 112.spc
oledir 0.54 - http://decalage.info/python/oletools
OLE directory entries in file 112.spc:
----+------+-------+----------------------+-----+-----+-----+--------+------
id  |Status|Type   |Name                  |Left |Right|Child|1st Sect|Size  
----+------+-------+----------------------+-----+-----+-----+--------+------
0   |<Used>|Root   |Root Entry            |-    |-    |1    |6       |2560  
1   |<Used>|Stream |Contents              |2    |3    |-    |5       |4     
2   |<Used>|Storage|Version               |-    |-    |10   |0       |0     
3   |<Used>|Stream |\x05SummaryInformation|5    |4    |-    |2       |132   
4   |<Used>|Stream |DataStorageHeaderInfo |-    |-    |-    |1       |4     
5   |<Used>|Storage|DataStorage1          |-    |6    |7    |0       |0     
6   |<Used>|Stream |NumberofSaved         |-    |-    |-    |0       |4     
7   |<Used>|Stream |DataStorageName       |8    |-    |-    |6       |15    
8   |<Used>|Storage|DataSetGroup          |-    |-    |12   |0       |0     
9   |<Used>|Stream |CLSID                 |-    |-    |-    |9       |16    
10  |<Used>|Stream |Module Version        |9    |11   |-    |8       |8     
11  |<Used>|Stream |File Format Version   |-    |-    |-    |7       |4     
12  |<Used>|Stream |DataSetGroupHeaderInfo|13   |-    |-    |A       |4     
13  |<Used>|Storage|DataSet1              |-    |-    |14   |0       |0     
14  |<Used>|Stream |DataSetHeaderInfo     |15   |16   |-    |B       |125   
15  |<Used>|Storage|MethodStorage         |-    |-    |32   |0       |0     
16  |<Used>|Storage|DataSpectrumStorage   |19   |18   |28   |0       |0     
17  |<Used>|Storage|DataPeakPickStorage   |-    |-    |27   |0       |0     
18  |<Used>|Storage|DataPointPickStorage  |-    |-    |25   |0       |0     
19  |<Used>|Storage|DataAreaCalcStorage   |20   |17   |24   |0       |0     
20  |<Used>|Storage|DataHistoryStorage    |-    |-    |23   |0       |0     
21  |<Used>|Stream |HistoryVersion        |-    |-    |-    |F       |4     
22  |<Used>|Stream |HistoryHeader         |-    |-    |-    |E       |4     
23  |<Used>|Stream |DataSetHistory        |22   |21   |-    |D       |63    
24  |<Used>|Stream |AreaCalcRegions       |-    |-    |-    |10      |50    
25  |<Used>|Stream |PointPickData         |-    |26   |-    |12      |4     
26  |<Used>|Stream |PointPickColWidths    |-    |-    |-    |11      |16    
27  |<Used>|Stream |PeakPickPAV           |-    |-    |-    |13      |288   
28  |<Used>|Stream |Version               |30   |29   |-    |18      |4     
29  |<Used>|Storage|DataHeader            |-    |-    |40   |0       |0     
30  |<Used>|Storage|Data                  |-    |-    |39   |0       |0     
31  |<Used>|Stream |Contents              |-    |-    |-    |25      |60    
32  |<Used>|Stream |PageTexts0            |31   |34   |-    |22      |171   
33  |<Used>|Stream |PageTexts1            |-    |-    |-    |20      |112   
34  |<Used>|Stream |PageTexts2            |33   |35   |-    |1C      |245   
35  |<Used>|Stream |PageTexts3            |-    |36   |-    |1B      |53    
36  |<Used>|Stream |PageTexts4            |-    |-    |-    |19      |95    
37  |<Used>|Stream |Data Header.1         |-    |-    |-    |26      |8     
38  |<Used>|Stream |X Data.1              |-    |-    |-    |2A      |11208 
39  |<Used>|Stream |Y Data.1              |38   |37   |-    |14      |11208 
40  |<Used>|Stream |Header Info           |-    |-    |-    |27      |61    
41  |unused|Empty  |                      |-    |-    |-    |0       |0     
42  |unused|Empty  |                      |-    |-    |-    |0       |0     
43  |unused|Empty  |                      |-    |-    |-    |0       |0     
----+----------------------------+------+--------------------------------------
id  |Name                        |Size  |CLSID                                 
----+----------------------------+------+--------------------------------------
0   |Root Entry                  |-     |                                      
3   |\x05SummaryInformation      |132   |                                      
1   |Contents                    |4     |                                      
5   |DataStorage1                |-     |                                      
8   |  DataSetGroup              |-     |                                      
13  |    DataSet1                |-     |7FAC4E0B-5987-11D0-954C-0800096B7523  
19  |      DataAreaCalcStorage   |-     |60F779CB-D341-11CF-91E2-0800096BCA1F  
24  |        AreaCalcRegions     |50    |                                      
20  |      DataHistoryStorage    |-     |                                      
23  |        DataSetHistory      |63    |                                      
22  |        HistoryHeader       |4     |                                      
21  |        HistoryVersion      |4     |                                      
17  |      DataPeakPickStorage   |-     |D069DE03-FFBB-11CF-A7AD-0800096A3C5E  
27  |        PeakPickPAV         |288   |                                      
18  |      DataPointPickStorage  |-     |2303D603-1C5B-11D0-9649-0800096BAA1D  
26  |        PointPickColWidths  |16    |                                      
25  |        PointPickData       |4     |                                      
14  |      DataSetHeaderInfo     |125   |                                      
16  |      DataSpectrumStorage   |-     |1851B2E3-83F4-11CF-BD45-0800096B1920  
30  |        Data                |-     |                                      
37  |          Data Header.1     |8     |                                      
38  |          X Data.1          |11208 |                                      
39  |          Y Data.1          |11208 |                                      
29  |        DataHeader          |-     |                                      
40  |          Header Info       |61    |                                      
28  |        Version             |4     |                                      
15  |      MethodStorage         |-     |                                      
31  |        Contents            |60    |                                      
32  |        PageTexts0          |171   |                                      
33  |        PageTexts1          |112   |                                      
34  |        PageTexts2          |245   |                                      
35  |        PageTexts3          |53    |                                      
36  |        PageTexts4          |95    |                                      
12  |    DataSetGroupHeaderInfo  |4     |                                      
7   |  DataStorageName           |15    |                                      
4   |DataStorageHeaderInfo       |4     |                                      
6   |NumberofSaved               |4     |                                      
2   |Version                     |-     |                                      
9   |  CLSID                     |16    |                                      
11  |  File Format Version       |4     |                                      
10  |  Module Version            |8     |                                      

@uri-t
Copy link

uri-t commented Feb 11, 2020

I def agree that something off-the-shelf would be ideal. I did a bit of looking and couldn't find any R packages for this (:()--I tried the antiword library from CRAN but it only accepts OLE files that are actually Word documents alas. I'm guessing you might have better luck finding something given that you're not a stranger to R like me.

If there ends up not being anything available in R would it be possible to wrap the python library you linked to so it can be used in R? This in particular looks pretty promising.

@ximeg
Copy link
Collaborator Author

ximeg commented Feb 14, 2020

Thanks for suggestions, we have to think how this can be solved. For now I just call the python bulk convertor to make CSV files from SPC, and then I read them into R. We can always call python from R, this is not a problem. We can even pass data back and forth between Python and R without creating any intermediate files (thanks to the rpy2 module). However, this solution assumes that the end user has Python installed on their machine.

So far I see several options how we can address this

Option 1: add Python script

  • We include the short python script from uri-t that does the conversion into the hyperSpec
  • If the user tries to import a Shimadzu SPC file, we check whether Python can be invoked. If yes, then we use this Python script to convert the SPC file. If Python interpreter is not available, then we show an error message.

This solution looks kinda dirty, but requires a minimal effort.

Option 2: translate Python script into R

Another option would be to re-write the uri-t Python script into R. I think this would take about a week to do, including creation of unit tests and writing Roxygen docs and vignettes. The downside is that this script looks a bit hacky, but it does work, at least with files from UV-2600. I am not sure whether it supports all possible combinations of Shimadzu parameters and metadata, but it is anyway a good starting point.

Option 3: Implement a generic OLE CF file reader

We could implement a generic R reader for OLE CF files as a separate project, and use it to import Shimadzu SPC. This would be beneficial to the whole R community, not only hyperSpec users. However, this is a tremendous amount of work, and I believe we don't have resources for that.

We could go with option 1 and then replace it later option 2. @cbeleites, do you have any opinion on this topic?

@uri-t
Copy link

uri-t commented Feb 14, 2020

A small update from my end which might inform this decision: over the past few days I've written part of a more general OLE reader. Right now, it's relatively short and and can do most of the things we'd need from a generic parser--making the directory tree and retrieving the data corresponding to each entry. Based on this it seems like building a generic OLE file reader might not be so bad--not too much more effort than translating the existing script at least.

More generally, now that I have good handle on the format I'll upload a more complete explanation either later tonight or tomorrow--this might be useful for either option 2 or 3.

@ximeg
Copy link
Collaborator Author

ximeg commented Feb 14, 2020

@cbeleites I noticed there is a file Vignettes/fileio/spc.Shimadzu/F80A20-1.SPC, which is neither Galactic SPC nor OLE CF file. Do you remember where it comes from and what it could be?

@cbeleites
Copy link
Owner

cbeleites commented Feb 14, 2020

First of all, @uri-t thanks for helping - your experience with those files is valuable for us!

@ximeg , Yes, I have opinions on the topic :-)

  • Even though Dependencies are invitations for other people to break your package this particular file import filter IMHO is something that should go into its own package which can then depend on hyperSpec and later on maybe on a more general OLE reader. With a small package only for the Shimadzu files, heavy changes or experimental code in that package or in its dependencies will not cause trouble with hyperSpec while allowing everone to try out the Shimadzu import if they want.

  • I have not yet tried to package python code in an R package. Interaction/knitr works fine, but packaging may be a different beast (see also what @bryanhanson wrote at the GSoC discussion)
    reticulate has a vignette/article "Using reticulate in an R Package".

  • A general OLE reader would possibly be good for R, but I definitively don't have the capacity to look after any such package.

  • Vignettes/fileio/spc.Shimadzu/F80A20-1.SPC this file was sent to me in 2015 as an example spectrum produced by a Shimadzu UV-1800 by someone from Oslo. I could try to contact them if we need details about that file. Back then I did not have any documentation on that file format, so I put it into our collection but did not actually do anything about it.
    I'd have expected it to be actually this OLE CF format...
    file (the linux command) thinks: F80A20-1.SPC: Applesoft BASIC program data, first line number 2

  • @ximeg: is one of those 4 files sufficient as an example or do they internally have important differences (subformat or whatnot)?

@ximeg
Copy link
Collaborator Author

ximeg commented Feb 17, 2020

@ximeg: is one of those 4 files sufficient as an example or do they internally have important differences (subformat or whatnot)?

These files have the same internal organization, only data differs. Yes, one is enough!

@uri-t
Copy link

uri-t commented Feb 18, 2020

The structure doesn't vary, but when the files get bigger there are couple extra cases the code has to handle (since the sector allocation tables get filled up and have to expand). There's a hulking 9 MB file here should work as a test file for these cases.

I've also uploaded a collection of files people have tried to convert on my website here. This should give us better coverage in case other Shimadzu spectrometers have different directory structures for some reason. Of the 903 files, 341 are OLE files, and 46 are the Applesoft Basic files that you've seen. I'm not sure what the rest are, but some are likely Galactic SPC files. See file_signatures.txt in this folder for each file's signature.

@uri-t
Copy link

uri-t commented Feb 19, 2020

Also on the issue of test coverage, I've extracted the instrument information for all the OLE files in the folder I linked to above. I realized that @ximeg and I have the same UV-2600 model, but other people have tried the UV-1700, UV-1800, and UV-1900 models so we have examples from those instruments as well.

The instrument info is in instrument_info.txt in the folder.

@GegznaV
Copy link
Collaborator

GegznaV commented Aug 8, 2022

I think, the discussion should continue here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Topic: file IO Input and output (read/write) related functions Type: enhancement 🎈 asking for a new feature.
Projects
None yet
Development

No branches or pull requests

4 participants