# JCAMPX and ASD
JCAMP uses some strange compression that are investigated in this notebook, the usage of this compression appear in  JCAMP-DX: A Standard Form r Exchange of Infrared Spectra in Computer Readable Form
>(5.2) DATA COMPRESSION. TABULAR DATA can be compressed by (1) converting decimals to integers, (2) expressing data as differences between adjacent points, (3) combining duplicate ordinates, and (4) replacing leading digits by pseudo-digits which represent delimiter, sign, and the digit itself (Table VII). Subsequent digits of a multidigit number are standard ASCII digits.
It is not necessary to specify ASDF forms separately, because the first character of each numerical value contains this information. A single software procedure can be written to decode any combination of the ASDF forms described below. The compressed forms can be encoded or decoded faster than ASCII digits alone, and also faster than most other forms of compressed data.

In this tutorial we will investigate all the data compression strategies described in the paper.

### ASCII FREE FORMAT NUMERIC
This is the the simple way to store uncrompress numerical values is defined as follow:

>(5.3) AFFN. ASDF processors should accept ASCII FREE FORMAT NUMERIC _[is similar to free-form input of BASIC. A field which starts with a +, -, decimal point, or digit is treated as a numeric field and converted to the internal form of the target computer. E is the only other allowed character. A numeric field is terminated by E, comma, or blank. If E is followed immediately by either + or - and a two- or three-digit integer, it gives the power of 10 by which the initial field must be multiplied.]_ to simplify user generation of JCAMP-DX files by systems which are not supported by manufacturers.

This is implemented in most of the programming language casting rules:

In [2]:
eval(".34,66E-2 ")

(0.34, 0.66)

In [3]:
FIXform = "1 2 3 3 2 1 0 -1 -2 -3"
eval(FIXform.replace(" ",","))

(1, 2, 3, 3, 2, 1, 0, -1, -2, -3)

### PACKED Form (PAC)
>5.4) PACKED Form (PAC). Adjacent values are separated by + or - or blank.

This form can be easly handeled with a simple replacements and split operations:

In [4]:
startingPACstring = "1000+2000-2001+2002 2003 2003 2003"
list(map(float,startingPACstring.replace("+"," +").replace("-"," -").split()))

[1000.0, 2000.0, -2001.0, 2002.0, 2003.0, 2003.0, 2003.0]

Here we convert FIX form to PAC form:

In [5]:
PACform1 = FIXform.replace(" -","-").replace(" ","+")
PACform1

'1+2+3+3+2+1+0-1-2-3'

In [6]:
PACform2 = FIXform.replace(" -","-")
PACform2

'1 2 3 3 2 1 0-1-2-3'

### SQUEEZED Form (SQZ)
>(5.5) SQUEEZED Form (SQZ). Delimiter, leading digit, and sign are replaced by a pseudo-digit from Table VII, lines 2,3.

For this form we have first to rebuild the lines 2,3 of the Table VII of the paper, this can be done with the following code:

In [7]:
import string
convstr = string.ascii_lowercase[:9][::-1] + "@"+ string.ascii_uppercase[:9]
num = list(range(-9,10))
SQZDIGIT = dict(zip(num,convstr))
SQZDIGIT

{-9: 'i',
 -8: 'h',
 -7: 'g',
 -6: 'f',
 -5: 'e',
 -4: 'd',
 -3: 'c',
 -2: 'b',
 -1: 'a',
 0: '@',
 1: 'A',
 2: 'B',
 3: 'C',
 4: 'D',
 5: 'E',
 6: 'F',
 7: 'G',
 8: 'H',
 9: 'I'}

We test the result on the string of the original paper in the FIX form, note that in all the following elaboration the first digit of the first number each line remains unchanged this is not well described in the paper. 

In [8]:
newstr = ""
for ind,i in enumerate(FIXform.split()):
    if ind == 0:
        # note that the first digit remains unchanged
        newstr+= i
    else:
        if i.startswith("-"):
            rep,rest = int(i[:2]),i[2:]
        else:
            rep,rest = int(i[0]),i[1:]
        newstr+= SQZDIGIT[rep]+rest
newstr

'1BCCBA@abc'

### (5.6) DIFFERENCE Form (DIF)
>(5.6) DIFFERENCE Form (DIF). Delimiter, leading digit, and sign of the difference between adjacent values are represented by a pseudo-digit from Table VII, lines 4,5. The abscissa at the start of each line is in AFFN form, and the first ordinate is an actual value in SQZ form. Remaining ordinates are differences, in DIF form.

Also in this case we need to build a dictionary with the values of line 4 and 5 of the Table VII of the paper:

In [9]:
import string
import numpy as np

In [10]:
convstr = string.ascii_lowercase[9:18][::-1] + "%"+ string.ascii_uppercase[9:18]
num = list(range(-9,10))
DIFDIGIT = dict(zip(num,convstr))
DIFDIGIT

{-9: 'r',
 -8: 'q',
 -7: 'p',
 -6: 'o',
 -5: 'n',
 -4: 'm',
 -3: 'l',
 -2: 'k',
 -1: 'j',
 0: '%',
 1: 'J',
 2: 'K',
 3: 'L',
 4: 'M',
 5: 'N',
 6: 'O',
 7: 'P',
 8: 'Q',
 9: 'R'}

We start from the FIXform string

In [11]:
FIXform = list(eval(FIXform.replace(" ",",")))
FIXform

[1, 2, 3, 3, 2, 1, 0, -1, -2, -3]

Then we compute the difference between the values

In [12]:
diff = [FIXform[n]-FIXform[n-1] for n in range(1,len(FIXform))]
diff

[1, 1, 0, -1, -1, -1, -1, -1, -1]

Is not clear from the definition but from the example in table VIIb that the first digit must be the same (not in SQZ form as in the definition)

In [13]:
newstr = ""
for ind,i in enumerate(diff):
    i = str(i)
    if ind == 0:
        # note that the first digit remains unchanged
        # this will be added even though "the first ordinate is an actual value in SQZ form"
        newstr+= i
    if i.startswith("-"):
        rep,rest = int(i[:2]),i[2:]
    else:
        rep,rest = int(i[0]),i[1:]
    newstr+= DIFDIGIT[rep]+rest
newstr

'1JJ%jjjjjj'

# DIFDUP Form
Evneutually the DIFDUP Form is based on the duplicate suppression as follow:

>(5.9) DUPLICATE SUPPRESSION (DUP). When two or more adjacent Y-values in a table are identical, all but the first are replaced by a duplicate-count whose initial digit is a pseudo-digit from Table VII line 6. Duplicate- count is the number of successive identical table values, including the first. It can be used with all ASDF forms. Count for four identical values is 4, i.e., 50 50 50 50 becomes 50V in DUP form.
>(5.10) DIFDUP Form. When duplicate suppression is combined with difference form, the duplicate count is obtained by counting identical differences. The above example, 50 50 50 50, becomes 50% % % in DIF form, and 50 % U in DIFDUP form.
DIFDUP form takes the least space and is processed most rapidly. However, spectral data in this form are not easily inspected visually.

In [14]:
convstr = string.ascii_uppercase[18:]+"s"
num = list(range(1,10))
POSITIVEDUP = dict(zip(num,convstr))
POSITIVEDUP

{1: 'S', 2: 'T', 3: 'U', 4: 'V', 5: 'W', 6: 'X', 7: 'Y', 8: 'Z', 9: 's'}

In [18]:
last = ""
counter = 1
result = ""
for i in newstr:
    if i == last:
        counter+=1
        last = i
    else:
        dp = ""
        if counter > 1:
            dp = POSITIVEDUP[counter]
        result+=last+dp
        last = i
        counter = 1
if counter > 1:
    dp = POSITIVEDUP[counter]
    result+=last+dp
    last = i
result

1 -> 
1
J -> 1
1
J -> J
% -> J
2
j -> %
1
j -> j
j -> j
j -> j
j -> j
j -> j


'1JT%jX'

In [81]:
counter

0