# Data Processing

For details on the planned workflow refer to Notes

The starting point will be the file *nubase_4.mas20*

This is a fixed width field data file - aim here is to parse and process the data for use in simulation

### The columns are defined below

                                    Data FORMAT 
    column   quantity   format      description
      1: 3   AAA           a3       Mass Number (AAA)
      5: 8   ZZZi          a4       Atomic Number (ZZZ); i=0 (gs); i=1,2 (isomers); i=3,4 (levels); i=5 (resonance); i=8,9 (IAS)
                                    i=3,4,5,6 can also indicate isomers (when more than two isomers are presented in a nuclide)
    12: 16   A El          a5       A Element 
    17: 17   s             a1       s=m,n (isomers); s=p,q (levels); s=r (reonance); s=i,j (IAS); 
                                    s=p,q,r,x can also indicate isomers (when more than two isomers are presented in a nuclide)
    19: 31   Mass #     f13.6       Mass Excess in keV (# from systematics)
    32: 42   dMass #    f11.6       Mass Excess uncertainty in keV (# from systematics)
    43: 54   Exc #      f12.6       Isomer Excitation Energy in keV (# from systematics)
    55: 65   dE #       f11.6       Isomer Excitation Energy uncertainty in keV (# from systematics)
    66: 67   Orig          a2       Origin of Excitation Energy  
    68: 68   Isom.Unc      a1       Isom.Unc = *  (gs and isomer ordering is uncertain) 
    69: 69   Isom.Inv      a1       Isom.Inv = &  (the ordering of gs and isomer is reversed compared to ENSDF) 
    70: 78   T #         f9.4       Half-life (# from systematics); stbl=stable; p-unst=particle unstable
    79: 80   unit T        a2       Half-life unit 
    82: 88   dT            a7       Half-life uncertainty 
    89:102   Jpi */#/T=    a14      Spin and Parity (* directly measured; # from systematics; T=isospin) 
    103:104   Ensdf year   a2       Ensdf update year 
    115:118   Discovery    a4       Year of Discovery 
    120:209   BR           a90      Decay Modes and their Intensities and Uncertanties in %; IS = Isotopic Abundance in %

In [None]:
# Read in data files
with open("nubase_4.mas20", "r") as file:
    data = file.read()

In [None]:
data = data.split("\n") # Separate lines
for line in data:
    print(line)

In [None]:
# Remove lines starting with '#' to remove header section
data = [line for line in data if not line.startswith("#")]

for line in data:
    print(line)



In [None]:
# Test logic for separating data items on a line
line = data[0]
line = [item for item in line.split(" ") if item]
print(line)

Apply the above logic to the entire data set - output should be a 2D list

In [None]:
processed_data = []

for line in data:
    processed_data.append([item for item in line.split(" ") if item])

2D list created above - ready to output to a CSV.

### Note - possible issue noticed

Currently, the "lines" in the processed_data list are of varying length - and this means there is no way to determine what data is in each item since some isotopes do not have as much information.

This will need fixing before going further - each line must be of equal length so I will need to ensure items not included in the data are left blank.

For continuation I will preserve the above and start over below

# Restart point (see directly above for reason)

In [None]:
# For reference - what is the max lenght we will require?
max_len = 0

for line in processed_data:
    max_len = len(line) if len(line) > max_len else max_len
    
print(max_len)

Largest number of data items in one row is 19 - so all rows will need this (with empty padding elements included if needed)

In [26]:
# Read in data files
with open("nubase_4.mas20", "r") as file:
    data = file.read()
    
data = data.split("\n") # Separate lines

# Remove lines starting with '#' to remove header section
data = [line for line in data if not line.startswith("#")]

In [29]:
# Test logic for separating data items on a line - need to work out how to ensure length is correct
line = data[0]
line = [item for item in line.split(" ")]
print(line)

print(len(line))

['001', '0000', '', '', '1n', '', '', '', '', '', '', '8071.3181', '', '', '', '', '0.0004', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '609.8', '', '', '', 's', '0.6', '', '', '', '1/2+*', '', '', '', '', '', '', '', '', '06', '', '', '', '', '', '', '', '', '', '1932', 'B-=100']
76


Current idea:
- If there is a sequence of two or more empty strings in the list remove all but the first one
- This *may* bring the list length to the correct value (19), with the empty spaces in the places where data is not present

In [31]:
cleaned_line = []
for i, item in enumerate(line):
    if item:
        cleaned_line.append(item)
    elif i + 1 < len(line) and line[i + 1]:
        cleaned_line.append(item)
print(cleaned_line)
print(len(cleaned_line))

['001', '0000', '', '1n', '', '8071.3181', '', '0.0004', '', '609.8', '', 's', '0.6', '', '1/2+*', '', '06', '', '1932', 'B-=100']
20


Close, but not quite - the lenght of this line is 1 greater than the length of the longest line before hand - which suggests that one of the empty strings included here is not needed.

The requirement now is to find the item that is not needed somehow