# Working with large FITS files

This tutorial builds on this guide to [Create a very large FITS file from scratch](https://docs.astropy.org/en/stable/generated/examples/io/skip_create-large-fits.html) and shows how to buid a large fits file with multiple HDUs and with the big one not being the last. It is aimed at users already quite familiar with the FITS format.



## Authors
C. E. Brasseur

## Learning Goals
* Build a *large* FITS file (*large* means is too large to fit in memory all at once)
* Access data from a *large* FITS file
* Modify a *large* FITS file

## Keywords
Example, example, example

## Companion Content
LINK TO FITS DOCUMENTATION

## Summary

This is an advanced tutorial. If you don't want to know about the inner workings of the FITS format, just stop here. If you don't want to know but nevertheless neeed to, proceed with caution, that's how I started and now here I am writing this tutorial.



## Building a large FITS file

1. [Imports](#Imports)
2. [Primary HDU](#Primary-HDU)
3. [Large Image HDU](#Large-Image-HDU)
4. [Large Table HDU](#Large-Table-HDU)
5. [Adding an Extra Small HDU](#Adding-an-Extra-Small-HDU)
6. [Cleanup](#Cleanup)


https://docs.astropy.org/en/stable/generated/examples/io/skip_create-large-fits.html

https://fits.gsfc.nasa.gov/fits_standard.html

https://docs.python.org/3/library/mmap.html#mmap.mmap.madvise

https://docs.python.org/3/library/mmap.html#madvise-constants

https://man7.org/linux/man-pages/man2/madvise.2.html

https://github.com/astropy/astropy/issues/1380

https://github.com/astropy/astropy/pull/7597

https://github.com/astropy/astropy/pull/7926

## Imports

In [None]:
import os

from time import time

import numpy as np

from astropy.io import fits
from astropy.table import Table

from mmap import MADV_SEQUENTIAL

And since we're building a huge file, we'll write a little function to give us the file size in a whatever units we want.

In [None]:
def print_file_size(path, unit="B"):
    
    size = os.path.getsize(path)
    
    if unit=="KB":
        size /= 1e3
        fmt = '.1f'
    elif unit=="MB":
        size /= 1e6
        fmt = '.1f'
    elif unit=="GB":
        size /= 1e9
        fmt = '.1f'
    elif unit=="FITS":
        size //= 2880
        unit = "FITS block(s)"
        fmt = 'd'
        
    else:
        unit = "Bytes"
        fmt = 'd'
        
    print(f"{size:{fmt}} {unit}")

## Primary HDU

We're going to build this up as a properly formated multi-extension FITS file, so before we get into the matter of greating a masive FITS file we will build a basic primary header and write that to the file that will become our monster FITS file.

In [None]:
# Make some header entries for important information
primary_header_cards = [("ORIGIN", 'Fancy Archive', "Where the data came from"),
                        ("DATE", '2024-03-05',  "Creation date"),
                        ("MJD", 60374, "Creation date in MJD"),
                        ("CREATOR", 'Me',  "Who created this file")]  

# Build the Primary HDU object and put it in an HDU list
primary_hdu = fits.PrimaryHDU(header=fits.Header(primary_header_cards))
hdu_list = fits.HDUList([primary_hdu])

# Write the HDU list to file
big_fits_fle = "./patagotitan.fits"
hdu_list.writeto(big_fits_fle, overwrite=True)

Before we continue let's verify our (currently tiny) FITS file is valid.

In [None]:
with fits.open(big_fits_fle) as hdu_list:
    hdu_list.info()

Checking out the current size, we see it's one FITS block.

In [None]:
print_file_size(big_fits_fle)
print_file_size(big_fits_fle, "FITS")

## Large Image HDU

Here we are going to expand out FITS file to fit a large (40,000 x 40,000 pixel) image. This will cause the file to grow to ~13 GB in size. If that is too large for your system, adjust `array_dims` below. All of the steps still work as expected with smaller data, just there are simpler ways to do this if the whole FITS file fits in memory.

In [None]:
array_dims = [40_000, 40_000]

First we build an ImageHDU object with a small data array. The data in the array does not matter because we won't be using it, but the data type needs to be correct, and you need to know how many bytes per entry goes with that data type. In this case we are maing a `float64` array, so each entry uses 8 bytes of memory.

In [None]:
data = np.zeros((100, 100), dtype=np.float64)
hdu = fits.ImageHDU(data)

Now we pull out just the header, and adjust the NAXIS keywords to matach our desired large aray dimensions. We also set an EXTNAME which is optional, but helpful because it allows us to refer to that extention by name as well as index. 

In [None]:
header = hdu.header

header["NAXIS2"] = array_dims[0]
header["NAXIS1"] = array_dims[1]

header["EXTNAME"] = 'BIG_IMG' 

Now we write just the header to the end of our soon to balloon FITS file (at the end of this step it is temporarily NOT a valid FITS file).

In [None]:
with open(big_fits_fle, 'ab') as FITSFLE:  # 'ab' means open to append bytes
    FITSFLE.write(bytearray(header.tostring(), encoding="utf-8"))

Now we calculate the number of bytes we need for our large array, remembering the result needs to be a multiple of 2880 bytes to conform to the FITS standard. Note the multiplication by 8 because our array is of type `float64` this would be adjusted for different data types.

In [None]:
arraysize_in_bytes = ((np.prod(array_dims)  * 8 + 2880 - 1) // 2880) * 2880

Now we need to expand the file by that many bytes. To do this we seek to the desired new end of the file and write a null byte.

In [None]:
filelen = os.path.getsize(big_fits_fle) 
        
with open(big_fits_fle, 'r+b') as FITSFLE:
    FITSFLE.seek(filelen + arraysize_in_bytes - 1)
    FITSFLE.write(b'\0')

Now lets see how big our FITS file has become.

In [None]:
print_file_size(big_fits_fle, "GB")
print_file_size(big_fits_fle, "FITS")

So just about 13 GB as expected, and a lot more FITS blocks.

### Filling the big array

Now we have a big ol' empty array, so lets put some stuff in it.

Add some stuff about he different ways of openeing fits files and how memmap is now critical etc. 

In [None]:
hdu_list = fits.open(big_fits_fle, mode='update', memmap=True)

In [None]:
hdu_list.info()

That's what we expect, and also this is a point where you find out if you've messed up this operations. FITS files don't have indices up front so the computer just has to scan through it (in chunks of 2880) looking for more extensions. By default the Astropy fits module does not do this until necessary (LINK TO DOCS), so it's at the point where we call the info function that we find out if out FITS file is still valid. If this operations hangs, most likely the array size calculation is wrong.

We'll pull out the large data array, and then fill it in a loop. 

In [None]:
data_array = hdu_list[1].data

If you are on a system with the `madvise` call (you're on your own figuring that out), you can set madvise to MADV_SEQUENTIAL for the data_array. This tells the memory mapping that you are going to be accessing the array in a sequential manner and allows it to be more efficient in how it handles memory allocation based on that. (Obviously don't set this if you aren't going to be accessing the array sequentially).

In [None]:
mm = fits.util._get_array_mmap(data_array)
mm.madvise(MADV_SEQUENTIAL)

Now we fill the large array in blocks. We want the block size to comfortably fit in memory. The block_size I am using yields an ~1.3 GB array, adjust as your system requires.

In [None]:
block_size = 4000

it = time()
for i,j in enumerate(range(0, array_dims[0], block_size)):
    sub_arr = np.ones((block_size,array_dims[1]))*i
    data_array[j:j+block_size,:] = sub_arr
    print(f"{i}: {time()-it:.0f} sec")
    it = time()

Note the differing times for the loops. DO I KNOW WHY????

In [None]:
hdu_list.close()

### Checking the file contents

So now we've theoretically filled the elephantine array, but we want to make sure it actually got filled and save. So we'll open the file in a non-editable mode and check.

In [None]:
hdu_list = fits.open(big_fits_fle, mode='denywrite', memmap=True)

data_array = hdu_list[1].data

In [None]:
it = time()
for i,j in enumerate(range(0, array_dims[0], block_size)):
    print(f"{i}: Data match is {(data_array[j:j+block_size,:] == i).all()}: {time()-it:.0f} sec")
    it = time()
    
hdu_list.close()

## Large Table HDU

In the last section we expanded our FITS file to add a colossal image extension, in this section we will do the same for a table extension. The method is similar, but with a few key differences.

As with the image the data is not important but the data types are. In particular, the maximum string length for columns cannot be changed one the fly (since the memory has been allocated and is fixed).

In [None]:
small_tbl = Table(names=["Name", "Population", "Prince", "Years since fall", "Imports", "Exports"],
                  dtype=['U128', int, 'U128', np.float64, 'U2048', 'U2048'],
                  rows=[["Vangaveyave", 1297382, "Oriana", 34.6, "wine, cheese", "ahalo cloth, pearls, foamwork"],
                        ["Azilint", 50000, "n/a", 92.3, "none", "none"],
                        ["Amboloyo", 50937253, "Rufus", 1504.2, "pears, textiles, spices", "wine, timber"]])

table_hdu = fits.BinTableHDU(data=small_tbl)
table_hdu.header["EXTNAME"] = "BIG_TABLE"

The header for this table HDU gives us the information to determine how many bytes we need for our mammoth table.

In [None]:
table_hdu.header

The `NAXIS#` keywords hold the dimensions of the table where `NAXIS1` is the length of a single table row in bytes and `NAXIS2` is the number of rows in the table. So to get the total size of the jumbo table in bytes we simply multiply `NAXIS1` by the number of rows desired (adjusting for FITS blocksize). I'm choosing a million rows which is about 4GB, adjust as necessary for your system.

In [None]:
num_rows = 1_000_000
tablesize_in_bytes = ((table_hdu.header["NAXIS1"]*num_rows + 2880 - 1) // 2880) * 2880

Now we adjust the `NAXIS2` keyword to match our new table length and write just the header to the end of our towering FITS file.

In [None]:
table_hdu.header["NAXIS2"] = num_rows

with open(big_fits_fle, 'ab') as FITSFLE:
    FITSFLE.write(bytearray(table_hdu.header.tostring(), encoding="utf-8"))

Before we expand the file, lets remind ourself of the current filesize.

In [None]:
print_file_size(big_fits_fle, "GB")

Now, just as for the vast data array, we seek `tablesize_in_bytes` beyond the current end of the file and write a null byte.

In [None]:
filelen = os.path.getsize(big_fits_fle)

with open(big_fits_fle, 'r+b') as FITSFLE:
    FITSFLE.seek(filelen + tablesize_in_bytes - 1)
    FITSFLE.write(b'\0')

And we can see that the filesize has indeed increased by about 4GB.

In [None]:
print_file_size(big_fits_fle, "GB")

### Adding data to the titanic table

We can now open the prodigeous FITS file in update mode and fill in our table. Note that this time we don't advise the memory mapper we will be accessing the memory in sequential order, because we are not doing that.

CHANGE THIS NOW I KNOW IT'S STORED ROW BY ROW


ALSO CAN FILL ROW BY ROW

In [6]: hdu.data
Out[6]: 
FITS_rec([(1, 1., 'c'), (2, 2., 'd'), (3, 3., 'e')],
         dtype=(numpy.record, [('a', '<i8'), ('b', '<f8'), ('c', 'S1')]))

In [7]: hdu.data[0]
Out[7]: (1, 1.0, 'c')

In [8]: hdu.data[0] = (5, 5, 'f')

In [9]: hdu.data
Out[9]: 
FITS_rec([(5, 5., 'f'), (2, 2., 'd'), (3, 3., 'e')],
         dtype=(numpy.record, [('a', '<i8'), ('b', '<f8'), ('c', 'S1')]))

In [None]:
hdu_list = fits.open(big_fits_fle, mode='update', memmap=True)

In [None]:
hdu_list.info()

In [None]:
table_data = hdu_list["BIG_TABLE"].data

It's easier to update FITS column-wise rather than row-wise so we'll start by updating a couple of columns.

In [None]:
it = time()
table_data["Years since fall"] = np.linspace(5,1000,1000000)
print(f"Float column: {time()-it:.0f} sec")

In [None]:
it = time()
table_data["Exports"] = ["Magic"]*1_000_000
print(f"String column: {time()-it:.0f} sec")

In [None]:
it = time()
hdu_list.flush()
print(f"Flushing: {time()-it:.0f} sec")

We can of course add data on a row by row basis, however, to do that we have to access each field individually. To demonstrate that I'll add in some data from our original small table in a couple of random rows.

In [None]:
for i in [0,2]:
    for col in small_tbl.colnames:
        table_data[col][i*50000] = small_tbl[col][i]

When we close the file it has to flush our new data to disk, so this can take some time.

In [None]:
it = time()
hdu_list.close()
print(f"Closing: {time()-it:.0f} sec")

### Checking our data

Now let's again open up our behemothic FITS file and check that the daa we just loaded in is still there.

In [None]:
hdu_list = fits.open(big_fits_fle, mode='denywrite', memmap=True)

In [None]:
hdu_list.info()

In [None]:
table_data = hdu_list["BIG_TABLE"].data

Checking the first few rows.

In [None]:
print(table_data[:3])

Looking at the one row later on we also put data in.

In [None]:
print(table_data[100000])

Note how accessing the middle of the array takes longer than accessing the start.

## Adding an Extra Small HDU

The last thing we will do is add another small HDU to the oversize FITS file. We can do this in the usual way because the extension we are adding is of a normal size.

In [None]:
small_hdu = fits.ImageHDU(data=np.random.random((10,10)))
small_hdu.header["EXTNAME"] = "MINI_IMG"

Because we don't have to do anything funky with the file size we can just open the mighty FITS file in `append` mode and write the whole HDU, and it is a very fast operation.

In [None]:
with fits.open(big_fits_fle, mode='append', memmap=True) as hdu_list:
    hdu_list.append(small_hdu)

And now if we open the mondo FITS file we can see that additional extension.

In [None]:
with fits.open(big_fits_fle, mode='denywrite', memmap=True) as hdu_list:
    hdu_list.info()

## Cleanup

Lastly, we'll remove the behemothic file we created.

In [None]:
os.remove(big_fits_fle)