# How Fortran stores binary files

## Introduction

Fortran is still the go-to language for number crunching.

## Types of Fortran Files

There are three different native ways for Fortran to store data in files:

1. Formatted
2. Unformatted
3. Stream

Then, there are libraries to store the data in specific formats, for example NetCDF.

If you want to store complex data sets for a long time, I strongly recommend NetCDF or another dedicated data format. 
We have detailed on this blog before how to write NetCDF files with Fortran and Python, and they have features like compression and documentation that are very beneficial.

But what if the data isn't very complex and NetCDF would be an overkill? 
Or if you received the data from someone else and they didn't bother with this?

This blog post will help you with your task of reading the data.

## A few notes on best practices

There are some good practices on how to deal with files in Fortran. 
These make it easier to port data, because they will work on different systems.

### Ensure that you know the kind of the variable

If you write something like

```fortran
integer :: ii
```

you don't really know what kind of integer `ii` will be. 
Often, you can set the default integer and real kind with compiler options, but it's far better to explicitly declare the kind in the code itself.

Since Fortran 2003 -- and all compilers we use today are compatible with this -- you can use the intrinsic `iso_fortran_env` module to get the proper kinds:

```fortran
use iso_fortran_env, only: int32, real64
implicit none
integer(kind=int32) :: ii
real(kind=real64) :: x(10, 100)
```

In old code, you might find statements like:

```fortran
integer*4 ii        ! DO NOT DO THIS
```

This syntax has *never* been standard, and I strongly discourage you from using it.
Slightly better, but still wrong, is this:

```fortran
integer(kind=4) :: ii   ! Still not good
```

There is no guarantee that every compiler will use the same kind values for the same variable types.
If for some reason you can not use `iso_fortran_env`, use the `selected_int_kind` and `selected_real_kind` methods instead:

```fortran
integer, parameter :: real64 = selected_real_kind(15, 307)
real(kind=real64) :: x(10, 100)
```

See the table below for which type you need

| bytes | int name | integer kind | integer max |   |  | real name | real kind |
|------|-------|------|------|-----|    |-------|-------|-------|
|  2  | `int16` | `selected_int_kind(3)` |  127 | | | |
|  4  | `int32` | `selected_int_kind(5)` | > 2*10^9 | | | `real32` | `selected_real_kind(6, 37)` |
|  8  | `int64` | `selected_int_kind(10)` | > 9*10^18 | | | `real64` | `selected_real_kind(15, 307)` |
| 16  | ----    | `selected_int_kind(19)` | > 10^38 | | | `real128` | `selected_real_kind(33, 4931)` |

Note that `iso_fortran_env` does not have a named type `int128`, though your compiler might have it. 
Some compilers also have a 10-byte real kind.

### newunit

Whenever you interact with a file, you need a unit, an integer value that references a specific open file.
Some I/O streams, specifically Standard Input, Standard Output, and Standard Error have compiler dependent values for these units, which unfortunately are not standardised.

Keeping track of these values while remaining compiler-agnostic is getting a bit confusing.
Fortunately, there's an option for that: `newunit`.

Instead of using a hardcoded integer value, declare an integer variable with a meaningful name, then open the file with `newunit=` instead of `unit=` parameter:

```fortran
integer :: output_handle
...
open(newunit=output_handle, file='data.dat', ...)
...
write(output_handle, *) values(:, i)
...
close(output_handle)
```

A new, unused value is assigned every time you open the file, and you don't have to worry about interfering file handles any more.

## Stream

Stream output has been part of Fortran 2003 and later. 

The binary representation of the data is written directly to the file, without any metadata.

### Fortran writing stream data

```fortran
program write_stream
    use iso_fortran_env, only: int16
    implicit none
    integer(kind=int16) :: ii
    integer :: output_handle
    open(newunit=output_handle, file='stream_data.dat', action='write',   &
         status='replace', access='stream', format='unformatted')
    write(output_handle) [(ii, ii=1, 10)]
    write(output_handle) "Hello World"
    close(output_handle)
end program write_stream
``` 

In [1]:
!hexdump -C stream_data.dat

00000000  01 00 02 00 03 00 04 00  05 00 06 00 07 00 08 00  |................|
00000010  09 00 0a 00 48 65 6c 6c  6f 20 57 6f 72 6c 64     |....Hello World|
0000001f


This has written the 16-bit values from 1 to 10, followed by the ascii values for "Hello World".

### Reading it into Python

If it were purely one large array, it would be very easy to read it into Python:

In [2]:
import numpy as np
np.fromfile('stream_data.dat', '<i2')

array([    1,     2,     3,     4,     5,     6,     7,     8,     9,
          10, 25928, 27756,  8303, 28503, 27762], dtype=int16)

The `np.fromfile` method reads the data stream in as-is, and iterprets the values according to the datatype you gave it, in the above case little-endian 2-byte integer.

For an overview of possible data types, see [here](https://docs.scipy.org/doc/numpy/reference/arrays.interface.html#python-side).

The integer values are correctly read in, but of course the 'H' and 'e' get mashed into a single integer value of 25928, 'll' becomes 27759, and so forth.

Still, this might be the simplest way to transfer a single array bit-correct to python.

## Unformatted sequential

There is no standardised method to store unformatted sequential data, and the exact format might vary between different compilers and platforms.

That said, most compilers seem to store it in a similar way by now.

### Fortran Write

```fortran
program write_unformatted
    use iso_fortran_env
    implicit none
    integer(kind=int16) :: ii
    integer :: output_handle
    open(newunit=output_handle, file='unformatted_data.dat', form='unformatted', &
        status='replace', action='write', access='sequential')
    write(output_handle) [(ii, ii=1, 10)]
    write(output_handle) "Hello World"
    close(output_handle)
end program write_unformatted
```

In [3]:
!hexdump -C unformatted_data.dat

00000000  14 00 00 00 01 00 02 00  03 00 04 00 05 00 06 00  |................|
00000010  07 00 08 00 09 00 0a 00  14 00 00 00 0b 00 00 00  |................|
00000020  48 65 6c 6c 6f 20 57 6f  72 6c 64 0b 00 00 00     |Hello World....|
0000002f


You can still see the values of 1 through 10 (`01 00` through `0a 00`), but you can also see that it's no longer the first value. 
It starts with `14 00 00 00`, or 20, which is the number of bytes that make this list up. After the array, the 20 is repeated, in case you read in reverse.

Next comes `0b 00 00 00`, or 11 -- exactly the number of bytes in "Hello World", again followed by a repeat of the record header 11.

### Python read

To read this data in Python, you need to know the data type of the header, almost always an unsigned int, and usually 4 bytes in length:

In [7]:
from scipy.io import FortranFile
ff=FortranFile('unformatted_data.dat', 'r', '<u4')
print(ff.read_record('<i2'))
print(b''.join(ff.read_record('S1')))
ff.close()

[ 1  2  3  4  5  6  7  8  9 10]
b'Hello World'
