## LMA source files

Most of the older LMA source files have filenames like `LYLOUT_20211013_000000_0600.dat.gz`.

Newer ones look like `WTLMA_2011_211013_102700_0060.dat.gz`. 

In the newer files, `LYLOUT` (which was shared by all networks) is replaced with a network identifier, here `WTLMA_2011`, for the West Texas LMA that was established in 2011. All files end with `yyyymmdd_HHMMSS_duration.dat.gz`. The start time of the file is `yyyymmdd_HHMMSS`, and `duration` is the total length of the file in seconds.

They are compressed (`.gz`) plain text files - if you unzip them you can look at them with any text editor.

Realtime data files are usually one minute in length, though sometiems they are also provided as bundles of hourly data. Postprocessed data files are almost always ten minutes long.

### Headers of LMA postprocessed data.

Main sections:
- `Analysis program:` is the command used to process this data file.
- The next few lines are information about the LMA network read from the `.loc` or `.gps` file provided to the processing program, and other information about the processing settings used.
- The `Station information` table provides the location (here, truncated for privacy)  and receive channel (Ch. 3 is 60-66 MHz) of each station.
- The `Station data` table gives the data collection mode of the station. Here we see an 80 µs collection window was used, that there was an assumed 70 ns GPS timing error, the number and fraction of sources to which each station contributed, a power metric relative to the median (I think), and  whether that station was active and used in processing this data file.
- The `Station mask order` tells us which station corresponds to which bit in the station mask. We'll explain this below.
- Finally, we have a description of the data format for the VHF sources (here, called events), with a descriptive name for each column and its the data format, and the total event count. These descriptions are not always correct, as you can see for the realtime data file!

**Postprocssed data header** from 2021-06-27, 0200-0210 UTC - ten minutes of data.

**Realtime data header** from 2021-10-13 0400-0401 UTC - one minute of data.

How do we tell the difference between the file types? Look at the `lma_analysis` line: the program version ending in `RT` indicates realtime, while the postprocessing ends in `R`, for reasons I don't understand. The path to the output filename location (`-o`) also gives a clue about where the data file was first saved, which hopefully was on a somewhat informative path on your data processing server.

## Interpreting the event data columns

The time and location columns in the data file above are self-explanatory, though it is good to know they are with respect to the WGS84 ellipsoid. The event power, in dBW at the source, is also provided.

The station mask column is a [hexadecimal number](https://en.wikipedia.org/wiki/Hexadecimal), where each digit after the `0x` represents a value between 0-15. Let's look at how to interpret `0xf08`.

In [12]:
# Convert hexadecimal (base-16) to a decimal (base-10) integer
hexmask = '0xf09'
number = int(hexmask, 16)
print("Hex {0} is decimal integer {1}".format(hexmask, number))

# Print the integer as a decimal, and as binary, with leading zeroes and 16 bits (016b).
print("Integer {0} is binary {0:016b}".format(number))

Hex 0xf09 is decimal integer 3849
Integer 3849 is binary 0000111100001001


We see that `0xf09` converts to this binary string corresponding to the station location order from the header:

```
0000111100001001
    XHAPLRONBWGE
```

so stations X, H, A, P, B, and E contributed to this solution. If we sum the bits in the station mask, we find the number of contributing stations was 6. This is above the minimum required for a solution (5). 

**The number of contributing staitons is an important parameter that lets us control the amount of noise in our science analyses.**

For a fullly operational network a minimum of 6 contributing stations is a good starting number, but values of 5 or 7 can be useful.

Noise is primarily caused by false correlations produced by random, local VHF sources at several stations that happen to correlate with each other. More stations in a network means a greater chance of false corelation, so larger LMA networks tend to need a minimum higher station count to keep noise in check.

**The other primary quality control parameter is the reduced `chi^2` value.** The processing requires that all sources have $\chi^2 < 5$.


(If you see values larger than 5 in a data file, a second pass has been used to restore some very high power sources that might be a special class of discharge called "narrow bipolar events". The post-processed header above includes `-y 500.00`, so a second pass was done for that data file, up to $\chi^2 < 500$.)

$\chi^2$ is a measure of the goodness of fit of the solutions. Specififally, it is the [$\chi^2$ statistic](https://mathworld.wolfram.com/Chi-SquaredDistribution.html) of the normalized squared timing errors. Using the notation of [Thomas et al. (2004)](10.1029/2004JD004549), eq. A2,

\begin{equation}
\Large\chi^2 = \Large\sum_{i=1}^N \frac{(t_i^\mathrm{obs} - t_i^\mathrm{fit})^2}{\Delta t_\mathrm{rms}^2},
\end{equation}

where $t_i^\mathrm{obs}$ is the observed arrival time of the source at each station, and $t_i^\mathrm{fit}$ the predicted arrival time of each source (traveling at the speed of light) at each station. $\Delta t_\mathrm{rms}$ is the expected (root mean square) normalized timing error assumed in the processing, i.e., 70 ns for the two data files above.

The actual value in the file is the **reduced $\chi^2$**, 

\begin{equation}
\chi_{\nu}^2 = \frac{\chi^2}{N - 4}
\end{equation}

which has been further normalized by the number of degrees of freedom $\nu = N - 4$, where $N$ is the number of contributing stations and 4 is the number of retrieved paramters for the source location $(x,y,z,t)$.

Using these equations, we can convert the data file's $\chi_{\nu}^2$ value into the actual timing errors, and/or calculate the true \chi_{\nu}^2 for the actual GPS timing noise, which is typically less than 70 ns. 

([Thomas et al. (2004)](10.1029/2004JD004549) shows how the timing errors relate to range, azimuth, and elevation errors, and how to determine $\Delta t_\mathrm{rms}$ for any data file.)

Practically speaking, we usualy don't calculate the corrected $\chi_\nu^2$ unless we care about converting the timing errors into location errors.  Instead, we simply use the file's $\chi_{\nu}^2$ and reduce our maximum chi-sq until we're satisfied we've removed most noise.

When [working with data from some day for the first time](./FirstLMAplots.ipynb), it is useful to experiment with a minimum number of stations of 5, 6, and 7, and reduced chi-squared values of 1.0 and 5.0, since the best setting for that day depends on the number of active stations and the radio noise in the ambient environment on that day. You will observe that there is a tradeoff between removing noise and removing detail in lightning channels. The balance point is a judgment call, but should always be reported in publications using LMA data.

Before we [move on to actually trying to filter and plot some data ourselves](./FirstLMAplots.ipynb), let's use the cell below to look at the header of a file we'll work with later today. You might need to adjust `n_lines` to see more of the header.

In [27]:
n_lines = 100
filename = '/data/Houston/realtime-tracer/LYLOUT_200524_210000_0600.dat.gz'

import gzip
with gzip.open(filename, 'rt', encoding='utf8') as lmafile:
    header_lines = [lmafile.readline() for i in range(n_lines)]
for line in header_lines:
    print(line, end='')

New Mexico Tech Lightning Mapping Array - analyzed data
Analysis program: /data1/hlma_tamu/lma_analysis -d 20200524 -t 210000 -s 600 -l /data1/hlma_tamu/hstnA.loc -o /data1/hlma_data/processed/080us_data/0600/nomdl//2020/05/200524 -x 5.00 -y 500.00
Analysis program version: 10.11.7R
File created: Mon May 25 00:23:04 2020
Data start time: 05/24/20 21:00:00
Number of seconds analyzed: 600
Location: HSTN LMA 2012
Coordinate center (lat,lon,alt): 29.7600000 -95.3700000 -200.00
Coordinate frame: cartesian
Maximum diameter of LMA (km): 205.007
Maximum light-time across LMA (ns): 683965
Number of stations: 13
Number of active stations: 7
Active stations: A B I J K L M
Minimum number of stations per solution: 6
Maximum reduced chi-squared: 5.00
Maximum number of chi-squared iterations: 20
Station information: id, name, lat(d), lon(d), alt(m), delay(ns), board_rev, rec_ch
Sta_info: A  Cy-Fair ISD        29.9392583  -95.6464869    24.31   45 3  3
Sta_info: B  Williams Airport   30.1574342  -95.3