# Process HURDAT

In this notebook we will process the HURDAT text files to produce a single combined `*.csv`.
The notebook will demonstrate the following Python concepts:

* Context managers
* Reading a file line by line.
* Writing a file line by line.
* Splitting strings by position.
* Cleaning and coercing strings.
* Unit testing.
* I/O buffer optimizations.
* Simple text progress indicators.

The source data and documentation can be found on the
[NOAA website](https://www.aoml.noaa.gov/hrd/hurdat/Data_Storm.html).

In [1]:
import regex as re
import time

## Identify the Type of Row

The text files contain two different types of rows:

* A header row that identifies storm of the subsequent track records.
* The storm track data.

Identifying which type of row we are working with is an excellent use of
[regular expressions](https://docs.python.org/3/library/re.html). The
[how-to guide](https://docs.python.org/3/howto/regex.html) regular expressions in Python is
helpful. We can start by looking at an example of each type and reading the documentation in
the `*.pdf` file. We can use [RegEx101](https://regex101.com/) for rapid interactive testing
of our regular expressions.

The most common Canadian Regular Expression matches the postal code:
```regex
[a-z][0-9][a-z][0-9][a-z][0-9]
```

In [2]:
ISHEADER = re.compile("^[^,]{8},[^,]{19},[^,]{7},\n$")
ISTRACK = re.compile("^[^,]{8},[^,]{5},[^,]{2},[^,]{3},[^,]{6},[^,]{7},[^,]{4}(,[^,]{5}){14}\n$")
testheader = "EP052002,            DOUGLAS,     27,\n"
testtrack = "19800813, 0600,  , TD, 15.2N,  22.5W,  25, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999\n"
testfail = "Hello World\n"
print(ISHEADER.fullmatch(testheader))
print(ISHEADER.fullmatch(testtrack))
print(ISHEADER.fullmatch(testfail))
print(ISTRACK.fullmatch(testheader))
print(ISTRACK.fullmatch(testtrack))
print(ISTRACK.fullmatch(testfail))

<regex.Match object; span=(0, 38), match='EP052002,            DOUGLAS,     27,\n'>
None
None
None
<regex.Match object; span=(0, 126), match='19800813, 0600,  , TD, 15.2N,  22.5W,  25, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999\n'>
None


### Encapsulate the Identification

We take the working logic and put it in a function that we can cleanly reuse.

In [3]:
def isheader(line):
    """
    Determine the type of row and return an indicator of the row type:
    * True if header.
    * False if track.
    * None if no match found.
    """
    ISHEADER = re.compile(r"^[^,]{8},[^,]{19},[^,]{7},\n$")
    ISTRACK = re.compile(r"^[^,]{8},[^,]{5},[^,]{2},[^,]{3},[^,]{6},[^,]{7},[^,]{4}(,[^,]{5}){14}\n$")
    if ISHEADER.fullmatch(line) is not None:
        return True
    elif ISTRACK.fullmatch(line) is not None:
        return False
    else:
        return None

### Unit Test the Function

Using the tests from before we can make sure the function works as expected.

In [4]:
print(isheader(testheader))
print(isheader(testtrack))
print(isheader(testfail))

True
False
None


## Encapsulate Header Parsing

This function will extract and coerce the elements of the header:

* Unique identifier in database
* Year of season
* Storm number in season
* Basin Name
* Storm Name
* Tracks

To clean the parsed strings we remove the left hand spaces or zeros using `lstrip()` and
then set any missing data flags to the empty string using `removesuffix()`.

In [5]:
def parseheader(line):
    """
    Parse and tidy-up the elements of the header. Returns identification elements of the
    storm:
    * `identifier` - unique identifier of the storm in the source.
    * `basincode` - identifier of the basin.
    * `basinname` - decoded name of the basin.
    * `stormnumber` - ordinal of storm in the season and basin.
    * `seasonyear` - beginning year of the storm season.
    * `stormname` - name of storm when available.
    * `stormtracks` - number of storm track records in source.
    """
    BASIN = {
        "AL": "Atlantic",
        "EP": "East Pacific",
        "CP": "Central Pacific"
    }
    identifier = line[0:8]
    basincode = ""
    basinname = ""
    basincode = line[0:2].lstrip()
    basinname = BASIN.get(basincode, "")
    stormnumber = line[2:4].lstrip("0")
    seasonyear = line[4:8]
    stormname = line[9:28].lstrip().removesuffix("UNNAMED")
    stormtracks = line[29:36].lstrip()
   
    # Send
    return (
        identifier,
        basincode,
        basinname,
        stormnumber,
        seasonyear,
        stormname,
        stormtracks
    )

### Test the Header Process

Test on your branches or corner cases. In this case put in a test for each element in the
`BASIN` dictionary.

In [6]:
print(parseheader("AL011851,            UNNAMED,     14,"))
print(parseheader("AL142012,             NADINE,     96,"))
print(parseheader("EP061950,            UNNAMED,      9,"))
print(parseheader("CP011990,                AKA,     32,"))


('AL011851', 'AL', 'Atlantic', '1', '1851', '', '14')
('AL142012', 'AL', 'Atlantic', '14', '2012', 'NADINE', '96')
('EP061950', 'EP', 'East Pacific', '6', '1950', '', '9')
('CP011990', 'CP', 'Central Pacific', '1', '1990', 'AKA', '32')


## Encapsulate Track Parsing

In this case we want the details of the storm track:

* Date and time of observation
* Type of record
* Storm status
* Latitude
* Longitude
* Minimum pressure
* Maximum wind
* Eye radius
* 34kt radius

To clean the parsed strings we remove the left hand spaces or zeros using `lstrip()` and
then set any missing data flags to the empty string using `removesuffix()`.

Latitude is reported in Northings with a suffix `N` for north of the equator and a suffix
`S` for south of the equator. To convert this to Decimal all Latitudes south of the equator
need a negative sign.

Likewise Longitude is reported in Eastings with a suffix `E` for east of the zero meridian
`W` for west of the zero meridian. To convert this to decimal all Longitudes to the west of
the zero meridian need a negative sign.

To accomplish this we will use the 
[character translation](https://docs.python.org/3/library/stdtypes.html#str.translate)
static method of strings, passing a
[translation table](https://docs.python.org/3/library/stdtypes.html#str.maketrans).

In [8]:
# `maketrans` is a static function of strings. You do no need an actual string to call
# this method. You just need the general class definition. The `type` function is a built-in
# method that tells us the name of the type (or class) of the data.
print(type("Aaron") == str)
print(type(10.0))
print(type(10.0) == float)
NORTHINGS = str.maketrans(
    {
        "N": "",
        "S": "-",
        "E": "",
        "W": "-"
    }
)
print(NORTHINGS)
"S01234N567E89W".translate(NORTHINGS)


True
<class 'float'>
True
{78: '', 83: '-', 69: '', 87: '-'}


'-0123456789-'

In [10]:
def parsetrack(line):
    """"
    Extract and tidy-up individual observation records. Returns the details of the
    observations:
    * `trackcode` - Code of type of observation.
    * `tracktype` - Decoded type of observation.
    * `stormcode` - Code of type of storm at observation.
    * `stormtype` - Decoded type of storm at observation.
    * `latitude` - Latitude in hemisphere.
    * `longitude` - Longitude in hemisphere.
    * `wind` - Maximum wind speed in kt.
    * `pressure` - Minimum pressure in mbar.
    * `neradius` - NE corner of furthest extent in nmi.
    * `seradius` - SE corner of furthest extent in nmi.
    * `swradius` - SW corner of furthest extent in nmi.
    * `nwradius` - NW corner of furthest extent in nmi.
    * `eyeradius` - Radius of maximum wind speed in nmi.
    """
    TRACK = {
        "C": "Closest to Coast",
        "G": "Genesis", 
        "I": "Peak Pressure and Wind",
        "L": "Landfall",
        "P": "Minimum Pressure",
        "R": "Intensity Details",
        "S": "Storm Type Change", 
        "T": "Position Details",
        "W": "Maximum Wind" 
    }
    STORM = {
        "TD": "Tropical Depression",
        "TS": "Tropical Storm",
        "HU": "Tropical Hurricane",
        "EX": "Extratropical Cyclone",
        "SD": "Subtropical Depression",
        "SS": "Subtropical Storm",
        "LO": "Low Pressure",
        "WV": "Tropical Wave",
        "DB": "Disturbance",
        "TY": "Typhoon",
        "ST": "Subtropical Typhoon",
        "ET": "Extratropical Typhoon",
        "PT": "Extratropical Low"
    }
    HEMISPHERE = str.maketrans(
        {
            "N": "",
            "S": "-",
            "E": "",
            "W": "-"
        }
    )
    year = line[0:4]
    month = line[4:6]
    day = line[6:8]
    hour = line[10:12]
    minute = line[12:14]
    trackcode = line[15:17].lstrip()
    tracktype = TRACK.get(trackcode, "")
    stormcode = line[18:21].lstrip()
    stormtype = STORM.get(stormcode, "")
    latitude = line[22:27].lstrip().lstrip("-")
    northing = line[27:28].translate(HEMISPHERE)
    longitude = line[29:35].lstrip().lstrip("-")
    easting = line[35:36].translate(HEMISPHERE)
    wind = line[37:41].lstrip().removesuffix("-99")
    pressure = line[42:47].lstrip().removesuffix("-999")
    neradius = line[48:53].lstrip().removesuffix("-999")
    seradius = line[54:59].lstrip().removesuffix("-999")
    swradius = line[60:65].lstrip().removesuffix("-999")
    nwradius = line[66:71].lstrip().removesuffix("-999")
    eyeradius = line[120:125].lstrip().removesuffix("-999")

    # Send. Serialize time using JSON ISO standard
    return (
        f"{year}-{month}-{day}T{hour}:{minute}Z",
        trackcode,
        tracktype,
        stormcode,
        stormtype,

        # To Do: Format Northing/Southing
        f"{northing}{latitude}",

        # To Do: Format Easting/Westing
        f"{easting}{longitude}",

        wind,
        pressure,
        neradius,
        seradius,
        swradius,
        nwradius,
        eyeradius
    )

### Test the Track Parsing

In [11]:
print(parsetrack("20140808, 1230, L, TS, 19.2N, 155.4W,  50, 1001,  100,   60,    0,   90,   40,    0,    0,   20,    0,    0,    0,    0, -999"))
print(parsetrack("20230831, 1800,  , TD, 17.0N,  27.0W,  30, 1006,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,   50"))
print(parsetrack("20230901, 0600,  , TS, 32.9N,  52.4W,  55,  996,   30,   30,   20,   20,   20,   20,    0,    0,    0,    0,    0,    0,   10"))
print(parsetrack("20150713, 0600,  , TS, 13.7N, 178.5E,  55,  982,   50,   30,   30,   45,   30,    0,    0,   30,    0,    0,    0,    0, -999"))

('2014-08-08T12:30Z', 'L', 'Landfall', 'TS', 'Tropical Storm', '19.2', '-155.4', '50', '1001', '100', '60', '0', '90', '')
('2023-08-31T18:00Z', '', '', 'TD', 'Tropical Depression', '17.0', '-27.0', '30', '1006', '0', '0', '0', '0', '50')
('2023-09-01T06:00Z', '', '', 'TS', 'Tropical Storm', '32.9', '-52.4', '55', '996', '30', '30', '20', '20', '10')
('2015-07-13T06:00Z', '', '', 'TS', 'Tropical Storm', '13.7', '178.5', '55', '982', '50', '30', '30', '45', '')


## Encapsulate Processing

We guard the file processing in a function so that all the resources are released cleanly
when we are done processing.

To provide a progress prompt we are going to use the `end` named argument of the `print`
command and specify that the end is a carriage return `\r` (back to beginning of same line).
```python
print(f"Processed: {counter}", end = "\r")
```

We need to accomplish the following, for each file:

1. Distinguish between headers and tracks.
2. If it is a storm header, save the values for use on the storm track records.
3. If it is a storm track extract the values
4. Write the storm header and storm track to an output file as a new line
5. Close everything

As well we will create progress counters to report the data processing progress.

**!Python line iterators in for loops include the newline character `\n`!**

In [12]:
# Our program will process each file identically, so we put the processing in a loop
# that loops through each file.
SOURCES = [
    "../data/hurdat2-atl.txt",
    "../data/hurdat2-nepac.txt"
]
for source in SOURCES:
    print(source)

../data/hurdat2-atl.txt
../data/hurdat2-nepac.txt


In [13]:
def processhurdat():
    """
    Process both HURDAT2 files into a single file `hurdat2.csv`. Initially all options are
    hardcoded. We can change these to arguments later.
    """

    # Note that we are using the UNIX/LINUX file separator `/`, if you use the Windows
    # make sure to use the double `\\` to prevent escape sequence mistakes.
    SOURCES = [
        "../data/hurdat2-atl.txt",
        "../data/hurdat2-nepac.txt"
    ]
    TARGET = "../data/hurdat2.csv"
    COLUMNS = (
        "\"Identifier\"," +
        "\"Basin Code\"," +
        "\"Basin Name\"," +
        "\"Storm Number\"," +
        "\"Season Year\"," +
        "\"Storm Name\"," +
        "\"Tracks\"," +
        "\"Observed\"," +
        "\"Track Code\"," +
        "\"Track Type\"," +
        "\"Storm Code\"," +
        "\"Storm Type\"," +
        "\"Latitude\"," +
        "\"Longitude\"," +
        "\"Maximum Wind (kt)\"," +
        "\"Minimum Pressure (mbar)\"," +
        "\"NE Radius (nmi)\"," +
        "\"SE Radius (nmi)\"," +
        "\"SW Radius (nmi)\"," +
        "\"NW Radius (nmi)\"," +
        "\"Eye Radius (nmi)\"\n"
    )

    # Open the target for writing
    with open(TARGET, "a") as outfile:
        print(f"Writing: {TARGET}")
        outfile.write(COLUMNS)

        # Open each file for reading
        for source in SOURCES:
            print(f"Reading: {source}")

            # Reset progress counters
            reads = 0
            storms = 0
            tracks = 0
            filestart = time.time_ns()
        
            # Loop through each record
            with open(source, "r") as infile:
                for line in infile:

                    # reads = 1 + reads is the same as below
                    reads += 1

                    # What type of line is it
                    foundheader = isheader(line)

                    # Store the storm identifiers of a header. Mirror explicit check.
                    if foundheader is True:
                        storms += 1
                        (
                            identifier,
                            basincode,
                            basinname,
                            stormnumber,
                            seasonyear,
                            stormname,
                            stormtracks
                        ) = parseheader(line)

                    # Write a storm track. Use explicit Boolean check to avoid Falsy.
                    elif foundheader is False:
                        tracks += 1
                        (
                            observed,
                            trackcode,
                            tracktype,
                            stormcode,
                            stormtype,
                            latitude,
                            longitude,
                            wind,
                            pressure,
                            neradius,
                            seradius,
                            swradius,
                            nwradius,
                            eyeradius
                        ) = parsetrack(line)
                        outfile.write(
                            f"\"{identifier}\"," +
                            f"\"{basincode}\"," +
                            f"\"{basinname}\"," +
                            f"\"{stormnumber}\"," +
                            f"\"{seasonyear}\"," +
                            f"\"{stormname}\"," +
                            f"\"{stormtracks}\"," +
                            f"\"{observed}\"," +
                            f"\"{trackcode}\"," +
                            f"\"{tracktype}\"," +
                            f"\"{stormcode}\"," +
                            f"\"{stormtype}\"," +
                            f"\"{latitude}\"," +
                            f"\"{longitude}\"," +
                            f"\"{wind}\"," +
                            f"\"{pressure}\"," +
                            f"\"{neradius}\"," +
                            f"\"{seradius}\"," +
                            f"\"{swradius}\"," +
                            f"\"{nwradius}\"," +
                            f"\"{eyeradius}\"\n"
                        )

                    # Update progress every 1000 reads
                    if reads % 10000 == 1:
                        print(
                            f"Reads {reads} = Storms {storms} + Tacks {tracks}",
                            end = "\r",
                            flush = True
                        )
            print(
                f"Reads {reads} = Storms {storms} + Tacks {tracks}",
                end = "\r",
                flush = True
            )
            fileend = time.time_ns()
            fileduration = (fileend - filestart) / 1000000000
            readrate = (fileend - filestart) / (1000 * reads)
            print()
            print(f"{fileduration:.3f}(s) {readrate:.3f}(\u03BCs/r)")

    # Send
    print("Done")

### Run the Processing

We can quickly test our progress so far.

In [14]:
processhurdat()

Writing: ../data/hurdat2.csv
Reading: ../data/hurdat2-atl.txt
Reads 56722 = Storms 1973 + Tacks 54748
2.162(s) 38.118(μs/r)
Reading: ../data/hurdat2-nepac.txt
Reads 32407 = Storms 1227 + Tacks 31179
1.269(s) 39.170(μs/r)
Done


### Optimizations

There are three optimizations we can do to speed up the processing:

1. Batch the terminal updates into groups of 1000 to 10000.
2. Move the constants out of the functions. Currently the constants are re-allocated on
every function call. We can move them out of the function and either pass them in as an
argument to the function, or bind with `nonlocal`.
3. Move the inner code of the `for` loop into its own locally defined function.

According to the Python documentation line-by-line file reading is already an optimized
stream, and [writing](https://docs.python.org/3/library/io.html#io.RawIOBase.write) is
buffered at the OS level.

We can time our code to the nanosecond in Python by reading the
[processor clock time](https://docs.python.org/3/library/time.html#time.time_ns).

Python allows multiple ways of writing string literals with either single or double quotes.

In [None]:
a = '""'
b = "\"\""
print(a)
print(b)
print(a == b)
print(False is not None)
print(None is False)

### Dependency Injection and Cancellation Tokens

In [None]:
def factory():
    """
    Creates the engine and protects the scope of the state of the engine between repeated
    calls of the engine.
    """
    state = False
    def engine():
        """
        Updates the state and returns whether to continue.
        """
        nonlocal state
        return state
    return engine
def executor(engine):
    """
    Runs the engine and cancels the execution
    """
    def proceed():
        """
        Sets the conditions to cancel execution
        """
        return False
    while proceed() and engine():
        pass

# The factory produces an engine that the executor repeatedly calls.
executor(factory())
