# Breaking Out of the Loop
## Refactoring Legacy Software with Polars

In this quick tutorial, we'll be creating a very limited version of the Global Summary of the Month. By the end, you should have a handy cookbook for whipping up solutions to common data-crunching requirements. 

Let's start by importing Polars, NumPy, and Numba.

In [3]:
import polars as pl
import numpy
import numba

For today's example, we'll be using real NOAA data. More specifically, we'll be pulling station files from the Global Historical Climate Network's daily dataset. These are just plain text "flat files" with fixed column widths.

<img src="imgs/dly.png" alt="image" width="500" height="auto">

Unfortunately, Polars doesn't have a native method for reading fixed-width files, so I'm including a utility class here.

In [7]:
import InputOutputUtils as io
from importlib import reload
reload(io) #Prevents caching in Python modules
station_file = "stations/USW00094789.dly"
df = io.dlyAsDataFrame( station_file )
df.select(["STATION", "Element", "DATE", "daily_values", "qc_flags"])

STATION,Element,DATE,daily_values,qc_flags
str,str,date,list[f32],list[str]
"""USW00094789""","""TMAX""",1948-07-01,"[-9999.0, -9999.0, … 317.0]","["""", """", … """"]"
"""USW00094789""","""TMIN""",1948-07-01,"[-9999.0, -9999.0, … 233.0]","["""", """", … """"]"
"""USW00094789""","""PRCP""",1948-07-01,"[-9999.0, -9999.0, … 0.0]","["""", """", … """"]"
"""USW00094789""","""SNOW""",1948-07-01,"[-9999.0, -9999.0, … 0.0]","["""", """", … """"]"
"""USW00094789""","""SNWD""",1948-07-01,"[-9999.0, -9999.0, … 0.0]","["""", """", … """"]"
…,…,…,…,…
"""USW00094789""","""TAVG""",2025-03-01,"[93.0, -5.0, … -9999.0]","["""", """", … """"]"
"""USW00094789""","""WDF2""",2025-03-01,"[310.0, 330.0, … -9999.0]","["""", """", … """"]"
"""USW00094789""","""WDF5""",2025-03-01,"[310.0, 330.0, … -9999.0]","["""", """", … """"]"
"""USW00094789""","""WSF2""",2025-03-01,"[165.0, 143.0, … -9999.0]","["""", """", … """"]"


What's happening inside of input_output_utils is outside the scope of this talk, but it's a useful class. Shoutout to (GET NAME) over on StackOverflow for posting the snippet I used to create it. The important thing is, we've converted our fixed-width flat file into a Polars Dataframe.

The real GSOM puts out more than 63 possible element-columns. In this quick demonstration, our program will only put out 5 elements:
* Average Minimum Temperature
* Average Maximum Temperature
* Average Temperature
* Cooling Degree Days
* Heating Degree Days

To create these elements, we'll only need to keep rows where Element is equal to TMIN and TMAX. Let's go ahead and filter those elements:

In [8]:
needed = ["TMIN", "TMAX"]
df = df.filter(
    pl.col("Element").is_in(needed)#only show columns that match the string in this list
)

df.select(["STATION", "Element", "DATE", "daily_values", "qc_flags"])

STATION,Element,DATE,daily_values,qc_flags
str,str,date,list[f32],list[str]
"""USW00094789""","""TMAX""",1948-07-01,"[-9999.0, -9999.0, … 317.0]","["""", """", … """"]"
"""USW00094789""","""TMIN""",1948-07-01,"[-9999.0, -9999.0, … 233.0]","["""", """", … """"]"
"""USW00094789""","""TMAX""",1948-08-01,"[233.0, 261.0, … 283.0]","["""", """", … """"]"
"""USW00094789""","""TMIN""",1948-08-01,"[211.0, 200.0, … 172.0]","["""", """", … """"]"
"""USW00094789""","""TMAX""",1948-09-01,"[256.0, 250.0, … -9999.0]","["""", """", … """"]"
…,…,…,…,…
"""USW00094789""","""TMIN""",2025-01-01,"[50.0, 17.0, … 44.0]","["""", """", … """"]"
"""USW00094789""","""TMAX""",2025-02-01,"[100.0, 39.0, … -9999.0]","["""", """", … """"]"
"""USW00094789""","""TMIN""",2025-02-01,"[-49.0, -71.0, … -9999.0]","["""", """", … """"]"
"""USW00094789""","""TMAX""",2025-03-01,"[194.0, 33.0, … -9999.0]","["""", """", … """"]"


If you look at the shape of the dataframes, you'll see our row count has dropped from 15, 584 rows to 1,842. You don't necessarily need to drive yourself nuts trying to whittle down the size of your dataset, but if it's 700% larger than it needs to be, your speed will suffer.

In [9]:
#TODO Remove bad data
#TODO Run .mean() on TMIN and TMAX
#TODO Create TAVG with the "free real estate" method
#TODO Create Numba JIT function for ClDD/HTDD
