# Breaking Out of the Loop
## Refactoring Legacy Software with Polars

In this quick tutorial, we'll be creating a very limited version of the Global Summary of the Month. By the end, you should have a handy cookbook for whipping up solutions to common data-crunching requirements.


### Part 1 - Reading Your Input
***

Let's start by importing Polars. We'll also import a library called itables to make exploring our tables easier in the notebook.

In [1]:
import polars as pl
#import itables
#itables.init_notebook_mode(connected=False)

For today's example, we'll be using real NOAA data. More specifically, we'll be pulling station files from the Global Historical Climate Network's daily dataset. These are just plain text "flat files" with fixed column widths. Daily values are limited to 5 spaces and are followed by measurement, QC, and source flags--each taking up a single space:

<img src="imgs/dly.png" alt="image" width="500" height="auto">

Unfortunately, Polars doesn't have a native method for reading fixed-width files, so I'm including a utility class here. You'll find it in the GitHub repo as **InputOutputUtils.py.**

In [9]:
import InputOutputUtils as io
from importlib import reload
reload(io) #Prevents caching in Python modules
station_file = "stations/USW00094789.dly"
df = io.dlyAsDataFrame( station_file )
df.head()
#show(df, layout={"topStart": None, "topEnd": None})

STATION,Element,0_measure,0_source,1_measure,1_source,2_measure,2_source,3_measure,3_source,4_measure,4_source,5_measure,5_source,6_measure,6_source,7_measure,7_source,8_measure,8_source,9_measure,9_source,10_measure,10_source,11_measure,11_source,12_measure,12_source,13_measure,13_source,14_measure,14_source,15_measure,15_source,16_measure,16_source,17_measure,17_source,18_measure,18_source,19_measure,19_source,20_measure,20_source,21_measure,21_source,22_measure,22_source,23_measure,23_source,24_measure,24_source,25_measure,25_source,26_measure,26_source,27_measure,27_source,28_measure,28_source,29_measure,29_source,30_measure,30_source,daily_values,qc_flags,DATE,YEAR_MONTH,days_in_month
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,list[f32],list[str],date,str,i32
"""USW00094789""","""TMAX""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","[-9999.0, -9999.0, … 317.0]","["""", """", … """"]",1948-07-01,"""194807""",31
"""USW00094789""","""TMIN""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","[-9999.0, -9999.0, … 233.0]","["""", """", … """"]",1948-07-01,"""194807""",31
"""USW00094789""","""PRCP""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""X""","""T""","""X""","""T""","""X""","""""","""X""","""""","""X""","""T""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""T""","""X""","""""","""X""","""T""","""X""","""T""","""X""","[-9999.0, -9999.0, … 0.0]","["""", """", … """"]",1948-07-01,"""194807""",31
"""USW00094789""","""SNOW""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","[-9999.0, -9999.0, … 0.0]","["""", """", … """"]",1948-07-01,"""194807""",31
"""USW00094789""","""SNWD""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","""""","""X""","[-9999.0, -9999.0, … 0.0]","["""", """", … """"]",1948-07-01,"""194807""",31




What's happening inside of input_output_utils is outside the scope of this talk, but it's a useful class. Shoutout to (GET NAME) over on StackOverflow for posting the snippet I used to create it. The important thing is, we've converted our fixed-width flat file into a Polars Dataframe.

### Part 2 - Data Crunching
***

The real GSOM puts out more than 63 possible element-columns. In this quick demonstration, our program will only put out 5 elements:
* Average Minimum Temperature
* Average Maximum Temperature
* Average Temperature
* Cooling Degree Days
* Heating Degree Days

#### 2A - Filtering
***

To create these elements, we'll only need to keep rows where Element is equal to TMIN and TMAX. Let's go ahead and filter those elements:


In [10]:
#Create an array with the elements we want to keep
needed = ["TMIN", "TMAX"]

#only show rows where "Element" matches strings in this list
df = df.filter(
    pl.col("Element").is_in(needed)
)

df.select(["STATION", "Element", "DATE", "daily_values", "qc_flags"])

STATION,Element,DATE,daily_values,qc_flags
str,str,date,list[f32],list[str]
"""USW00094789""","""TMAX""",1948-07-01,"[-9999.0, -9999.0, … 317.0]","["""", """", … """"]"
"""USW00094789""","""TMIN""",1948-07-01,"[-9999.0, -9999.0, … 233.0]","["""", """", … """"]"
"""USW00094789""","""TMAX""",1948-08-01,"[233.0, 261.0, … 283.0]","["""", """", … """"]"
"""USW00094789""","""TMIN""",1948-08-01,"[211.0, 200.0, … 172.0]","["""", """", … """"]"
"""USW00094789""","""TMAX""",1948-09-01,"[256.0, 250.0, … -9999.0]","["""", """", … """"]"
…,…,…,…,…
"""USW00094789""","""TMIN""",2025-01-01,"[50.0, 17.0, … 44.0]","["""", """", … """"]"
"""USW00094789""","""TMAX""",2025-02-01,"[100.0, 39.0, … -9999.0]","["""", """", … """"]"
"""USW00094789""","""TMIN""",2025-02-01,"[-49.0, -71.0, … -9999.0]","["""", """", … """"]"
"""USW00094789""","""TMAX""",2025-03-01,"[194.0, 33.0, … -9999.0]","["""", """", … """"]"


If you look at the shape of the dataframes, you'll see our row count has dropped from 15, 584 rows to 1,842. You don't necessarily need to drive yourself nuts trying to whittle down the size of your dataset, but if it's 700% larger than it needs to be, your speed will suffer.


#### 2B - Data Validation
***

In [4]:
#TODO Remove bad data


#### 2C - Easy Requirements
***

In [5]:
#TODO Run .mean() on TMIN and TMAX

#### 2C - Challenging Requirements
***

In [6]:
#TODO Create TAVG with the "free real estate" method

#### 2D - "Tap Out" Requirements
***
Ok, so you really tried your best to use native Polars or Pandas expressions, but neither you nor the AI can get the right output. What now? Now we're going to create a custom, "Just in time" compiled function using Numba.

In [6]:
#TODO Create CLDD & HTDD with a guvectorize function

### Part 3 - Creating the Output CSV
***
Ok, so you really tried your best to use native Polars or Pandas expressions, but neither you nor the AI can get the right output. What now? Now we're going to create a custom, "Just in time" compiled function using Numba.

In [7]:
#TODO Pivot the dataframe to the final structure

In [8]:
#TODO Write to CSV to local