# Breaking Out of the Loop
## Refactoring Legacy Software with Polars

In this quick tutorial, we'll be creating a very limited version of the Global Summary of the Month. By the end, you should have a handy cookbook for whipping up solutions to common data-crunching requirements.


### Part 1 - Reading Your Input
***

Let's start by importing Polars. We'll also import a library called itables to make exploring our tables easier in the notebook.

In [1]:
import polars as pl
#import itables
#itables.init_notebook_mode(connected=False)

For today's example, we'll be using real NOAA data. More specifically, we'll be pulling station files from the Global Historical Climate Network's daily dataset. These are just plain text "flat files" with fixed column widths. Daily values are limited to 5 spaces and are followed by measurement, QC, and source flags--each taking up a single space:

<img src="imgs/dly.png" alt="image" width="500" height="auto">

Unfortunately, Polars doesn't have a native method for reading fixed-width files, so I'm including a utility class here. You'll find it in the GitHub repo as **InputOutputUtils.py.**

In [2]:
import InputOutputUtils as io
from importlib import reload
reload(io) #Prevents caching in Python modules
station_file = "stations/USW00094789.dly"
df = io.dlyAsDataFrame( station_file )
df.head()
#show(df, layout={"topStart": None, "topEnd": None})

STATION,DATE,Element,daily_values,qc_flags,days_in_month
str,date,str,list[f32],list[str],i32
"""USW00094789""",1948-07-01,"""TMAX""","[-9999.0, -9999.0, … 317.0]","["""", """", … """"]",31
"""USW00094789""",1948-07-01,"""TMIN""","[-9999.0, -9999.0, … 233.0]","["""", """", … """"]",31
"""USW00094789""",1948-07-01,"""PRCP""","[-9999.0, -9999.0, … 0.0]","["""", """", … """"]",31
"""USW00094789""",1948-07-01,"""SNOW""","[-9999.0, -9999.0, … 0.0]","["""", """", … """"]",31
"""USW00094789""",1948-07-01,"""SNWD""","[-9999.0, -9999.0, … 0.0]","["""", """", … """"]",31




What's happening inside of input_output_utils is outside the scope of this talk, but it's a useful class. Shoutout to (GET NAME) over on StackOverflow for posting the snippet I used to create it. The important thing is, we've converted our fixed-width flat file into a Polars Dataframe.

### Part 2 - Data Crunching
***

The real GSOM puts out more than 63 possible element-columns. In this quick demonstration, our program will only put out 5 elements:
* Average Minimum Temperature
* Average Maximum Temperature
* Average Temperature
* Cooling Degree Days
* Heating Degree Days

#### 2A - Filtering
***

To create these elements, we'll only need to keep rows where Element is equal to TMIN and TMAX. Let's go ahead and filter those elements:


In [43]:
#Create an array with the elements we want to keep
needed = ["TMIN", "TMAX"]

#only show rows where "Element" matches strings in this list
df = df.filter(
    pl.col("Element").is_in(needed)
)

df.select(["STATION", "Element", "DATE", "daily_values", "qc_flags"]).head()

STATION,Element,DATE,daily_values,qc_flags
str,str,date,list[f32],list[str]
"""USW00094789""","""TMAX""",1948-08-01,"[233.0, 261.0, … 283.0]","["""", """", … """"]"
"""USW00094789""","""TMIN""",1948-08-01,"[211.0, 200.0, … 172.0]","["""", """", … """"]"
"""USW00094789""","""TMAX""",1948-09-01,"[256.0, 250.0, … -9999.0]","["""", """", … """"]"
"""USW00094789""","""TMIN""",1948-09-01,"[144.0, 150.0, … -9999.0]","["""", """", … """"]"
"""USW00094789""","""TMAX""",1948-10-01,"[256.0, 261.0, … 200.0]","["""", """", … """"]"


If you look at the shape of the dataframes, you'll see our row count has dropped from 15, 584 rows to 1,842. You don't necessarily need to drive yourself nuts trying to whittle down the size of your dataset, but if it's 700% larger than it needs to be, your speed will suffer.


#### 2B - Data Validation
***

Before we run our calculations, let's first delete any invalid data in our daily_value lists. In the real GSOM, we'd remove any daily values where the corresponding QC flag wasn't an empty string. For the purposes of this demo, we're only going to remove the values that are equal to -9999.0.

In [36]:

df = df.with_columns(
        #We could have simply replaced daily_values, but we're not done with it yet!
        filtered_values = pl.col("daily_values").list.eval(
            pl.element().filter(pl.element() != -9999.0)
        )
    )

df.select(
    ["DATE", "Element", "daily_values", "filtered_values"]
    ).head()

DATE,Element,daily_values,filtered_values
date,str,list[f32],list[f32]
1948-08-01,"""TMAX""","[233.0, 261.0, … 283.0]","[233.0, 261.0, … 283.0]"
1948-08-01,"""TMIN""","[211.0, 200.0, … 172.0]","[211.0, 200.0, … 172.0]"
1948-09-01,"""TMAX""","[256.0, 250.0, … -9999.0]","[256.0, 250.0, … 239.0]"
1948-09-01,"""TMIN""","[144.0, 150.0, … -9999.0]","[144.0, 150.0, … 167.0]"
1948-10-01,"""TMAX""","[256.0, 261.0, … 200.0]","[256.0, 261.0, … 200.0]"


Once we've filtered our daily_values, we can subtract it's length from the days_in_month column. We'll then filter out any rows where we have 4 or more missing_days. 

In [40]:
df = df.with_columns(
        missing_days = pl.col("days_in_month") - pl.col("filtered_values").list.len()
    ).filter(
        pl.col("missing_days") < 4
    )

df.select(
    ["DATE", "Element", "daily_values", "filtered_values", "missing_days"]
    ).head()

DATE,Element,daily_values,filtered_values,missing_days
date,str,list[f32],list[f32],i64
1948-08-01,"""TMAX""","[233.0, 261.0, … 283.0]","[233.0, 261.0, … 283.0]",0
1948-08-01,"""TMIN""","[211.0, 200.0, … 172.0]","[211.0, 200.0, … 172.0]",0
1948-09-01,"""TMAX""","[256.0, 250.0, … -9999.0]","[256.0, 250.0, … 239.0]",0
1948-09-01,"""TMIN""","[144.0, 150.0, … -9999.0]","[144.0, 150.0, … 167.0]",0
1948-10-01,"""TMAX""","[256.0, 261.0, … 200.0]","[256.0, 261.0, … 200.0]",0


#### 2C - Easy Requirements
***
Now that we have a cleaned up version of our daily_values, we can create a monthly summary for Average Minimum Temperature and Average Maximum Temperature. We'll also need to scale the value down by 90%. Luckily, we can handle this all in a single line!

In [41]:
#TODO Run .mean() on TMIN and TMAX
df = df.with_columns(
    value = pl.col("filtered_values").list.mean() * .1
    
)

df.select(["DATE", "Element", "filtered_values", "value"]).head()

DATE,Element,filtered_values,value
date,str,list[f32],f32
1948-08-01,"""TMAX""","[233.0, 261.0, … 283.0]",27.812902
1948-08-01,"""TMIN""","[211.0, 200.0, … 172.0]",19.493549
1948-09-01,"""TMAX""","[256.0, 250.0, … 239.0]",24.969999
1948-09-01,"""TMIN""","[144.0, 150.0, … 167.0]",15.313334
1948-10-01,"""TMAX""","[256.0, 261.0, … 200.0]",17.670967


#### 2C - Challenging Requirements
***
We've successfully run calculations on our base elements and decided which values we need to discard using only native Polars expressions. What if we need to compare two base elements? **FOR EXAMPLE**, GSOM also produces Average Temperature by comparing Average Max Temp and Average Min Temp.

In [42]:
#TODO Create TAVG with the "free real estate" method

#### 2D - "Tap Out" Requirements
***
Ok, so you really tried your best to use native Polars or Pandas expressions, but neither you nor the AI can get the right output. What now? Now we're going to create a custom, "Just in time" compiled function using Numba.

In [7]:
#TODO Create CLDD & HTDD with a guvectorize function

### Part 3 - Creating the Output CSV
***

In [8]:
#TODO Pivot the dataframe to the final structure

In [9]:
#TODO Write to CSV to local