# Filtering Data Based on Criteria

In this lesson, we will use a simple, small, old dataset of weather projections in Chapel Hill for Thursday, March 25th, through Saturday, April 3rd, of 2022. Each row is the projection for the next day in that timeframe.

Our analysis goal is to find the average temperatures on days where it is unlikely (less than 30%) to rain.

We will consider approaching this problem from a column-oriented perspective.

First, let's consider our data set.

In [1]:
col_data: dict[str, list[float]] = {
    "high": [77, 84, 78, 79, 65, 67, 74, 61, 55, 61],
    "low":  [67, 51, 64, 45, 43, 53, 56, 37, 34, 42],
    "rain": [.3, .2, .4, .8, 0., .2, .4, .5, .1, .1]
}

col_data

{'high': [77, 84, 78, 79, 65, 67, 74, 61, 55, 61],
 'low': [67, 51, 64, 45, 43, 53, 56, 37, 34, 42],
 'rain': [0.3, 0.2, 0.4, 0.8, 0.0, 0.2, 0.4, 0.5, 0.1, 0.1]}

In [2]:
print(col_data["rain"])

[0.3, 0.2, 0.4, 0.8, 0.0, 0.2, 0.4, 0.5, 0.1, 0.1]


Produce a "Mask" Based on Criteria

In [4]:
def less_than(col: list[float], threshold: float) -> list[bool]:
    result: list[bool] = []
    for each in col:
        if each < threshold:
            result.append(True)
    else:
        result.append(False)

    return result

# Example testing call:
less_than(col_data["rain"], 0.3)
no_rain_mask: list[bool] = less_than(col_data["rain"], 0.3)
print(no_rain_mask)

[True, True, True, True, True, False]


Masked

In [7]:
def masked(col: list[float], mask: list[bool]) -> list[float]:
    result: list[float] = []
    for i in range(len(mask)):
        if mask[i]:
            result.append(col[i])
    return result

print(col_data["rain"])
print(no_rain_mask)
print(col_data["high"])
highs_of_no_rain_days: list[float] = masked(col_data["high"], no_rain_mask)

[0.3, 0.2, 0.4, 0.8, 0.0, 0.2, 0.4, 0.5, 0.1, 0.1]
[True, True, True, True, True, False]
[77, 84, 78, 79, 65, 67, 74, 61, 55, 61]


Compute the Average

In [9]:
def average(col: list[float]):
    total: float = 0
    for each in col:
        total += each + total
        
    return (total / len(col))

average(highs_of_no_rain_days)

487.8