Implement design doc changes #128

wagnerlmichael · 2024-05-31T15:56:04Z

This PR handles #125 and #124.

Essentially I take the existing system and make changes such that it fits with these issues and the new design doc.

The biggest change here is that we remove sv_outlier_type and introduce 3 columns:

sv_outlier_reason1
sv_outlier_reason2
sv_outlier_reason3

In the existing set up, if we have an outlier that is classified , then sv_outlier_type() must be one of the following:

A price outlier defined by the dev_bounds: [2, 2] definition in params.yaml
Above price outlier + a characteristic reason determined in the outlier_type() function in glue/flagging_script_glue/flagging.py.
One of the ptax questions flagged and a price deviation from ptax_sd: [1, 1]defined in params.yaml

The following changes are implemented in order to maintain the existing specs and bring in the new structure. In sv_outlier_type() the price and characteristic outlier types are split into char_conditions/char_labels and price_conditions/price_labels.

For the three sv_outlier_reason$n columns, we populate them with the following order of priority

PTAX-203 Exclusion
Price label
Char label

Then, we assign the observation as an outlier only if sv_is_outlier1 is a price label or a ptax label. This allows us to retain sales validation information even if we don't want to technically classify the sale as an outlier. Here are some scenarios:

Scenario 1:

Outlier Classification: True
sv_outlier_reason1 - High Price
sv_outlier_reason2 - High Price (sqft)
sv_outlier_reason3 - Non-person sale

Scenario 2:

Outlier Classification: False
sv_outlier_reason1 - Non-person sale
sv_outlier_reason2 - Statistical anomaly
sv_outlier_reason3 - null

glue/flagging_script_glue/flagging.py

manual_flagging/requirements.txt

glue/flagging_script_glue/flagging.py

glue/sales_val_flagging.py

jeancochrane

This is a great start! Sounds like we missed the Price swing / home flip characteristic, so I'll hold off on approving until that's in and I can take a look at the whole thing.

glue/flagging_script_glue/flagging.py

glue/sales_val_flagging.py

glue/flagging_script_glue/flagging.py

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

…o-data/model-sales-val into implement-design-doc-changes Merge.

wagnerlmichael · 2024-06-11T15:43:10Z

Placing a template for an overarching outlier type data structure here. We can come back to this whenever we do a large re-factor of the pipeline.

OUTLIER_TYPES = [
    {
        "colname": "sv_ind_ptax_flag_w_deviation",
        "label": "PTAX-203 Exclusion",
        "condition": {
            # TODO: include cast to int
            "res": lambda df, context: df["ptax_flag_original"] & (
                (df[f"sv_price_deviation_{context['group_string']}"] >= ptax_sd[1])
                | (df[f"sv_price_deviation_{group_string}"] <= -ptax_sd[0])
                | (df[f"sv_price_per_sqft_deviation_{group_string}"] >= ptax_sd[1])
                | (df[f"sv_price_per_sqft_deviation_{group_string}"] <= -ptax_sd[0])
            ),
           "condo": lambda df: df["ptax_flag_original"] & (
                (df[f"sv_price_deviation_{group_string}"] >= ptax_sd[1])
                | (df[f"sv_price_deviation_{group_string}"] <= -ptax_sd[0])
            )
        },
        "determines_outlier": True,
    },
    {
        "colname": "sv_ind_price_high_price",
        "label": "High price",
        "condition": {
            "res": lambda df: (
                df["sv_pricing"].str.contains("High")
                & (df["sv_which_price"].str.contains("raw"))
            ),
            "condo": lambda df: df["sv_pricing"].str.contains("High"),
        },
        "determines_outlier": True,
    },
    {
        "colname": "sv_ind_price_low_price",
        "label": "Low price",
        "condition": {
            "res": (
                lambda df: df["sv_pricing"].str.contains("Low")
                & (df["sv_which_price"].str.contains("raw"))
            ),
            "condo":lambda df: df["sv_pricing"].str.contains("Low"),
        },
        "determines_outlier": True,
    },
    {
        "colname": "sv_ind_price_high_price_sqft",
        "label": "High price per square foot",
        "condition": {
            "res": lambda df: (
                df["sv_pricing"].str.contains("High"))
                & (df["sv_which_price"].str.contains("sqft")
            ),
        },
        "determines_outlier": True,
    },
    {
        "colname": "sv_ind_price_low_price_sqft",
        "label": "Low price per square foot",
        "condition": {
            "res": lambda df: (
                df["sv_pricing"].str.contains("Low"))
            & (df["sv_which_price"].str.contains("sqft")
            )
        },
        "determines_outlier": True,
    },
    {
        "colname": "sv_ind_char_short_term_owner",
        "label": "Short-term owner",
        "condition": lambda df: df["sv_short_owner"] == "Short-term owner",
        "determines_outlier": False,
    },
    {
        "colname": "sv_ind_char_family_sale",
        "label": "Family sale",
        "condition": lambda df: df["sv_name_match"] != "No match",
        "determines_outlier": False,
    },
    {
        "colname": "sv_ind_char_non_person_sale",
        "label": "Non-person sale",
        "condition": lambda df: df[["sv_buyer_category", "sv_seller_category"]].eq("legal_entity").any(axis=1),
        "determines_outlier": False,
    },
    {
        "colname": "sv_ind_char_statistical_anomaly",
        "label": "Statistical Anomaly",
        "condition": lambda df: df["sv_anomaly"] == "Outlier",
        "determines_outlier": False,
    },
    {
        "colname": "sv_ind_char_price_swing_homeflip",
        "label": "Price swing / Home flip",
        "condition":  lambda df: df["sv_pricing"].str.contains("High price swing")
        | df["sv_pricing"].str.contains("Low price swing"),
        "determines_outlier": False,
    },
]

jeancochrane

Looking excellent! I have a few tiny suggestions for further code simplification, but they're all optional. We just need to get a path forward on the non-outlier-with-flags edge case and then we should be good to go.

glue/sales_val_flagging.py

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

wagnerlmichael · 2024-06-12T21:43:48Z

glue/sales_val_flagging.py

+
+    group_thresh_price_fix = [
+        "sv_ind_price_high_price"
+        "sv_ind_price_low_price"
+        "sv_ind_price_high_price_sqft"
+        "sv_ind_price_low_price_sqft"
+    ]

-    # Using .loc[] to set the desired values for rows meeting the condition
-    merged_df.loc[condition, "sv_outlier_type"] = "Not outlier"
-    merged_df.loc[condition, "sv_is_outlier"] = 0
+    def fill_outlier_reasons(row):
+        reason_idx = 1
+        for reason in outlier_type_crosswalk:
+            # Add a check to ensure that only specific reasons are added if _merge is not 'both'
+            if (
+                reason in row
+                and row[reason]
+                and not (row["_merge"] == "both" and reason in group_thresh_price_fix)
+            ):
+                row[f"sv_outlier_reason{reason_idx}"] = outlier_type_crosswalk[reason]
+                if reason_idx >= 3:
+                    break
+                reason_idx += 1
+        return row
+
+    df = df.apply(fill_outlier_reasons, axis=1)


This is how we I handled the implementation of not counting the price outliers but retaining the char outliers given a group threshold violation.

The row["_merge"] == "both" check checks for the group thresh violation, and the reason in group_thresh_price_fix checks to see if we are looking at a price indicator column. So it basically says:

if not (below threshold and price outlier) then implement this iteration in the loop, which means the condition will be filled.

I thought this approach would be better than manually re-arranging at the end of the function.

wagnerlmichael · 2024-06-14T19:38:37Z

glue/sales_val_flagging.py

+        "sv_ind_ptax_flag_w_high_price": "High price",
+        "sv_ind_price_low_price": "Low price",
+        "sv_ind_ptax_flag_w_low_price": "Low price",
+        "sv_ind_price_high_price_sqft": "High price per square foot",
+        "sv_ind_ptax_flag_w_high_price_sqft": "High price per square foot",
+        "sv_ind_price_low_price_sqft": "Low price per square foot",
+        "sv_ind_ptax_flag_w_low_price_sqft": "Low price per square foot",


Added new ptax cols

wagnerlmichael · 2024-06-14T19:39:07Z

glue/sales_val_flagging.py

-    merged_df.loc[condition, "sv_is_outlier"] = 0
+    def fill_outlier_reasons(row):
+        reason_idx = 1
+        reasons_added = set()  # Set to track reasons already added


The set allows functional iteration over two high price values in the dict

jeancochrane

Beautifully done! A couple of small nitpicks to simplify the code even further, but I think this is basically ready to go.

jeancochrane · 2024-06-14T21:08:43Z

glue/sales_val_flagging.py

@@ -102,45 +102,49 @@ def ptax_adjustment(df, groups, ptax_sd, condos: bool):
    group_string = "_".join(groups)

    if not condos:
-        df["ptax_flag_w_deviation"] = df["ptax_flag_original"] & (
+        df["sv_ind_ptax_flag_w_high_price"] = df["ptax_flag_original"] & (


[Praise] Nice work on this refactor!

jeancochrane · 2024-06-14T21:10:38Z

glue/sales_val_flagging.py

-    merged_df.loc[condition, "sv_outlier_type"] = "Not outlier"
-    merged_df.loc[condition, "sv_is_outlier"] = 0
+    def fill_outlier_reasons(row):
+        reason_idx = 1


[Nitpick, non-blocking] Now that we're keeping track of the set of reasons_added, I don't think we need this counter anymore -- we can just replace it with references to len(reasons_added). Note that we'd need to be careful with off-by-one errors the context of the outlier_types_crosswalk loop, since we want to set the column name to sv_outlier_reason{len(reasons_added) + 1} and then subsequently check if len(reasons_added) >= 3. (We could also consider just setting reason_idx = len(reasons_added) + 1 before setting sv_outlier_reason{reason_idx} in the loop if the if conditional passes.)

jeancochrane · 2024-06-14T21:13:06Z

glue/sales_val_flagging.py

+        for reason in outlier_type_crosswalk:
+            current_reason = outlier_type_crosswalk[reason]


[Nitpick, non-blocking] Two tiny readability improvements you could make here:

It may not be immediately clear to a reader how current_reason is supposed to differ from reason; perhaps reason_label would be a clearer name for the variable?

Since we're iterating the keys and values of the crosswalk, we can just loop over the crosswalk's items():

for reason, reason_label in outlier_type_crosswalk.items(): ...

jeancochrane · 2024-06-14T21:15:01Z

glue/sales_val_flagging.py

+                and current_reason
+                not in reasons_added  # Check if the reason is already added
+                # Apply group threshold logic
+                and not (row["_merge"] == "both" and reason in group_thresh_price_fix)


[Question, non-blocking] I don't totally get how this line works, can you break it down for me?

Let me know if I am misinterpreting anything here of if you have suggestions to improve

def fill_outlier_reasons(row): reasons_added = set() # Set to track reasons already added for reason_ind_col in outlier_type_crosswalk: current_reason = outlier_type_crosswalk[reason_ind_col] # Add a check to ensure that only specific reasons are added if _merge is not 'both' if ( reason_ind_col in row and row[reason_ind_col] and current_reason not in reasons_added # Check if the reason is already added # Apply group threshold logic and not (row["_merge"] == "both" and reason_ind_col in group_thresh_price_fix) ): row[f"sv_outlier_reason{len(reasons_added) + 1}"] = current_reason reasons_added.add(current_reason) # Add current reason to the set if len(reasons_added) >= 3: break return row

Condition 1

reason_ind_col in row: this checks if the column exists in the dataset. This check is necessary because the sqft columns won't exist in the condos data

Condition 2

and row[reason_ind_col]: this is the check for whether the ind col is true

Condition 3

This checks to make sure that the column doesn't already exist in the reasons_added_set. Given that it is a set, it is kind of a redundant operation, but does it make sense to keep this for computation reduction?

and current_reason not in reasons_added

Also I think it acts as a good check for row[f"sv_outlier_reason{len(reasons_added) + 1}"] = current_reason to make sure the column doesn't get populated with an incorrect reason name.

Condition 4

and not (row["_merge"] == "both" and reason_ind_col in group_thresh_price_fix

This condition makes sure a price value won't be assigned to an obs with a group of N < group_thresh

Thanks for these responses!

[Condition 3] Given that it is a set, it is kind of a redundant operation, but does it make sense to keep this for computation reduction?

I think it makes sense to keep it -- the set will automatically deduplicate indicator values, but we still want to make sure we don't add duplicate values to row[f"sv_outlier_reason{len(reasons_added) + 1}"].

[Condition 4] This condition makes sure a price value won't be assigned to an obs with a group of N < group_thresh

Got it, that makes sense! Do you mind if we adjust the comment to make this a little bit clearer?

if ( reason_ind_col in row and row[reason_ind_col] and current_reason not in reasons_added # Check if the reason is already added # Apply group threshold logic: `row["_merge"]` will be `both` when the group threshold # is met, but only price indicators (`group_thresh_price_fix`) should use this threshold, # since ptax indicators have thresholds built-in and not (row["_merge"] == "both" and reason_ind_col in group_thresh_price_fix) ):

manual_flagging/yaml/inputs.yaml

wagnerlmichael added 7 commits May 29, 2024 15:31

Rename reason codes from issue 125

0451eed

Add beginning logic for new flags

fa88d08

Intitial 3 column outlier_reasons working

d828a8d

Add functioning 3 column output

fcb5d73

Remove print statements

4fea1e6

Fix ptax classifiers

b16bb36

Change comments

677d440

wagnerlmichael commented Jun 3, 2024

View reviewed changes

glue/flagging_script_glue/flagging.py Outdated Show resolved Hide resolved

wagnerlmichael commented Jun 3, 2024

View reviewed changes

glue/flagging_script_glue/flagging.py Outdated Show resolved Hide resolved

Fix raw and sqft condition

364da45

wagnerlmichael commented Jun 4, 2024

View reviewed changes

glue/flagging_script_glue/flagging.py Outdated Show resolved Hide resolved

wagnerlmichael commented Jun 4, 2024

View reviewed changes

glue/flagging_script_glue/flagging.py Outdated Show resolved Hide resolved

Remove comments

b2eae8b

wagnerlmichael commented Jun 4, 2024

View reviewed changes

manual_flagging/requirements.txt Show resolved Hide resolved

wagnerlmichael commented Jun 4, 2024

View reviewed changes

glue/flagging_script_glue/flagging.py Outdated Show resolved Hide resolved

Edit docstring and comments

fe85714

wagnerlmichael marked this pull request as ready for review June 4, 2024 22:14

wagnerlmichael requested a review from jeancochrane June 4, 2024 22:14

Remove condos bool

16f3594

wagnerlmichael commented Jun 5, 2024

View reviewed changes

glue/sales_val_flagging.py Outdated Show resolved Hide resolved

Fix conditions based on design doc spec

270fd09

jeancochrane reviewed Jun 5, 2024

View reviewed changes

glue/flagging_script_glue/flagging.py Outdated Show resolved Hide resolved

wagnerlmichael and others added 7 commits June 5, 2024 11:21

Update glue/flagging_script_glue/flagging.py

44e866d

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

Fix conditions

9d69aae

Merge branch 'implement-design-doc-changes' of https://github.com/cca…

c1b463c

…o-data/model-sales-val into implement-design-doc-changes Merge.

Start re-work

2c26d2a

Start re-work

e94cdf5

Establish sorting logic

8d55514

Edit comment

88828ec

wagnerlmichael added 2 commits June 10, 2024 19:29

Improve ptax reference

457fa8e

Replace string null with np nan

525eed0

wagnerlmichael added 2 commits June 11, 2024 16:45

Re-factor classify_outliers

38f1d44

Remove dict

fc1ae01

jeancochrane approved these changes Jun 11, 2024

View reviewed changes

glue/sales_val_flagging.py Outdated Show resolved Hide resolved

glue/sales_val_flagging.py Outdated Show resolved Hide resolved

glue/sales_val_flagging.py Outdated Show resolved Hide resolved

wagnerlmichael and others added 7 commits June 11, 2024 13:31

Update glue/sales_val_flagging.py

272036a

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

Update glue/sales_val_flagging.py

1bb883c

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

Update glue/sales_val_flagging.py

243d4b6

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

Fix idx ref

e6a6faa

Merge branch 'main' into implement-design-doc-changes

3166915

Handle condos sqft indictoar error and change group thresh handling

22da575

Fix comma problem

c7e87a3

wagnerlmichael commented Jun 12, 2024

View reviewed changes

wagnerlmichael added 5 commits June 13, 2024 15:17

Fix dtypes

a36aceb

Edit workflow to disentangle ptax and price

feb56d7

Add documentation

7efa1b4

Add documentation

f7c811b

Remove todo

f6490d1

wagnerlmichael commented Jun 14, 2024

View reviewed changes

jeancochrane approved these changes Jun 14, 2024

View reviewed changes

wagnerlmichael added 4 commits June 17, 2024 16:13

Simplify func

0d8c983

Restore yaml

e532d25

Restore yaml

9823a81

Add docs

9c8a354

wagnerlmichael merged commit 7ea5146 into main Jun 18, 2024
2 checks passed

wagnerlmichael deleted the implement-design-doc-changes branch June 18, 2024 18:37

This was referenced Jun 18, 2024

Change /reports/performance/_outliers.qmd to accomodate new sales val spec ccao-data/model-res-avm#247

Open

Adjust references to deprecated sv_outlier_type ccao-data/model-res-avm#248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement design doc changes #128

Implement design doc changes #128

wagnerlmichael commented May 31, 2024 •

edited

Loading

jeancochrane left a comment

wagnerlmichael commented Jun 11, 2024

jeancochrane left a comment

wagnerlmichael Jun 12, 2024 •

edited

Loading

wagnerlmichael Jun 14, 2024

wagnerlmichael Jun 14, 2024

jeancochrane left a comment

jeancochrane Jun 14, 2024

jeancochrane Jun 14, 2024

jeancochrane Jun 14, 2024

jeancochrane Jun 14, 2024

wagnerlmichael Jun 17, 2024 •

edited

Loading

jeancochrane Jun 17, 2024

		for reason in outlier_type_crosswalk:
		current_reason = outlier_type_crosswalk[reason]

Implement design doc changes #128

Implement design doc changes #128

Conversation

wagnerlmichael commented May 31, 2024 • edited Loading

jeancochrane left a comment

Choose a reason for hiding this comment

wagnerlmichael commented Jun 11, 2024

jeancochrane left a comment

Choose a reason for hiding this comment

wagnerlmichael Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

wagnerlmichael Jun 14, 2024

Choose a reason for hiding this comment

wagnerlmichael Jun 14, 2024

Choose a reason for hiding this comment

jeancochrane left a comment

Choose a reason for hiding this comment

jeancochrane Jun 14, 2024

Choose a reason for hiding this comment

jeancochrane Jun 14, 2024

Choose a reason for hiding this comment

jeancochrane Jun 14, 2024

Choose a reason for hiding this comment

jeancochrane Jun 14, 2024

Choose a reason for hiding this comment

wagnerlmichael Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

Condition 1

Condition 2

Condition 3

Condition 4

jeancochrane Jun 17, 2024

Choose a reason for hiding this comment

wagnerlmichael commented May 31, 2024 •

edited

Loading

wagnerlmichael Jun 12, 2024 •

edited

Loading

wagnerlmichael Jun 17, 2024 •

edited

Loading