Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement design doc changes #128

Merged
merged 55 commits into from
Jun 18, 2024
Merged

Conversation

wagnerlmichael
Copy link
Member

@wagnerlmichael wagnerlmichael commented May 31, 2024

This PR handles #125 and #124.

Essentially I take the existing system and make changes such that it fits with these issues and the new design doc.

The biggest change here is that we remove sv_outlier_type and introduce 3 columns:

  • sv_outlier_reason1
  • sv_outlier_reason2
  • sv_outlier_reason3

In the existing set up, if we have an outlier that is classified , then sv_outlier_type() must be one of the following:

  • A price outlier defined by the dev_bounds: [2, 2] definition in params.yaml
  • Above price outlier + a characteristic reason determined in the outlier_type() function in glue/flagging_script_glue/flagging.py.
  • One of the ptax questions flagged and a price deviation from ptax_sd: [1, 1]defined in params.yaml

The following changes are implemented in order to maintain the existing specs and bring in the new structure. In sv_outlier_type() the price and characteristic outlier types are split into char_conditions/char_labels and price_conditions/price_labels.

For the three sv_outlier_reason$n columns, we populate them with the following order of priority

  1. PTAX-203 Exclusion
  2. Price label
  3. Char label

Then, we assign the observation as an outlier only if sv_is_outlier1 is a price label or a ptax label. This allows us to retain sales validation information even if we don't want to technically classify the sale as an outlier. Here are some scenarios:

Scenario 1:

  • Outlier Classification: True
  • sv_outlier_reason1 - High Price
  • sv_outlier_reason2 - High Price (sqft)
  • sv_outlier_reason3 - Non-person sale

Scenario 2:

  • Outlier Classification: False
  • sv_outlier_reason1 - Non-person sale
  • sv_outlier_reason2 - Statistical anomaly
  • sv_outlier_reason3 - null

@wagnerlmichael wagnerlmichael marked this pull request as ready for review June 4, 2024 22:14
Copy link
Collaborator

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start! Sounds like we missed the Price swing / home flip characteristic, so I'll hold off on approving until that's in and I can take a look at the whole thing.

glue/flagging_script_glue/flagging.py Outdated Show resolved Hide resolved
glue/flagging_script_glue/flagging.py Outdated Show resolved Hide resolved
glue/flagging_script_glue/flagging.py Outdated Show resolved Hide resolved
glue/flagging_script_glue/flagging.py Outdated Show resolved Hide resolved
glue/flagging_script_glue/flagging.py Outdated Show resolved Hide resolved
glue/sales_val_flagging.py Outdated Show resolved Hide resolved
glue/flagging_script_glue/flagging.py Outdated Show resolved Hide resolved
@wagnerlmichael
Copy link
Member Author

Placing a template for an overarching outlier type data structure here. We can come back to this whenever we do a large re-factor of the pipeline.

OUTLIER_TYPES = [
    {
        "colname": "sv_ind_ptax_flag_w_deviation",
        "label": "PTAX-203 Exclusion",
        "condition": {
            # TODO: include cast to int
            "res": lambda df, context: df["ptax_flag_original"] & (
                (df[f"sv_price_deviation_{context['group_string']}"] >= ptax_sd[1])
                | (df[f"sv_price_deviation_{group_string}"] <= -ptax_sd[0])
                | (df[f"sv_price_per_sqft_deviation_{group_string}"] >= ptax_sd[1])
                | (df[f"sv_price_per_sqft_deviation_{group_string}"] <= -ptax_sd[0])
            ),
           "condo": lambda df: df["ptax_flag_original"] & (
                (df[f"sv_price_deviation_{group_string}"] >= ptax_sd[1])
                | (df[f"sv_price_deviation_{group_string}"] <= -ptax_sd[0])
            )
        },
        "determines_outlier": True,
    },
    {
        "colname": "sv_ind_price_high_price",
        "label": "High price",
        "condition": {
            "res": lambda df: (
                df["sv_pricing"].str.contains("High")
                & (df["sv_which_price"].str.contains("raw"))
            ),
            "condo": lambda df: df["sv_pricing"].str.contains("High"),
        },
        "determines_outlier": True,
    },
    {
        "colname": "sv_ind_price_low_price",
        "label": "Low price",
        "condition": {
            "res": (
                lambda df: df["sv_pricing"].str.contains("Low")
                & (df["sv_which_price"].str.contains("raw"))
            ),
            "condo":lambda df: df["sv_pricing"].str.contains("Low"),
        },
        "determines_outlier": True,
    },
    {
        "colname": "sv_ind_price_high_price_sqft",
        "label": "High price per square foot",
        "condition": {
            "res": lambda df: (
                df["sv_pricing"].str.contains("High"))
                & (df["sv_which_price"].str.contains("sqft")
            ),
        },
        "determines_outlier": True,
    },
    {
        "colname": "sv_ind_price_low_price_sqft",
        "label": "Low price per square foot",
        "condition": {
            "res": lambda df: (
                df["sv_pricing"].str.contains("Low"))
            & (df["sv_which_price"].str.contains("sqft")
            )
        },
        "determines_outlier": True,
    },
    {
        "colname": "sv_ind_char_short_term_owner",
        "label": "Short-term owner",
        "condition": lambda df: df["sv_short_owner"] == "Short-term owner",
        "determines_outlier": False,
    },
    {
        "colname": "sv_ind_char_family_sale",
        "label": "Family sale",
        "condition": lambda df: df["sv_name_match"] != "No match",
        "determines_outlier": False,
    },
    {
        "colname": "sv_ind_char_non_person_sale",
        "label": "Non-person sale",
        "condition": lambda df: df[["sv_buyer_category", "sv_seller_category"]].eq("legal_entity").any(axis=1),
        "determines_outlier": False,
    },
    {
        "colname": "sv_ind_char_statistical_anomaly",
        "label": "Statistical Anomaly",
        "condition": lambda df: df["sv_anomaly"] == "Outlier",
        "determines_outlier": False,
    },
    {
        "colname": "sv_ind_char_price_swing_homeflip",
        "label": "Price swing / Home flip",
        "condition":  lambda df: df["sv_pricing"].str.contains("High price swing")
        | df["sv_pricing"].str.contains("Low price swing"),
        "determines_outlier": False,
    },
]

Copy link
Collaborator

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking excellent! I have a few tiny suggestions for further code simplification, but they're all optional. We just need to get a path forward on the non-outlier-with-flags edge case and then we should be good to go.

glue/sales_val_flagging.py Outdated Show resolved Hide resolved
glue/sales_val_flagging.py Outdated Show resolved Hide resolved
glue/sales_val_flagging.py Outdated Show resolved Hide resolved
wagnerlmichael and others added 7 commits June 11, 2024 13:31
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Comment on lines 176 to 199

group_thresh_price_fix = [
"sv_ind_price_high_price"
"sv_ind_price_low_price"
"sv_ind_price_high_price_sqft"
"sv_ind_price_low_price_sqft"
]

# Using .loc[] to set the desired values for rows meeting the condition
merged_df.loc[condition, "sv_outlier_type"] = "Not outlier"
merged_df.loc[condition, "sv_is_outlier"] = 0
def fill_outlier_reasons(row):
reason_idx = 1
for reason in outlier_type_crosswalk:
# Add a check to ensure that only specific reasons are added if _merge is not 'both'
if (
reason in row
and row[reason]
and not (row["_merge"] == "both" and reason in group_thresh_price_fix)
):
row[f"sv_outlier_reason{reason_idx}"] = outlier_type_crosswalk[reason]
if reason_idx >= 3:
break
reason_idx += 1
return row

df = df.apply(fill_outlier_reasons, axis=1)
Copy link
Member Author

@wagnerlmichael wagnerlmichael Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how we I handled the implementation of not counting the price outliers but retaining the char outliers given a group threshold violation.

The row["_merge"] == "both" check checks for the group thresh violation, and the reason in group_thresh_price_fix checks to see if we are looking at a price indicator column. So it basically says:

if not (below threshold and price outlier) then implement this iteration in the loop, which means the condition will be filled.

I thought this approach would be better than manually re-arranging at the end of the function.

Comment on lines +178 to +184
"sv_ind_ptax_flag_w_high_price": "High price",
"sv_ind_price_low_price": "Low price",
"sv_ind_ptax_flag_w_low_price": "Low price",
"sv_ind_price_high_price_sqft": "High price per square foot",
"sv_ind_ptax_flag_w_high_price_sqft": "High price per square foot",
"sv_ind_price_low_price_sqft": "Low price per square foot",
"sv_ind_ptax_flag_w_low_price_sqft": "Low price per square foot",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new ptax cols

merged_df.loc[condition, "sv_is_outlier"] = 0
def fill_outlier_reasons(row):
reason_idx = 1
reasons_added = set() # Set to track reasons already added
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set allows functional iteration over two high price values in the dict

Copy link
Collaborator

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautifully done! A couple of small nitpicks to simplify the code even further, but I think this is basically ready to go.

@@ -102,45 +102,49 @@ def ptax_adjustment(df, groups, ptax_sd, condos: bool):
group_string = "_".join(groups)

if not condos:
df["ptax_flag_w_deviation"] = df["ptax_flag_original"] & (
df["sv_ind_ptax_flag_w_high_price"] = df["ptax_flag_original"] & (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Praise] Nice work on this refactor!

merged_df.loc[condition, "sv_outlier_type"] = "Not outlier"
merged_df.loc[condition, "sv_is_outlier"] = 0
def fill_outlier_reasons(row):
reason_idx = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nitpick, non-blocking] Now that we're keeping track of the set of reasons_added, I don't think we need this counter anymore -- we can just replace it with references to len(reasons_added). Note that we'd need to be careful with off-by-one errors the context of the outlier_types_crosswalk loop, since we want to set the column name to sv_outlier_reason{len(reasons_added) + 1} and then subsequently check if len(reasons_added) >= 3. (We could also consider just setting reason_idx = len(reasons_added) + 1 before setting sv_outlier_reason{reason_idx} in the loop if the if conditional passes.)

Comment on lines 214 to 215
for reason in outlier_type_crosswalk:
current_reason = outlier_type_crosswalk[reason]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nitpick, non-blocking] Two tiny readability improvements you could make here:

  1. It may not be immediately clear to a reader how current_reason is supposed to differ from reason; perhaps reason_label would be a clearer name for the variable?
  2. Since we're iterating the keys and values of the crosswalk, we can just loop over the crosswalk's items():
for reason, reason_label in outlier_type_crosswalk.items():
    ...

and current_reason
not in reasons_added # Check if the reason is already added
# Apply group threshold logic
and not (row["_merge"] == "both" and reason in group_thresh_price_fix)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] I don't totally get how this line works, can you break it down for me?

Copy link
Member Author

@wagnerlmichael wagnerlmichael Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if I am misinterpreting anything here of if you have suggestions to improve

    def fill_outlier_reasons(row):
        reasons_added = set()  # Set to track reasons already added

        for reason_ind_col in outlier_type_crosswalk:
            current_reason = outlier_type_crosswalk[reason_ind_col]
            # Add a check to ensure that only specific reasons are added if _merge is not 'both'
            if (
                reason_ind_col in row
                and row[reason_ind_col]
                and current_reason
                not in reasons_added  # Check if the reason is already added
                # Apply group threshold logic
                and not (row["_merge"] == "both" and reason_ind_col in group_thresh_price_fix)
            ):
                row[f"sv_outlier_reason{len(reasons_added) + 1}"] = current_reason
                reasons_added.add(current_reason)  # Add current reason to the set
                if len(reasons_added) >= 3:
                    break

        return row

Condition 1

reason_ind_col in row: this checks if the column exists in the dataset. This check is necessary because the sqft columns won't exist in the condos data

Condition 2

and row[reason_ind_col]: this is the check for whether the ind col is true

Condition 3

This checks to make sure that the column doesn't already exist in the reasons_added_set. Given that it is a set, it is kind of a redundant operation, but does it make sense to keep this for computation reduction?

and current_reason
not in reasons_added

Also I think it acts as a good check for row[f"sv_outlier_reason{len(reasons_added) + 1}"] = current_reason to make sure the column doesn't get populated with an incorrect reason name.

Condition 4

and not (row["_merge"] == "both" and reason_ind_col in group_thresh_price_fix

This condition makes sure a price value won't be assigned to an obs with a group of N < group_thresh

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these responses!

[Condition 3] Given that it is a set, it is kind of a redundant operation, but does it make sense to keep this for computation reduction?

I think it makes sense to keep it -- the set will automatically deduplicate indicator values, but we still want to make sure we don't add duplicate values to row[f"sv_outlier_reason{len(reasons_added) + 1}"].

[Condition 4] This condition makes sure a price value won't be assigned to an obs with a group of N < group_thresh

Got it, that makes sense! Do you mind if we adjust the comment to make this a little bit clearer?

            if (
                reason_ind_col in row
                and row[reason_ind_col]
                and current_reason
                not in reasons_added  # Check if the reason is already added
                # Apply group threshold logic: `row["_merge"]` will be `both` when the group threshold
                # is met, but only price indicators (`group_thresh_price_fix`) should use this threshold,
                # since ptax indicators have thresholds built-in
                and not (row["_merge"] == "both" and reason_ind_col in group_thresh_price_fix)
            ):

manual_flagging/yaml/inputs.yaml Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants