-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement design doc changes #128
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start! Sounds like we missed the Price swing / home flip
characteristic, so I'll hold off on approving until that's in and I can take a look at the whole thing.
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
…o-data/model-sales-val into implement-design-doc-changes Merge.
Placing a template for an overarching outlier type data structure here. We can come back to this whenever we do a large re-factor of the pipeline.
OUTLIER_TYPES = [
{
"colname": "sv_ind_ptax_flag_w_deviation",
"label": "PTAX-203 Exclusion",
"condition": {
# TODO: include cast to int
"res": lambda df, context: df["ptax_flag_original"] & (
(df[f"sv_price_deviation_{context['group_string']}"] >= ptax_sd[1])
| (df[f"sv_price_deviation_{group_string}"] <= -ptax_sd[0])
| (df[f"sv_price_per_sqft_deviation_{group_string}"] >= ptax_sd[1])
| (df[f"sv_price_per_sqft_deviation_{group_string}"] <= -ptax_sd[0])
),
"condo": lambda df: df["ptax_flag_original"] & (
(df[f"sv_price_deviation_{group_string}"] >= ptax_sd[1])
| (df[f"sv_price_deviation_{group_string}"] <= -ptax_sd[0])
)
},
"determines_outlier": True,
},
{
"colname": "sv_ind_price_high_price",
"label": "High price",
"condition": {
"res": lambda df: (
df["sv_pricing"].str.contains("High")
& (df["sv_which_price"].str.contains("raw"))
),
"condo": lambda df: df["sv_pricing"].str.contains("High"),
},
"determines_outlier": True,
},
{
"colname": "sv_ind_price_low_price",
"label": "Low price",
"condition": {
"res": (
lambda df: df["sv_pricing"].str.contains("Low")
& (df["sv_which_price"].str.contains("raw"))
),
"condo":lambda df: df["sv_pricing"].str.contains("Low"),
},
"determines_outlier": True,
},
{
"colname": "sv_ind_price_high_price_sqft",
"label": "High price per square foot",
"condition": {
"res": lambda df: (
df["sv_pricing"].str.contains("High"))
& (df["sv_which_price"].str.contains("sqft")
),
},
"determines_outlier": True,
},
{
"colname": "sv_ind_price_low_price_sqft",
"label": "Low price per square foot",
"condition": {
"res": lambda df: (
df["sv_pricing"].str.contains("Low"))
& (df["sv_which_price"].str.contains("sqft")
)
},
"determines_outlier": True,
},
{
"colname": "sv_ind_char_short_term_owner",
"label": "Short-term owner",
"condition": lambda df: df["sv_short_owner"] == "Short-term owner",
"determines_outlier": False,
},
{
"colname": "sv_ind_char_family_sale",
"label": "Family sale",
"condition": lambda df: df["sv_name_match"] != "No match",
"determines_outlier": False,
},
{
"colname": "sv_ind_char_non_person_sale",
"label": "Non-person sale",
"condition": lambda df: df[["sv_buyer_category", "sv_seller_category"]].eq("legal_entity").any(axis=1),
"determines_outlier": False,
},
{
"colname": "sv_ind_char_statistical_anomaly",
"label": "Statistical Anomaly",
"condition": lambda df: df["sv_anomaly"] == "Outlier",
"determines_outlier": False,
},
{
"colname": "sv_ind_char_price_swing_homeflip",
"label": "Price swing / Home flip",
"condition": lambda df: df["sv_pricing"].str.contains("High price swing")
| df["sv_pricing"].str.contains("Low price swing"),
"determines_outlier": False,
},
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking excellent! I have a few tiny suggestions for further code simplification, but they're all optional. We just need to get a path forward on the non-outlier-with-flags edge case and then we should be good to go.
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
glue/sales_val_flagging.py
Outdated
|
||
group_thresh_price_fix = [ | ||
"sv_ind_price_high_price" | ||
"sv_ind_price_low_price" | ||
"sv_ind_price_high_price_sqft" | ||
"sv_ind_price_low_price_sqft" | ||
] | ||
|
||
# Using .loc[] to set the desired values for rows meeting the condition | ||
merged_df.loc[condition, "sv_outlier_type"] = "Not outlier" | ||
merged_df.loc[condition, "sv_is_outlier"] = 0 | ||
def fill_outlier_reasons(row): | ||
reason_idx = 1 | ||
for reason in outlier_type_crosswalk: | ||
# Add a check to ensure that only specific reasons are added if _merge is not 'both' | ||
if ( | ||
reason in row | ||
and row[reason] | ||
and not (row["_merge"] == "both" and reason in group_thresh_price_fix) | ||
): | ||
row[f"sv_outlier_reason{reason_idx}"] = outlier_type_crosswalk[reason] | ||
if reason_idx >= 3: | ||
break | ||
reason_idx += 1 | ||
return row | ||
|
||
df = df.apply(fill_outlier_reasons, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how we I handled the implementation of not counting the price outliers but retaining the char outliers given a group threshold violation.
The row["_merge"] == "both"
check checks for the group thresh violation, and the reason in group_thresh_price_fix
checks to see if we are looking at a price indicator column. So it basically says:
if not (below threshold and price outlier) then implement this iteration in the loop, which means the condition will be filled.
I thought this approach would be better than manually re-arranging at the end of the function.
"sv_ind_ptax_flag_w_high_price": "High price", | ||
"sv_ind_price_low_price": "Low price", | ||
"sv_ind_ptax_flag_w_low_price": "Low price", | ||
"sv_ind_price_high_price_sqft": "High price per square foot", | ||
"sv_ind_ptax_flag_w_high_price_sqft": "High price per square foot", | ||
"sv_ind_price_low_price_sqft": "Low price per square foot", | ||
"sv_ind_ptax_flag_w_low_price_sqft": "Low price per square foot", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added new ptax cols
merged_df.loc[condition, "sv_is_outlier"] = 0 | ||
def fill_outlier_reasons(row): | ||
reason_idx = 1 | ||
reasons_added = set() # Set to track reasons already added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The set allows functional iteration over two high price values in the dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautifully done! A couple of small nitpicks to simplify the code even further, but I think this is basically ready to go.
@@ -102,45 +102,49 @@ def ptax_adjustment(df, groups, ptax_sd, condos: bool): | |||
group_string = "_".join(groups) | |||
|
|||
if not condos: | |||
df["ptax_flag_w_deviation"] = df["ptax_flag_original"] & ( | |||
df["sv_ind_ptax_flag_w_high_price"] = df["ptax_flag_original"] & ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Praise] Nice work on this refactor!
glue/sales_val_flagging.py
Outdated
merged_df.loc[condition, "sv_outlier_type"] = "Not outlier" | ||
merged_df.loc[condition, "sv_is_outlier"] = 0 | ||
def fill_outlier_reasons(row): | ||
reason_idx = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nitpick, non-blocking] Now that we're keeping track of the set of reasons_added
, I don't think we need this counter anymore -- we can just replace it with references to len(reasons_added)
. Note that we'd need to be careful with off-by-one errors the context of the outlier_types_crosswalk
loop, since we want to set the column name to sv_outlier_reason{len(reasons_added) + 1}
and then subsequently check if len(reasons_added) >= 3
. (We could also consider just setting reason_idx = len(reasons_added) + 1
before setting sv_outlier_reason{reason_idx}
in the loop if the if
conditional passes.)
glue/sales_val_flagging.py
Outdated
for reason in outlier_type_crosswalk: | ||
current_reason = outlier_type_crosswalk[reason] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nitpick, non-blocking] Two tiny readability improvements you could make here:
- It may not be immediately clear to a reader how
current_reason
is supposed to differ fromreason
; perhapsreason_label
would be a clearer name for the variable? - Since we're iterating the keys and values of the crosswalk, we can just loop over the crosswalk's
items()
:
for reason, reason_label in outlier_type_crosswalk.items():
...
glue/sales_val_flagging.py
Outdated
and current_reason | ||
not in reasons_added # Check if the reason is already added | ||
# Apply group threshold logic | ||
and not (row["_merge"] == "both" and reason in group_thresh_price_fix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question, non-blocking] I don't totally get how this line works, can you break it down for me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if I am misinterpreting anything here of if you have suggestions to improve
def fill_outlier_reasons(row):
reasons_added = set() # Set to track reasons already added
for reason_ind_col in outlier_type_crosswalk:
current_reason = outlier_type_crosswalk[reason_ind_col]
# Add a check to ensure that only specific reasons are added if _merge is not 'both'
if (
reason_ind_col in row
and row[reason_ind_col]
and current_reason
not in reasons_added # Check if the reason is already added
# Apply group threshold logic
and not (row["_merge"] == "both" and reason_ind_col in group_thresh_price_fix)
):
row[f"sv_outlier_reason{len(reasons_added) + 1}"] = current_reason
reasons_added.add(current_reason) # Add current reason to the set
if len(reasons_added) >= 3:
break
return row
Condition 1
reason_ind_col in row
: this checks if the column exists in the dataset. This check is necessary because the sqft columns won't exist in the condos data
Condition 2
and row[reason_ind_col]
: this is the check for whether the ind col is true
Condition 3
This checks to make sure that the column doesn't already exist in the reasons_added_set
. Given that it is a set, it is kind of a redundant operation, but does it make sense to keep this for computation reduction?
and current_reason
not in reasons_added
Also I think it acts as a good check for row[f"sv_outlier_reason{len(reasons_added) + 1}"] = current_reason
to make sure the column doesn't get populated with an incorrect reason name.
Condition 4
and not (row["_merge"] == "both" and reason_ind_col in group_thresh_price_fix
This condition makes sure a price value won't be assigned to an obs with a group of N < group_thresh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these responses!
[Condition 3] Given that it is a set, it is kind of a redundant operation, but does it make sense to keep this for computation reduction?
I think it makes sense to keep it -- the set will automatically deduplicate indicator values, but we still want to make sure we don't add duplicate values to row[f"sv_outlier_reason{len(reasons_added) + 1}"]
.
[Condition 4] This condition makes sure a price value won't be assigned to an obs with a group of
N < group_thresh
Got it, that makes sense! Do you mind if we adjust the comment to make this a little bit clearer?
if (
reason_ind_col in row
and row[reason_ind_col]
and current_reason
not in reasons_added # Check if the reason is already added
# Apply group threshold logic: `row["_merge"]` will be `both` when the group threshold
# is met, but only price indicators (`group_thresh_price_fix`) should use this threshold,
# since ptax indicators have thresholds built-in
and not (row["_merge"] == "both" and reason_ind_col in group_thresh_price_fix)
):
This PR handles #125 and #124.
Essentially I take the existing system and make changes such that it fits with these issues and the new design doc.
The biggest change here is that we remove
sv_outlier_type
and introduce 3 columns:sv_outlier_reason1
sv_outlier_reason2
sv_outlier_reason3
In the existing set up, if we have an outlier that is classified , then
sv_outlier_type()
must be one of the following:dev_bounds: [2, 2]
definition inparams.yaml
outlier_type()
function inglue/flagging_script_glue/flagging.py
.ptax_sd: [1, 1]
defined inparams.yaml
The following changes are implemented in order to maintain the existing specs and bring in the new structure. In
sv_outlier_type()
the price and characteristic outlier types are split intochar_conditions
/char_labels
andprice_conditions
/price_labels
.For the three
sv_outlier_reason$n
columns, we populate them with the following order of priorityThen, we assign the observation as an outlier only if
sv_is_outlier1
is a price label or a ptax label. This allows us to retain sales validation information even if we don't want to technically classify the sale as an outlier. Here are some scenarios:Scenario 1:
sv_outlier_reason1
- High Pricesv_outlier_reason2
- High Price (sqft)sv_outlier_reason3
- Non-person saleScenario 2:
sv_outlier_reason1
- Non-person salesv_outlier_reason2
- Statistical anomalysv_outlier_reason3
- null