Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comps check #224

Closed
ccao-jardine opened this issue Mar 25, 2024 · 1 comment
Closed

Comps check #224

ccao-jardine opened this issue Mar 25, 2024 · 1 comment
Assignees

Comments

@ccao-jardine
Copy link
Member

In the Rogers Park desk review spreadsheet, occasionally comp1 and comp2 will share the same PIN and sale price but different comp score. Example is for subject PIN 11304200220000. My hypothesis: these are PINs that sold twice, hence the different comp scores, but the sale price printed in the sheet is the PIN's most recent sale price.

Is that expected/desired behavior, and at what stage of the pipeline does it arise (later for the desk review spreadsheet, or earlier in the comps algorithm)?

Let's identify where this occurs and determine whether it's something to correct.

@ccao-jardine
Copy link
Member Author

Thanks @jeancochrane for investigating. Pasting our discussion here for @julia-klauss and @dfsnow and closing for now.

Query to check dupe comps:

WITH unpivoted_comps AS (
    SELECT pin, comp_pin
    FROM (
        SELECT
            pin,
            ARRAY[
                comp_pin_1, comp_pin_2, comp_pin_3, comp_pin_4, comp_pin_5,
                comp_pin_6, comp_pin_7, comp_pin_8, comp_pin_9, comp_pin_10,
                comp_pin_11, comp_pin_12, comp_pin_13, comp_pin_14, comp_pin_15,
                comp_pin_16, comp_pin_17, comp_pin_18, comp_pin_19, comp_pin_20
            ] AS comp_array
        FROM model.comp
        WHERE run_id = '2024-03-17-stupefied-maya'
    ) AS comp_data
    CROSS JOIN UNNEST(comp_array) AS comp_pin(comp_pin)
)
SELECT pin, COUNT(comp_pin) AS num_comps, COUNT(DISTINCT comp_pin) as num_distinct_comps 
FROM unpivoted_comps
GROUP BY pin
HAVING COUNT(DISTINCT comp_pin) < COUNT(comp_pin);

What we learned:

  • Results suggest that 137,094 / 1,098,988 or about 13% of PINs have a PIN show up more than once in their set of 20 comps
  • But if we restrict the query to just look for dupes in the top two comps (comp_pin_1 and comp_pin_2) we only get 6,204 results, or 0.6% of PINs

Therefore, when analyzing comps, we should take care to not assume distinct rows for PINs in the training data and in the comps. Such that:

  • if a PIN only appears once in a comp set, it's reasonably a safe assumption to select its most recent sale.

  • if a PIN appears more than once in a comp set, assign the most recent sale to the comp with the higher comp_score, then the next lowest comp_score to the next most recent sale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants