Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record doc numbers along with PINs when calculating comps #246

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Jun 13, 2024

This PR updates the comps calculation step in the interpret stage to save comp document numbers along with PINs. With document numbers, we can update downstream code that consumes these comps to uniquely tie each comp to a specific sale, instead of having to infer the sale based only on the comp's PIN.

I tested this locally on a subset of training/assessment data. The comps aren't great due to the limited size of the data, but it ran reasonably fast and confirmed that this processing code works. I'm happy to share the output of my test, or to kick off a remote comps run if that would make things easier to review. In the meantime, here's a quick peek at the output schema:

> final_comps <- cbind(comps[[1]], comps[[2]])
> final_comps %>% names
 [1] "pin"            "card"           "comp_pin_1"     "comp_pin_2"     "comp_pin_3"     "comp_pin_4"     "comp_pin_5"     "comp_pin_6"    
 [9] "comp_pin_7"     "comp_pin_8"     "comp_pin_9"     "comp_pin_10"    "comp_pin_11"    "comp_pin_12"    "comp_pin_13"    "comp_pin_14"   
[17] "comp_pin_15"    "comp_pin_16"    "comp_pin_17"    "comp_pin_18"    "comp_pin_19"    "comp_pin_20"    "comp_doc_no_1"  "comp_doc_no_2" 
[25] "comp_doc_no_3"  "comp_doc_no_4"  "comp_doc_no_5"  "comp_doc_no_6"  "comp_doc_no_7"  "comp_doc_no_8"  "comp_doc_no_9"  "comp_doc_no_10"
[33] "comp_doc_no_11" "comp_doc_no_12" "comp_doc_no_13" "comp_doc_no_14" "comp_doc_no_15" "comp_doc_no_16" "comp_doc_no_17" "comp_doc_no_18"
[41] "comp_doc_no_19" "comp_doc_no_20" "comp_score_1"   "comp_score_2"   "comp_score_3"   "comp_score_4"   "comp_score_5"   "comp_score_6"  
[49] "comp_score_7"   "comp_score_8"   "comp_score_9"   "comp_score_10"  "comp_score_11"  "comp_score_12"  "comp_score_13"  "comp_score_14" 
[57] "comp_score_15"  "comp_score_16"  "comp_score_17"  "comp_score_18"  "comp_score_19"  "comp_score_20" 

Closes https://github.com/ccao-data/pinval/issues/7.

\(idx_row) {
training_data[idx_row, ]$meta_sale_document_num
},
.names = "comp_doc_no_{str_remove(col, 'comp_idx_')}"
Copy link
Contributor Author

@jeancochrane jeancochrane Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a variety of different ways we refer to this type of data -- alternately, doc_no or instruno or document_num, depending on the context. What's the best choice for this particular case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick with what's used in the model pipeline - document_num.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, done in d510fe7.

@jeancochrane jeancochrane marked this pull request as ready for review June 13, 2024 19:53
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this should be super helpful. Before we merge, let's also use this as an opportunity to test the Batch action by doing a re-run of the final res model with the updated comps. This will give us the new output and will ensure we haven't busted anything in the interim since we wrapped modeling.

\(idx_row) {
training_data[idx_row, ]$meta_sale_document_num
},
.names = "comp_doc_no_{str_remove(col, 'comp_idx_')}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick with what's used in the model pipeline - document_num.

@jeancochrane
Copy link
Contributor Author

@dfsnow I finally got a completed run finished with run ID 2024-06-18-calm-nathan. Want to take a look before I merge?

@dfsnow
Copy link
Member

dfsnow commented Jun 18, 2024

@dfsnow I finally got a completed run finished with run ID 2024-06-18-calm-nathan. Want to take a look before I merge?

@jeancochrane I took a look and everything looks good. Let's merge it!

@jeancochrane jeancochrane merged commit 4dd0363 into master Jun 18, 2024
10 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/7-update-comps-algorithm-to-save-instruno-in-addition-to-parid branch June 18, 2024 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants