New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download Pardot prospect data #34293
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible issue with your exit condition for the loop?
total_results_retrieved = 0 | ||
|
||
# Run repeated requests querying for prospects above our highest known Pardot ID. | ||
loop do | ||
url = "#{PROSPECT_QUERY_URL}?id_greater_than=#{last_id}&fields=email,id&sort_by=id" | ||
url = "#{PROSPECT_QUERY_URL}?id_greater_than=#{last_id}&fields=#{fields.join(',')}&sort_by=id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is not currently a problem, but thinking about future-proofing this system: Do we need to somehow url-encode the fields
value? I'm wondering if those field names could ever contain spaces or other characters we need to worry about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Custom fields in Pardot could have space and parenthesis and more in their names. We will have to url-encode those names if we want to download them. However, the fields we care about now (and in the near future) are all in this format db_*
. I'll add a comment explaining the implicit constraint in field names.
Example of custom fields of a prospect https://pi.pardot.com/prospect/read/id/82839621:
doc = post_with_auth_retry(url) | ||
raise_if_response_error(doc) | ||
|
||
# Pardot returns the count total available prospects (not capped to 200), | ||
# although the data for a max of 200 are contained in the response. | ||
# @see http://developer.pardot.com/kb/api-version-4/prospects/#xml-response-format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
lib/cdo/contact_rollups/v2/pardot.rb
Outdated
log "Retrieved #{results_in_response}/#{total_results} new Pardot IDs. Last Pardot ID = #{last_id}." | ||
|
||
# Stop if all the remaining results were in this response | ||
total_results_retrieved += results_in_response | ||
break if results_in_response == total_results | ||
break if results_in_response == total_results || total_results_retrieved >= limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the case where limit
is nil but there is more than one page of results, shouldn't we be checking total_results_retrieved
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went back to Jeremy's original code to make sure we didn't miss something in translation -- I think this works b/c we keep increasing the last_id
value, and as a result the total_results
decreases each time through the loop until we've hit the end of the road?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! When limit
is nill, this comparison total_results_retrieved >= limit
will raise an exception. Will fix.
Looks great! Love the "travel guide" as well. Two overarching questions:
|
1.a If we want to get more fields and still save them to PardotMemory (overwrite existing data), we can simply add new fields to the list in 1.b If we just want to download new prospect data without overwriting PardotMemory, we can call 2 . My rough estimate is it will takes about 7+ hours to run (5M prospects, 200 prospects/second). I will manually run it for a few days with a limit of 1M prospects/day by calling |
…ids to reflect its close relationship with download_pardot_prospects method.
7d4eec1
to
93c9b31
Compare
|
||
import! batch, | ||
validate: false, | ||
on_duplicate_key_update: [:pardot_id, :pardot_id_updated_at] | ||
end | ||
end | ||
|
||
# Downloads and saves prospect data from Pardot. | ||
# *Warning:* This method overwrites existing data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarification q: does it wipe the existing data and replace it with whatever is downloaded here or do some sort of partial overwrite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It upserts (inserts + updates) downloaded data into contact_rollups_pardot_memory table. Only the update operation will overwrite existing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me once you've sorted out that loop condition!
total_results_retrieved += results_in_response | ||
break if results_in_response == total_results | ||
break if results_in_response == total_results || | ||
(limit && total_results_retrieved >= limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think there's a possible infinite loop here.
- Suppose
retrieve_prospects
is called withlimit = nil
and there are 300 total results to retrieve. - On our first loop we retrieve 200 prospects. At line 66 we log "Retrieved 200/300 new Pardot IDs"
- On our second loop, we reset
results_in_response
to 0 on line 51, and log "Retrieved 100/300 new Pardot IDs" at line 66. - We should break at line 70, but we do not, because
results_in_response
(100) does not equaltotal_results
(300) andlimit
is nil.
I think the solution here is to use total_results_retrieved
instead of results_in_response
on line 70.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's what I was trying to say in my previous comment -- I think total_results
gets reduced to 100 the second time through the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the confusion. Ben is right, total_results
is updated for each request
total_results = doc.xpath('/rsp/result/total_results').text.to_i |
because the
url
we sent to Pardot is updated with new id_greater_than
value.url = "#{PROSPECT_QUERY_URL}?id_greater_than=#{last_id}&fields=#{fields.join(',')}&sort_by=id" |
So at step 3 in Brad's list, it will log "Retrieved 100/100 new Pardot IDs".
I'll add a comment to clarify this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I understand now. That is what confused me, I was expecting total_results
to remain constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a reasonable assumption :)
PLC-838: Downloading Pardot prospect data to use as seeds for contact roll-ups pipeline.
How to read this PR
(or Travel guide in contact roll-ups town)
Starts at
contact_rollups_pardot_memory.rb
download_pardot_prospects
method: Downloads Pardot prospects and saves them to ContactRollupsPardotMemory. It has a list of 12 specific fields we want.download_pardot_ids
method (previously add_and_update_pardot_ids): Downloads only new Pardot IDs, no prospect data.Then goes to
pardot.rb
retrieve_prospects
method (previously retrieve_new_ids): Sends request to Pardot to retrieve prospect data, parses the response and yields the results back to the caller. Used by the 2 functions mentioned incontact_rollups_pardot_memory.rb
.Then goes to test file
contact_rollups_pardot_memory_test.rb
download_pardot_prospects
method and move common test pre-condition to asetup
block.Other changes are minor.
Background
Contact roll-ups pipeline doesn't know what data is already in Pardot. Thus, every time we add a contact attribute to the pipeline, it will send million of updates to Pardot even though that data may already exist there.
The solution is to do a 1-time download of all data from Pardot and seed them into the pipeline. Then when we add new attributes, only actual data changes will trigger Pardot updates.
Deployment strategy
Manually execute download command in production as described in the manual test below.
Testing story
Manual test from Rails console sandbox (
bin/rails c --sandbox
)Unit tests
bundle exec rails test test/models/contact_rollups_pardot_memory_test.rb
bundle exec ruby -Ilib:test lib/test/cdo/contact_rollups/test_pardot_v2.rb
Reviewer Checklist: