Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download Pardot prospect data #34293

Merged
merged 12 commits into from Apr 23, 2020
Merged

Download Pardot prospect data #34293

merged 12 commits into from Apr 23, 2020

Conversation

hacodeorg
Copy link
Contributor

@hacodeorg hacodeorg commented Apr 17, 2020

PLC-838: Downloading Pardot prospect data to use as seeds for contact roll-ups pipeline.

How to read this PR

(or Travel guide in contact roll-ups town)

  • Starts at contact_rollups_pardot_memory.rb

    • download_pardot_prospects method: Downloads Pardot prospects and saves them to ContactRollupsPardotMemory. It has a list of 12 specific fields we want.
    • download_pardot_ids method (previously add_and_update_pardot_ids): Downloads only new Pardot IDs, no prospect data.
  • Then goes to pardot.rb

    • retrieve_prospects method (previously retrieve_new_ids): Sends request to Pardot to retrieve prospect data, parses the response and yields the results back to the caller. Used by the 2 functions mentioned in contact_rollups_pardot_memory.rb.
  • Then goes to test file contact_rollups_pardot_memory_test.rb

    • It adds a new test for download_pardot_prospects method and move common test pre-condition to a setup block.

Other changes are minor.

Background

Contact roll-ups pipeline doesn't know what data is already in Pardot. Thus, every time we add a contact attribute to the pipeline, it will send million of updates to Pardot even though that data may already exist there.

The solution is to do a 1-time download of all data from Pardot and seed them into the pipeline. Then when we add new attributes, only actual data changes will trigger Pardot updates.

Deployment strategy

Manually execute download command in production as described in the manual test below.

Testing story

  • Manual test from Rails console sandbox (bin/rails c --sandbox)

    require 'cdo/contact_rollups/v2/pardot'
    ContactRollupsPardotMemory.count  # should be 0
    
    # Download the 2 first prospects
    ContactRollupsPardotMemory.download_pardot_prospects(nil, 2)
    # Download 3 prospects with IDs greater than a certain number
    ContactRollupsPardotMemory.download_pardot_prospects(50848578, 3)
    
    ContactRollupsPardotMemory.count  # should be 5
  • Unit tests

    • From dashboard: bundle exec rails test test/models/contact_rollups_pardot_memory_test.rb
    • From repo root: bundle exec ruby -Ilib:test lib/test/cdo/contact_rollups/test_pardot_v2.rb

Reviewer Checklist:

  • Tests provide adequate coverage
  • Code is well-commented
  • New features are translatable or updates will not break translations
  • Relevant documentation has been added or updated
  • User impact is well-understood and desirable
  • Pull Request is labeled appropriately
  • Follow-up work items (including potential tech debt) are tracked and linked

@hacodeorg hacodeorg changed the title Seed Pardot data to CR pipeline Download Pardot prospect data Apr 17, 2020
@hacodeorg hacodeorg marked this pull request as ready for review April 17, 2020 18:29
@hacodeorg hacodeorg requested review from bencodeorg and a team April 17, 2020 18:29
Copy link
Contributor

@islemaster islemaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible issue with your exit condition for the loop?

total_results_retrieved = 0

# Run repeated requests querying for prospects above our highest known Pardot ID.
loop do
url = "#{PROSPECT_QUERY_URL}?id_greater_than=#{last_id}&fields=email,id&sort_by=id"
url = "#{PROSPECT_QUERY_URL}?id_greater_than=#{last_id}&fields=#{fields.join(',')}&sort_by=id"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is not currently a problem, but thinking about future-proofing this system: Do we need to somehow url-encode the fields value? I'm wondering if those field names could ever contain spaces or other characters we need to worry about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom fields in Pardot could have space and parenthesis and more in their names. We will have to url-encode those names if we want to download them. However, the fields we care about now (and in the near future) are all in this format db_*. I'll add a comment explaining the implicit constraint in field names.

Example of custom fields of a prospect https://pi.pardot.com/prospect/read/id/82839621:
Screen Shot 2020-04-20 at 2 48 03 PM

doc = post_with_auth_retry(url)
raise_if_response_error(doc)

# Pardot returns the count total available prospects (not capped to 200),
# although the data for a max of 200 are contained in the response.
# @see http://developer.pardot.com/kb/api-version-4/prospects/#xml-response-format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

log "Retrieved #{results_in_response}/#{total_results} new Pardot IDs. Last Pardot ID = #{last_id}."

# Stop if all the remaining results were in this response
total_results_retrieved += results_in_response
break if results_in_response == total_results
break if results_in_response == total_results || total_results_retrieved >= limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case where limit is nil but there is more than one page of results, shouldn't we be checking total_results_retrieved?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went back to Jeremy's original code to make sure we didn't miss something in translation -- I think this works b/c we keep increasing the last_id value, and as a result the total_results decreases each time through the loop until we've hit the end of the road?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! When limit is nill, this comparison total_results_retrieved >= limit will raise an exception. Will fix.

@bencodeorg
Copy link
Contributor

Looks great! Love the "travel guide" as well. Two overarching questions:

  1. What does this look like once we want to grab more fields from Pardot?
  2. Do you have concerns about running this all at once? Any idea how long it will take? Idea of when you'd like to kick it off?

@hacodeorg
Copy link
Contributor Author

Looks great! Love the "travel guide" as well. Two overarching questions:
1. What does this look like once we want to grab more fields from Pardot?
2. Do you have concerns about running this all at once? Any idea how long it will take? Idea of when you'd like to kick it off?

1.a If we want to get more fields and still save them to PardotMemory (overwrite existing data), we can simply add new fields to the list in ContactRollupsPardotMemory.download_pardot_prospects.

1.b If we just want to download new prospect data without overwriting PardotMemory, we can call PardotV2.retrieve_prospects with a block which saves yielded results to a different table.

2 . My rough estimate is it will takes about 7+ hours to run (5M prospects, 200 prospects/second). I will manually run it for a few days with a limit of 1M prospects/day by calling ContactRollupsPardotMemory.download_pardot_prospects(nil, 1_000_000) directly from production-console. There is a JIRA task to implement throttling mechanism for the entire pipeline, but this task is a one-time task so it doesn't have to wait for the other one to finish.


import! batch,
validate: false,
on_duplicate_key_update: [:pardot_id, :pardot_id_updated_at]
end
end

# Downloads and saves prospect data from Pardot.
# *Warning:* This method overwrites existing data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarification q: does it wipe the existing data and replace it with whatever is downloaded here or do some sort of partial overwrite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It upserts (inserts + updates) downloaded data into contact_rollups_pardot_memory table. Only the update operation will overwrite existing data.

Copy link
Contributor

@bencodeorg bencodeorg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me once you've sorted out that loop condition!

total_results_retrieved += results_in_response
break if results_in_response == total_results
break if results_in_response == total_results ||
(limit && total_results_retrieved >= limit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think there's a possible infinite loop here.

  1. Suppose retrieve_prospects is called with limit = nil and there are 300 total results to retrieve.
  2. On our first loop we retrieve 200 prospects. At line 66 we log "Retrieved 200/300 new Pardot IDs"
  3. On our second loop, we reset results_in_response to 0 on line 51, and log "Retrieved 100/300 new Pardot IDs" at line 66.
  4. We should break at line 70, but we do not, because results_in_response (100) does not equal total_results (300) and limit is nil.

I think the solution here is to use total_results_retrieved instead of results_in_response on line 70.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's what I was trying to say in my previous comment -- I think total_results gets reduced to 100 the second time through the loop?

Copy link
Contributor Author

@hacodeorg hacodeorg Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the confusion. Ben is right, total_results is updated for each request

total_results = doc.xpath('/rsp/result/total_results').text.to_i

because the url we sent to Pardot is updated with new id_greater_than value.
url = "#{PROSPECT_QUERY_URL}?id_greater_than=#{last_id}&fields=#{fields.join(',')}&sort_by=id"

So at step 3 in Brad's list, it will log "Retrieved 100/100 new Pardot IDs".
I'll add a comment to clarify this behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I understand now. That is what confused me, I was expecting total_results to remain constant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a reasonable assumption :)

@hacodeorg hacodeorg merged commit 459d3ce into staging Apr 23, 2020
@hacodeorg hacodeorg deleted the ha/cr-seeding branch April 23, 2020 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants