Download Pardot prospect data #34293

hacodeorg · 2020-04-17T02:34:10Z

PLC-838: Downloading Pardot prospect data to use as seeds for contact roll-ups pipeline.

How to read this PR

(or Travel guide in contact roll-ups town)

Starts at contact_rollups_pardot_memory.rb
- download_pardot_prospects method: Downloads Pardot prospects and saves them to ContactRollupsPardotMemory. It has a list of 12 specific fields we want.
- download_pardot_ids method (previously add_and_update_pardot_ids): Downloads only new Pardot IDs, no prospect data.
Then goes to pardot.rb
- retrieve_prospects method (previously retrieve_new_ids): Sends request to Pardot to retrieve prospect data, parses the response and yields the results back to the caller. Used by the 2 functions mentioned in contact_rollups_pardot_memory.rb.
Then goes to test file contact_rollups_pardot_memory_test.rb
- It adds a new test for download_pardot_prospects method and move common test pre-condition to a setup block.

Other changes are minor.

Background

Contact roll-ups pipeline doesn't know what data is already in Pardot. Thus, every time we add a contact attribute to the pipeline, it will send million of updates to Pardot even though that data may already exist there.

The solution is to do a 1-time download of all data from Pardot and seed them into the pipeline. Then when we add new attributes, only actual data changes will trigger Pardot updates.

Deployment strategy

Manually execute download command in production as described in the manual test below.

Testing story

Manual test from Rails console sandbox (bin/rails c --sandbox)

require 'cdo/contact_rollups/v2/pardot'
ContactRollupsPardotMemory.count  # should be 0

# Download the 2 first prospects
ContactRollupsPardotMemory.download_pardot_prospects(nil, 2)
# Download 3 prospects with IDs greater than a certain number
ContactRollupsPardotMemory.download_pardot_prospects(50848578, 3)

ContactRollupsPardotMemory.count  # should be 5

Unit tests
- From dashboard: bundle exec rails test test/models/contact_rollups_pardot_memory_test.rb
- From repo root: bundle exec ruby -Ilib:test lib/test/cdo/contact_rollups/test_pardot_v2.rb

Reviewer Checklist:

Tests provide adequate coverage
Code is well-commented
New features are translatable or updates will not break translations
Relevant documentation has been added or updated
User impact is well-understood and desirable
Pull Request is labeled appropriately
Follow-up work items (including potential tech debt) are tracked and linked

islemaster

One possible issue with your exit condition for the loop?

islemaster · 2020-04-20T19:50:35Z

lib/cdo/contact_rollups/v2/pardot.rb

    total_results_retrieved = 0

    # Run repeated requests querying for prospects above our highest known Pardot ID.
    loop do
-      url = "#{PROSPECT_QUERY_URL}?id_greater_than=#{last_id}&fields=email,id&sort_by=id"
+      url = "#{PROSPECT_QUERY_URL}?id_greater_than=#{last_id}&fields=#{fields.join(',')}&sort_by=id"


It looks like this is not currently a problem, but thinking about future-proofing this system: Do we need to somehow url-encode the fields value? I'm wondering if those field names could ever contain spaces or other characters we need to worry about.

Custom fields in Pardot could have space and parenthesis and more in their names. We will have to url-encode those names if we want to download them. However, the fields we care about now (and in the near future) are all in this format db_*. I'll add a comment explaining the implicit constraint in field names.

Example of custom fields of a prospect https://pi.pardot.com/prospect/read/id/82839621:

islemaster · 2020-04-20T19:50:51Z

lib/cdo/contact_rollups/v2/pardot.rb

      doc = post_with_auth_retry(url)
      raise_if_response_error(doc)

      # Pardot returns the count total available prospects (not capped to 200),
      # although the data for a max of 200 are contained in the response.
+      # @see http://developer.pardot.com/kb/api-version-4/prospects/#xml-response-format


islemaster · 2020-04-20T20:00:04Z

lib/cdo/contact_rollups/v2/pardot.rb

      log "Retrieved #{results_in_response}/#{total_results} new Pardot IDs. Last Pardot ID = #{last_id}."

      # Stop if all the remaining results were in this response
      total_results_retrieved += results_in_response
-      break if results_in_response == total_results
+      break if results_in_response == total_results || total_results_retrieved >= limit


For the case where limit is nil but there is more than one page of results, shouldn't we be checking total_results_retrieved?

Went back to Jeremy's original code to make sure we didn't miss something in translation -- I think this works b/c we keep increasing the last_id value, and as a result the total_results decreases each time through the loop until we've hit the end of the road?

Good catch! When limit is nill, this comparison total_results_retrieved >= limit will raise an exception. Will fix.

bencodeorg · 2020-04-20T21:08:08Z

Looks great! Love the "travel guide" as well. Two overarching questions:

What does this look like once we want to grab more fields from Pardot?
Do you have concerns about running this all at once? Any idea how long it will take? Idea of when you'd like to kick it off?

hacodeorg · 2020-04-21T17:56:36Z

Looks great! Love the "travel guide" as well. Two overarching questions:
1. What does this look like once we want to grab more fields from Pardot?
2. Do you have concerns about running this all at once? Any idea how long it will take? Idea of when you'd like to kick it off?

1.a If we want to get more fields and still save them to PardotMemory (overwrite existing data), we can simply add new fields to the list in ContactRollupsPardotMemory.download_pardot_prospects.

1.b If we just want to download new prospect data without overwriting PardotMemory, we can call PardotV2.retrieve_prospects with a block which saves yielded results to a different table.

2 . My rough estimate is it will takes about 7+ hours to run (5M prospects, 200 prospects/second). I will manually run it for a few days with a limit of 1M prospects/day by calling ContactRollupsPardotMemory.download_pardot_prospects(nil, 1_000_000) directly from production-console. There is a JIRA task to implement throttling mechanism for the entire pipeline, but this task is a one-time task so it doesn't have to wait for the other one to finish.

…argument.

…ot function.

…ids to reflect its close relationship with download_pardot_prospects method.

molly-moen · 2020-04-21T21:09:15Z

dashboard/app/models/contact_rollups_pardot_memory.rb


      import! batch,
        validate: false,
        on_duplicate_key_update: [:pardot_id, :pardot_id_updated_at]
    end
  end

+  # Downloads and saves prospect data from Pardot.
+  # *Warning:* This method overwrites existing data.


clarification q: does it wipe the existing data and replace it with whatever is downloaded here or do some sort of partial overwrite?

It upserts (inserts + updates) downloaded data into contact_rollups_pardot_memory table. Only the update operation will overwrite existing data.

bencodeorg

Looks good to me once you've sorted out that loop condition!

islemaster · 2020-04-23T19:08:04Z

lib/cdo/contact_rollups/v2/pardot.rb

      total_results_retrieved += results_in_response
-      break if results_in_response == total_results
+      break if results_in_response == total_results ||
+        (limit && total_results_retrieved >= limit)


I still think there's a possible infinite loop here.

Suppose retrieve_prospects is called with limit = nil and there are 300 total results to retrieve.

On our first loop we retrieve 200 prospects. At line 66 we log "Retrieved 200/300 new Pardot IDs"

On our second loop, we reset results_in_response to 0 on line 51, and log "Retrieved 100/300 new Pardot IDs" at line 66.

We should break at line 70, but we do not, because results_in_response (100) does not equal total_results (300) and limit is nil.

I think the solution here is to use total_results_retrieved instead of results_in_response on line 70.

I think that's what I was trying to say in my previous comment -- I think total_results gets reduced to 100 the second time through the loop?

I see the confusion. Ben is right, total_results is updated for each request

code-dot-org/lib/cdo/contact_rollups/v2/pardot.rb

Line 49 in 080db54

total_results = doc.xpath('/rsp/result/total_results').text.to_i

because the url we sent to Pardot is updated with new id_greater_than value.

code-dot-org/lib/cdo/contact_rollups/v2/pardot.rb

Line 41 in 080db54

url = "#{PROSPECT_QUERY_URL}?id_greater_than=#{last_id}&fields=#{fields.join(',')}&sort_by=id"

So at step 3 in Brad's list, it will log "Retrieved 100/100 new Pardot IDs".
I'll add a comment to clarify this behavior.

Oh! I understand now. That is what confused me, I was expecting total_results to remain constant.

That seems like a reasonable assumption :)

hacodeorg changed the title ~~Seed Pardot data to CR pipeline~~ Download Pardot prospect data Apr 17, 2020

hacodeorg marked this pull request as ready for review April 17, 2020 18:29

hacodeorg requested review from bencodeorg and a team April 17, 2020 18:29

hacodeorg added the contact-rollups label Apr 17, 2020

islemaster suggested changes Apr 20, 2020

View reviewed changes

hacodeorg added 10 commits April 21, 2020 12:04

Generalize function to retrieve prospect data from Pardot. Add limit …

7356a02

…argument.

Use the new generalized function to get email-pardotId mappings

bfa0f54

Use the new generalized function to seed prospect data (multiple fields)

b869244

Update ContactRollupsPardotMemory unit tests to stub out the new Pard…

9cd1b52

…ot function.

Update PardotV2 tests

4440d6b

Save downloaded prospect data

85a3ac7

Change method name from add_and_update_pardot_ids to download_pardot_…

45814ba

…ids to reflect its close relationship with download_pardot_prospects method.

Add test for download_pardot_prospects

5b0d5aa

Move common test pre-condition into setup block.

7704dcb

Fix download limit condition.

93c9b31

hacodeorg force-pushed the ha/cr-seeding branch from 7d4eec1 to 93c9b31 Compare April 21, 2020 19:05

molly-moen reviewed Apr 21, 2020

View reviewed changes

Fix tests

080db54

hacodeorg requested a review from islemaster April 22, 2020 21:03

bencodeorg approved these changes Apr 23, 2020

View reviewed changes

islemaster reviewed Apr 23, 2020

View reviewed changes

islemaster approved these changes Apr 23, 2020

View reviewed changes

Explain implicit change in Pardot response.

034829d

hacodeorg merged commit 459d3ce into staging Apr 23, 2020

hacodeorg deleted the ha/cr-seeding branch April 23, 2020 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download Pardot prospect data #34293

Download Pardot prospect data #34293

hacodeorg commented Apr 17, 2020 •

edited

islemaster left a comment

islemaster Apr 20, 2020

hacodeorg Apr 20, 2020

islemaster Apr 20, 2020

islemaster Apr 20, 2020

bencodeorg Apr 20, 2020 •

edited

hacodeorg Apr 20, 2020

bencodeorg commented Apr 20, 2020

hacodeorg commented Apr 21, 2020

molly-moen Apr 21, 2020

hacodeorg Apr 21, 2020

bencodeorg left a comment

islemaster Apr 23, 2020

bencodeorg Apr 23, 2020

hacodeorg Apr 23, 2020 •

edited

islemaster Apr 23, 2020

bencodeorg Apr 23, 2020

Download Pardot prospect data #34293

Download Pardot prospect data #34293

Conversation

hacodeorg commented Apr 17, 2020 • edited

How to read this PR

Background

Deployment strategy

Testing story

Reviewer Checklist:

islemaster left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bencodeorg Apr 20, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bencodeorg commented Apr 20, 2020

hacodeorg commented Apr 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bencodeorg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hacodeorg Apr 23, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hacodeorg commented Apr 17, 2020 •

edited

bencodeorg Apr 20, 2020 •

edited

hacodeorg Apr 23, 2020 •

edited