Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schedule_e pagination and missing transactions bug #3396

Closed
grimmius opened this issue Sep 22, 2018 · 8 comments
Closed

Schedule_e pagination and missing transactions bug #3396

grimmius opened this issue Sep 22, 2018 · 8 comments

Comments

@grimmius
Copy link

grimmius commented Sep 22, 2018

I'm trying to query independent expenditures by committee. I've noticed a bug in the pagination mechanism that results in many transactions being excluded. I'll use the DCCC as an example.

The following will query the first 100 IEs for DCCC during the 2016 cycle:
https://api.open.fec.gov/v1/schedules/schedule_e/?api_key=DEMO_KEY&committee_id=C00000935&per_page=100&cycle=2016&sort=-expenditure_date

It also returns the following pagination object:

"pagination":{"per_page":100,"count":7926,"pages":80,"last_indexes":{"last_index":"4081120171446520621","last_expenditure_date":"2016-11-03T00:00:00"}}

Following the documentation, appending the last_index value to the end of the request like &last_index=4081120171446520621 will return the the next page of results in sequence, and so on until no results are left. By the estimation given in the pagination object there should be 80 pages of results, however following the last_index yields only 9 more pages:

last_index
4081120171446520621
4081120171446520421
4081120171446520221
4081120171446520021
4081120171446519821
4081120171446519621
4061620171409944984
4061620171409944784
4042220161283669188

Missing Transactions:

https://api.open.fec.gov/v1/schedules/schedule_e/?min_amount=134680.46&cycle=2016&committee_id=C00000935&per_page=20&sort=-expenditure_date&api_key=DEMO_KEY&max_amount=134680.46

These transactions (and many more) should be returned by the first request, but are not found on any of the pages.

@AmyKort AmyKort added this to the Sprint 7.4 milestone Oct 9, 2018
@AmyKort AmyKort added the Bug label Oct 9, 2018
@lbeaufort lbeaufort removed their assignment Oct 16, 2018
@JonellaCulmer JonellaCulmer modified the milestones: Sprint 7.4, Sprint 7.5 Oct 25, 2018
@JonellaCulmer
Copy link
Contributor

Thanks for reporting! We're going to take a look into this issue and get back to you when we have more information on what's happening.

@qqss88
Copy link
Contributor

qqss88 commented Oct 31, 2018

run the same query on different tiers. Here are the results:

  • DEV:
    "pagination": {
    "count": 2701,
    "per_page": 100,
    "pages": 28,
    "last_indexes": {
    "last_index": "4081120171446520621",
    "last_expenditure_date": "2016-11-03T00:00:00"
    }
    }

-STG:
"pagination": {
"last_indexes": {
"last_index": "4081120171446520621",
"last_expenditure_date": "2016-11-03T00:00:00"
},
"pages": 28,
"per_page": 100,
"count": 2701
-PRD:

"pagination": {
    "count": 2701,
    "pages": 28,
    "per_page": 100,
    "last_indexes": {
        "last_index": "4081120171446520621",
        "last_expenditure_date": "2016-11-03T00:00:00"
    }
}

they all yield the same results - will check the database to verify.

@qqss88
Copy link
Contributor

qqss88 commented Oct 31, 2018

database layer query give the same results: 2701
select count(*) from ofec_sched_e
where cmte_id='C00000935' and rpt_yr=2016;

@JonellaCulmer
Copy link
Contributor

Hi @grimmius. We aren't able to replicate your issue. Do you mind trying your search again and seeing if you get the same result?

@lbeaufort lbeaufort removed this from the Sprint 7.5 milestone Nov 26, 2018
@AmyKort
Copy link

AmyKort commented Jun 10, 2019

Thanks for reaching out. I'm going to close this ticket as stale. Please let us know if you have any other concerns.

@AmyKort AmyKort closed this as completed Jun 10, 2019
@kcym-3c
Copy link

kcym-3c commented Sep 19, 2019

I'm running into the same issue, where i try to request from the /schedules/schedule_a endpoint.

I have 600,000 records to retrieve however i can't iterate past the 24th page out of 6000 pages. This is because at the 24th page, the last_index variable does not exist and so i can't jump to the next page because i can't retrieve last_index to pass on to the next request.

In the case that last_index does not exist, does this mean that there is no more data to retrieve? In my code, i create a while loop to keep on retrieving the pages and updating the last_index in the parameters until i hit the end of the pagination['pages']

Here are my parameters:

params = {
'api_key' : fec_key_personal,
'committee_id' : 'C00431569',
'per_page': 100,
'two_year_transaction_period' : 2008
}

@jason-upchurch
Copy link
Contributor

Hello,

Thank you for using the FEC's api. This endpoint has specific pagination requirements and can be found in the endpoint documentation: https://api.open.fec.gov/developers/#/receipts/get_schedules_schedule_a_

The relevant portion for pagination is summarized here:

Due to the large quantity of Schedule A filings, this endpoint is not paginated by
page number. Instead, you can request the next page of results by adding the values in
the last_indexes object from pagination to the URL of your last request. For
example, when sorting by contribution_receipt_date, you might receive a page of
results with the following pagination information:

pagination: {
    pages: 2152643,
    per_page: 20,
    count: 43052850,
    last_indexes: {
        last_index: "230880619",
        last_contribution_receipt_date: "2014-01-01"
    }
}

To fetch the next page of sorted results, append last_index=230880619 and
last_contribution_receipt_date=2014-01-01 to the URL. We strongly advise paging through
these results by using sort indices (defaults to sort by contribution date), otherwise some resources may be
unintentionally filtered out. This resource uses keyset pagination to improve query performance and these indices
are required to properly page through this large dataset.

Note: because the Schedule A data includes many records, counts for
large result sets are approximate; you will want to page through the records until no records are returned.

The key requirement is for a new page of results, the URL must be appended as:

To fetch the next page of sorted results, append last_index=230880619 and
last_contribution_receipt_date=2014-01-01 to the URL

@kcym-3c
Copy link

kcym-3c commented Sep 19, 2019

After adding the last_contribution_receipt_date to the last request, it was able to continue to the next page. Thanks for the advice, greatly appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants