New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out the proper timezones for DateTimeField values #1457

Closed
gordonje opened this Issue Apr 21, 2016 · 2 comments

Comments

Projects
None yet
1 participant
@gordonje
Copy link
Contributor

gordonje commented Apr 21, 2016

After inspecting the raw .csv files (using agate), I've found seven DateTimeFields with values where the time part is something other than 12:00:00 AM. Though in all but one case, these real times are a small percent of all non-null values in each column:

  • 1,144 of 347,061 (0.33%) in CVR_CAMPAIGN_DISCLOSURE_CD.RPT_DATE (in campaign.py)
  • 146,354 of 146,813 (99.69%) in EFS_FILING_LOG_CD.FILING_DATE (in other.py)
  • 373 of 15,318 (2.44%) in F501_502_CD.RPT_DATE (in campaign.py)
  • 617 of 10,575 (5.83%) in F501_502_CD.EXECUTE_DT (in campaign.py)
  • 29 of 15,167 (0.19%) in FILER_ETHICS_CLASS_CD.ETHICS_DATE (in other.py)
  • 8,294 of 404,222 (2.05%) in FILER_ADDRESS_CD.EFFECT_DT (in other.py)
  • 196 of 36,065 (0.54%) in FILER_INTERESTS_CD.EFFECT_DATE (in other.py)

To properly preserve this data, we need to know the timezone that goes along with each of these time values. But the raw data doesn't include any timezone info, and I haven't yet found anything about timezones for time values in the official documentation.

There are two possibilities I can imagine:

  1. Times are in the local timezone of the CAL-ACCESS db server, which is probably either UTC or PST
  2. Times are in the local timezones of the filers, which probably means most of them are in PST, but some might be in different timezones, say, in cases of electronic filings submitted by an out-of-state third-party vendor.

I suppose it's also possible that not all of these fields fall under the same scenario.

Short of asking one of the CAL-ACCESS admins about this, I started looking into each of these fields:

CVR_CAMPAIGN_DISCLOSURE_CD.RPT_DATE: All of the values with real times in this column are on rows where FORM_TYPE is F497 for "Late Contribution Report", and the latest RPT_DATE value is 2000-11-06 16:10:06. I wasn't able to pull up any of these filings via the CAL-ACCESS PDFgen program, just get CGI errors like this one.

I was able to pull up some of these filings via the CAL-ACCESS search tool, but the results contain only date info, no time. I checked the Excel export too.

EFS_FILING_LOG_CD.FILING_DATE: This seems like an internal system table for logging electronic filings from third-party vendors. Not sure it's still in use since the most recent FILING_DATE value is 2004-01-16 10:59:25

F501_502_CD.RPT_DATE and F501_502_CD.EXECUTE_DT: Both of these fields contain values with real times that are as recent as March 2016. Form 501 is the Campaign Intention Statement and Form 502 is the Campaign bank account statement. But these forms don't seem to be available via the CAL-ACCESS PDFgen program, throws the CGI error like this one.

FILER_ETHICS_CLASS_CD.ETHICS_DATE: This table contains a record for each lobbyist training class, and the ETHICS_DATE column should be the date the class was finished. But the most recent value with a real time is 2001-06-07 05:51:59. A lot of these all fall and the same day too, just within a few minutes of each other. Maybe someone was testing something?

FILER_ADDRESS_CD.EFFECT_DT: This is a table of address history for filers, and the EFFECT_DT column contains values with real times that are as recent as April 2016. My guess is this is the datetime when the filer's address change was logged in the system, probably triggered by a filing or amendment. Also, the real times haven't become any more prevalent in the last couple of years.

FILER_INTERESTS_CD.EFFECT_DATE: This is the table that links filers to their interests (e.g., "AGRICULTURE", "FINANCE/INSURANCE"). The EFFECT_DATE column contains values with real times as recent as Sept 2015. Similar to FILER_ADDRESS_CD.EFFECT_DT, these are probably datetimes when the interests are assigned to the filers in the system. As with FILER_ADDRESS_CD.EFFECT_DT, the real times haven't become any more prevalent in the last couple of years.

@gordonje

This comment has been minimized.

Copy link
Contributor

gordonje commented Apr 21, 2016

tl;dr: I believe there are only as many as four DateTimeFields where times are actively being collected:

  • F501_502_CD.RPT_DATE
  • F501_502_CD.EXECUTE_DT
  • FILER_ADDRESS_CD.EFFECT_DT
  • FILER_INTERESTS_CD.EFFECT_DATE

And the values with real times are very small percentage of non-NULL values in these fields.

@gordonje

This comment has been minimized.

Copy link
Contributor

gordonje commented Apr 21, 2016

So for the time being, we are just going to convert these DateTimeFields to DateFields, and add a note in our documentation, explaining this decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment