Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out how to wire up Data Set Creation for Massive Data Sets #340

Closed
vforgione opened this issue Jun 28, 2018 · 9 comments
Closed

Figure out how to wire up Data Set Creation for Massive Data Sets #340

vforgione opened this issue Jun 28, 2018 · 9 comments
Assignees
Labels
Blocked Tasks that are unable to be completed. Task Work that is not bug related.

Comments

@vforgione
Copy link
Member

Related to #235

For things like Crime Data and Divvy Rides, where the data set takes minutes to just even respond to requests, we need some sort of mechanism to handle this.

One possible solution: add an extra field to the meta that says this is a really big data set. When that flag is true, we can either increase the timeout value for the request or turn it into some sort of background job (and still bump out the request timeout). I'm not so sure about the background job, as it would have a significantly different workflow for the users.

@HeyZoos, @brlodi, @sanilio please comment.

@vforgione vforgione added this to the To a One-dot Release milestone Jun 28, 2018
@vforgione vforgione self-assigned this Jun 28, 2018
@brlodi
Copy link
Collaborator

brlodi commented Jun 28, 2018

add an extra field to the meta that says this is a really big data set.

I'd argue it's less that it's a really big dataset (though I'm sure that's the underlying cause) and more of a poorly-configured, or at least insufficiently configured, source server. Ideally we'd just say that if your server can't be bothered to return at least a 202 Request Accepted within 30 seconds or so then that's your problem.

To pick on the Divvy Trips and the City a little bit here for the sake of example:

> GET /api/views/fg6s-gzvg/rows.csv HTTP/1.1
> Host: data.cityofchicago.org
> User-Agent: insomnia/5.16.6
> Accept: */*

< HTTP/1.1 200 OK
< Server: nginx
< Date: Thu, 28 Jun 2018 17:49:33 GMT
< Content-Type: text/csv; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< X-Socrata-RequestId: a8tww493syofhmkgmsbh7u0yb
< Access-Control-Allow-Origin: *
< Content-disposition: attachment; filename=Divvy_Trips.csv
< Cache-Control: public, must-revalidate, max-age=21600
< ETag: W/"YWxwaGEuMTE4NDM4XzI5XzIyMTU1THVYRTFfc3E5eUd6X0xFdmRLOTZhdFBTTEEVhO3X06sMVfUWaJqVNGN6bsfILw-gzip--gzip"
< X-SODA2-Data-Out-Of-Date: false
< X-SODA2-Truth-Last-Modified: Tue, 27 Mar 2018 21:37:31 GMT
< X-SODA2-Secondary-Last-Modified: Tue, 27 Mar 2018 21:37:31 GMT
< Last-Modified: Tue, 27 Mar 2018 21:37:31 GMT
< Vary: Accept-Encoding
< Age: 64
< X-Socrata-Region: aws-us-east-1-fedramp-prod

* Received 15.3 KB chunk
* Received 16 KB chunk
* Received 16 KB chunk
...

It's smart enough to use a chunked (streaming) transfer, but still produces the entire content on the server side before sending the header. It should send a 200 OK response header as soon as it starts compiling the data to transfer. Whether or not it has the first chunk immediately ready is irrelevant.


Unfortunately I think the best we can do (and I have no idea how to implement it) is check if the SSL handshake happens successfully, so we know we've at least connected to the target server, and then wait some arbitrarily long amount of time for it to get around to providing a header response. We have the slight benefit that such servers should be relatively quick about bad requests so we can somewhat safely assume that a present but silent server is formulating some huge response body.

@vforgione
Copy link
Member Author

To be clear, we can verify the resource exists relatively quickly sending options and head requests, which we already do.

@HeyZoos
Copy link
Collaborator

HeyZoos commented Jun 28, 2018

Could you clarify where the workflow would be different? Data set ingest, small and large, is always carried out in the background from what I understand. Our issues with those have been due to GenServer timeouts and memory management.

@vforgione
Copy link
Member Author

The field guesser, specifically. I think I just need to tinker around with that.

@brlodi
Copy link
Collaborator

brlodi commented Jun 28, 2018

If we know the resource exists, then, I don't see any problem with just putting a stupidly long timeout on it. If we want to be fancy, we could set it automatically based on like 125% of the delay for the response the last time we accessed the resource

@HeyZoos
Copy link
Collaborator

HeyZoos commented Jun 28, 2018

Hmm, streaming the 1000 or so rows off the top doesn't work for those?

@vforgione
Copy link
Member Author

It just kind of dies. I'm doing some verbose logging locally.

14:22:38.571 [info] POST /data-sets/2
14:22:38.579 [debug] QUERY OK source="users" db=1.1ms
SELECT u0."id", u0."name", u0."email", u0."password_hash", u0."bio", u0."is_active", u0."is_admin", u0."inserted_at", u0."updated_at" FROM "users" AS u0 WHERE (u0."id" = $1) [1]
14:22:38.579 [debug] Processing with PlenarioWeb.Web.DataSetController.update/2
  Parameters: %{"_csrf_token" => "HyJULUlhfyJcOA4LAV8ZD2NDU1QmEAAAEgcDyNFo/pdJp1zc/wc8Vw==", "_method" => "put", "_utf8" => "✓", "id" => "2", "meta" => %{"attribution" => "", "description" => "", "force_fields_reset" => "true", "name" => "Divvy Trips", "refresh_ends_on" => "", "refresh_interval" => "", "refresh_rate" => "", "refresh_starts_on" => "", "source_type" => "csv", "source_url" => "https://data.cityofchicago.org/api/view/fg6s-gzvg/rows.csv?accessType=DOWNLOAD"}}
  Pipelines: [:browser, :maybe_authenticated]
14:22:38.584 [debug] QUERY OK source="metas" db=4.0ms
SELECT m0."id", m0."name", m0."slug", m0."table_name", m0."state", m0."description", m0."attribution", m0."source_url", m0."source_type", m0."refresh_rate", m0."refresh_interval", m0."refresh_starts_on", m0."refresh_ends_on", m0."first_import", m0."latest_import", m0."next_import", m0."bbox", m0."time_range", m0."inserted_at", m0."updated_at", m0."user_id" FROM "metas" AS m0 WHERE (m0."id" = $1) [2]
14:22:38.589 [debug] QUERY OK source="metas" db=4.8ms
SELECT m0."id", m0."name", m0."slug", m0."table_name", m0."state", m0."description", m0."attribution", m0."source_url", m0."source_type", m0."refresh_rate", m0."refresh_interval", m0."refresh_starts_on", m0."refresh_ends_on", m0."first_import", m0."latest_import", m0."next_import", m0."bbox", m0."time_range", m0."inserted_at", m0."updated_at", m0."user_id" FROM "metas" AS m0 WHERE (m0."id" = $1) [2]
14:22:38.592 [info] beginning download of https://data.cityofchicago.org/api/view/fg6s-gzvg/rows.csv?accessType=DOWNLOAD
14:22:39.099 [debug] https://data.cityofchicago.org/api/view/fg6s-gzvg/rows.csv?accessType=DOWNLOAD streaming response is %HTTPoison.AsyncResponse{id: #Reference<0.910888085.3588227073.86861>}
14:22:39.106 [info] Sent 302 in 534ms
14:22:39.141 [info] GET /data-sets/2
14:22:39.146 [debug] QUERY OK source="users" db=4.0ms
SELECT u0."id", u0."name", u0."email", u0."password_hash", u0."bio", u0."is_active", u0."is_admin", u0."inserted_at", u0."updated_at" FROM "users" AS u0 WHERE (u0."id" = $1) [1]
14:22:39.146 [debug] Processing with PlenarioWeb.Web.DataSetController.show/2
  Parameters: %{"id" => "2"}
  Pipelines: [:browser, :maybe_authenticated]
14:22:39.147 [debug] QUERY OK source="metas" db=0.3ms
SELECT m0."id", m0."name", m0."slug", m0."table_name", m0."state", m0."description", m0."attribution", m0."source_url", m0."source_type", m0."refresh_rate", m0."refresh_interval", m0."refresh_starts_on", m0."refresh_ends_on", m0."first_import", m0."latest_import", m0."next_import", m0."bbox", m0."time_range", m0."inserted_at", m0."updated_at", m0."user_id" FROM "metas" AS m0 WHERE (m0."id" = $1) [2]
14:22:39.150 [debug] QUERY OK source="metas" db=3.1ms
SELECT m0."id", m0."name", m0."slug", m0."table_name", m0."state", m0."description", m0."attribution", m0."source_url", m0."source_type", m0."refresh_rate", m0."refresh_interval", m0."refresh_starts_on", m0."refresh_ends_on", m0."first_import", m0."latest_import", m0."next_import", m0."bbox", m0."time_range", m0."inserted_at", m0."updated_at", m0."user_id" FROM "metas" AS m0 WHERE (m0."id" = $1) [2]
14:22:39.151 [debug] QUERY OK source="data_set_fields" db=0.9ms
SELECT d0."id", d0."name", d0."type", d0."description", d0."inserted_at", d0."updated_at", d0."meta_id", d0."meta_id" FROM "data_set_fields" AS d0 WHERE (d0."meta_id" = $1) ORDER BY d0."meta_id" [2]
14:22:39.151 [debug] QUERY OK source="users" db=1.1ms
SELECT u0."id", u0."name", u0."email", u0."password_hash", u0."bio", u0."is_active", u0."is_admin", u0."inserted_at", u0."updated_at", u0."id" FROM "users" AS u0 WHERE (u0."id" = $1) [1]
14:22:39.156 [debug] QUERY OK source="metas" db=4.1ms
SELECT m0."id", m0."name", m0."slug", m0."table_name", m0."state", m0."description", m0."attribution", m0."source_url", m0."source_type", m0."refresh_rate", m0."refresh_interval", m0."refresh_starts_on", m0."refresh_ends_on", m0."first_import", m0."latest_import", m0."next_import", m0."bbox", m0."time_range", m0."inserted_at", m0."updated_at", m0."user_id", m0."user_id" FROM "metas" AS m0 WHERE (m0."user_id" = $1) ORDER BY m0."user_id" [1]
14:22:39.157 [debug] QUERY OK source="virtual_date_fields" db=0.8ms
SELECT v0."id", v0."name", v0."inserted_at", v0."updated_at", v0."meta_id", v0."year_field_id", v0."month_field_id", v0."day_field_id", v0."hour_field_id", v0."minute_field_id", v0."second_field_id" FROM "virtual_date_fields" AS v0 WHERE (v0."meta_id" = $1) [2]
14:22:39.158 [debug] QUERY OK source="virtual_point_fields" db=0.5ms
SELECT v0."id", v0."name", v0."inserted_at", v0."updated_at", v0."meta_id", v0."lat_field_id", v0."lon_field_id", v0."loc_field_id" FROM "virtual_point_fields" AS v0 WHERE (v0."meta_id" = $1) [2]
14:22:39.164 [info] Sent 200 in 23ms

And then that's it...

@vforgione
Copy link
Member Author

Ok, so Divvy trips, in particular, takes waaaaaaaaay too long to render server side before it even starts to send a response. I'm going to work with the people at the city to work around a solution.

This may be the lead up to building out an ingest process for Top-N data sets, where the body is too large to process in a realistic time frame.

@vforgione vforgione removed this from the To a One-dot Release milestone Jul 19, 2018
@vforgione vforgione added Task Work that is not bug related. Blocked Tasks that are unable to be completed. and removed etl labels Jul 19, 2018
@vforgione
Copy link
Member Author

Blocked until we address #377

vforgione pushed a commit that referenced this issue Oct 26, 2018
Adds new columns for Socrata sourced data sets and make the original
source field nullable.

Updates #340
Updates #235
vforgione pushed a commit that referenced this issue Oct 26, 2018
Added fields and annotated source url to be default null.

Updates #340
Updates #235
vforgione pushed a commit that referenced this issue Oct 26, 2018
Since I totally rewrote the changesets, I needed to update the actions.
And while I was updating how it worked with changesets, I refactored its
methods as well.

This module has smelled for a while and it was time for a change.

Updates #340
Updates #235
vforgione pushed a commit that referenced this issue Oct 26, 2018
Many internal MetaActions changed signatures and caused some serious
changes to tests and other internal actions.

Also added back some helper functions.

Updates #340
Updates #235
vforgione pushed a commit that referenced this issue Nov 14, 2018
The internal API was getting really nasty -- we had a bunch of one off
functions that clashed in arity (positional arguments, matches, guards,
options ...).

The web application was also a disaster -- originally I thought it would
make it easier to keep the web, admin and API separate in subapps,
but that ended up making things that much more difficult.

Then that leaves the elephant in the room: Socrata. We've always relied
on them and all of their awful decisions. The changes here in remove
some of the terrible things about Socrata integration and makes
ingesting their data sets a little cleaner.

Breaking Changes:

- total revision of the migrations
- entirely removed the `UserAdminMessage` schema
- entirely removes all the outstanding ETL job stuff
- entirely removes charts -- that was a really stupid idea
- entirely removes exports -- again just a stupid idea
- totally new ingest pipeline
- slimmed down the API (still needs some work0

Closes #235
Closes #340
Closes #360
Closes #361
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocked Tasks that are unable to be completed. Task Work that is not bug related.
Projects
No open projects
Development

No branches or pull requests

3 participants