Figure out how to wire up Data Set Creation for Massive Data Sets #340

vforgione · 2018-06-28T17:28:37Z

Related to #235

For things like Crime Data and Divvy Rides, where the data set takes minutes to just even respond to requests, we need some sort of mechanism to handle this.

One possible solution: add an extra field to the meta that says this is a really big data set. When that flag is true, we can either increase the timeout value for the request or turn it into some sort of background job (and still bump out the request timeout). I'm not so sure about the background job, as it would have a significantly different workflow for the users.

@HeyZoos, @brlodi, @sanilio please comment.

brlodi · 2018-06-28T18:08:54Z

add an extra field to the meta that says this is a really big data set.

I'd argue it's less that it's a really big dataset (though I'm sure that's the underlying cause) and more of a poorly-configured, or at least insufficiently configured, source server. Ideally we'd just say that if your server can't be bothered to return at least a 202 Request Accepted within 30 seconds or so then that's your problem.

To pick on the Divvy Trips and the City a little bit here for the sake of example:

> GET /api/views/fg6s-gzvg/rows.csv HTTP/1.1
> Host: data.cityofchicago.org
> User-Agent: insomnia/5.16.6
> Accept: */*

< HTTP/1.1 200 OK
< Server: nginx
< Date: Thu, 28 Jun 2018 17:49:33 GMT
< Content-Type: text/csv; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< X-Socrata-RequestId: a8tww493syofhmkgmsbh7u0yb
< Access-Control-Allow-Origin: *
< Content-disposition: attachment; filename=Divvy_Trips.csv
< Cache-Control: public, must-revalidate, max-age=21600
< ETag: W/"YWxwaGEuMTE4NDM4XzI5XzIyMTU1THVYRTFfc3E5eUd6X0xFdmRLOTZhdFBTTEEVhO3X06sMVfUWaJqVNGN6bsfILw-gzip--gzip"
< X-SODA2-Data-Out-Of-Date: false
< X-SODA2-Truth-Last-Modified: Tue, 27 Mar 2018 21:37:31 GMT
< X-SODA2-Secondary-Last-Modified: Tue, 27 Mar 2018 21:37:31 GMT
< Last-Modified: Tue, 27 Mar 2018 21:37:31 GMT
< Vary: Accept-Encoding
< Age: 64
< X-Socrata-Region: aws-us-east-1-fedramp-prod

* Received 15.3 KB chunk
* Received 16 KB chunk
* Received 16 KB chunk
...

It's smart enough to use a chunked (streaming) transfer, but still produces the entire content on the server side before sending the header. It should send a 200 OK response header as soon as it starts compiling the data to transfer. Whether or not it has the first chunk immediately ready is irrelevant.

Unfortunately I think the best we can do (and I have no idea how to implement it) is check if the SSL handshake happens successfully, so we know we've at least connected to the target server, and then wait some arbitrarily long amount of time for it to get around to providing a header response. We have the slight benefit that such servers should be relatively quick about bad requests so we can somewhat safely assume that a present but silent server is formulating some huge response body.

vforgione · 2018-06-28T18:34:15Z

To be clear, we can verify the resource exists relatively quickly sending options and head requests, which we already do.

HeyZoos · 2018-06-28T18:39:31Z

Could you clarify where the workflow would be different? Data set ingest, small and large, is always carried out in the background from what I understand. Our issues with those have been due to GenServer timeouts and memory management.

vforgione · 2018-06-28T18:50:47Z

The field guesser, specifically. I think I just need to tinker around with that.

brlodi · 2018-06-28T18:52:43Z

If we know the resource exists, then, I don't see any problem with just putting a stupidly long timeout on it. If we want to be fancy, we could set it automatically based on like 125% of the delay for the response the last time we accessed the resource

HeyZoos · 2018-06-28T18:58:59Z

Hmm, streaming the 1000 or so rows off the top doesn't work for those?

vforgione · 2018-06-28T19:24:37Z

It just kind of dies. I'm doing some verbose logging locally.

14:22:38.571 [info] POST /data-sets/2
14:22:38.579 [debug] QUERY OK source="users" db=1.1ms
SELECT u0."id", u0."name", u0."email", u0."password_hash", u0."bio", u0."is_active", u0."is_admin", u0."inserted_at", u0."updated_at" FROM "users" AS u0 WHERE (u0."id" = $1) [1]
14:22:38.579 [debug] Processing with PlenarioWeb.Web.DataSetController.update/2
  Parameters: %{"_csrf_token" => "HyJULUlhfyJcOA4LAV8ZD2NDU1QmEAAAEgcDyNFo/pdJp1zc/wc8Vw==", "_method" => "put", "_utf8" => "✓", "id" => "2", "meta" => %{"attribution" => "", "description" => "", "force_fields_reset" => "true", "name" => "Divvy Trips", "refresh_ends_on" => "", "refresh_interval" => "", "refresh_rate" => "", "refresh_starts_on" => "", "source_type" => "csv", "source_url" => "https://data.cityofchicago.org/api/view/fg6s-gzvg/rows.csv?accessType=DOWNLOAD"}}
  Pipelines: [:browser, :maybe_authenticated]
14:22:38.584 [debug] QUERY OK source="metas" db=4.0ms
SELECT m0."id", m0."name", m0."slug", m0."table_name", m0."state", m0."description", m0."attribution", m0."source_url", m0."source_type", m0."refresh_rate", m0."refresh_interval", m0."refresh_starts_on", m0."refresh_ends_on", m0."first_import", m0."latest_import", m0."next_import", m0."bbox", m0."time_range", m0."inserted_at", m0."updated_at", m0."user_id" FROM "metas" AS m0 WHERE (m0."id" = $1) [2]
14:22:38.589 [debug] QUERY OK source="metas" db=4.8ms
SELECT m0."id", m0."name", m0."slug", m0."table_name", m0."state", m0."description", m0."attribution", m0."source_url", m0."source_type", m0."refresh_rate", m0."refresh_interval", m0."refresh_starts_on", m0."refresh_ends_on", m0."first_import", m0."latest_import", m0."next_import", m0."bbox", m0."time_range", m0."inserted_at", m0."updated_at", m0."user_id" FROM "metas" AS m0 WHERE (m0."id" = $1) [2]
14:22:38.592 [info] beginning download of https://data.cityofchicago.org/api/view/fg6s-gzvg/rows.csv?accessType=DOWNLOAD
14:22:39.099 [debug] https://data.cityofchicago.org/api/view/fg6s-gzvg/rows.csv?accessType=DOWNLOAD streaming response is %HTTPoison.AsyncResponse{id: #Reference<0.910888085.3588227073.86861>}
14:22:39.106 [info] Sent 302 in 534ms
14:22:39.141 [info] GET /data-sets/2
14:22:39.146 [debug] QUERY OK source="users" db=4.0ms
SELECT u0."id", u0."name", u0."email", u0."password_hash", u0."bio", u0."is_active", u0."is_admin", u0."inserted_at", u0."updated_at" FROM "users" AS u0 WHERE (u0."id" = $1) [1]
14:22:39.146 [debug] Processing with PlenarioWeb.Web.DataSetController.show/2
  Parameters: %{"id" => "2"}
  Pipelines: [:browser, :maybe_authenticated]
14:22:39.147 [debug] QUERY OK source="metas" db=0.3ms
SELECT m0."id", m0."name", m0."slug", m0."table_name", m0."state", m0."description", m0."attribution", m0."source_url", m0."source_type", m0."refresh_rate", m0."refresh_interval", m0."refresh_starts_on", m0."refresh_ends_on", m0."first_import", m0."latest_import", m0."next_import", m0."bbox", m0."time_range", m0."inserted_at", m0."updated_at", m0."user_id" FROM "metas" AS m0 WHERE (m0."id" = $1) [2]
14:22:39.150 [debug] QUERY OK source="metas" db=3.1ms
SELECT m0."id", m0."name", m0."slug", m0."table_name", m0."state", m0."description", m0."attribution", m0."source_url", m0."source_type", m0."refresh_rate", m0."refresh_interval", m0."refresh_starts_on", m0."refresh_ends_on", m0."first_import", m0."latest_import", m0."next_import", m0."bbox", m0."time_range", m0."inserted_at", m0."updated_at", m0."user_id" FROM "metas" AS m0 WHERE (m0."id" = $1) [2]
14:22:39.151 [debug] QUERY OK source="data_set_fields" db=0.9ms
SELECT d0."id", d0."name", d0."type", d0."description", d0."inserted_at", d0."updated_at", d0."meta_id", d0."meta_id" FROM "data_set_fields" AS d0 WHERE (d0."meta_id" = $1) ORDER BY d0."meta_id" [2]
14:22:39.151 [debug] QUERY OK source="users" db=1.1ms
SELECT u0."id", u0."name", u0."email", u0."password_hash", u0."bio", u0."is_active", u0."is_admin", u0."inserted_at", u0."updated_at", u0."id" FROM "users" AS u0 WHERE (u0."id" = $1) [1]
14:22:39.156 [debug] QUERY OK source="metas" db=4.1ms
SELECT m0."id", m0."name", m0."slug", m0."table_name", m0."state", m0."description", m0."attribution", m0."source_url", m0."source_type", m0."refresh_rate", m0."refresh_interval", m0."refresh_starts_on", m0."refresh_ends_on", m0."first_import", m0."latest_import", m0."next_import", m0."bbox", m0."time_range", m0."inserted_at", m0."updated_at", m0."user_id", m0."user_id" FROM "metas" AS m0 WHERE (m0."user_id" = $1) ORDER BY m0."user_id" [1]
14:22:39.157 [debug] QUERY OK source="virtual_date_fields" db=0.8ms
SELECT v0."id", v0."name", v0."inserted_at", v0."updated_at", v0."meta_id", v0."year_field_id", v0."month_field_id", v0."day_field_id", v0."hour_field_id", v0."minute_field_id", v0."second_field_id" FROM "virtual_date_fields" AS v0 WHERE (v0."meta_id" = $1) [2]
14:22:39.158 [debug] QUERY OK source="virtual_point_fields" db=0.5ms
SELECT v0."id", v0."name", v0."inserted_at", v0."updated_at", v0."meta_id", v0."lat_field_id", v0."lon_field_id", v0."loc_field_id" FROM "virtual_point_fields" AS v0 WHERE (v0."meta_id" = $1) [2]
14:22:39.164 [info] Sent 200 in 23ms

And then that's it...

vforgione · 2018-06-28T20:08:59Z

Ok, so Divvy trips, in particular, takes waaaaaaaaay too long to render server side before it even starts to send a response. I'm going to work with the people at the city to work around a solution.

This may be the lead up to building out an ingest process for Top-N data sets, where the body is too large to process in a realistic time frame.

vforgione · 2018-07-23T15:53:54Z

Blocked until we address #377

Adds new columns for Socrata sourced data sets and make the original source field nullable. Updates #340 Updates #235

Added fields and annotated source url to be default null. Updates #340 Updates #235

Since I totally rewrote the changesets, I needed to update the actions. And while I was updating how it worked with changesets, I refactored its methods as well. This module has smelled for a while and it was time for a change. Updates #340 Updates #235

Many internal MetaActions changed signatures and caused some serious changes to tests and other internal actions. Also added back some helper functions. Updates #340 Updates #235

The internal API was getting really nasty -- we had a bunch of one off functions that clashed in arity (positional arguments, matches, guards, options ...). The web application was also a disaster -- originally I thought it would make it easier to keep the web, admin and API separate in subapps, but that ended up making things that much more difficult. Then that leaves the elephant in the room: Socrata. We've always relied on them and all of their awful decisions. The changes here in remove some of the terrible things about Socrata integration and makes ingesting their data sets a little cleaner. Breaking Changes: - total revision of the migrations - entirely removed the `UserAdminMessage` schema - entirely removes all the outstanding ETL job stuff - entirely removes charts -- that was a really stupid idea - entirely removes exports -- again just a stupid idea - totally new ingest pipeline - slimmed down the API (still needs some work0 Closes #235 Closes #340 Closes #360 Closes #361

vforgione added core labels Jun 28, 2018

vforgione added this to the To a One-dot Release milestone Jun 28, 2018

vforgione self-assigned this Jun 28, 2018

vforgione added the blocked label Jun 28, 2018

vforgione removed this from the To a One-dot Release milestone Jul 19, 2018

vforgione added Task Work that is not bug related. Blocked Tasks that are unable to be completed. and removed etl labels Jul 19, 2018

vforgione pushed a commit that referenced this issue Oct 26, 2018

Modify metas table

b9d4d97

Adds new columns for Socrata sourced data sets and make the original source field nullable. Updates #340 Updates #235

vforgione pushed a commit that referenced this issue Oct 26, 2018

Added socrata fields to meta schema

5c22d47

Added fields and annotated source url to be default null. Updates #340 Updates #235

vforgione pushed a commit that referenced this issue Oct 26, 2018

Updated internal API calls

7595339

Many internal MetaActions changed signatures and caused some serious changes to tests and other internal actions. Also added back some helper functions. Updates #340 Updates #235

vforgione mentioned this issue Nov 14, 2018

Toward a sane internal API and Socrata native ingests #460

Merged

vforgione closed this as completed in #460 Nov 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out how to wire up Data Set Creation for Massive Data Sets #340

Figure out how to wire up Data Set Creation for Massive Data Sets #340

vforgione commented Jun 28, 2018

brlodi commented Jun 28, 2018

vforgione commented Jun 28, 2018

HeyZoos commented Jun 28, 2018

vforgione commented Jun 28, 2018

brlodi commented Jun 28, 2018

HeyZoos commented Jun 28, 2018

vforgione commented Jun 28, 2018

vforgione commented Jun 28, 2018

vforgione commented Jul 23, 2018

Figure out how to wire up Data Set Creation for Massive Data Sets #340

Figure out how to wire up Data Set Creation for Massive Data Sets #340

Comments

vforgione commented Jun 28, 2018

brlodi commented Jun 28, 2018

vforgione commented Jun 28, 2018

HeyZoos commented Jun 28, 2018

vforgione commented Jun 28, 2018

brlodi commented Jun 28, 2018

HeyZoos commented Jun 28, 2018

vforgione commented Jun 28, 2018

vforgione commented Jun 28, 2018

vforgione commented Jul 23, 2018