Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Additional Sample Data Sets #21604

Closed
alexfrancoeur opened this issue Aug 2, 2018 · 20 comments
Closed

Add Additional Sample Data Sets #21604

alexfrancoeur opened this issue Aug 2, 2018 · 20 comments
Labels
discuss Feature:Add Data Add Data and sample data feature on Home Feature:Home Kibana home application Spacetime Team:Visualizations Visualization editors, elastic-charts and infrastructure

Comments

@alexfrancoeur
Copy link

alexfrancoeur commented Aug 2, 2018

With the new splash screen (#21353, #18828), we will be surfacing sample data as one of the first things you see when you enter Kibana in 6.5. In order for this feature to be meaningful, it would be better to have more than one sample data set.

These could be more "fun" and generic sample data sets like Flights or use case driven such as logging, metrics, etc. I think a total of 3 might be a good start here. While these data sets are small, we should still be cognizant of what we're adding to Kibana.

I know we have discussed waiting to see adoption of Flights before adding more data sets, but even with the more recent changes to the home page (#20953), sample data is still a bit hidden. We also will not have sample data telemetry in 6.4 (#19319). This splash screen will make sample data much more prominent and is important for Cloud trial users / adoption of the stack. It would also make sense to align with demo.elastic.co needs in order to avoid duplicate work.

Here are some (brief) initial thoughts, but I'd like to discuss what options might be best within this issue.

Would love to hear any ideas

cc: @gingerwizard @asawariS @EthanStrider @AlonaNadler @jamiesmith @epixa @jimgoodwin @rayafratkina @nreese

@alexfrancoeur alexfrancoeur added discuss Feature:Add Data Add Data and sample data feature on Home :Sharing Feature:Home Kibana home application labels Aug 2, 2018
@jamiesmith
Copy link
Contributor

I like the idea of the ecommerce one - it would be really cool if we could get some geo data in there.

@gingerwizard
Copy link

If we can ensure the ecommerce one matches the structure we use in our current ecommerce data, we can reuse the canvas workpads design have been helping with. They are very polished and will be maintained moving forward. Agreed also re examples repo - this would move us to a more maintainable state.

@asawariS
Copy link

asawariS commented Aug 3, 2018

I agree with @alexfrancoeur about limiting the effort to 3 for now.

I would vote for eCommerce as 2nd (first being flights). With Canvas in 6.5, huge ++ for shipping with prebuilt Canvas workpads on these dataset.
For the 3rd, I would vote something that supported by modules (like Apache or Nginx) as a way to promote modules and the Add Data tutorials in a more concrete way.

If possible I would bake an anomaly or 3 in the datasets. This will make ML docs and tutorials super simple.

@alexfrancoeur
Copy link
Author

I like the idea of the ecommerce one - it would be really cool if we could get some geo data in there.

If we can ensure the ecommerce one matches the structure we use in our current ecommerce data, we can reuse the canvas workpads design have been helping with. They are very polished and will be maintained moving forward. Agreed also re examples repo - this would move us to a more maintainable state.

@gingerwizard @jamiesmith Regarding the ecommerce data set, this was something I threw together for our Business Analytics instruction set. I'm not familiar enough with the current ecommerce data set but would gladly replace the one I came up with. The flights data set is around ~14k docs and I think 6 weeks worth of data. I was hoping to do something similar for other data sets. Is there any way we can get a snippet of that data set in a JSON new line delimitated format? If we have the mappings and saved objects (including Canvas workpad) exported, we could easily add a new PR for a new sample data set.

I would vote for eCommerce as 2nd (first being flights). With Canvas in 6.5, huge ++ for shipping with prebuilt Canvas workpads on these dataset.
For the 3rd, I would vote something that supported by modules (like Apache or Nginx) as a way to promote modules and the Add Data tutorials in a more concrete way.

@asawariS do you think we'd want to use the actual module dashboards? Like literally re-use the Apache or Nginx data / module? Or introduce a more custom one. The only downside to re-using a beat dashboard is that we'd have to maintain it and make sure the dashboards are in sync. Any preference on data set? Could we borrow on from the examples repo?

If possible I would bake an anomaly or 3 in the datasets. This will make ML docs and tutorials super simple.

++ At least for the flight sample data and my ecommerce sample data, these are controlled by a script. If we can define the anomalies we want, we can easily create them. If we're baking the anomalies in, it'd be great to bake the ML jobs in itself. I don't believe they use the saved object service today though, so it may not be possible. Worth looking into though.

@asawariS
Copy link

asawariS commented Aug 7, 2018

Is there any way we can get a snippet of that data set in a JSON new line delimitated format? If we have the mappings and saved objects (including Canvas workpad) exported, we could easily add a new PR for a new sample data set.

@alexfrancoeur I have included a sample doc and workpad below [1]. I believe all of it is synthetic data except the manufacturer, sku, product name and prices. I think product name is generic enough that it can stay in. SKU and Manufacturer needs to be scrambled (since the product catalog comes from Zalando, and I don't think we have rights to distribute). My guess is that it will easier to stick with your toy dataset and focus on improving dashboards using themes from the cyclops demo.

@jamiesmith can you action this and respond on this thread. Loop in @EthanStrider as needed since he did a bunch of work on the Canvas workpad for the business analytics demo.

do you think we'd want to use the actual module dashboards? Like literally re-use the Apache or Nginx data / module? Or introduce a more custom one. The only downside to re-using a beat dashboard is that we'd have to maintain it and make sure the dashboards are in sync. Any preference on data set? Could we borrow on from the examples repo?

@alexfrancoeur I was thinking 3-fold. Include the default dashboard that ships with modules. Include a customized one (with a note that says so, and also references the default, as well as the add data instructions), and 3. Canvas workpad because not many people expect presentation style dashboards for infra data - so it goes above and beyond. We have an apache one in the examples repo that could be borrowed, but there may be GDPR constraints for in product use. But, if we decide to go with Apache, we have a good IP hashing script that can do the trick.


[1]
This is what Ethan has been creating:

image

{
  "_index": "cyclops-sold-product",
  "_type": "sold_product",
  "_id": "BPn9X2MB9UR6Fg0db3X4",
  "_version": 1,
  "_score": null,
  "_source": {
    "previous_order_count": 3200,
    "customer_id": 36,
    "is_anonymous": false,
    "discount_percentage": 0,
    "customer_gender": "MALE",
    "unit_discount_amount": 0,
    "day_of_week_i": 0,
    "manufacturer": "Pier One",
    "geoip": {
      "country_iso_code": "GB",
      "location": {
        "lon": -0.1224,
        "lat": 51.4964
      },
      "continent_name": "Europe"
    },
    "sku": "PI912DA2K-O11",
    "currency": "EUR",
    "base_price": 64.99,
    "customer_last_name": "Jones",
    "created_on": "2017-03-20T23:58:34+00:00",
    "min_price": 34.44,
    "product_name": "Winter boots - dark brown",
    "shipping_city": "Windsor",
    "taxful_price": 64.99,
    "merchant_notes": {
      "customer_type": "Value"
    },
    "version": "1.0",
    "type": "sold_product",
    "day_of_week": "Monday",
    "quantity": 1,
    "ip_address": "212.62.5.158",
    "tax_amount": 0,
    "session_key": "aqQB7LTNKxANvg83YD5TN6L55ksEMhzo",
    "billing_city": [
      "Windsor"
    ],
    "taxless_price": 64.99,
    "user": [
      "boris"
    ],
    "customer_first_name": "Boris",
    "basket_key": "DWw1HVp3Iof8GTSwxOBZbVRVJyyIW7oX",
    "customer_full_name": "Boris Jones",
    "discount_amount": 0,
    "category": "Men's Shoes",
    "price": 64.99,
    "base_unit_price": 64.99
  },
  "fields": {
    "created_on": [
      "2017-03-20T23:58:34.000Z"
    ]
  },
  "sort": [
    1490054314000
  ]
}

@jamiesmith
Copy link
Contributor

I set up a quick call to chat about this tomorrow. I have a greatly cut down set of data, but we might actually want more than just the bare bones

@jamiesmith
Copy link
Contributor

jamiesmith commented Aug 8, 2018

Things we would need to do to use that data:

  1. Lose the IPs
  2. Change the geo to be less exact
  3. modify manufacturers
  4. Just one month, maybe just three weeks
  5. obfuscate sku
  6. don't need:
  • any FKs
  • any frontmatter

Ethan is using: category, customer_gender, order_date, taxless_total_price
Alex is going to go through the JSON and see which he thinks are easily disposable (see the google doc)

@alexfrancoeur
Copy link
Author

Quick update from Jamie and I

  • We've trimmed down the data set for ecommerce a bit, but it should work with Ethan's workpad still
  • We'll see what dashboards can be re-used on from the SA demos here but I may need to create new ones. A Sankey chart could be fun here 😄
  • We're discussing what we should use for a log analytics data source and dashboards. I'm guessing we'll need to create a new workpad for the ops use case. I may do something like this for my Canvas tour talks anyway, so it's possible that could be available for re-use.
  • We still need a flights sample data workpad
  • eCommerce is close, I could probably open a PR for it within the next week or so

@alexfrancoeur
Copy link
Author

Started a PR for logs sample data #22276

@alexfrancoeur
Copy link
Author

@asawariS @jamiesmith first pass at a flight sample data workpad. Will add other pages but I'll probably use this as part of my Elastic{ON} tour presentation. Logs up next

screen shot 2018-08-27 at 8 27 43 pm

@asawariS
Copy link

sooooooo good @alexfrancoeur. I will come back with more questions/comments, if any, later.

quick question: where are the visual assets from? Assuming we have the rights to the images.

@alexfrancoeur
Copy link
Author

@asawariS we do not, at least with not acknowledging the artist(s) somewhere. That's something we need to talk about. I'm building out prototypes but I think we'll need input from design for some of these assets if we're going to package them with Kibana.

@jamiesmith
Copy link
Contributor

Alex, take a look at the blog doc. There is an image section that talks about getting appropriately licensed artwork.

@asawariS
Copy link

@alexfrancoeur for this we can consider getting our own design team to create something for this purpose.We have done that for past Canvas projects. We share rough mockups / concepts, and design builds something to align with the ask.

Examples:

@AlonaNadler
Copy link

@alexfrancoeur that's excellent!!

@alexfrancoeur
Copy link
Author

Logs are in (sans workpad), eCommerce next

@alexfrancoeur
Copy link
Author

This is what I'm thinking for a logs workpad though

screen shot 2018-08-29 at 5 56 45 pm

@alexfrancoeur
Copy link
Author

@asawariS I'll open some Design issues shortly

@alexfrancoeur
Copy link
Author

Opened the following issues to track these sample data workpads. Design asset issues coming soon.

[Sample Data] Add Canvas Workpad for Flight Data #22891
[Sample Data] Add Canvas Workpad for Web Logs #22892
[Sample Data] Add Canvas Workpad for eCommerce Data #22893

Goal is to submit a PR for eCommerce sample data this week.

@timroes timroes added Team:Visualizations Visualization editors, elastic-charts and infrastructure and removed Team:Visualizations Visualization editors, elastic-charts and infrastructure labels Sep 13, 2018
@timroes timroes added Team:Visualizations Visualization editors, elastic-charts and infrastructure and removed Team:Visualizations Visualization editors, elastic-charts and infrastructure :Sharing labels Sep 13, 2018
@alexfrancoeur
Copy link
Author

alexfrancoeur commented Oct 4, 2018

Closing. Web logs and eCommerce data sets have been merged. Sample data dashboards are in separate issues linked above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature:Add Data Add Data and sample data feature on Home Feature:Home Kibana home application Spacetime Team:Visualizations Visualization editors, elastic-charts and infrastructure
Projects
None yet
Development

No branches or pull requests

7 participants