Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Guidelines for Data Preparation

Adil edited this page Jul 4, 2016 · 26 revisions

So, you are creating your own data, or you found a (little messy) external data source and you'd like to transform it for the better. Don't worry, and ...

Keep your data :

  • Simple.
  • Readable.
  • Compact.
  • Consistent.

Good analysis and good visualization is only possible with good data. Successful and rapid data analysis starts with an attention and care on how you measure, collect, store, manage, and prepare your data. The following guidelines can help your data get in better shape for analysis.

Also, check out the amazing bad-data guide by Quartz :: https://github.com/Quartz/bad-data-guide

The concepts in this page are also relevant to the "Tiny Data" concept described by Hadley Wickham :: http://vita.had.co.nz/papers/tidy-data.pdf It is no surprise that good analysis (with Keshif) starts with tidy data!

Tables

One table shows one entity, record by record.

  • Each data table presents a single entity, such as people, publications, incidents, etc.
  • Each record should have a uniquely identifying id, ideally. If not, you cannot create links between records.
  • Avoid aggregating records. Each record should describe a specific item or observation.

Your data might consist of multiple tables! In keshif, you can create one browser for each major entity, and you can link to data in other tables when creating a summary. Other visualization tools may allow for more flexible join operations between data tables.

Note: Tables can also be used as code-books for categories. For example, "E" might mean expedit, "C" may map to complete, and so on.

Numeric Attribute

Just the number. No abbreviations. Same scale for all records.

  • Remove the unit-names from the columns. If you list price of cellphones, do not encode price as $755, or the height as 7inches. Keep only the numeric part, and include the unitName as part of the column title, or the "metadata" of the attribute.
  • Make sure all columns are encoded with the same unit. If part of the data is in milligrams and the other part in grams, make sure that all values have the same unit.
  • Do not include abbreviations like "3M" or "3 million". Put down the number using digits, to whatever accuracy you have available. You may store "accuracy" as a metadata per a numerical attribute, although keshif currently doesn't use such metadata.
  • Ideally, encode the number in its simplest unit. For example, use "meter" measurements, instead of "kilometer". This is helpful to avoid cases like "1k" kilometers, and allows the interface to show "1M" meters.

Date/Time Attributes

When you load data using CSV or JSON files, the date attributes are loaded to Keshif as strings, and thus need to be parsed to create Date objects before they can be displayed with date/time semantics. Keshif does not internally deal with parsing date strings, and it does not impose any standard or "accepted" formats. Working with date objects is left to data loading, or to specifying attribute summaries. Thus, you will most probably want to use the JS API to create a browser that shows temporal features in your data. If you implement a customized JS callback to load your data, you may also need to parse any date-related string in your input appropriately.

You can use the default JS Date object to represent your time-attributes. For reference, you may use: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date

If you summarize an attribute of data type Date(), keshif automatically shows a line-chart with an appropriate level of aggregation.

You can use d3's time formatting functions to parse or display date strings. For details on the type of formats d3 recognizes, please refer to https://github.com/mbostock/d3/wiki/Time-Formatting. To use d3's formatting capabilities, you need to create a dateFormat object first, ideally before you create new kshf.Browser().

var dateFormat = d3.time.format("%m/%d/%Y");

One option is to convert the type of the data to create a single summary. Assuming the "DateString" field holds values such as "1/25/2011", and using the dateFormat defined above, you may define the value for your summary as:

  value: function(){ return dateFormat.parse(this.DateString); }

Another option is to update the data directly after it is loaded. This will give you more control over re-using the same Date() object to extract day/month/year/etc, or to re-format it for display.

loadedCb: function(){
  var dateFormat = d3.time.format("%m/%d/%Y");
  kshf.dt.Data.forEach(function(d){
    d.data.Date = dateFormat.parse(d.data.Date);
  });
},

You may also use momentjs (http://momentjs.com/) library for more flexibility (dealing with timezones, more advanced formatting and parsing options, etc.). If you choose to convert the attribute to a moment object, you can extract various time/date properties using functions in moment library. An example which extracts hour, date, month, year, and also the full JS Date (linechart) is available at:

https://github.com/adilyalcin/Keshif/blob/master/demo/bookdownloads.html

To display the month/date/year/etc. on a linechart given a JS Date object, keshif retrieves the UTC-based values from date objects. If time-zone accuracy is important, please convert your time objects into the UTC timezone.

  • In future, Keshif may try to auto-detect certain date formats. However, at the moment, it does not impose a specific required format, and offers flexible ways to work with Date fields using JS.

  • Make sure all representations are consistent. If one columns in "May 2015", do not have columns like "2014", "2015-07", "07-2015", etc.

  • Use numbers instead of strings. Parsing "05" as month is easier than parsing "May", and arguably more error prone.

  • If the date attribute of a record cannot be correctly parsed, or is invalid, assign the date value to null or undefined for that record so that it is correctly skipped.

Categorical Attributes

  • You may use standard codes in place of categorical variables, such as ISO 3166 Country Codes, or in place of county names, you can use FIPS Codes. Both examples also represent some geographical context, but same practice can apply to other data points too.
  • If the values of categories can be further described/identified, such as "authors" in a paper collection may have data related to them (affiliation, gender, location, etc), use the unique identifier for each item to describe the categories of the current table.

Multi-Valued Categorical Attributes

Multi,Values,Merged,With,Commas

  • Store multiple values of a categorical attribute using a simple separator within a single cell. For example, an article may have multiple tags, with the goal of organizing the list of articles in a searchable and indexed format. Or, multiple authors. You may use "+" as a separator. Just be aware that any valid category text with the specific separator might be parsed incorrectly. ** This is counter to database design, which would ask you to ID each tag, ID each record, and have a separate table that encodes relations between tags and records. But, this violates simplicity and compactness. If all you care about is the list of tags per record, store it that way.

  • If each value has its own attributes, you'll need to have multiple tables to describe the relations. For example, assume you have a list of employees. Each employee can be a member of multiple organizations. And for each membership, there can be a starting date. If you are to store this information, separate the tables. Visualization will be more tricky.

Note: See keshif:AgreeSet for set-based visualization details.

Using Spreadsheets

Spreadsheet as database

  • Make sure the first header includes the column (attribute) name.
  • First 1-2 columns in your table should be unique identifiers per record, if any.
  • You can "freeze/fix" the topmost and leftmost columns, so that when you scroll in the spreadsheet, you always see the attribute name on the row top and the unique record identifiers on the left column.

Data File

Google Sheets Great for small to moderate sizes (a few thousand records, tens of attributes). Make your sheet publicly readable, so that others can view your visualization.

JSON If you have JSON files already, keep them that way. You can parse those files easily in keshif using JavaScript.

  • Ideally, the file should be [{},{}, .... , {}], where each {} is an object with key-value pairs, such as {"id": 10, "name": "Jon", "age": 21, "likes": "data, visualization, biking"}. The keys should ideally appear for all data items (missing key will mean missing attribute per record). If you follow this structure, keshif can detect the attributes in the record collection easily.
  • If you do not want to implement custom data loading or value() functions for aggregates, do not have a hierarchical data representation per record. That is, if you have a record block {}, none of the attributes should be another block { name: {first: 'John', last: 'Doe'} }. Sometimes the reason is semantic grouping. Sometimes, the reason is merging multiple tables. Remember the simple tabular structure, and flatten your JSON files if you want to keep it simple.
  • Socrata's JSON formats are a good place to start if you want to do some custom data coding.

Be careful with missing values

  • If you have to encode missing value, leave the field empty, or if you have to differentiate "meaning" of missing from incomplete, do simple, consistent coding, such as using "NA" to show empty fields.

Make it compact

  • If you can generate one attribute from another, remove the longer attribute. For example, if each record has an integer id (43), and a url (www.example.com/record/43), remove the URL. You can generate it from the existing column easily (using a simple custom formula).

Scaling up: More record, more attributes, larger files

Visualizing 1k record is trivial, 10k is fine, and with 100k, things may start to slow down, at least on browsers. Also remember that all data is transmitted over the web to create a browser, not a subset of it.

The guidelines below are to deal with files that may be a little too large to be displayed and filtered on a web browser. These are not general guidelines.

  • Remove large chunks that are not "exploratory". Make these resources available on-demand per record.

Does your data include big chunks of text, descriptions, or maybe even embedded binary images, etc, that result in large files? If so, remove those columns with big chunks, and make these resources available on demand per each record. If you have an image, the image can be retrieved from www.example.com/images/1.png. Or, if you have a text description, you can return it with simple API calls on demand (when the record is visible or more record details are requested.)

  • Filter to recent time.

Is your data collected over time? If so, can you start with focusing on the most recent data records for exploration? Then, you are lucky, you can preprocess your data to use the most recent records, like last month, last year, or most recent 5 years.

  • "Code" categories.

If an attribute is a categorical variable (string), create a codebook and map long strings to 1-2 letter codes. For example, "Architectural Planning" can be "AP". You can convert these codes to their original text form while visualizing/exploring the data.

  • Aggregate records, and store the number of records per aggregation.

This is counter to one of the first guidelines: storing each record individually. However, there are cases where such data can be of millions of items. A good way to bring down the data size is for each row to represent multiple records in the original data. Use this method only when absolutely necessary. Note that, each record can still appear in one row only in this form of aggregation.

Licensing and attributions

Sample Datasets

Below are some sample datasets that mostly follow these guidelines. These datasets may have some formatting issues too. They are not tailored for absolute conformance. However, their structure allows for effective visual analytics, and they are easy to read. They are chosen from the keshif datasets at Google Sheets for convenience.

Other Resources