Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Guidelines for Data Preparation

Adil edited this page Sep 2, 2015 · 26 revisions

So, you are creating your own data, or you found a (little messy) external data source and you'd like to transform it for the better. Don't worry, and ...

Keep your data :

  • Simple.
  • Readable.
  • Compact.
  • Consistent.

Good analysis and good visualization is only possible with good data. Successful and rapid data analysis starts with an attention and care on how you measure, collect, store, manage, and prepare your data. The following guidelines can help your data get in better shape for analysis, from the perspective of keshif.

Tables

One table shows one entity, record by record.

  • Each data table presents a single entity, such as people, publications, incidents, etc.
  • Each record should have a uniquely identifying id, ideally. If not, you cannot create links between records.
  • Avoid aggregating records. Each record should describe a specific item or observation.

Your data might consist of multiple tables! In keshif, you can create one browser for each major entity, and you can link to data in other tables when creating a summary. Other visualization tools may allow for more flexible join operations between data tables.

Note: Tables can also be used as code-books for categories. For example, "E" might mean expedit, "C" may map to complete, and so on.

Numeric Attribute

Just the number. No abbreviations. Same scale for all records.

  • Remove the unit-names from the columns. If you list price of cellphones, do not encode price as $755, or the height as 7inches. Keep only the numeric part, and include the unitName as part of the column title, or the "metadata" of the attribute.
  • Make sure all columns are encoded with the same unit. If part of the data is in milligrams and the other part in grams, make sure that all values have the same unit.
  • Do not include abbreviations like "3M" or "3 million". Put down the number using digits, to whatever accuracy you have available. You may store "accuracy" as a metadata per a numerical attribute, although keshif currently doesn't use such metadata.

Date/Time Attribute

  • Make sure all representations are consistent. If one columns in "May 2015", do not have columns like "2014", "2015-07", "07-2015", etc.
  • Use numbers instead of strings. Parsing "05" as month is easier than parsing "May", and arguably more error prone.
  • "Date" ve "Time"... Granularity...
  • If the data is generated by a computer algorithm, use a standard timestamp.
  • Be careful about "time zone". Either include time-zone information per column or dataset, or encode values in a standard time-zone.
  • (Incomplete)

Categorical Attributes

  • You may use standard codes in place of categorical variables, such as ISO 3166 Country Codes, or in place of county names, you can use FIPS Codes. Both examples also represent some geographical context, but same practice can apply to other data points too.
  • If the values of categories can be further described/identified, such as "authors" in a paper collection may have data related to them (affiliation, gender, location, etc), use the unique identifier for each item to describe the categories of the current table.

Multi-Valued Categorical Attributes

Multi,Values,Merged,With,Commas

  • Store multiple values of a categorical attribute using a simple separator within a single cell. For example, an article may have multiple tags, with the goal of organizing the list of articles in a searchable and indexed format. Or, multiple authors. You may use "+" as a separator. Just be aware that any valid category text with the specific separator might be parsed incorrectly. ** This is counter to database design, which would ask you to ID each tag, ID each record, and have a separate table that encodes relations between tags and records. But, this violates simplicity and compactness. If all you care about is the list of tags per record, store it that way.

  • If each value has its own attributes, you'll need to have multiple tables to describe the relations. For example, assume you have a list of employees. Each employee can be a member of multiple organizations. And for each membership, there can be a starting date. If you are to store this information, separate the tables. Visualization will be more tricky.

Note: See keshif:AgreeSet for set-based visualization details.

Using Spreadsheets

Spreadsheet as database

  • Make sure the first header includes the column (attribute) name.
  • First 1-2 rows should be unique identifiers per record, if any.
  • You can "fix" the topmost and leftmost columns, so that when you scroll in the spreadsheet, you always see the attribute name and record identifier.

Data File

Google Sheets Great for small to moderate sizes (a few thousand records, tens of attributes). Make your sheet publicly readable, so that others can view your visualization.

JSON If you have JSON files already, keep them that way. You can parse those files easily in keshif using JavaScript.

  • Ideally, the file should be [{},{}, .... , {}], where each {} is an object with key-value pairs, such as {"id": 10, "name": "Jon", "age": 21, "likes": "data, visualization, biking"}. The keys should ideally appear for all data items (missing key will mean missing attribute per record). If you follow this structure, keshif can detect the attributes in the record collection easily.
  • Do not have hierarchical representations. That is, if you have a record block {}, none of the attributes should be another block. Sometimes the reason is semantic grouping. Sometimes, the reason is merging multiple tables. Remember the simple tabular structure, and flatten your JSON files where possible.
  • Socrata's JSON formats are a good place to start if you want to be extensive.

Be careful with missing values

  • If you have to encode missing value, leave the field empty, or if you have to differentiate "meaning" of missing from incomplete, do simple, consistent coding, such as using "NA" to show empty fields.

Make it compact

  • If you can generate one attribute from another, remove the longer attribute. For example, if each record has an integer id (43), and a url (www.example.com/record/1), remove the URL. You can generate it from the existing column easily (using a simple formula).
  • Ideally, if you want to store such relations, dataset should encode such formulas, and do not store values explicitly.

Scaling up: More record, more attributes, larger files

Visualizing 1k record is trivial, 10k is fine, and with 100k, things may start to slow down, at least on browsers. Also remember that in most web-based browser, all data is transmitted over the web, not a subset of it.

The guidelines below are to deal with files that may be a little too large to be displayed and filtered on a web browser. These are not general guidelines.

  • Remove large chunks that are not "exploratory". Make these resources available on-demand per record.

Does your data include big chunks of text, descriptions, or maybe even embedded binary images, etc, that result in large files? If so, remove those columns with big chunks, and make these resources available on demand per each record. If you have an image, the image can be retrieved from www.example.com/images/1.png. Or, if you have a text description, you can return it with simple API calls on demand (when the item is visible or more item details are requested.)

  • Filter to recent time.

Is your data collected over time? If so, will focusing on the most recent data records help you with analysis, at least begin understanding the current state of your data? Then, you are lucky, you can sort and filter your data based on the date field.

  • "Code" categories.

If an attribute is a categorical variable (string), create a codebook and map long strings to 1-2 letter codes. For example, "Architectural Planning" can be "AP". You can convert these codes to their original text form while visualizing/exploring the data.

  • Aggregate by sum

This is counter to one of the first guidelines: storing each record individually. However, there are cases where such data can be of millions of items. A good way to bring down the data size is for each row to represent multiple records in the original data. Use this method only when absolutely necessary. Note that, each record can still appear in one row only in this form of aggregation.

Licensing and attributions

Other Resources