Skip to content

Csv Spec

etosch edited this page Feb 23, 2014 · 1 revision

SurveyMan input is a csv that has a specific data format, with prescribed headers. The minimum headers required to run a simple survey are the QUESTION and OPTION headers.

QUESTION and OPTIONS

The question column contains the question text and any HTML associated with this question. For now, we do support the RESOURCE column, but are in the process of deprecating it.

Each new question is entered on a unique row. All other columns may be filled out. Each option gets its own line. The question column must only be filled out once per question. That is, a question that has two options, "yes" and "no", that lists "yes" in the option column of the same row as the question, must have a subsequent row whose question column is blank, but whose option column contains "yes." For example :

QUESTION OPTIONS
Are you human? yes
no
Are you a Cylon? yes
no
maybe

BLOCK and BRANCH

If the BRANCH and BLOCK columns are not provided, SurveyMan assumes that the questions in the survey may appear in any order. Randomization is an integral part of SurveyMan's philosophy and it provides the basis for its statistical guarantees.

However, there are times when some questions must appear before others, or where questions ought to be grouped by topic. For example, demographic questions are conventionally grouped together, rather than scattered throughout the main part of a survey. Grouping by topic is a conventional way to reduce the amount of effort required to take the survey and has been shown to increase participation and improve the quality of results.

Blocks can either be stationary or randomizable. A stationary block is one whose ordering in relation to other blocks is fixed. An example where stationary blocks might be used is a survey that asks the respondent to answer questions on the basis of some passage and then tests recall later in the survey. The immediate reading response questions paired with the text should be grouped together, whereas the questions testing longer-term recall should be placed in another group.

A randomizable block is one whose ordering in relation to other blocks doesn't matter. For example, in a survey that contains a block for demographic information and a block with questions of research interest, the two blocks may themselves be made randomizable, since it does may not matter whether they appear at the beginning or at the end of the survey, so long as they appear together.

Together, branching and blocking provide basic control flow in the survey. Branching is the mechanism that causes one respondent to see one question on the basis of a particular answer and another respondent to see a different question if the answer differs.

Branching from any question to any answer is not supported. Branching must obey the following restrictions:

  1. Branching must always be forward, to a top level block.
  2. If a questions's answer options have branching associated with them, that question must either be the only question in that block (including sub-blocks) that branches, or all questions in the block branch. We call this "branch-one" and "branch-all" semantics.

If the block has branch-one semantics, the branch question can appear in any order permitted by the blocking semantics. However, the answer to that question determines the next block shown after all questions in that block have been show. Consequently, branch questions cannot be used to "short-circuit" a block.

Block and Branch Syntax

The block column respects the regex _?[1-9][0-9]*(\._?[1-9][0-9]*)*. A simple survey like the demographic survey described above would have block number such as :

BLOCK QUESTION OPTION
_1 What is your age? <18
19-35
36-64
65+
_1 Where do you live? USA
Canada
Mexico
Elsewhere
_2 Are you human? yes
no
_2 Are you a Cylon? yes
no
maybe

The underscore indicates that this block is randomizable, so the respondent may be asked about human/Cylon status before or after being asked about their age and home location.

Sub-blocks can be used to control the maximum distance between two questions. One application of this is to ensure that two questions appear back-to-back. Survey designers can use this feature to display information as question-answer pairs:

BLOCK QUESTION OPTION
_1 What is your age? <18
19-35
36-64
65+
_1 Where do you live? USA
Canada
Mexico
Elsewhere
_2 Are you human? yes
no
_2 Are you a Cylon? yes
no
maybe
_2.1.1 Please watch this video before clicking next : <iframe width="420" height="345" src="http://www.youtube.com/embed/somelink"> </iframe>
_2.1.2 What was your main reaction to the video? Disgust at the human race
Fear
Pity
Apathy

To ensure ordering, we created an "empty" subblock, 2.1. This subblock contains two other subblocks that each have one question. We ensure that the respondent will always watch the video first and answer a question about the video immediately after viewing. Since there is only one immedate subblock in block 2 (i.e. block 2.1), we do not need to specify that 2.1 is randomizable. However, if we had another subblock that we also wanted to treat as a pair, while maximizing the ability to randomize, we would change the notation "_2.1" to be "_2._1".

If we were to add branching, we would need to add a BRANCH column that points to top level blocks according to the answer value. Since blocks are presented in order, if the survey designer wishes to implement parallel paths in the survey, all blocks will need to branch until the paths join. There can only be one endpoint for a survey. We recommend that the survey design include a feedback form or thank-you page at the end of the survey to join all paths.

BLOCK QUESTION OPTION BRANCH
1._1 What is your age? <18
19-35
36-64
65+
1._1 Where do you live? USA
Canada
Mexico
Elsewhere
1._2 Are you human? yes
no
1._2 Are you a Cylon? yes 4
no 3
maybe 4
1._2.1.1 Please watch this video before clicking next : <iframe width="420" height="345" src="http://www.youtube.com/embed/somelink"> </iframe>
1._2.1.2 What was your main reaction to the video? Disgust at the human race
Fear
Pity
Apathy
3 How do you know you are human? I have a bellybutton. 5
I feel shame 5
Up made me cry. 5
Well, I guess I'm not that sure 5
4 Are you a misanthrope? yes 5
no 5
I don't know what that word means 5
5 Thank you for your time.

The only ordering that matters in the csv is the grouping of question options with questions, so the branches could appear in any order. The block notation gives a partial order over all questions.

FREETEXT, ORDERED, EXCLUSIVE, RANDOMIZE, and default values

There are four Boolean columns used to determine quality control guarantees, that double as display information in the web survey. The following table describes the semantics of particular combinations of these flags. A star indicates that either boolean value is valid.

FREETEXT ORDERED EXCLUSIVE RANDOMIZE Display QC
true * * * HTML text box The weakest QC option. We currently do not use text responses for quality control.
* true true true HTML radio button Our second strongest statistical guarantee - these settings are used for "Likert Scale" questions
* false true true HTML radio button Our strongest statistical guarantee - use this for any question where the order doesn't matter and the user may only choose one of an item.
* true false true HTML check box Ordered checkboxes rely on a similarity metric between items. We do not currently allow the user to define such a metric, so we default to a QC mechanism that is similar to the one we use for Likert scales.
* false false true HTML check box Unordered checkboxes are treated as if they are exclusive options of the power set of options
* * * false HTML according to the ORDERED, EXCLUSIVE, and FREETEXT options When the options cannot be randomized in any way, we must condition on the order. This provides significantly weaker statistical guarantees, but can still be used for testing other properties of the survey.

Correlation

The CORRELATION column allows users to flag certain questions as having answers that are expected to be correlated. Correlation can be marked with any string. Strings are compared to form sets of questions whose results are expected to have a monotonic function mapping between pairs of questions.