# Types of Scientific Questions

Before embarking on  any data analysis, you should know what the question you're trying to answer is. It may seem obvious, but the biggest mistake I see in my career as a data scientist is analyses that claim (sometimes implicitly) to answer one question, but that really answer a different question altogether. 

There are different ways to classify data analyses and scientific questions, but I like the way [Leek and Peng](https://www.d.umn.edu/~kgilbert/ened5560-1/The%20Research%20Question-2015-Leek-1314-5.pdf) do it the best, which is summed up by this diagram:

![](https://science.sciencemag.org/content/sci/347/6228/1314/F1.large.jpg)

Let's look at some examples.

Many children (and adults) think they are innately "bad at math" and avoid activities that involve numbers or calculation. This is called *math phobia*. Let's say we're interested in understanding math phobia in primary-school aged children. 

## Descriptive and Exploratory Questions

There are many questions we could ask on the subject. For instance, how many children self-report as "good" or "bad" at math? And how many of each of these groups of children identify as boys? If we simply reported the numbers we find, according to our classification, that would be a *descriptive* analysis. If, in addition we interpreted those findings by suggesting that being raised as a boy appears to decrease (or increase) math phobia, then that would be an *exploratory* analysis. 

The difference between these is that we made a claim based on the data instead of just reporting the data. Sometimes claims are made implicitly, so it's hard to draw a hard and fast distinction between descriptive and exploratory analyses. For that reason, many people use these terms more or less interchangeably, or to describe any kind of analysis that is not meant to be thoroughly rigorous.

## Inferential and Causal Questions

Let's say instead that we performed a regression analysis to tease out the association between gender and math phobia. We report the coefficient of gender in the regression model, which expresses the strength of the association, along with a p-value, which expresses how likely it is that we would have observed our assoication if there really were none (given some assumptions). However, we don't claim that the association is a causal effect- that is, that somehow changing a child's gender would cause a change in their math phobia. In fact, it's hard to interpret precisely what that would entail. All we're saying is that there does seem to be some relationship between gender and math phobia, and that we would be likely to find a similar result if we gathered more data. Since we have quantified our claim and shown that it is statistically robust, this is no longer an exploratory analysis, but an *inferential* analysis. We have produced rigorous evidence of some relationship.

We're talking about relationships between variables here, but it's also possible to ask single-variable questions. For instance, we could ask: is the proportion of students with math phobia less than 1/2? If we took a classroom of students and just reported how many of them have math phobia, that would be a descriptive analysis. If we also performed a one-sample test and reported a p-value to quantify how sure we are that our result is not random chance based on the classroom we gathered data from, then our analysis is inferential.

Let's say that we have data from two classrooms that teach math using different curricula, one of which is called "cool math" and one of which is called "boring math". We do the same regression analysis as before, but this time look at the relationship between the curricula and math phobia. As before, we produce an estimate of the relationship and a p-value. Let's say we find that the kids in the "cool math" classroom have less math phobia. If, in addition, we claim that the kids taught boring math would have less math phobia if they had been taught cool math instead, then our analysis is *causal*. 

If it's obvious to you that the effect we observe is causal (i.e. cool math decreases math phobia),  think again. What if the cool math classroom is in a wealthy neighborhood and the boring math classroom is in a poor neighborhood? 

Notice that the method we used for both the inferential and causal questions was the same (regression). What's different is that in the causal setting we are intepreting our result to mean that in the future we could affect one variable by changing another. However, inferential analyses are often interpreted causally, even if they only mean to show that there exists some relationship, causal or otherwise. This is natural because ultimately the goal of science is not just to understand, but to be able to manipulate our environment to make it better. In that sense, causal analyses are almost always what we're really after. Establishing an association is only useful insofar as it can generate causal hypotheses.