## PS3 Discussion: Randomization and Experiments

In this notebook, we will analyze some data from Arman Catterson's Psych 101 course, where they went over the psychological phenomenon of "anchoring." According to Wikipedia: "Anchoring is a cognitive bias whereby an individual's decisions are influenced by a particular reference point or anchor."

In this case, students guessed the maximum speed of a house cat in miles per hour. In one condition, the "high" condition, students were prompted with the message 

"Is the maximum speed of an average house cat FASTER or SLOWER than **3** miles per hour?"

In the other condition, the "low" condition, students were prompted with

"Is the maximum speed of an average house cat FASTER or SLOWER than **40** miles per hour?"

The student then had to provide an estimate for the speed of a house cat. We want to learn: did the anchoring have an effect on students' estimates? Let's assume that students were randomly assigned to each group.

-------

**Data Dictionary/Codebook**

`cat`: estimated speed in mph

`before`: 1 = student knew the exercise beforehand, 0 = student did not know

`live.cats`: 1 = student lives with cats, 0 = student does not live with cats

`cond`: 1 = student was exposed to the "high" condition (40 mph), 0 = student was exposed to the "low" condition (3 mph)

In [1]:
# Run this cell to read in the data.
anchor <- read.csv("../data/anchoring_data_FA20.csv")
head(anchor)

cat,before,live.cats,cond
2,0,0,0
1,0,0,1
30,0,0,1
25,0,0,1
17,0,0,0
50,0,0,1


## Quick Check 1:

What are the potential outcomes for this experiment? Talk with your neighbors for a few minutes about this.

(**Note:** We do NOT necessarily have a control group in this study. In a perfect study, we would have ALSO asked a third group of people to answer the question "What is the maximum speed of an average house cat", without an anchor, but we don't have that data in this case.)

**Once you have an idea, go to Ed, move to Quick Check 1 and type in the potential outcomes for the "high" treatment in this experiment.**

*Type any notes here.*

Great!

Now, before we begin, let's "clean" the dataset. In this case, we want to remove some potentially problematic variables; we do NOT want to study people who (1) know about the goal of the experiment or (2) have lived with a cat, since they'll be more knowledgeable and have different answers.

**Quick Check 2**

Subset the data so we only get rows in which (1) the participant has NOT lived with a cat, and (2) the participant was NOT aware of the experiment/exercise. Save this to the variable `focus_anchor`.

Only use 1 line of code to do this - a hint is that you should use the & (and) or | (or) operators when you do the comparison.

**Copy paste the code cell below into Quick Check 2 on Ed once you think you have the answer.**

In [3]:
focus_anchor <- subset(anchor, live.cats == 0 & before == 0)
head(focus_anchor)

cat,before,live.cats,cond
2,0,0,0
1,0,0,1
30,0,0,1
25,0,0,1
17,0,0,0
50,0,0,1


Now, let's compare the results!

In [5]:
# What am I doing in this code below? Explain to your neighbor. 

high_focus <- subset(focus_anchor, cond == 1)
low_focus <- subset(focus_anchor, cond == 0)

mean(high_focus$cat) # What does this number mean?
mean(low_focus$cat) # and this one?
mean(low_focus$cat) - mean(high_focus$cat) # How about this? 

## Quick Demo: why randomization works (ignore the code, follow along)

As we saw above, we know that there is some significant difference between students who were exposed to the "high" and "low" questions.

In experimental design, we usually want to **randomize** (i.e. randomly assign the treatment/control groups) to our participants. This ensures that the groups are equal, and that we don't have other factors, like selection bias or omitted variables, that could affect our analysis.

Let's see how this works with a more general example, and then let's do it with the data we used earlier.


In [9]:
# Let's say we have a study of 50 medical patients, 25 of which are of the Male sex and 25 of the Female sex.
sex_data <- data.frame("Sex" = c(rep("F", 25), rep("M", 25)))

In [10]:
# If I just assigned them to experimental (1-25) and treatment groups (26-50) in order, 
# this is what our groups would look like:
sex_data$Group <- c(rep("Treatment", 25), rep("Control", 25))
table(sex_data$Sex, sex_data$Group)

   
    Control Treatment
  F       0        25
  M      25         0

In [15]:
# That's a problem! Instead, let's randomly assign the treatment and control.
sex_data$random.group <- sample(sex_data$Group, replace = FALSE)
table(sex_data$Sex, sex_data$random.group)

   
    Control Treatment
  F      12        13
  M      13        12

That's a lot better of an experimental design-- **but why?** Discuss with your neighbor.

Now, let's see this mathematically. Again, we know that the difference between people who saw the "high" condition versus people who saw the "low" condition was ~17 mph. What if we randomized the groups?

In [34]:
random_anchor <- focus_anchor
random_anchor$rand.cond <- sample(random_anchor$cond, replace = FALSE)
head(random_anchor)

cat,before,live.cats,cond,rand.cond
2,0,0,0,1
1,0,0,1,1
30,0,0,1,1
25,0,0,1,0
17,0,0,0,1
50,0,0,1,0


In [35]:
# Now, we're going to do the same calculation as above
# but remember -- the labels are randomly shuffled
# and in theory, the groups should now be equal (statistically/randomly -- not 100% the same)

high_rand_focus <- subset(random_anchor, rand.cond == 1)
low_rand_focus <- subset(random_anchor, rand.cond == 0)

mean(high_rand_focus$cat) 
mean(low_rand_focus$cat) 
mean(low_rand_focus$cat) - mean(high_rand_focus$cat)  

If the numbers above are very different, try re-running the above 2 cells again and see what happens!