<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/Data_Science_07_InferentialStats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pydataset -q # Install required packages
from pydataset import data # Import required modules
import pandas as pd # More on this below

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pydataset (setup.py) ... [?25l[?25hdone
initiated datasets repo at: /root/.pydataset/


## Background To the Case Study

Imagine you're learning to read better, and your teacher says, "Let's try something new today." She breaks the class into smaller groups and gives each a different task. You're curious: Will this new method actually help you understand what you're reading? This is the crux of the Baumann experiment. It sought to find out whether particular teaching methods could improve how well fourth-grade students understood their reading material.

Sixty-six fourth-grade students were randomly assigned to one of three experimental groups: (a) a Think-Aloud (**TA, or "Strategy"**) group, in which students were taught various comprehension monitoring strategies for reading stories (e.g., self-questioning, prediction, retelling, rereading) through the medium of thinking aloud; (b) a Directed Reading-Thinking Activity (**DRTA**) group, in which students were taught a predict-verify strategy for reading and responding to stories; or (c) a Directed Reading Activity (**DRA or Basal**) group, an instructed control, in which students engaged in a noninteractive, guided reading of stories.
This is what we call a controlled experiment, a cornerstone of scientific research. In a controlled experiment, you have one or more groups who receive a special treatment (TA and DRTA in this case), and a control (**basal**) group that doesn't (DRA,). This setup allows researchers to compare results and draw conclusions about the effectiveness of the methods being tested.

So, why should you care? Well, the results showed that students in the Strat and DRTA groups were better at understanding their reading than those in the DRA/Basal group. They were more skilled at monitoring their comprehension, as shown by tests and questionnaires. Interestingly, Strat students were particularly good at being aware of their own understanding, while DRTA students were sometimes even better at spotting errors. This is crucial because it shows that teaching methods can significantly affect how well students understand what they read, a vital skill in almost every area of life.

In essence, the Baumann experiment shows us that the way we're taught can make a difference in how well we understand information. That's not just useful for teachers wanting to improve their methods; it's valuable knowledge for anyone who cares about learning, at school or beyond.

### Loading the Baumann Data
Let's get started by loading the Baumann data, and take a look at the head.

In [21]:
read_df = data('Baumann') # Load the baumann dataset
read_df.head()

Unnamed: 0,group,pretest.1,pretest.2,post.test.1,post.test.2,post.test.3
1,Basal,4,3,5,4,41
2,Basal,6,5,9,5,41
3,Basal,9,4,5,3,43
4,Basal,12,6,8,5,46
5,Basal,16,5,10,9,46


It looks like this contains students in the "Basal" (control) group. Now, let's look at the middle of the data.

In [25]:
read_df[21:26]

Unnamed: 0,group,pretest.1,pretest.2,post.test.1,post.test.2,post.test.3
22,Basal,9,6,7,8,32
23,DRTA,7,2,7,6,31
24,DRTA,7,6,5,6,40
25,DRTA,12,4,13,3,48
26,DRTA,10,1,5,7,30


Here we see students in DRTA group. Finally, we can take a look at the tail of the data:

In [22]:
read_df.tail()

Unnamed: 0,group,pretest.1,pretest.2,post.test.1,post.test.2,post.test.3
62,Strat,11,4,11,7,48
63,Strat,14,4,15,7,49
64,Strat,8,2,9,5,33
65,Strat,5,3,6,8,45
66,Strat,8,3,4,6,42


This appears to contain students in the "Strat" group. If we look closer, we'll find that there are exactly 22 students in each group. Now, let's get a summary of the data:

In [10]:
read_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66 entries, 1 to 66
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   group        66 non-null     object
 1   pretest.1    66 non-null     int64 
 2   pretest.2    66 non-null     int64 
 3   post.test.1  66 non-null     int64 
 4   post.test.2  66 non-null     int64 
 5   post.test.3  66 non-null     int64 
dtypes: int64(5), object(1)
memory usage: 3.6+ KB


In [26]:
read_df.describe()

Unnamed: 0,pretest.1,pretest.2,post.test.1,post.test.2,post.test.3
count,66.0,66.0,66.0,66.0,66.0
mean,9.787879,5.106061,8.075758,6.712121,44.015152
std,3.02052,2.212752,3.393707,2.635644,6.643661
min,4.0,1.0,1.0,0.0,30.0
25%,8.0,3.25,5.0,5.0,40.0
50%,9.0,5.0,8.0,6.0,45.0
75%,12.0,6.0,11.0,8.0,49.0
max,16.0,13.0,15.0,13.0,57.0
