<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/Data_Science_07_InferentialStats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Inferential Statistics
### Brendan Shea, PhD (Brendan.Shea@rctc.edu)

Imagine you're learning to read better, and your teacher says, "Let's try something new today." She breaks the class into smaller groups and gives each a different task. You're curious: Will this new method actually help you understand what you're reading? This is the crux of the Baumann experiment. It sought to find out whether particular teaching methods could improve how well fourth-grade students understood their reading material.

Sixty-six fourth-grade students were randomly assigned to one of three experimental groups: (a) a Think-Aloud (**TA, or "Strategy"**) group, in which students were taught various comprehension monitoring strategies for reading stories (e.g., self-questioning, prediction, retelling, rereading) through the medium of thinking aloud; (b) a Directed Reading-Thinking Activity (**DRTA**) group, in which students were taught a predict-verify strategy for reading and responding to stories; or (c) a Directed Reading Activity (**DRA or Basal**) group, an instructed control, in which students engaged in a noninteractive, guided reading of stories.
This is what we call a controlled experiment, a cornerstone of scientific research. In a controlled experiment, you have one or more groups who receive a special **treatment** (TA and DRTA in this case), and a control (**basal**) group that doesn't (DRA,). This setup allows researchers to compare results and draw conclusions about the effectiveness of the methods being tested.

So, why should you care? Well, the results showed that students in the Strat and DRTA groups were better at understanding their reading than those in the DRA/Basal group. They were more skilled at monitoring their comprehension, as shown by tests and questionnaires. Interestingly, Strat students were particularly good at being aware of their own understanding, while DRTA students were sometimes even better at spotting errors. This is crucial because it shows that teaching methods can significantly affect how well students understand what they read, a vital skill in almost every area of life.

In this chapter, we'll be using the Baummann Data to introduce basic concepts of **inferential statistics**, which involves making (inductive) inferences about the the world *beyond* our data. (For example, this study is meant to show something about how reading instruction should work for all students, not just the ones in this study!).

## Brendan's Lecture
Run the following cell to launch the lecture for this chapter.

In [None]:
##Click here to launch my lecture
from IPython.display import YouTubeVideo
YouTubeVideo('vcj6TLEKrBI', width=800, height=500)

### Loading the Baumann Data
Let's get started by loading the Baumann data, and take a look at the head.

In [None]:
!pip install pydataset -q # Install required packages
from pydataset import data # Import required modules
import pandas as pd # More on this below

In [None]:
read_df = data('Baumann') # Load the baumann dataset
read_df.head()

Unnamed: 0,group,pretest.1,pretest.2,post.test.1,post.test.2,post.test.3
1,Basal,4,3,5,4,41
2,Basal,6,5,9,5,41
3,Basal,9,4,5,3,43
4,Basal,12,6,8,5,46
5,Basal,16,5,10,9,46


It looks like this contains students in the "Basal" (control) group. Now, let's look at the middle of the data.

In [None]:
read_df[21:26]

Unnamed: 0,group,pretest.1,pretest.2,post.test.1,post.test.2,post.test.3
22,Basal,9,6,7,8,32
23,DRTA,7,2,7,6,31
24,DRTA,7,6,5,6,40
25,DRTA,12,4,13,3,48
26,DRTA,10,1,5,7,30


Here we see students in DRTA group. Finally, we can take a look at the tail of the data:

In [None]:
read_df.tail()

Unnamed: 0,group,pretest.1,pretest.2,post.test.1,post.test.2,post.test.3
62,Strat,11,4,11,7,48
63,Strat,14,4,15,7,49
64,Strat,8,2,9,5,33
65,Strat,5,3,6,8,45
66,Strat,8,3,4,6,42


This appears to contain students in the "Strat" group. If we look closer, we'll find that there are exactly 22 students in each group. Now, let's get a summary of the data:

In [None]:
read_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66 entries, 1 to 66
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   group        66 non-null     object
 1   pretest.1    66 non-null     int64 
 2   pretest.2    66 non-null     int64 
 3   post.test.1  66 non-null     int64 
 4   post.test.2  66 non-null     int64 
 5   post.test.3  66 non-null     int64 
dtypes: int64(5), object(1)
memory usage: 3.6+ KB


## A Brief Review of Descriptive Statistics
Before moving on to material about inferential statistics, let's briefly review what we early learned about descriptive statistics. We can retrieve many of these statistics as follows:

In [None]:
read_df.describe()

Unnamed: 0,pretest.1,pretest.2,post.test.1,post.test.2,post.test.3
count,66.0,66.0,66.0,66.0,66.0
mean,9.787879,5.106061,8.075758,6.712121,44.015152
std,3.02052,2.212752,3.393707,2.635644,6.643661
min,4.0,1.0,1.0,0.0,30.0
25%,8.0,3.25,5.0,5.0,40.0
50%,9.0,5.0,8.0,6.0,45.0
75%,12.0,6.0,11.0,8.0,49.0
max,16.0,13.0,15.0,13.0,57.0


As you can see, descriptive statistics give us a snapshot of our data, allowing us to understand its main features without making broader conclusions. These statistics can be like a magnifying glass, bringing into focus the central tendencies, dispersion, and shape of the data distribution. Let's go through some of the key terms based on the `read_df.describe()` output.

1. The **count** tells us the number of data points in each column. Here, we have 66 students in each group, which means 66 pretests and post-tests were conducted. The count is an essential starting point because it lets us know the size of our data set.
2. The **mean** is the average of all the scores. For example, the average score for `pretest.1` is approximately 9.79. This gives us a general idea of how the group performed but doesn't tell us much about individual performance or variability within the group.
3. The **standard deviation** measures how spread out the scores are from the mean. A small standard deviation means the scores are closely clustered around the mean, while a large one indicates a wider range of scores. For example, `post.test.3` has a standard deviation of approximately 6.64, suggesting that the post-test 3 scores varied reasonably around the mean.
4. The **minimum** and **maximum **values tell us the range of the scores. In `pretest.1`, the scores ranged from a minimum of 4 to a maximum of 16. Knowing the range can help us understand the breadth of performance among the students.
5. **Quartiles** divide the data into four equal parts. The 25% (first quartile), 50% (**median**), and 75% (third quartile) give us a sense of the data's distribution. For example, in `post.test.2`, 25% of students scored 5 or below, 50% scored 6 or below (the median), and 75% scored 8 or below.

By exploring these key terms, we can better understand the landscape of our data. And remember, while descriptive statistics are insightful, they're just the tip of the iceberg. They set the stage for inferential statistics, where we'll use this data to make more general conclusions.

## Samples and Populations
Inferential statistics allows us to "learn" things about the wider world (outside of our study) from out data. As a first step to understanding how this works, we should distinguish between a **sample**, which is a subset of individuals from a larger group, and a **population**, the entire group we're interested in. In the Baumann experiment, are the 66 fourth-grade students a sample or a population?

In the realm of statistics, the terms sample and population carry weighty significance. A sample is like a snapshot---a smaller group pulled from a broader context. On the other hand, a population is the entire movie reel, the full context from which the snapshot is taken.

In the Baumann experiment, the 66 fourth-grade students would most likely be considered a sample. Why? Because they stand in for a much larger group---say, all fourth-grade students in a district, state, or even the country. Researchers often use samples because studying an entire population would be impractical or impossible due to time, resources, and logistical constraints.

The purpose of using a sample is to make inferences about the larger population. In the Baumann experiment, we're not just interested in whether these specific 66 students improve their reading comprehension with different teaching methods. Instead, the ultimate goal is to generalize these findings to a broader population of fourth-grade students. We want to answer a bigger question: Can different teaching methods improve reading comprehension for all fourth-graders, or at least a significant subset of them?


### Sampling Methods
**Sampling methods** are the various techniques used to select a subset of individuals from a larger group for study. The choice of sampling method is crucial because it affects how well the sample represents the population, which in turn influences the validity of the study's conclusions. One key concern here is **bias**, a tendency to systematically favor certain outcomes over others. Bias can creep in through poorly designed surveys, non-representative samples, or even subtle wording in questions. It's a bit like taking a photograph at a strange angle; what you capture won't accurately reflect the whole scene. Common sampling method include:

1.  **Simple Random Sampling.** Every member of the population has an equal chance of being selected. It's the statistical equivalent of drawing names out of a hat. This method is excellent for reducing bias because it doesn't favor any group.

2.  **Stratified Random Sampling.** The population is divided into smaller groups based on a particular characteristic, like age or income. Then, a random sample is drawn from each group. This ensures that the sample represents all the strata in the population.

3.  **Cluster Sampling.** The population is divided into clusters, often geographically, and a random sample of clusters is chosen. Then, all members, or a random sample of members from those clusters, are surveyed. This method is often used when the population is spread out over a large area.

4.  **Systematic Sampling.** Every nth member of the population is selected, starting from a random point. For example, you might survey every 10th person on a customer list.

5.  **Convenience Sampling.** The sample consists of easily accessible members of the population. This method is the least rigorous and most prone to bias.

In all of these methods, the aim is to select a sample that is as similar as possible to the population in all respects that are relevant to the study. When bias is reduced, the results of the study are more generalizable to the larger population. For example, if you're studying voter behavior and only sample from one neighborhood, you may miss broader trends affecting other areas.

In the context of the Baumann experiment, simple random sampling was an appropriate choice for several reasons. First, it minimizes bias by giving every fourth-grade student an equal chance of being part of the experiment. This is essential for the integrity of the study, as it assures that the sample is likely representative of the population. Second, simple random sampling is straightforward to understand and implement, making it a practical choice for many types of research. Lastly, because the Baumann experiment aims to make general claims about teaching methods and reading comprehension for all fourth-graders, a sample that is as unbiased as possible is crucial.


### Control Groups
The concept of a **control group** (or "basal" group) is central to scientific research, acting as a sort of yardstick against which other experimental changes are measured. In an experiment, you have groups that receive some sort of treatment---these are your experimental groups. The control group, however, doesn't receive this special treatment or gets a neutral, standard one. It's like running a race where some runners get a high-tech pair of shoes designed to boost speed, while one group wears ordinary sneakers. That group in the regular footwear? That's your control.

In the Baumann experiment, the Basal group serves as this control. They engage in a "non-interactive, guided reading of stories," which we can consider the standard or traditional method of teaching reading comprehension. This control group is essential for several reasons:

1. The control group provides a **baseline level** of performance against which the effects of the different teaching methods (Think-Aloud and Directed Reading-Thinking Activity) can be compared. It's the "default setting" of the experiment.

2. By comparing the control group with the experimental groups, we can better **isolate** the effects of the specific teaching methods under scrutiny. If the experimental groups show significant improvement over the control group, we have strong evidence that the teaching methods are effective.

3.  In any experiment, various factors could potentially influence the outcome. The control group helps to mitigate the effects of these **confounding variables.** If both the control and experimental groups are subjected to the same conditions apart from the variable being tested, we can be more confident that any differences in outcomes are due to the variable itself.

4. Sometimes, it's not ethical to deprive a group of a standard treatment, especially in medical studies. In educational settings like the Baumann experiment, however, using a control group that receives the standard teaching method is generally considered ethical and helps validate the results. (For examle, it would NOT be ethical to simply have one group of fourth graders receive NO reading instruction!).

In summary, the Basal control group in the Baumann experiment acts as a critical anchor, grounding the study and enabling researchers to measure the effectiveness of the new teaching methods with greater confidence. Without this control group, distinguishing the impact of the teaching methods from other factors would be much more challenging.

### What is a "Cause", Anyway?

When we speak of causation, we're often grappling with the deep-seated human desire to understand "why" and "how" things happen. We want to know the roots of effects so we can predict them, control them, and even create them. In this pursuit, the philosopher James Woodward provides a valuable lens through which to view the concept.

Woodward's approach to causation is couched in the language of intervention. For Woodward, saying "A causes B" is not merely an observational statement. It's an actionable one. To **cause** means that if we were to intervene to change A, while keeping everything else the same, then B would change in a specific and predictable manner. This idea is powerful because it maps neatly onto the framework of scientific experiments, where researchers intentionally manipulate one variable to see its impact on another.

Let's apply this to our Baumann reading study. The teaching methods---whether Think-Aloud, Directed Reading-Thinking Activity, or the standard Basal method---act as our independent variable, or our "A" in Woodward's terms. The reading comprehension levels, gauged through tests and questionnaires, are the dependent variable, or "B".

If Woodward's conception of causation holds true, then deliberately altering the teaching method should lead to discernible changes in reading comprehension. But Woodward's idea carries a caveat: we must hold all else constant. This is where the controlled experiment shines. By carefully controlling other variables, such as the time of day when teaching occurs or the age group of the students, the researchers aim to isolate the effect of the teaching method on reading comprehension.

## Independent and Dependent Variables
When we attempt to infer causes, we need to carefully distinguish between the independent variable and the dependent variable. The **independent variable** is the one you, the researcher, manipulate or change. Think of it as the cause in a cause-and-effect relationship. On the other hand, the **dependent variable** is what you measure in the experiment. It's the effect that occurs due to the changes you made to the independent variable.

In the context of the Baumann experiment, let's identify these key players:

1. The independent variable here is the type of teaching method used. Specifically, the experiment employs three teaching methods: Think-Aloud (TA), Directed Reading-Thinking Activity (DRTA), and Directed Reading Activity (DRA or Basal, the control group). These methods are deliberately chosen and manipulated by the researchers to see if they have any impact on students' reading comprehension.

2. The dependent variable is the students' reading comprehension levels, as measured by the various tests and questionnaires. These include the scores in `pretest.1`, `pretest.2`, `post.test.1`, `post.test.2`, and `post.test.3`. The dependent variable is what the researchers are most interested in: Does the teaching method (independent variable) affect reading comprehension (dependent variable)?

By manipulating the teaching method (independent variable), the researchers aim to observe any changes in the students' reading comprehension (dependent variable). If significant changes are found, and other variables are controlled for, the researchers can then conclude that the teaching method has a measurable impact on reading comprehension.

### Examples: Independent and Dependent Variables
The basic concepts of independent and dependent variables apply across all areas of science. For example:

1.  Medical Drug Efficacy

    -   Experiment: Medical researchers regularly conduct experiments to determine whether new antiviral drugs are effective against influenza. They administer the new drug to one group of patients and a placebo to another.
    -   Independent Variable: Type of treatment (new drug or placebo)
    -   Dependent Variable: Reduction in flu symptoms after a week
2.  Climate Change Impact

    -   Experiment: Environmental scientists study the effect of carbon dioxide emissions on global warming. They use climate models to simulate different levels of emissions.
    -   Independent Variable: Amount of carbon dioxide released
    -   Dependent Variable: Average global temperature in the simulation
3.  Fuel Efficiency in Cars

    -   Experiment: Automotive engineers test various types of fuel to see which is most efficient. They run cars on different fuels and measure the distance traveled.
    -   Independent Variable: Type of fuel used
    -   Dependent Variable: Miles per gallon achieved
4.  Sleep and Memory Retention

    -   Experiment: Psychologists investigate the impact of sleep on memory. Participants are either deprived of sleep or allowed a full night's rest before taking a memory test.
    -   Independent Variable: Amount of sleep before the test
    -   Dependent Variable: Number of items correctly recalled
5.  Fertilizers and Crop Yield

    -   Experiment: Agricultural scientists apply different fertilizers to separate plots of land where the same crop is grown. At the end of the season, they measure the yield.
    -   Independent Variable: Type of fertilizer used
    -   Dependent Variable: Crop yield at harvest

In each case, researchers start with a hypothesis involving the causal relationship between two variables. They then design an experiment that allows them to intervene on the potential cause (the independent variable) and measure its effect (the dependent variable).

## Challenges in Causal Inferance
While establishing causation is an important (and perhaps the *most important* goal of science), it can be challenging. One of the most formidable obstacles is the existence of confounding variables (mentioned above). These are factors, other than the one we're interested in, that might also influence the outcome. For example, let's say we're keen to find out if a specific teaching method can improve reading comprehension. We must consider that other elements such as a student's home environment, their previous educational background, or even something as mundane as the time of day the class is held, could all influence comprehension levels. These confounding variables make it a tricky task to say definitively that it's the teaching method causing the change, and not something else.

A second challenge is the often-repeated mantra that *correlation doesn't imply causation*. Just because two variables move in tandem doesn't mean one is causing the other to happen. In the realm of reading, it might be observed that students who perform well in reading comprehension also have high attendance rates. However, high attendance doesn't necessarily cause better reading skills. It could be that a third factor, say, parental involvement, is influencing both attendance and reading ability.

Another hurdle lies in *ethical constraints*. There are situations where it might be morally or ethically questionable to manipulate a variable to observe its effects. In educational studies like the Baumann experiment, imagine the ethical dilemma if preliminary data strongly suggested that a new teaching method was remarkably effective. Would it then be ethical to withhold this method from a control group? Such questions complicate the researcher's ability to establish causation unequivocally.

Finally, we face the challenge of *complexity of systems*. Some systems are so intricate that it becomes a Herculean task to isolate individual factors. Reading comprehension is a case in point. It's a complex skill that's influenced by a myriad of factors including cognitive skills, emotional state, and classroom environment. Determining the impact of a single teaching method on such a multifaceted skill can be like finding a needle in a haystack.

In sum, while the quest for causation is integral to advancing knowledge and policy, it's fraught with complexities. Researchers must navigate a labyrinth of confounding variables, ethical considerations, and systemic complexities, all while avoiding the trap of mistaking correlation for causation. These challenges underscore the need for meticulous experimental design, especially in fields as vital and impactful as education.

## Review Question

1.  What is a controlled experiment, and why is it important in scientific research? Think of a situation or subject you're interested in; how could you design a controlled experiment to study it?

2.  In your own words, explain what causation means and why it's critical to distinguish it from correlation. Can you think of a real-world example where mistaking correlation for causation could lead to incorrect conclusions?

3. Describe what independent and dependent variables are in an experiment. Choose a hobby or activity you enjoy, and identify potential independent and dependent variables if you were to study it scientifically.

4.  Summarize James Woodward's concept of causation that focuses on "intervention." How could this idea of causation be applied to improve something you care about?

5.  In the Baumann reading study, what were the independent and dependent variables? What were some of the challenges that faced researchers in trying to determine what methods of teaching reading "work"?


### My Answers

1.

2.

3.

4.

5.

### Introducing Hypothesis Testing: What is the Null Hypothesis?

When embarking on the scientific adventure of hypothesis testing, think of the null hypothesis as your starting line. It's the baseline assumption you begin with, a sort of "innocent until proven guilty" stance in the world of research. The **null hypothesis** asserts that there is no effect or relationship between the variables you're scrutinizing. In simpler terms, it posits that whatever you're trying to prove isn't actually happening.

In the context of the Baumann study, the null hypothesis would state that the different teaching methods---Think-Aloud, Directed Reading-Thinking Activity, and the standard Basal method---have no impact on reading comprehension levels in fourth-grade students. Essentially, it claims that if you were to measure how well these students understand what they read, you'd find no significant differences between those taught with different methods.

Why do we start with this assumption? It's a matter of scientific rigor. The null hypothesis serves as a testing ground, a challenge for our alternate hypothesis, which is what we really want to prove. By starting with the assumption that our teaching methods don't have an impact, we set up a rigorous test. If the data collected ends up showing a significant difference in reading comprehension based on teaching method, then we can confidently reject the null hypothesis in favor of the alternate one---that the teaching methods do, in fact, make a difference.

### How Do We Collect and Analyze Data?

The next crucial phase in hypothesis testing is data collection and analysis. If the null hypothesis is our starting point, data is the road we travel to reach our conclusion. In the Baumann study, data was collected on students' reading comprehension levels before and after exposure to different teaching methods---Think-Aloud, Directed Reading-Thinking Activity, and the standard Basal method.

Before the experiment, a pretest was conducted to gauge the students' initial reading comprehension levels. This provides a baseline for comparison. After implementing the teaching methods, a post-test was carried out to measure any changes in reading comprehension. The data collected includes scores from these tests, and possibly additional measures like questionnaires or interviews for a more rounded understanding.

Once the data is collected, it's then prepared for analysis. This may involve cleaning the data to remove any outliers or errors, and then using statistical software to perform the hypothesis test. The aim here is to compare the pretest and post-test scores to see if the teaching methods have led to statistically significant changes in reading comprehension.

### What is the P-Value and What Does It Tell Us?

Moving on, let's talk about the p-value, a term that often mystifies those new to statistics but is central to hypothesis testing. Simply put, the **p-value** tells us how likely it is that we would observe the data we have if the null hypothesis were true. A low p-value---often below 0.05---suggests that what we've observed is unlikely under the null hypothesis. This gives us reason to reject the null hypothesis in favor of the alternative---that the teaching methods do have an impact on reading comprehension.

In the context of the Baumann study, we would calculate the p-value to assess the likelihood that the changes in reading comprehension are due to random chance rather than the teaching methods applied. If the p-value is low, it strengthens our confidence that the teaching methods are indeed effective.

### How Do We Make a Conclusion?

After all this work, how do we finally draw a conclusion? Well, based on the p-value, we make a decision to either reject or fail to reject the null hypothesis. If the p-value is low (typically below 0.05), we **reject the null hypothesis** and conclude that our teaching methods likely had a significant impact on reading comprehension. If the p-value is high, we **fail to reject the null hypothesis**, meaning we don't have enough evidence to say the teaching methods made a difference.

It's important to note that "failing to reject" the null hypothesis isn't the same as proving it true. It simply means we don't have enough evidence against it. Likewise, rejecting the null hypothesis doesn't prove the alternative hypothesis; it just means it's more likely based on the data we have.

By following this sequence---from defining the null hypothesis, through data collection and analysis, to interpreting p-values and drawing conclusions---we give ourselves a structured, reliable way to probe the mysteries of causation. It's through this rigorous process that we can make meaningful assertions about the effectiveness of different teaching methods on reading comprehension, or any other topic we might choose to study.

### What is a T-test? How Do I Conduct One?

A T-test is a type of hypothesis test that helps you compare means --- the averages of two groups --- to see if they're significantly different from each other. It's like a magnifying glass for your data; it amplifies differences and lets you see details that might not be immediately obvious. This is particularly useful in experiments like the Baumann study, where we want to determine whether different teaching methods yield different levels of reading comprehension.

In the Baumann study, let's say we're interested in comparing the Basal group (our control group) with the DRTA group (one of the experimental groups). We start with a null hypothesis that asserts there's no significant difference between the two groups' reading comprehension scores. The alternative hypothesis posits the opposite: that there is a significant difference.

The **T-test** uses the means and standard deviations of the two groups to calculate a t-statistic, which is a measure of the difference between the two groups relative to the variation in the data. The formula for the t-statistic in an independent samples T-test is:

$$
t=\frac{\text{difference of means}}{\text{standard error of the difference}}​
$$

The standard error of the difference accounts for the spread and size of each group, giving us a normalized measure. Once we have the t-statistic, we use it to find the p-value, a crucial number that tells us whether to reject our null hypothesis. If the p-value is lower than a predefined level of significance (usually $α = 0.05$), we reject the null hypothesis in favor of the alternative. In the context of the Baumann study, a low p-value would mean it's unlikely the difference in reading comprehension between the Basal and DRTA groups happened by chance, thereby strengthening our confidence in the teaching methods' effectiveness.

The T-test is one of the most commonly used tests in statistics, offering a powerful way to dissect your data and glean insights that can drive decision-making. Whether you're a researcher, a policy-maker, or just someone curious about the world, understanding how to properly conduct a T-test can arm you with the tools to make more informed, data-driven conclusions.

### Conducting a T-test in Python: A Practical Guide

Once you've grasped the conceptual framework behind the T-test, the next step is to actually perform one. Python, with its rich ecosystem of data science libraries, makes this process straightforward. We'll focus on using the `scipy.stats` library, which contains a function specifically designed to conduct T-tests on two independent samples.

#### Step 1: Import the Necessary Libraries

The first thing you'll need is to import the relevant Python libraries. For a T-test, the primary library you'll use is `scipy.stats`.

In [None]:
import scipy.stats as stats

#### Step 2: Prepare Your Data

The Baumann dataset is already loaded into a DataFrame called `read_df`. To perform the T-test, we need to extract the post-test scores for the two groups we're comparing --- Basal and DRTA.

In [None]:
# Extracting post-test scores for Basal and DRTA groups
basal_scores = read_df['post.test.1'][read_df['group'] == 'Basal']
drta_scores = read_df['post.test.1'][read_df['group'] == 'DRTA']

#### Step 3: Conduct the T-test
We'll use the `ttest_ind()` function from the scipy.stats library to perform an independent samples T-test. This function takes the two groups of scores as arguments and returns the t-statistic and the p-value.

In [None]:
test_results = stats.ttest_ind(basal_scores, drta_scores)
test_results

TtestResult(statistic=-3.733591722324427, pvalue=0.0005618273655552026, df=42.0)

#### Step 4: Interpret the results
From the above, we can see that the p-value is (much) lower than 0.05, which means we can *reject* the null hypothesis (of no difference betweeen these methods of teaching reading) and *accept* the hypothesis that the there was a difference!

We could also automate this using Python (this is important when working with large datasets and repeated tests).

In [None]:
alpha = 0.05
p_value = test_results.pvalue

if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between Basal and DRTA.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between Basal and DRTA.")


Reject the null hypothesis: There is a significant difference between Basal and DRTA.


### Coding Exercise: Conducting a T-test on Basal vs Strat Groups

In this exercise, you'll apply what you've learned by conducting a T-test to compare the Basal and Strat groups in the Baumann dataset. By doing this, you'll gain hands-on experience in performing hypothesis testing in Python.

#### Part 1: Import the Necessary Libraries

Your first task is to import the `scipy.stats` library, which contains the functions needed for performing a T-test.

#### Part 2: Extract the Data

The Baumann dataset is loaded in a DataFrame called `read_df`. Extract the post-test scores for the Basal and Strat groups into separate variables.

#### Part 3: State the Hypotheses

Before performing the T-test, state the null hypothesis and alternative hypothesis that you're testing. You can write these in a text cell.

#### Part 4: Perform the T-test

Now, use the `ttest_ind()` function from the `scipy.stats` library to perform an independent samples T-test on the Basal and Strat groups. Store the t-statistic and p-value in variables.

#### Part 5: Make a Conclusion

Finally, interpret the p-value. If it's less than 0.05, reject the null hypothesis. If it's greater, fail to reject the null hypothesis. Write your conclusion below.

### My Answer: Run a T-test
Include your answers below (a mix of code and text cells).

## From T-tests to Confidence Intervals

After delving into the world of T-tests, where we directly compared the means of two groups to decide if they were different, you might wonder, "What's next?" That's where **confidence intervals** come into play. While a T-test tells us whether a difference exists, confidence intervals provide a richer understanding by quantifying the uncertainty around our sample mean. They give us a range of values that are likely to contain the true population mean, thereby adding nuance and depth to our understanding of the data.

Imagine you've found a statistically significant difference in reading comprehension between two teaching methods using a T-test. That's a powerful piece of information, but it doesn't tell you how certain you are that this observed effect would be replicated if you sampled another group of students. A confidence interval fills this gap by providing a range of plausible values for the population parameter. Essentially, it tells us how confident we can be that our sample statistic is a reliable estimate for the population parameter.

#### The Math Behind Confidence Intervals

Now that we have a conceptual grasp of why confidence intervals are so valuable, let's look at the math that brings this concept to life.  The confidence interval is built around the **sample mean $(\bar{x})$** and takes into account the variability in your data, represented by the **standard error of the mean (SEM)**. The formula for SEM is:

$$
\text{SEM} = \frac{s}{\sqrt{n}}
$$

where $s$ is the standard deviation of the sample, and $n$ is the sample size.

To calculate the confidence interval, you use the following formula:

$$
\text{Confidence Interval} = \bar{x} \pm (Z \times \text{SEM})
$$

In this equation, $Z$ is the **Z-score**, which corresponds to your chosen confidence level. For a 95% **confidence level**, the Z-score is generally 1.96.

By calculating the confidence interval, you get a range of values that likely contain the true population mean. This range is a powerful tool that not only adds depth to your findings from the T-test but also allows you to make more informed decisions and predictions based on your data.


## Python Code: Calculate Confidence Intervals
Before we dive into the code, let's pause for a moment to consider why we're doing this. Our overarching mission is to unearth which teaching methods are most effective for improving reading comprehension in fourth-grade students. We've already seen that T-tests can tell us if there's a statistically significant difference between teaching methods. But how certain are we about these findings? What's the range within which the true effectiveness of a method might lie?

This is where confidence intervals come into the picture. In this section, we'll focus on calculating the 95% confidence interval for the post-test scores of students in the 'Basal' group. Why? Because the Basal group serves as our control group—the teaching method that is currently in use. Knowing the range of effectiveness for this standard method will serve as our benchmark. We can then compare this with the ranges for other, more experimental teaching methods to see if they offer any significant benefits.

By calculating this confidence interval, we're adding another layer to our analysis. We're not just asking, "Is this teaching method different?" We're asking, "How different could this teaching method be, and how certain are we about that difference?" These are crucial questions for educators and policymakers aiming to make data-driven decisions about which teaching methods to employ.

Let's roll up our sleeves and dive into the Python code to calculate this.


In [None]:
# Step 1: import libraries
import pandas as pd
from scipy import stats
from math import sqrt

# Step 2: Filtering the DataFrame to only include the 'Basal' group
basal_data = read_df[read_df['group'] == 'Basal']['post.test.1']

# Step 3: Calculating the sample mean
sample_mean = basal_data.mean()
# Calculating the standard deviation
sample_std = basal_data.std()
# Getting the sample size
sample_size = len(basal_data)

# Step 4: Calculating the 95% confidence interval
basal_ci_lower, basal_ci_upper = stats.norm.interval(0.95, loc=sample_mean, scale=sample_std / sqrt(sample_size))

# Step 5: Print results
print(f"The 95% confidence interval for the post-test scores of the Basal group is between {basal_ci_lower} and {basal_ci_upper}.")



The 95% confidence interval for the post-test scores of the Basal group is between 5.525617309400891 and 7.838019054235472.


When we say that the 95% confidence interval for the post-test scores of the Basal group is between approximately 5.53 and 7.84, we're making a specific kind of statistical claim. This doesn't mean that 95% of the scores fall within this range. Rather, it means that if we were to repeat this experiment many times, and each time calculate a 95% confidence interval, we would expect about 95% of those intervals to contain the true population mean. In simpler terms, we're quite confident—but not absolutely certain—that the average post-test score for all fourth-grade students taught using the Basal method would fall between these two numbers. This range serves as our benchmark for evaluating other teaching methods. If the confidence interval for another teaching method does not overlap with this one and is higher, we might consider that method to be more effective at improving reading comprehension.


### Confidence Intervals: Comparision of Basal with DRTA

To compare the Basal ("control") group with DRTA (one of our preferred teaching strategies, we'll calculate the 95% confidence interval for the post-test scores of the DRTA group, just as we did for the Basal group. Then we'll see what this tells us in comparison to the Basal group.

First, we need to calcuate the CI for the DRTA group (in exactly the same way we did it for the control/basal group above):

In [None]:
# Filtering the DataFrame to only include the 'DRTA' group
drta_data = read_df[read_df['group'] == 'DRTA']['post.test.1']

# Calculating the sample mean and standard deviation for the DRTA group
drta_mean = drta_data.mean()
drta_std = drta_data.std()

# Getting the sample size for the DRTA group
drta_size = len(drta_data)

# Calculating the 95% confidence interval for the DRTA group
drta_ci_lower, drta_ci_upper = stats.norm.interval(0.95, loc=drta_mean, scale=drta_std / sqrt(drta_size))

print(f"The 95% confidence interval for the post-test scores of the DRTA group is between {drta_ci_lower} and {drta_ci_upper}.")


The 95% confidence interval for the post-test scores of the DRTA group is between 8.634315166852577 and 10.91113937860197.


### Interpreting the Comparison

After running this code, you'll have the 95% confidence interval for the DRTA group. To interpret the comparison, you need to look at how this new interval relates to the one we calculated earlier for the Basal group.

1.  *Overlap.* If the confidence intervals for Basal and DRTA overlap, it suggests that the difference between the two methods might not be statistically significant. In other words, one method is not clearly better than the other.

2.  *No Overlap, DRTA Higher.* If the confidence interval for DRTA is entirely above the Basal interval and there's no overlap, this could be a strong indication that the DRTA method is more effective.

3.  *No Overlap, DRTA Lower.* If the DRTA interval is entirely below the Basal interval, then the Basal method might be more effective.

Remember, confidence intervals provide a range of plausible values for the population parameter (in this case, the true mean post-test score). They don't offer definitive proof, but they do give us a more nuanced understanding of the data, which can be extremely valuable when making decisions about teaching methods.

Our results are as follows:



In [None]:
print(f"The 95% confidence interval for the post-test scores of the Basal group is between {basal_ci_lower} and {basal_ci_upper}.")
print(f"The 95% confidence interval for the post-test scores of the DRTA group is between {drta_ci_lower} and {drta_ci_upper}.")



The 95% confidence interval for the post-test scores of the Basal group is between 5.525617309400891 and 7.838019054235472.
The 95% confidence interval for the post-test scores of the DRTA group is between 8.634315166852577 and 10.91113937860197.


We can see that (1) there is NOT any overlap between the two groups and (2) that DRTA is higher. We have evidence our teaching methods work!

### Exercise: Confidence Intervals for Strat and Comparison with Basal

#### Part 1: Filter the DataFrame for Strat Group

In the first step, filter the DataFrame to include only the post-test scores of the Strat group. Save this filtered data in a new variable.


#### Part 2: Calculate Sample Mean and Standard Deviation for Strat

Now, calculate the sample mean and standard deviation for the Strat group's post-test scores. Store these values in separate variables.


#### Part 3: Calculate the 95% Confidence Interval for Strat

Use the sample mean and standard deviation to calculate the 95% confidence interval for the Strat group. Remember, the formula for the confidence interval uses the standard error, which is the standard deviation divided by the square root of the sample size.

#### Part 4: Display the Confidence Interval for Strat

Print out the calculated 95% confidence interval for the Strat group's post-test scores.


#### Part 5: Interpret the Comparison with Basal

Finally, compare the 95% confidence interval for Strat with that of the Basal group. Discuss whether the intervals overlap and what this suggests about the effectiveness of the Strat teaching method compared to Basal.

### My Answers: Calculating Confidence Intervals
Your code below.

## Review With Quizlet
Run the following cell to launch the quizlet review.

In [None]:
%%html
<iframe src="https://quizlet.com/832871577/learn/embed?i=psvlh&x=1jj1" height="500" width="100%" style="border:0"></iframe>

## Glossary
| Term | Definition |
| --- | --- |
| Sample | A subset of individuals or data points extracted from the larger pool, often for analytical investigation.  |
| Population | The complete set from which a sample is drawn, encompassing all individuals or data points of interest. |
| Bias | Systematic errors that skew results, often unintentionally favoring certain outcomes over others. |
| Sampling Method | The procedure used to select a subset from a larger group. Methods include random, stratified, and convenience sampling.  |
| Confounding Variables | Extraneous factors that affect both the dependent and independent variables, muddling the causal relationship.  |
| Controlled Experiment | An investigative setup where all variables, except the one being studied, are held constant.  |
| Control (Basal) Group | A group that receives no treatment or a standard treatment, serving as a baseline for comparison. In Cinderella's case, the control group would be those who never wore the magical glass slipper. |
| Treatment Group | The cohort exposed to the variable being tested, often to assess its impact. In an experiment studying the effects of a new drug, this group would receive the medication. |
| Cause  | The condition or variable that brings about a particular effect or result, making a difference to the outcome.  |
| Independent Variable | The condition being manipulated to observe its effect on another variable.  |
| Dependent Variable | The outcome measured in response to changes in the independent variable. |
| Hypothesis Test | A procedure to assess if a statement about a population parameter is statistically supported.  |
| Null Hypothesis | An initial assumption that there is no effect or relationship between variables, serving as a starting point for statistical testing. This is Scrooge before being convinced that spirits affect his life. |
| P-value | The probability of observing data at least as extreme as those in your sample, assuming the null hypothesis is true. Lower values suggest stronger evidence against the null hypothesis. |
| Reject Null Hypothesis | The decision to discard the initial assumption, often because the p-value is sufficiently low. |
| Fail to Reject Null Hypothesis | Maintaining the starting assumption due to insufficient evidence against it.  |
| T-test | A statistical test used to compare the means of two groups and assess the likelihood that they come from the same population. |
| Confidence Interval | A range of values within which the true population parameter is expected to fall, based on sample data. Imagine Gandalf estimating the time it will take to reach Mordor but allowing for uncertainties. |
| Z-score | The number of standard deviations a data point is from the mean of its distribution. For example, +2 means the data point is two standard deviations above the average. |
| Confidence Level | The probability that the confidence interval contains the true population parameter. |

## Code to Know
| Code | Function in English |
| --- | --- |
| `read_df.describe()` | Python: Generates descriptive statistics of the dataframe `read_df`, summarizing central tendency, dispersion, and shape of the dataset's distribution. |
| `import scipy.stats as stats` | Python: Imports the `stats` module from the `scipy` library, which includes a wide variety of statistical functions.  |
| `read_df['post.test.1'][read_df['group'] == 'Basal']` | Python: Filters `post.test.1` scores for rows where the group is 'Basal', similar to isolating the Control (Basal) Group's results in an experiment. |
| `stats.ttest_ind(basal_scores, drta_scores)` | Python: Performs an independent T-test between `basal_scores` and `drta_scores`, comparing their means to see if they come from the same population.  |
| `print("reject the null") if pvalue < alpha else print("Fail to reject null")` | Python: Prints "reject the null" if the `pvalue` is less than a significance level (`alpha`), otherwise prints "Fail to reject null". |
| `basal_data.mean()` | Python: Calculates the mean (average) of the `basal_data` dataset.  |
| `basal_data.std()` | Python: Computes the standard deviation of the `basal_data` dataset, offering a measure of the data's dispersion or spread. |
| `len(basal_data)` | Python: Returns the number of elements in the `basal_data` dataset.  |
| `stats.norm.interval(0.95, loc=sample_mean, scale=sample_std / sqrt(sample_size))` | Python: Calculates a 95% confidence interval for a normal distribution given a sample mean (`sample_mean`), standard deviation (`sample_std`), and sample size. It estimates the range in which the true population parameter likely resides. |