# **Applied Statistics Project - Analysis on PlantGrowth R Dataset** 
___

This notebook contains my analysis of the PlantGrowth R dataset. 

![Plant Growth](image/plant-growth.jpg)

**Author: Brianne McGrath**
___

## **Dataset Overview**

The **PlantGrowth** dataset contains results from an experiement designed to compare the dried weight yields of plants under different treatement conditions. The dataset includes measurements for a **control group** (`ctrl`) and two **treatment groups** (`trt1`) and (`trt2`). Within the datset we have the following variables: 
- **weight:** The dried weight of the plants. 
- **group:** The treatment group of the plant. 

There are 30 observations in total. 
___

## **Imports**

In [1]:
# import necessary libraries 
import pandas as pd 
from scipy import stats 
from scipy.stats import f_oneway

In [2]:
# loading dataset
df = pd.read_csv('data/plant_growth.csv')

In [3]:
df.head()

Unnamed: 0,rownames,weight,group
0,1,4.17,ctrl
1,2,5.58,ctrl
2,3,5.18,ctrl
3,4,6.11,ctrl
4,5,4.5,ctrl


___
## **$t$-Test:**

___

#### **What is a $t$-Test?** 
A t-test is a statistical test used to determine whether there is a significant difference between the means of two groups and how they are related. It helps assess if any observed difference is due to chance or if it reflects a real effect. In this case, we will use it to compare the plant weights between the two treatment groups `trt1` and `trt2`. 

### **How a $t$-test Works:** 
#### **Key Assumptions of the $t$-Test:**
- Data are continuous.
- Sample data have been randomly sampled from a population. 
- There is homogeneity of variance (i.e., the variability of the data in each group is similar)
- The distribution is approximately normal. 

For two sample $t$-tests, we must have independent samples. If the samples are not independent, then a paired $t$-test may be appropriate. 

In [4]:
# seperate data by treatment group 
trt1_data = df[df['group'] == 'trt1']['weight']
trt2_data = df[df['group'] == 'trt2']['weight']

# perform t-test
t_stat, p_value = stats.ttest_ind(trt1_data, trt2_data, equal_var=False)

# display results
t_stat, p_value

(-3.0100985421243616, 0.00929840471726984)

### **Results & Interpretation:**
- **t-statistic:** -3.01. The negative value indicates that the mean weight of plants in `trt1` is lower than that in `trt2`. A magnitude of 3.01 suggests a notable difference relative to the variation within the group. 
- **p-value:** 0.0093. Since the p-value is less than 0.05, we reject the null hypothesis. This provides strong evidence of a significant difference between the means of `trt1` and `trt2`. 

The dried plant weights of `trt1` and `trt2` are significantly different, with `trt1` resulting in lower weights. This suggests that the treatments have different effects on plant growth. 
___

## **ANOVA**
___

In [5]:
# seperate data by treatment group 
ctrl_data = df[df['group'] == 'ctrl']['weight']
trt1_data = df[df['group'] == 'trt1']['weight']
trt2_data = df[df['group'] == 'trt2']['weight']

# perform one-way ANOVA
f_stat, p_value = f_oneway(ctrl_data, trt1_data, trt2_data)

# display results 
f_stat, p_value

(4.846087862380136, 0.0159099583256229)

### **Results & Interpreatation:**
- **F-statistic:** 4.85. Indicates that the between group-variablility is substantially larger than the within-group variability. 
- **p-value:** 0.0159. Since the p-value is less than the significance level of 0.05, we reject the null hypothesis. 

The ANOVA results suggest there is a statiscally significant difference in the mean plant weights among the three treatment groups (`ctrl`, `trt1`, `trt2`). However, ANOVA does not specify which groups are different. 
___

### **Why Use ANOVA Instead of Multiple $t$-tests?**

When analysing more than two groups, it is more appropriate to use ANOVA instead of performing several $t$-tests. Here's why: 

1. **Reduces Errors:** Running multiple $t$-tests increase the risk of false positives (Type I errors). ANOVA avoids this by comparing all groups at once, keeping the error rate low. 
2. **More Efficient:** ANOVA tests all groups in a single step, saving time and avoiding repeated calculations. 
3. **Broad Comparison:** ANOVA checks for differences across all groups together, while $t$-tests only compares two groups at a time. 
4. **Follow-Up Testing:** If ANOVA shows a difference, additional tests like Tukey's HSD can identify which groups are different without increasing errors. 

ANOVA is better for analysing more than two groups because it is accurate, efficient, and provides a complete comparison. 
___


## **References:**

### **$t$-Test:**
- https://www.investopedia.com/terms/t/t-test.asp#:~:text=two%20sample%20populations.-,What%20Is%20a%20T%2DTest%3F,flipping%20a%20coin%20100%20times. (Definition for T-Test)
- https://www.jmp.com/en_us/statistics-knowledge-portal/t-test.html (T-Test Assumptions)

### **ANOVA:**
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html (f_oneway function for ANOVA)
- https://www.geeksforgeeks.org/how-to-perform-a-one-way-anova-in-python/ (f_oneway for ANOVA)
- https://surveysparrow.com/blog/anova/ (Guide for interpreting ANOVA)
- https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide-2.php (Why use ANOVA instead of multiple $t$-tests?)
- https://stats.stackexchange.com/questions/236877/is-it-wrong-to-use-anova-instead-of-a-t-test-for-comparing-two-means (Why use ANOVA instead of multiple $t$-tests?)
- https://www.voxco.com/blog/anova-vs-t-test-with-a-comparison-chart/ (Why use ANOVA instead of multiple $t$-tests?)
___

# **END**