# **Applied Statistics Project**

---

**Author: Damien Farrell**

---

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

sns.set_theme()

## **Project Task**

> Complete the project in a single notebook called `project.ipynb` in your repository.
The same style should be used as detailed above: explanations in MarkDown and code comments, clean code, and regular commits.
Use plots as appropriate.
<br><br>
In this project, you will analyze the [PlantGrowth R dataset](https://vincentarelbundock.github.io/Rdatasets/csv/datasets/PlantGrowth.csv).
You will find [a short description](https://vincentarelbundock.github.io/Rdatasets/doc/datasets/PlantGrowth.html) of it on [Vicent Arel-Bundock's Rdatasets page](https://vincentarelbundock.github.io/Rdatasets/).
The dataset contains two main variables, a treatment group and the weight of plants within those groups.
>
> Your task is to perform t-tests and ANOVA on this dataset while describing the dataset and explaining your work.
In doing this you should:
>
> 1. Download and save the dataset to your repository.
>
> 2. Describe the data set in your notebook.
>
> 3. Describe what a t-test is, how it works, and what the assumptions are.
>
> 3. Perform a t-test to determine whether there is a significant difference between the two treatment groups `trt1` and `trt2`.
>
> 4. Perform ANOVA to determine whether there is a significant difference between the three treatment groups `ctrl`, `trt1`, and `trt2`.
>
> 5. Explain why it is more appropriate to apply ANOVA rather than several t-tests when analyzing more than two groups.
> <br><br>

---
### **References**

1. [Visua](https://seaborn.pydata.org/tutorial/categorical.html)



In [36]:
df = pd.read_csv("data/plantgrowth.csv", index_col=0)

df

Unnamed: 0_level_0,weight,group
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.17,ctrl
2,5.58,ctrl
3,5.18,ctrl
4,6.11,ctrl
5,4.5,ctrl
6,4.61,ctrl
7,5.17,ctrl
8,4.53,ctrl
9,5.33,ctrl
10,5.14,ctrl


In [37]:
df.describe()

Unnamed: 0,weight
count,30.0
mean,5.073
std,0.701192
min,3.59
25%,4.55
50%,5.155
75%,5.53
max,6.31


In [38]:
df['group'] = df['group'].astype('category')

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30 entries, 1 to 30
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   weight  30 non-null     float64 
 1   group   30 non-null     category
dtypes: category(1), float64(1)
memory usage: 642.0 bytes


In [39]:
# Pull the groups out.
group_ctrl = df[df['group'] == 'ctrl']['weight']
group_trt1 = df[df['group'] == 'trt1']['weight']
group_trt2 = df[df['group'] == 'trt2']['weight']

## Independent Samples $t$-Test

[scipy.stats.
ttest_ind](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#ttest-ind)

**Null Hypothesis:** the population means are equal.

In [40]:
# Perform a t-test.
stats.ttest_ind(group_trt1, group_trt2)

TtestResult(statistic=np.float64(-3.0100985421243616), pvalue=np.float64(0.0075184261182198574), df=np.float64(18.0))

In [41]:
# Perform ANOVA.
f, p = stats.f_oneway(group_ctrl, group_trt1, group_trt2)

# Show.
f, p


(np.float64(4.846087862380136), np.float64(0.015909958325622895))

In [42]:
# Tukey's HSD.
res = stats.tukey_hsd(group_ctrl, group_trt1, group_trt2)

# Show.
print(res)

Tukey's HSD Pairwise Group Comparisons (95.0% Confidence Interval)
Comparison  Statistic  p-value  Lower CI  Upper CI
 (0 - 1)      0.371     0.391    -0.320     1.062
 (0 - 2)     -0.494     0.198    -1.185     0.197
 (1 - 0)     -0.371     0.391    -1.062     0.320
 (1 - 2)     -0.865     0.012    -1.556    -0.174
 (2 - 0)      0.494     0.198    -0.197     1.185
 (2 - 1)      0.865     0.012     0.174     1.556



6 .

---

# End