# **Fundamentals of Data Analysis Project**

---

**Author: Damien Farrell**

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

sns.set_theme()

## **Project Task**

> Complete the project in a single notebook called `project.ipynb` in your repository.
The same style should be used as detailed above: explanations in MarkDown and code comments, clean code, and regular commits.
Use plots as appropriate.
<br><br>
In this project, you will analyze the [PlantGrowth R dataset](https://vincentarelbundock.github.io/Rdatasets/csv/datasets/PlantGrowth.csv).
You will find [a short description](https://vincentarelbundock.github.io/Rdatasets/doc/datasets/PlantGrowth.html) of it on [Vicent Arel-Bundock's Rdatasets page](https://vincentarelbundock.github.io/Rdatasets/).
The dataset contains two main variables, a treatment group and the weight of plants within those groups.
>
> Your task is to perform t-tests and ANOVA on this dataset while describing the dataset and explaining your work.
In doing this you should:
>
> 1. Download and save the dataset to your repository.
>
> 2. Describe the data set in your notebook.
>
> 3. Describe what a t-test is, how it works, and what the assumptions are.
>
> 3. Perform a t-test to determine whether there is a significant difference between the two treatment groups `trt1` and `trt2`.
>
> 4. Perform ANOVA to determine whether there is a significant difference between the three treatment groups `ctrl`, `trt1`, and `trt2`.
>
> 5. Explain why it is more appropriate to apply ANOVA rather than several t-tests when analyzing more than two groups.
> <br><br>

---
### **References**

1. [Visua](https://seaborn.pydata.org/tutorial/categorical.html)



In [32]:
df = pd.read_csv("PlantGrowth.csv", index_col=0)

df.head(10)

Unnamed: 0_level_0,weight,group
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.17,ctrl
2,5.58,ctrl
3,5.18,ctrl
4,6.11,ctrl
5,4.5,ctrl
6,4.61,ctrl
7,5.17,ctrl
8,4.53,ctrl
9,5.33,ctrl
10,5.14,ctrl


In [30]:
df.describe()

Unnamed: 0,weight
count,30.0
mean,5.073
std,0.701192
min,3.59
25%,4.55
50%,5.155
75%,5.53
max,6.31


In [31]:
df['group'] = df['group'].astype('category')

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30 entries, 1 to 30
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   weight  30 non-null     float64 
 1   group   30 non-null     category
dtypes: category(1), float64(1)
memory usage: 642.0 bytes


In [None]:
# Pull the groups out.
b_times = df[df['Course'] == 'Beginner']['Time']
i_times = df[df['Course'] == 'Intermediate']['Time']
a_times = df[df['Course'] == 'Advanced']['Time']

# Perform ANOVA.
f, p = stats.f_oneway(b_times, i_times, a_times)

# Show.
f, p


In [None]:
# Tukey's HSD.
res = stats.tukey_hsd(b_times, i_times, a_times)

# Show.
print(res)

---

# End