## Fundamentals of Statistics Class: Fill-in Exercises  
#### This notebook acts as a fill-in workbook for the exercises in Python for the Fundamentals of Statistics Workshop covering topics on data exploration and statistics. Each exercise has specific objectives split into separate labeled cell blocks. How to generate basic data summaries, histograms, boxplots, conduct hypothesis tests of mean and variance, and ANOVA are covered using the provided datasets.

In [None]:
#Import required libraries for all exercises
suppressMessages(library(ggplot2))
suppressMessages(library(readxl))
options(repr.plot.width=7, repr.plot.height=6)

---

### Exercise 0

**Business Question:** A dataset has been shared. Import that dataset and gain basic information about it by answering the following:

* How many rows and columns does it have?
* What are the names of the columns and their data types?
* What does some of the data look like?
* What are some of the characteristics of the data?
* What are the value counts of a specific column?

In [None]:
#Ex0- Read the data from the excel file and create a dataframe
ex0 <- read_xlsx('../Datasets/VAV 3-06 & 4-06 1stHalf.xlsx')

In [None]:
#Ex0- How many rows and columns does it have?
dim(ex0) # Overall dimensions
nrow(ex0) # Number of rows
'<add here>' # Number of columns

In [None]:
#Ex0- What are the names of the columns and their data types?
str(ex0)

In [None]:
#Ex0- What does some of the data look like?
head(ex0)

In [None]:
#Ex0- What are some of the characteristics of the data?
summary(ex0)

In [None]:
#Ex0- What are the value counts of a specific column?
table(ex0$RmTemp) #RmTemp is the column of interest

---

### Exercise 1

**Business Question:** Part tolerance data (Tolerance Stack.xlsx) has been collected from a part manufacturing process. The tolerances are measured at various locations on the part.

* Create a histogram to visualize the overall distribution of tolerances over the entire part
* Using histograms for each location, assess the location effect of the tolerance measurements
* Evaluate the effect of bin size on result

In [None]:
#Ex1- Read the data from the excel file and create a dataframe
ex1 <- '<add here>' 
head(ex1)

In [None]:
#Ex1- Create a histogram to visualize the overall distribution of tolerances over the entire part
ggplot(data = ex1, aes(x=TOL)) + geom_histogram()

In [None]:
#Ex1- Evaluate the effect of bin size on result
for (b in c(5,40,15)) { #iterates over range of bin sizes
    print(ggplot('<add here>') + geom_histogram(bins = b)) # create ggplot object as above
}

In [None]:
#Ex1- Using histograms for each location, assess the location effect of the tolerance measurements
ggplot('<add here>') + geom_histogram(aes(fill=LOC), bins = 20) + facet_wrap(~LOC) #use ?facet_wrap or Shift+Tab to see a description of the function

---

### Exercise 2 - Boxplots

**Business Question:** Part tolerance data (Tolerance Stack.xlsx) has been collected from a part manufacturing process. The tolerances are measured at various locations on the part.

* Create a Boxplot to visualize the overall distribution of tolerances over the entire part
* Using boxplots for each location, assess the location effect of the tolerance measurements

In [None]:
#Ex2- Data was already read from the excel file and a dataframe called ex1 created in Ex1
head(ex1)

In [None]:
#Ex2- Create a Boxplot to visualize the overall distribution of tolerances over the entire part
boxplot(ex1$TOL, main="Boxplot of Tolerance") # This is the base boxplot function

In [None]:
#Ex2- Using boxplots for each location, assess the location effect of the tolerance measurements
ggplot(ex1, aes(x='<add here>', y=TOL)) + '<add here>' #add x aesthetic in ggplot and find ggplot's function for boxplot

In [None]:
#Ex2 - Optional - Adding overlayed points and colour
ggplot('<add here>') + '<add here>'(aes(fill=LOC)) + geom_jitter(width = 0.2, alpha=0.3) # Add ggplot object and boxplot geom. Fill aesthetic changes the color based on the location

---

### Exercise 3
**Business Question:** Historically, metal elongation has averaged 2% ($\mu_0=2$) with a known $\sigma^2=0.03$. Examine the current data set (Elongation.xlsx) to determine if the sample is statistically different than the historically expected value.
* Conduct a one-sided z-test at $\alpha=.05$ to evaluate the null and alternate hypotheses:
$$
H_0 : \mu \le \mu_0 \\
H_a : \mu \ge \mu_0
$$

In [None]:
#Ex3- Read the data from the excel file and create a dataframe
ex3 <- '<add here>'
'<add here>' # Inspect the beginning of ex3

In [None]:
#Ex3- Visualize the data as a boxplot
'<add here>'

In [None]:
#Ex3 - Conduct a one-sided z-test
mu0 = 2
n = '<add here>' # n should be equal to the number of rows of the dataframe

zscore = (mean('<add here>') - mu0) / (sqrt(.03/n)) # we need the mean of the Elongation column
pvalue= 1-pnorm(abs(zscore))
sprintf('Z-score = %.4f', zscore) # %.4f tells the sprintf function that it should give the result as a float with 4 decimal digits
sprintf('p-value = %.4f', pvalue)

**Interpretation:** 'Replace with your comments'

---

### Exercise 4
**Business Question:** Coating thickness of tablets in a drug manufacturing process (Thickness.xlsx) must meet a specification of $\mu=40$. $\sigma^2$ is assumed to be *UNKNOWN*. 
* Conduct a hypothesis test at $\alpha$ = 0.05 (95% confidence interval) to determine if $H_0$: $\mu$=40 is an acceptable hypothesis:
$$
H_0: \mu=\mu_0 \\
\text{vs}\\
H_a: \mu \ne \mu_0
$$

In [None]:
#Ex4- Read the data from the excel file and create a dataframe
ex4 <- '<add here>'
head(ex4)

In [None]:
#Ex4- Conduct a hypothesis test
result <- t.test('<add here>', mu = 40, alternative = "two.sided") # To access a column with a space in their name, you need this special notation df$`column name`
print(result)

**Interpretation:** 'Replace with your comments'

---

### Exercise 5: 
**Business Question:** We want to determine information about the mean coating thickness of tablets in a drug manufacturing process (Thickness.xlsx).
* Construct a 95% confidence interval (range of values for which we can be 95% certain contain the true mean of the tablets, $\mu$);  $\sigma^2$ is assumed to be *UNKNOWN*.

In [None]:
#Ex5- Data was already read from the excel file and a dataframe called df created in Ex4
head(ex4)

In [None]:
#Ex5- Construct a 95% confidence interval for mu
# You can access the confidence interval part of the result from exercise 5
print('<add here>') # How can you access the confidence interval part of the 'result' from previous question? Hint: remember how you can access columns of a dataframe by using the $ symbol

---

### Exercise 6
**Business Question:** Batch yield data (Tanks Stacked.xlsx) has been gathered from two reactor tanks in a chemical production process. We will test whether the Tanks produce the same mean yields after we do a comparison of the variances.
* Visually compare the tank data. Give some visual conclusions/conjectures
* Test the hypothesis that the variances are the same for both tanks
* Conduct the hypothesis test that the mean yields are the same

In [None]:
#Ex6- Read the data from the excel file and create a dataframe
ex6 <- '<add here>'
'<add here>' # Inspect the first few lines of ex6

In [None]:
#Ex6- Visually compare the tank data. Give some visual conclusions/conjectures
# Create a boxplot to compare the tank data
# With geom_hline we are adding a horizontal line on the mean yield
ggplot('<add here>') + geom_'<add here>'() + geom_hline(yintercept = mean(ex6$Yield), alpha = 0.5)

In [None]:
#Ex6- Test the hypothesis that the variances are the same for both tanks
var.test(Yield ~ Tank, data = ex6) # var.test function does an F test for equal variances
# Notice the different syntax above. You can use a formula (Yield ~ Tank) and specify the data argument (data = ex6)

**Interpretation:** We cannot reject the null hypothesis (p-value > 0.05), so we accept that the variances are equal

In [None]:
#Ex6- Conduct the hypothesis test that the mean yields are the same
t.test('<add here>', data = ex6, '<add here>') # 1) Conduct t.test by using a formula like var.test above. 2) How can you specify that the variances are equal?

**Interpretation:** 'Replace with your comments'

---

### Exercise 7
**Business Question:** 20 daily weight measurements are taken from metal production at two plants (Weights.xlsx). Are the mean daily weights from the plants different?

* Visually compare the North and South Plant weight data. Give some visual conclusions/conjectures.
* Test the hypothesis that the variances are the same for both plants. 
* Conduct the hypothesis test that the mean weights are the same  Η<sub>0</sub>: μ<sub>A</sub> - μ<sub>B</sub> = 0 vs Η<sub>1</sub>: μ<sub>A</sub> - μ<sub>B</sub> ≠ 0, assuming  unknown σ<sup>2</sup><sub>A</sub> ≠ σ<sup>2</sup><sub>B</sub>)

In [None]:
#Ex7- Read the data from the excel file and create a dataframe
ex7 <- '<add here>'
head(ex7)

In [None]:
#Ex7- Visually compare the North and South Plant weight data. Give some visual conclusions/conjectures.
'<add here>' # Create a boxplot to compare the two plants

In [None]:
#Ex7- Test the hypothesis that the variances are the same for both plants
var.test('<add here>', data=ex7) # Add formula for var.test

**Interpretation:** 'Replace wit your comments'

In [None]:
#Ex7- Conduct the hypothesis test that the mean weights are the same
t.test('<add here>') # Remember to specify if variances are equal or not

**Interpretation:** 'Replace with your comments'

---

### Exercise 8: 
**Business Question:** We want to look at the relationship between velocity and strength in our welding data (Welding.xlsx).
* Plot welding data as a scatterplot
* Fit a regression line to determine if the regression coefficients are statistically significant. Fit "Strength" by "Velocity".

In [None]:
#Ex8- Read the data from the excel file and create a dataframe
ex8 <- '<add here>'
head(ex8)

In [None]:
#Ex8- Plot welding data as a scatterplot, with fitted line and confidence intervals
ggplot('<add here>') + geom_point() + geom_smooth(method = "lm") # geom_points just adds the points to the graph.
# geom_smooth adds a fitted line with confidence intervals to the data by specifying different methods. Try method='loess' instead and see the difference

In [None]:
#Ex8- Fit a regression line to determine if the regression coefficients are statistically significant.
mod <- lm('<add here>', data = ex8) #Specify the formula in the form of y ~ x. Creates linear model (called mod) by using the lm function. 
summary(mod) # Summary function used to look at the results of model mod

**Interpretation:** 'Replace with your comments'

---

### Exercise 9
**Business Question:** Machine tolerance data (MachineTol.xlsx) was measured on a part at 5 locations: L1-L5. Conduct an analysis of variance of machine tolerance to determine if the population means associated with locations L1-L5 are the same or are different.
* Plot machine tolerance by location
* Show the results of an ANOVA and provide the conclusion to the hypothesis test
* Conduct a multiple comparison test to determine what pairs might be different using the All Pairs Tukey-Kramer HSD method. What means are different or not?
* Conduct a multiple comparison test to determine if a location exists that exhibits the best (lowest) mean tolerance using the Compare Best, Hsu MCB method. Are any of the means "best"?

In [None]:
#Ex9- Read the data from the excel file and create a dataframe
ex9 <- read_xlsx('../Datasets/MachineTol.xlsx')
head(ex9)

In [None]:
#Ex9- Plot machine tolerance by location
#using boxplot
'<add here>' # Hint: Remember `` notation for columns that ahve a space in their name

In [None]:
#Ex9- Plot machine tolerance by location
#using violinplot
ggplot(ex9, aes(Location, `Mach Tol`)) + geom_violin(aes(fill=Location), trim=F)

In [None]:
#Ex9- Show the results of an ANOVA and provide the conclusion to the hypothesis test
model_aov <- aov('<add here>') # Provide the formula in the form of y ~ x and specify the data argument as well, as we did in hypothesis testing
'<add here>' # Look at the summary of the anova model created in previous step

In [None]:
#Ex9- Conduct a multiple comparison test to determine what pairs might be different using the All Pairs 
#Tukey-Kramer HSD method. 

tukey_model <- TukeyHSD(model_aov) # Performs Tukey HSD test. Requires an aov object as input
print('<add here>') # Print results of Tukey HSD
plot('<add here>') # Plot results of Tukey HSD

In [None]:
#Ex9- Use jmp to find the best location with Hsu's MCB method