# ANOVA

The statistical procedure **Analysis of Variance** is actually an umbrella term used to classify a dozen different procedures. In this course, we will study 1-way ANOVA, the most plain vanilla version of ANOVA.

William Sealy Gosset created the $t$-test. While spetacularly successful, the $t$-test can be used at most with 2 samples. What if our statistical comparison involves 3 or more samples? Suppose we are comparing the *perfectionism* levels at a certain college among:
 
- Freshman
- Sophomores
- Juniors
- Seniors

The $t$-test by itself will not work, and performing the $\binom{4}{2} = 10$ $t$-tests needed to perform all the possible 2-way comparisons will drive the Type I error rate for the overall anaysis into the sky.

Step up, Ronald Fisher, who extended the $t$-test mathematically so that it could be used with 3 or more groups. The statistic is $F$, but the process came to be called ANOVA.

## Getting Started

Let's load some data to be used with our comparisons:

In [39]:
pers <- read.csv('https://faculty.ung.edu/rsinn/data/personality.csv')
parks = read.csv('https://faculty.ung.edu/rsinn/data/nationalparks.csv')
head(parks,5)

Unnamed: 0_level_0,title,acres,area_km_sq,visitors,state,lat,long,established,description,image.url,link,X,X.1,X.2
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<int>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>
1,Acadia,49057.36,198.5,3303393,Maine,44.35,-68.21,2/26/1919,"Covering most of Mount Desert Island and other coastal islands, Acadia features the tallest mountain on the Atlantic coast of the United States, granite peaks, ocean shoreline, woodlands, and lakes. There are freshwater, estuary, forest, and intertidal habitats.",acadia.jpg,https://www.nps.gov/acad/index.htm,,,
2,American Samoa,8256.67,33.4,28892,American Samoa,-14.25,-170.68,10/31/1988,"The southernmost National Park is on three Samoan islands and protects coral reefs, rainforests, volcanic mountains, and white beaches. The area is also home to flying foxes, brown boobies, sea turtles, and 900 species of fish.",american-samoa.jpg,https://www.nps.gov/npsa/index.htm,,,
3,Arches,76678.98,310.3,1585718,Utah,38.68,-109.57,11/12/1971,"This site features more than 2,000 natural sandstone arches, with some of the most popular arches in the park being Delicate Arch, Landscape Arch and Double Arch. Millions of years of erosion have created these structures located in a desert climate where the arid ground has life-sustaining biological soil crusts and potholes that serve as natural water-collecting basins. Other geologic formations include stone pinnacles, fins, and balancing rocks.",arches.jpg,https://www.nps.gov/arch/index.htm,,,
4,Badlands,242755.94,982.4,996263,South Dakota,43.75,-102.5,11/10/1978,"The Badlands are a collection of buttes, pinnacles, spires, and mixed-grass prairies. The White River Badlands contain the largest assemblage of known late Eocene and Oligocene mammal fossils. The wildlife includes bison, bighorn sheep, black-footed ferrets, and prairie dogs.",badlands.jpg,https://www.nps.gov/badl/index.htm,,,
5,Big Bend,801163.21,3242.2,388290,Texas,29.25,-103.25,6/12/1944,"Named for the prominent bend in the Rio Grande along the U.S.–Mexico border, this park encompasses a large and remote part of the Chihuahuan Desert. Its main attraction is backcountry recreation in the arid Chisos Mountains and in canyons along the river. A wide variety of Cretaceous and Tertiary fossils as well as cultural artifacts of Native Americans also exist within its borders.",big-bend.jpg,https://www.nps.gov/bibe/index.htm,,,


## Example 1: National Parks

**Which Region of the U.S. has National Parks with Largest Number of Annual Visitors?**

Let's use some creative subsetting for the **parks** data. We want three regions:

1. East -- east of the Mississippi River ( -89.5 deg longitude)
2. Northwest -- west of the Mississippi River and north of Denver (40 deg latitude)
3. Southwest -- west of the Mississippi River and south of Denver (40 deg latitude)

We show below the subsettig for the **Southwest** parks. Note that column 1 has the titles of the parks while column 5 lists the average annual visitors to that park.



In [41]:
southwest <- subset(parks, long <= -89.5 & lat > 40)
length(southwest[,1])
southwest[,1]
sw <- southwest[,4]
sw

## Example 2: Primary Humor Style

**Does Humor Style have an Impact upon Self-Esteem?**

In the personality data set, the grouping variable **PHS** provides the *Primary Humor Style* for that participant. The numeric variable **SE** gives us a measure of self-esteem for each participant.