# MATH 3350 Course Notes - Module S1
# Exploratory Data Analysis: The Basics

R has several built-in data sets that can be used to explore and demonstrate different concepts.  

We'll examine one below.


In [None]:
head(ToothGrowth)   # See the first few rows of ToothGrowth data set
?ToothGrowth        # Bring up R documentation that explains this data set

# What to explore...

## 1. Types of Variables
Quantitative vs. Categorical  
_Note: In R, categorical variables are called 'factors'_  



## 2. Size of Data Set
How many rows?  How many columns?

In [None]:
nrow(ToothGrowth)
ncol(ToothGrowth)

## 3. Variable Summaries

The type of summary will depend on the type of variable

In [None]:
summary(ToothGrowth$len)

In [None]:
summary(ToothGrowth$supp)

In [None]:
summary(ToothGrowth$dose)

### Let's examine a different data set
What type of variables do you see?

In [None]:
head(mtcars)
?mtcars

Notice that variables vs and am have data type 'double' but the documentation describes them as categorical variables.

In [None]:
summary(mtcars$am)

In [None]:
summary(as.factor(mtcars$am))

### Numeric Descriptions of a Quantitative Variable
* 5-number summary
* Mean and standard deviation

In [None]:
summary(mtcars$wt)
mean(mtcars$wt)
sd(mtcars$wt)

### Important Distinction: Population Parameters vs. Sample Statistics

These data are for a particular group of cars; the mean and standard deviation above are _sample statistics_, which are sometimes used as **_estimates_** of _population parameters_. The true values of population parameters are often unknown.

### Describing the Distribution of a Quantitative Variable
* Shape: Symmetry or Skew
* Shape: Uniform, Peaks - unimodal, bimodal, others
* Center: Mean and Median
* Spread: Range, IQR, standard deviation
* Outliers

#### Run the cell below to see examples of different shape characteristics.

In [None]:
set.seed(844)
setA <- rnorm(500,21,2)
setB <- rchisq(500,df=800)
setC <- rchisq(500,df=6)
setD <- -1*setC+30
setE <- runif(5000,min=1,max=6)
setF <- c(rnorm(500,21,2),rnorm(500,32,2))

par(mfrow=c(2,2))
hist(setA,main="Approximately Normal", xlab="Data Set A")
hist(setB,main="Unimodal and Roughly Symmetric", xlab="Data Set B")
hist(setC,main="Unimodal and Skewed RIGHT", xlab="Data Set C")
hist(setD,main="Unimodal and Skewed LEFT", xlab="Data Set D")
hist(setE,main="Approximately Uniform", xlab="Data Set E")
hist(setF,main="Bimodal", xlab="Data Set F")

### Appearance of Histogram Depends on Bins

The cell below creates a histogram of the _weight_ of the vehicles in the **mtcars** data set.  Look at the histogram and see what you can determine about the shape of the distribution.

Note that the histogram's _bin width_ can change the appearance of the histogram, but the underlying distribution is not changing.  Some bin widths make it easier to see certain patterns. The cell below allows you to change the bin width easily to see the effect on the histogram.

In [None]:
# Try this histogram with different binwidth values (e.g., 1, 0.25, 0.5, 0.6, 0.75, 1.25, 1.5, or others if you choose)

binwidth <- 1
hist(mtcars$wt, breaks=seq(1,6,binwidth), main="Distribution of Car Weights (1000 lbs)")


## Box plots

Box plots can help visualize other features of a distribution.

### What a boxplot DOES show:

* minimum value
* maximum value
* median
* quartiles (first and third)
* outliers 
* strong skewness

Note that a box plot is a visual representation of the 5-number summary. Based on the values in that summary, a boxplot can also be used to compute measures of spread:
* range: maximum - minimum
* IQR (interquartile range): Q3 - Q1

### What a boxplot does NOT show:

* mean
* standard deviation
* number of peaks
* features of specific distributions (e.g., normal or uniform)

#### Comparing Information in Box Plots and Histograms 
The cell below produces box plots for the same 6 data sets that we viewed above as histograms. The titles remind you of the shape that was visible in the histogram, but which features are evident from the box plot? Which features are **not** evident? 

In [None]:
par(mfrow=c(2,3))
boxplot(setA,main="(Approximately Normal)", xlab="Data Set A")
boxplot(setB,main="(Unimodal & Symmetric)", xlab="Data Set B")
boxplot(setC,main="(Skewed RIGHT)", xlab="Data Set C")
boxplot(setD,main="(Skewed LEFT)", xlab="Data Set D")
boxplot(setE,main="(Approximately Uniform)", xlab="Data Set E")
boxplot(setF,main="(Bimodal)", xlab="Data Set F")

### Making Box Plots Horizontal
By default, box plots in R are vertical. It can often be useful to display them horizontally. This is especially true for ease of interpreting the direction of a skew (left versus right).

The cell below shows box plots of our 6 sample data sets horizontally.

In [None]:
par(mfrow=c(3,2))
boxplot(setA,horizontal=TRUE,main="(Approximately Normal)", xlab="Data Set A")
boxplot(setB,horizontal=TRUE,main="(Unimodal & Symmetric)", xlab="Data Set B")
boxplot(setC,horizontal=TRUE,main="(Skewed RIGHT)", xlab="Data Set C")
boxplot(setD,horizontal=TRUE,main="(Skewed LEFT)", xlab="Data Set D")
boxplot(setE,horizontal=TRUE,main="(Approximately Uniform)", xlab="Data Set E")
boxplot(setF,horizontal=TRUE,main="(Bimodal)", xlab="Data Set F")

### Revisiting the Distribution of Weights 
The cell below shows a horizontal box plot of the weights of cars in the **mtcars** data set. 

What else can you detect about the distribution based on the box plot?

In [None]:
boxplot(mtcars$wt, main="Car Weights (1000 lbs)", horizontal=TRUE)

### Outliers
Box plots can show outliers, as the plot above does (see points on the right side of the plot.)  

We can also verify mathematically whether the data set contains outliers, using the 1.5 IQR rule (also called 'fencing'). For these calculations, we refer again to the values in the 5-number summary: 

In [None]:
summary(mtcars$wt)

### Using IQR to Compute Boundaries for Outliers

**IQR** is _Inter-Quartile Range_, representing the distance between Q1 and Q3.

$$IQR = 3.61 - 2.581 = 1.029$$

$$1.5(IQR) = 1.5 \times 1.029 = 1.5435$$

The "1.5 IQR" value is 1.5435; this is the distance _**below Q1**_ and _**above Q3**_ for the outlier boundaries.

#### Low boundary: 
$$Q1 - 1.5435 = 2.581 - 1.5435 = 1.0375$$

#### Upper boundary: 
$$Q3 + 1.5435 = 3.61 + 1.5435 = 5.1535$$

From the summary, we can see that the minimum value is 1.513.  Since 1.513 is not below 1.0375, there are no low outliers.

From the summary, we can see that the maximum value is 5.424.  Since 5.424 is above 5.1535, there is at least one high outlier. This is consistent with the boxplot, which appears to show 2 outliers.

**NOTE:** 'Whiskers' do NOT necessarily extend to the "boundary" for outliers.  They extend to the most extreme value **_in the data set_** that is _NOT_ an outlier.

### Stacking box plots

Box plots are useful for comparing how a variable is distributed when grouped by another variable.

In [None]:
boxplot(mtcars$wt ~ mtcars$am, main="Car Weights by Transmission Type", 
        xlab="Weight (1000 lbs)", ylab="Transmission: Automatic(0) or Manual(1)", 
        horizontal=TRUE)


### Density Plot
The distribution can also be viewed as a density plot.

In [None]:
plot(density(mtcars$wt), main="Density Plot of Car Weights", xlab="Weight (1000 lbs)")

### Scatter Plot of 2 Quantitative Variables
A scatter plot can help visualize possible relationships between 2 quantitative variables.  

In [None]:
plot(mtcars$wt,mtcars$mpg,main="Mileage vs Car Weight", xlab="Weight (1000 lbs)", ylab="Miles per Gallon")

Again, a third variable may be part of any relationship that exists. We can visualize that by grouping.

In [None]:
plot(mtcars$wt,mtcars$mpg,col=(as.factor(mtcars$am)),main="Mileage vs Car Weight by Transmission Type", 
     xlab="Weight (1000 lbs)", ylab="Miles per Gallon", pch=(mtcars$am+16))

# Add legend to plot
legend("topright", pch = c(16,17), legend = c('Automatic','Manual'), col = c("black","red"))
