# MATH 3350 Course Notes - Module S1
# Exploratory Data Analysis: The Basics

R has several built-in data sets that can be used to explore and demonstrate different concepts.  

We'll examine one below.


In [None]:
head(ToothGrowth)   # See the first few rows of ToothGrowth data set
?ToothGrowth        # Bring up R documentation that explains this data set

## What to explore...
### 1. Types of Variables
Quantitative vs. Categorical  
_Note: In R, categorical variables are called 'factors'_  



### 2. Size of Data Set
How many rows?  How many columns?

In [None]:
nrow(ToothGrowth)
ncol(ToothGrowth)

### 3. Variable Summaries

The type of summary will depend on the type of variable

In [None]:
summary(ToothGrowth$len)

In [None]:
summary(ToothGrowth$supp)

In [None]:
summary(ToothGrowth$dose)

#### Let's examine a different data set
What type of variables do you see?

In [None]:
head(mtcars)
?mtcars

Notice that variables vs and am have data type 'double' but the documentation describes them as categorical variables.

In [None]:
summary(mtcars$am)

In [None]:
summary(as.factor(mtcars$am))

### Numeric Descriptions of a Quantitative Variable
* 5-number summary
* Mean and standard deviation

In [None]:
summary(mtcars$wt)
mean(mtcars$wt)
sd(mtcars$wt)

#### Important Distinction: Population Parameters vs. Sample Statistics

These data are for a particular group of cars; the mean and standard deviation above are _sample statistics_, which are sometimes used as **_estimates_** of population parameters. The true values of population parameters are often unknown.

### Describing the Distribution of a Quantitative Variable
* Shape: Symmetry or Skew
* Shape: Uniform, Peaks - unimodal, bimodal, others
* Center: Mean and Median
* Spread: Range, IQR, standard deviation
* Outliers

In [None]:
# Try this histogram with different numbers of breaks (use break=)

hist(mtcars$wt, main="Distribution of Car Weights (1000 lbs)")


In [None]:
# Look at R documentation to find out more about how this command works, see additional options
?hist

### Box plots

In [None]:
boxplot(mtcars$wt, main="Car Weights (1000 lbs)")

By default, box plots in R are vertical. It can often be useful to display them horizontally.

In [None]:
boxplot(mtcars$wt, main="Car Weights (1000 lbs)", horizontal=TRUE)

### What you can (and cannot) learn from a boxplot
A box plot is a visual representation of the 5-number summary.  It does not show the mean or standard deviation.

#### Outliers
Box plots can show outliers, as the plot above does (see points to the right of the plot.)  

We can also verify mathematically whether the data set contains outliers, using the 1.5 IQR rule (also called 'fencing').

**IQR** is _Inter-Quartile Range_, representing the distance between Q1 and Q3.

$IQR = 3.61 - 2.581 = 1.029$
$1.5IQR = 1.5 \times 1.029 = 1.5435$

The distance 1.5435 is the boundary below Q1 and above Q3 for outliers.
Low boundary: $Q1 - 1.5435 = 2.581 - 1.5435 = 1.0375$
Upper boundary: $Q3 + 1.5435 = 3.61 + 1.5435 = 5.1535$

From the summary, we can see that the minimum value is 1.513.  Since 1.513 is not below 1.0375, there are no low outliers.

From the summary, we can see that the maximum value is 5.424.  Since 5.424 is above 5.1535, there is at least one high outlier. This is consistent with the boxplot, which appears to show 2 outliers.

**NOTE:** 'Whiskers' do NOT necessarily extend to the "boundary" for outliers.  They extend to the most extreme value **_in the data set_** that is _NOT_ an outlier.

### Stacking box plots

Box plots are useful for comparing how a variable is distributed when grouped by another variable.

In [None]:
boxplot(mtcars$wt ~ mtcars$am, main="Car Weights by Transmission Type", 
        xlab="Weight (1000 lbs)", ylab="Transmission: Automatic(0) or Manual(1)", 
        horizontal=TRUE)


### Density Plot
The distribution can also be viewed as a density plot.

In [None]:
plot(density(mtcars$wt), main="Density Plot of Car Weights", xlab="Weight (1000 lbs)")

### Scatter Plot of 2 Quantitative Variables
A scatter plot can help visualize possible relationships between 2 quantitative variables.  

In [None]:
plot(mtcars$wt,mtcars$mpg,main="Mileage vs Car Weight", xlab="Weight (1000 lbs)", ylab="Miles per Gallon")

Again, a third variable may be part of any relationship that exists. We can visualize that by grouping.

In [None]:
plot(mtcars$wt,mtcars$mpg,col=(as.factor(mtcars$am)),main="Mileage vs Car Weight by Transmission Type", 
     xlab="Weight (1000 lbs)", ylab="Miles per Gallon", pch=(mtcars$am+16))

# Add legend to plot
legend("topright", pch = c(16,17), legend = c('Automatic','Manual'), col = c("black","red"))
