<div class="alert alert-block alert-danger">

# More Practice with Descriptive Statistics (COMPLETE)

    
</div>

In [None]:
# Load the CourseKata library
suppressPackageStartupMessages({
    library(coursekata)
})

# Adjust scientific notation
options(scipen = 999)
# Makes smaller plots
options(repr.plot.width=4, repr.plot.height=3)


### 1.0 - Describing Variation

Descriptive statistics, or also called *summary statistics*, do just that: They describe and summarize a pattern of variation in data in simplified ways!

They are the core building blocks of the inferential stuff we can do in statistics, and are so very useful for communicating about variation, and for thinking about the DGP.

These descriptions or summaries can come from visual representations of the data (e.g., scatterplots, histograms, tables, etc.) or from numeric summaries of the data (e.g., mean, standard deviation).

**The main things we are usually trying describe about the variation in our data are:**

1. Shape
2. Center
3. Spread
4. Outliers

Let's get some more practice describing variation in data--visually and numerically!

Weâ€™re going to look at the data frame called `StudentSurvey`. It contains the data from an in-class survey given to introductory statistics students over several years.

It has 362 observations on the following 17 variables:

- `Year` Year in school: `FirstYear`, `Sophomore`, `Junior`, or `Senior`
- `Gender` Student's gender: `F` or `M`
- `Smoke` Smoker? `No` or `Yes`
- `Award` Prefered award: `Academy`, `Nobel` or `Olympic`
- `HigherSAT` Which SAT is higher? `Math` or `Verbal`
- `Exercise` Hours of exercsie per week
- `TV` Hours of TV viewing per week
- `Height` Height (in inches)
- `Weight` Weight (in pounds)
- `Siblings` Number of siblings
- `BirthOrder` Birth order, 1=oldest
- `VerbalSAT` Verbal SAT score
- `MathSAT` Math SAT score
- `SAT` Combined Verbal + Math SAT
- `GPA` College grade point average
- `Pulse` Pulse rate (beats per minute)
- `Piercings` Number of body piercings

**1.1:** Take a look at the data. Which variable(s) are you most interested in?

In [None]:
head(StudentSurvey)

<div class="alert alert-block alert-warning">

**Sample Response**

*Responses will vary*

</div>

### 2.0 - Visual Summaries: What is the shape of your data and why should we care?

When it comes to creating visualizations of your data, there are lots of options, including (but not limited too):

- histograms (counts)
- density histograms (relative frequency)
- faceted histograms
- using the argument`fill = ~` or `color = ~` to distinguish groups or cases
- boxplots
- jitter plots
- scatterplots
- tables (e.g., `tally()` output)
- bar plots
- pie charts*

*\*NOTE: for pie charts we recommend using a bar plot instead because they are easier to interpret and make comparisons across groups. See [this article](https://www.statmethods.net/graphs/pie.html) for additional information.*

Many of these plots can even be combined in different ways. We can also use arguments to make adjustments, such as:

- color preference
- binsize, binwidth (histograms)
- density curve (density histograms)
- transparency, point size (jitter/scatterplots)
- adding predictions/models with vertical/horizontal lines or regression lines
- adding predictions as jitter plot points

**Challenge**

Try to create a visualization that depicts all of the following at one time:  

- One outcome variable
- Two explanatory variables

***Bonus:***

- Try to combine boxplots with any histograms or jitterplots you made
- Try to add the lines for the model

In [None]:
# Sample Answer (SAT = Gender + Smoke + Error)

gf_histogram(~SAT, data = StudentSurvey, fill = ~Smoke)%>%
gf_facet_grid(Gender ~ .) %>%
gf_boxplot(width = 4) %>%
gf_model()


gf_boxplot(SAT~Gender, data = StudentSurvey)%>%
gf_jitter(color = ~Smoke)

**Using the Appropriate Viz**

Each type of visualization will be better suited for different needs and for different types of variables. 

As with most things in statistics, there are two important things to consider when deciding how to visualize or summarize your data:

**1. Number of Variables:**

- univariate distributions (a single variable)
- two or more variables (bivariate or multivariate distributions)

**2. Level of Measurement for Outcome/Explanatory Variables:**

- categorical variables
- quantitative variables


**Describing the Shape of a Distribution**

For instance, the way we describe the shape of a distribtuion will vary greatly on the two factors listed above. 


**Challenge Questions**

- When can we describe the shape of a distribution as: 

> - skewed, normal, unimodal, bimodal, or uniform?
> - linear, curvilinear, or nonlinear?

- How do we describe the shape of categorical variables?

<div class="alert alert-block alert-warning">

**Sample Responses**

- A quantitative variable can be described as skewed, normal, uniform, bimodal, etc.
- A distribution of two quantitative variables can be described as linear, curvilinear, or nonlinear.
- We cannot describe the shape of categorical variables because the ordering of the variables is arbitrary, but we can describe central tendency using the mode.

</div>

***Bonus***

- When should we use tables?

- How might the shape of a distribution impact our use and interpretation of numeric summaries (e.g., mean vs median, or standard deviation vs IQR)?

<div class="alert alert-block alert-warning">

**Sample Response**

- Tables can be used for a variety of reasons to help us summarize our data.
- They are useful for showing counts for categorical variables
- They are useful for comparing raw counts and proportions


- The mean and the median might be slightly more or less representative of non-normal distributions
- The shape also affects the spread, such as SD and IQR
- The whole shape of the distribution need to be taken into consideration when interpreting center and spread
</div>

**Challenge**

- Create a small data frame (n = 12) that would produce a uniform distribution (visualize it to prove it).
- Create a small data frame (n = 12) that would produce a bimodal distribution (visualize it to prove it).

In [None]:
# Complete Version

# Sample 1 - Uniform Distribution
uniform <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4)
DF1 <- data.frame(uniform)
gf_histogram(~uniform, data = DF1, bins = 4)

# Sample 2 - Bimodal Distribution
bimodal <- c(1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4)
DF2 <- data.frame(bimodal)
gf_histogram(~bimodal, data = DF2, bins = 4)

### 3.0 - Numeric Summaries: One number to represent them all

While we can use visualizations to help us describe the shape of our distributions, we can also use some numeric descriptives to help us capture the other aspects of variation that we care about:

- Center
- Spread
- Outliers

And just like our visualizations, we have to pay attention to the **number of variables** and the **level of measurement** that we are using to ensure valid interpretations.

There are various ways to measure central tendency, spread, and outliers. Below are a few common ways. 

**Center:**
- mean
- regression line (intercept and slope)
- median
- mode
- midrange*

*\*NOTE: midrange is a type of mean. You simply sum the minimum value plus the maximum value and divide it by 2.*

**Challenge**

Find the midrange for `Exercise` and compare it to the mean and the median. Are they similar? Why do you think that is? What would need to change to make them more similar? Which one seems like a "better" representation of the central tendency of the data? Why?

In [None]:
favstats(~Exercise, data = StudentSurvey)
(0 + 40) / 2

<div class="alert alert-block alert-warning">

**Sample Response**

- The midrange is much higher (20) than the mean (9.05) or the median (8).
- This is becuase it is literally the middle of the range, so it is taking the min and the max into account and finding the middle. It doesn't care about the rest of the distribution.
- If the distribution was less skewed and did not have some outliers, the midrange would be closer to the mean and the median.
- The best measure of central tendency will depend on your purposes, but the model that will minimize the error will still be the mean.

</div>

How good our measure of central tendency is depends on various things, but one thing that will be important is how spread out the data is.

**Spread:**
- standard deviation
- variance
- IQR
- range
- 5-number summary (min, Q1, median, Q3, max)
- skewness (how positively or negatively skewed it is, starting from normal--no skew)
- kurtosis (how pointy or flat the peak of the curve is)
- r (Pearson's r)

**Tip:** Boxplots are great for quickly visualizing the 5-number summary.

We also like to be aware of any "weird things" or outliers in the data. These may be mistakes, or they may be real reflections of the DGP. 

**Outliers:**

There is more than one way to define an outlier, and how you decide to treat outliers (e.g., keep or filter) will depend on a case-by-case basis and on your purposes.

One common way to define outliers is:
- Any data point bigger than $Q3 + (1.5 * IQR)$ is considered a high outlier
- Anything smaller than the $Q1 - (1.5 * IQR)$ is considered a low outlier

**Tip:** You can also use boxplots to quickly visually identify any outliers (they appear as dots beyond the whiskers of the boxplot!).

 **Challenge Questions**

1. Standard deviation is a measure of spread around which measure of center? 
2. Interquartile Range (IQR) is a measure of spread around which measure of center?
3. When is the mean, median, and mode usually the same?

<div class="alert alert-block alert-warning">

**Sample Responses**

1. Standard deviation is a measure of spread around which measure of center? 
> Around the mean.
2. Interquartile Range (IQR) is a measure of spread around which measure of center?
> Around the median.
3. When is the mean, median, and mode usually the same?
> When the distribution is normal.

</div>

### 4.0 - Explore and Describe some Variation

Explore the data and make a few different visualizations. Try to look at single variables (univariate distributions) as well as combinations of variables and levels of measurement (categorical vs quantitative). For each exploration, try to describe shape, center, spread, and outliers using the various tools we have discussed.