# Lab 2: Descriptive Statistics and Visualization
In this lab we will look at different ways to explore our sample data and get a feel for the distribution of our variables.  

The lab will start with a brief rundown of functions for common descriptive statistics.  The bulk of the lab will focus on using ggplot2 to create a variety of charts and graphs that will be relevant as we move forward in this course.

- [Descriptive Statistics](#desc)
    - [Univariate](#uni)
    - [Bivariate](#bistat)
- [Vizualizations](#viz)
    - [Histogram](#hist)
    - [Bar Chart](#bar)
    - [Boxplot](#box)
    - [Grouped Boxplot](#gbox)
    - [Grouped Means Plot](#gmean)
    - [Scatterplot](#scatter)
    - [Basic Customization](#cust)

In [None]:
#load packages
library(tidyverse) # includes ggplot2
library(magrittr) # so I can use the assignment pipe operator ( %<>% )
library(sjPlot)
##ggpubr?
##GGally?

In this lab I'm going to use polling data collected by the creators of the card came Cards Against Humanity - https://thepulseofthenation.com/#the-poll.  There are a mix of serious and silly questions. This particular poll is from September 2017.

Because they have labeled the variables with the full questions the first thing I'm going to do is relabel the variables with shorter names for use inside the coding.  Never use variable names that are full sentences with spaces, it's just an invitation for trouble (and a lot of typing).

In [None]:
#load data

cah_poll <- read_csv('201709-CAH_PulseOfTheNation_Raw.csv')
head(cah_poll)

In [None]:
#view colnames
colnames(cah_poll)

In [None]:
#rename columns
new_names <- c("income", "gender", "age", "age_cat", "polaffil", "apptrump", "educ", "race", "marital", "robots",
              "climate", "transformers", "sci_good", "vaccines", "books", "ghosts", "fedbudget", "fedfundsci", "earthsun",
              "smartdumb", "urinate")
colnames(cah_poll) <- new_names
glimpse(cah_poll)

### Data Cleaning
I'm going to run through some quick data cleaning.  I will include the code here for your reference, but not explain it in detail.  You can refer to the Lab 1 notebook for explanations of these functions in more detail.

In [None]:
## summary for numerical variables
summary(select_if(cah_poll, is.numeric))

In [None]:
## so I'm not losing observations I'm going to impute the numerical variables.  
## Most I will use median/mean imputation, but for transformers movies I'm going to impute 0.

# note, I'm overwriting my df and my variables here, If I make a mistake I'll need to reload my data!
cah_poll  %<>% ## assignment pipe!
    mutate(income = if_else(is.na(income), median(income, na.rm = TRUE), income),
          transformers = if_else(is.na(transformers), 0, transformers),
          books = if_else(is.na(books), median(books, na.rm = TRUE), books),
          fedbudget = if_else(is.na(fedbudget), median(fedbudget, na.rm = TRUE), fedbudget)) 


# if_else() is a helpful compact way to use an if statement inside mutate (although you can technically use it other places)
# if_else(logical statement, value if TRUE, value if FALSE)
# in this case if the variable is NA I'm assigning it the mean/median, otherwise i'll just use the existing value of that variable

In [None]:
# make all chr variables factors
cah_poll %<>% mutate_if(is.character, as.factor)
summary(select_if(cah_poll, is.factor))


In [None]:
# only 2 DK/REF on gender, so I'm going to subset the df to not include those obs
cah_poll %<>% filter(gender != "DK/REF")
cah_poll$gender <- droplevels(cah_poll$gender)
summary(cah_poll$gender)

<a id="desc"></a>
## Descriptive Statistics
I'll briefly run through first univariate then bivariate descriptive statistics.  Descriptive statistics describe our sample, but are not meant for inference to the population (those are inferential statistics)

<a id="uni"></a>
## Univariate
### Categorical Variables
For categorical variables, we typically look at frequencies and/or percentages to get a feel for the distribution of our variable.  There are a number of ways you've seen that you can do this, with table(), summary(), and with count().  I'm going to use summarize to create a table that has both frequencies and percentages.

In [None]:
## frequency table - do you beleive in ghosts?
cah_poll  %>% mutate(ghosts = fct_infreq(ghosts)) %>% 
              group_by(ghosts)  %>% 
              summarize(frequency = n(),
                        percentage = n()/dim(cah_poll)[1]*100)

### Numerical Variables
We learned in the last lab that we can get a summary of the major descriptive statistics (mean, median, min, max, 25% percentile, and 75% percentile) using summary()

In [None]:
summary(cah_poll$income)

There are also individual functions that may come in handy:

In [None]:
#mean
mean(cah_poll$income)

In [None]:
#median
median(cah_poll$income)

In [None]:
# there is no built-in function for mode, but we can define our own
get_mode <- function(v) {
    unique_value <- unique(v)
    unique_value[which.max(tabulate(match(v, unique_value)))]
}
get_mode(cah_poll$income)

In [None]:
#variance
var(cah_poll$income)

In [None]:
# standard deviation
sd(cah_poll$income)

In [None]:
#range/min/max
min(cah_poll$income)
max(cah_poll$income)
range(cah_poll$income)

#if I want to see max without scientific notation
format(max(cah_poll$income), scientific=F) # note, this returns a string, not a number

In [None]:
#percentiles
# summary prints 25th, 50th (median), and 75th by default
# you can get any percentile you want with the quantile function
quantile(cah_poll$income, c(.10, .20, .25, .32, .57, .75, .98)) 

In [None]:
#IQR - interquartile range
IQR(cah_poll$income) ## the IQR of income is zero because 25th and 75th percentile are identical
print("---------------")
IQR(cah_poll$books)

<a id="bistat"></a>
## Bivariate Statistics 
Bivariate Statistics are statistics that include two variables.  Multivariate statistics include two or more variables.

### Categorical
Similar to the frequency table above, we can obtain frequency tables that reflect the intersection of two variables.

I'm going to create a two-way table of the two questions:
'Do you agree or disagree with the following statement: scientists are generally honest and are serving the public good.' 
'Do you agree or disagree with the following statement: vaccines are safe and protect children from disease.'

In [None]:
table(cah_poll$sci_good, cah_poll$vaccines)

The long factor labels make the table a bit hard to read.

In [None]:
## as.data.frame.matrix() let's us make the table mor attactive
as.data.frame.matrix(table(cah_poll$sci_good, cah_poll$vaccines)) 

In [None]:
t <- as.data.frame.matrix(table(cah_poll$sci_good, cah_poll$vaccines))
stargazer(t,type="text")

In [None]:
    - [Bivariate](#bistat)
- [Vizualizations](#viz)
    - [Histogram](#hist)
    - [Bar Chart](#bar)
    - [Boxplot](#box)
    - [Grouped Boxplot](#gbox)
    - [Grouped Means Plot](#gmean)
    - [Scatterplot](#scatter)
    - [Basic Customization](#cust)

In [None]:
correlation
two-way table
multi-way table
grouped statisics

In [None]:
histogram
skewness
kurtosis
barchart
boxplot
groupedmeans
scatterplot
with line
