# Lab 2: Descriptive Statistics and Visualization
In this lab we will look at different ways to explore our sample data and get a feel for the distribution of our variables.  

The lab will start with a brief rundown of functions for common descriptive statistics.  The bulk of the lab will focus on using ggplot2 to create a variety of charts and graphs that will be relevant as we move forward in this course.

- [Descriptive Statistics](#desc)
    - [Univariate](#uni)
    - [Bivariate](#bistat)
- [Vizualizations](#viz)
    - [Histogram](#hist)
    - [Bar Chart](#bar)
    - [Boxplot](#box)
    - [Grouped Boxplot](#gbox)
    - [Grouped Means Plot](#gmean)
    - [Scatterplot](#scatter)

In [None]:
#load packages
library(tidyverse) # includes ggplot2
## magrittr is installed as a part of tidyverse, but not loaded unless loaded explicitly
library(magrittr) # so I can use the assignment pipe operator ( %<>% )
## install.packages("ggpubr")
library(ggpubr) # containes line/dot plot for visualizing means
## install.packages("descr")
library(descr) ## for "pretty" two-way table CrossTable()


In this lab I'm going to use polling data collected by the creators of the card came Cards Against Humanity - https://thepulseofthenation.com/#the-poll.  There are a mix of serious and silly questions. This particular poll is from September 2017.

Because they have labeled the variables with the full questions the first thing I'm going to do is relabel the variables with shorter names for use inside the coding.  Never use variable names that are full sentences with spaces, it's just an invitation for trouble (and a lot of typing).

In [None]:
#load data
cah_poll <- read_csv('201709-CAH_PulseOfTheNation_Raw.csv')
head(cah_poll)

In [None]:
#view colnames
colnames(cah_poll)

In [None]:
#rename columns
new_names <- c("income", "gender", "age", "age_cat", "polaffil", "apptrump", "educ", "race", "marital", "robots",
              "climate", "transformers", "sci_good", "vaccines", "books", "ghosts", "fedbudget", "fedfundsci", "earthsun",
              "smartdumb", "urinate")
colnames(cah_poll) <- new_names
glimpse(cah_poll)

### Data Cleaning
I'm going to run through some quick data cleaning.  I will include the code here for your reference, but not explain it in detail.  You can refer to the Lab 1 notebook for explanations of these functions in more detail.

In [None]:
## summary for numerical variables
summary(select_if(cah_poll, is.numeric))

In [None]:
## so I'm not losing observations I'm going to impute the numerical variables.  
## Most I will use median/mean imputation, but for transformers movies I'm going to impute 0.

# note, I'm overwriting my df and my variables here, If I make a mistake I'll need to reload my data!
cah_poll  %<>% ## assignment pipe!
    mutate(income = if_else(is.na(income), median(income, na.rm = TRUE), income),
          transformers = if_else(is.na(transformers), 0, transformers),
          books = if_else(is.na(books), median(books, na.rm = TRUE), books),
          fedbudget = if_else(is.na(fedbudget), median(fedbudget, na.rm = TRUE), fedbudget)) 


# if_else() is a helpful compact way to use an if statement inside mutate (although you can technically use it other places)
# if_else(logical statement, value if TRUE, value if FALSE)
# in this case if the variable is NA I'm assigning it the mean/median, otherwise i'll just use the existing value of that variable

In [None]:
# make all chr variables factors
cah_poll %<>% mutate_if(is.character, as.factor) ## mutate_if also works as an if statement
summary(select_if(cah_poll, is.factor))


In [None]:
# only 2 DK/REF on gender, so I'm going to subset the df to not include those obs
cah_poll %<>% filter(gender != "DK/REF")
## remember even if you filter out observations based on a factor level, that factor level will persist unless dropped
cah_poll$gender <- droplevels(cah_poll$gender) 
summary(cah_poll$gender)

<a id="desc"></a>
## Descriptive Statistics
I'll briefly run through first univariate then bivariate descriptive statistics.  Descriptive statistics describe our sample, but are not meant for inference to the population (those are inferential statistics)

<a id="uni"></a>
## Univariate
### Categorical Variables
For categorical variables, we typically look at frequencies and/or percentages to get a feel for the distribution of our variable.  There are a number of ways you've seen that you can do this, with table(), summary(), and with count().  I'm going to use summarize to create a table that has both frequencies and percentages.

In [None]:
## frequency table - do you beleive in ghosts?
cah_poll  %>% mutate(ghosts = fct_infreq(ghosts)) %>% ## using fct_infreq to order levels by frequency for the purposes of the chart
              group_by(ghosts)  %>% 
              summarize(frequency = n(),
                        percentage = n()/dim(cah_poll)[1]*100) ## use dim to get number of rows in df overall (the first dim is rows)

### Numerical Variables
We learned in the last lab that we can get a summary of the major descriptive statistics (mean, median, min, max, 25% percentile, and 75% percentile) using summary()

In [None]:
summary(cah_poll$income)

There are also individual functions that may come in handy:

In [None]:
#mean
mean(cah_poll$income)

In [None]:
#median
median(cah_poll$income)

In [None]:
# there is no built-in function for mode, but we can define our own
get_mode <- function(v) {
    unique_value <- unique(v)
    unique_value[which.max(tabulate(match(v, unique_value)))]
}
get_mode(cah_poll$income)

In [None]:
#variance
var(cah_poll$income)

In [None]:
# standard deviation
sd(cah_poll$income)

In [None]:
#range/min/max
min(cah_poll$income)
max(cah_poll$income)
range(cah_poll$income)

#if I want to see max without scientific notation
format(max(cah_poll$income), scientific=F) # note, this returns a string, not a number

In [None]:
#percentiles
# summary prints 25th, 50th (median), and 75th by default
# you can get any percentile you want with the quantile function
quantile(cah_poll$income, c(.10, .20, .25, .32, .57, .75, .98)) 

In [None]:
#IQR - interquartile range
IQR(cah_poll$income) ## the IQR of income is zero because 25th and 75th percentile are identical
print("---------------")
IQR(cah_poll$books)

<a id="bistat"></a>
## Bivariate Statistics 
Bivariate Statistics are statistics that include two variables.  Multivariate statistics include two or more variables.

### Categorical
Similar to the frequency table above, we can obtain frequency tables that reflect the intersection of two categorical variables.

I'm going to create a two-way table of the two questions:
'Do you agree or disagree with the following statement: scientists are generally honest and are serving the public good.' 
'Do you agree or disagree with the following statement: vaccines are safe and protect children from disease.'

In [None]:
table(cah_poll$sci_good, cah_poll$vaccines)

The long factor labels make the table a bit hard to read.

In [None]:
## use prop.table to get proportions instead of frequencies - multiply by 100 to get percentages
prop.table(table(cah_poll$sci_good, cah_poll$vaccines))*100

In [None]:
## as.data.frame.matrix() let's us make the table a data.frame
as.data.frame.matrix(table(cah_poll$sci_good, cah_poll$vaccines)) 

We can use the package "descr" if we want a "prettier" table with percentages (row %, col % and overall %)

In [None]:
## quickly reorder factor levels for better table
cah_poll %<>% mutate(sci_good = fct_relevel(sci_good, "Strongly Disagree", "Somewhat Disagree", "Neither Agree nor Disagree",
                                          "Somewhat Agree", "Strongly Agree"),
                   vaccines = fct_relevel(vaccines, "Strongly Disagree", "Somewhat Disagree", "Neither Agree nor Disagree",
                                          "Somewhat Agree", "Strongly Agree")) 

In [None]:
CrossTable(cah_poll$sci_good, cah_poll$vaccines, prop.chisq = FALSE, prop.r = FALSE, prop.c = FALSE)
## I've used prop.r = FALSE and prop.c = FALSE to turn off row and col proportions, but you can include those if you wish

### Grouped Statistics
Often we want to look at numerical descriptive statistics (mean, median) but within groups of a categorical variable.  We saw an example of this on the first homework.

In [None]:
# lets look at a summary of age by whether the R thinks it's ok to urinate in the shower
cah_poll  %>% 
    mutate(urinate = fct_infreq(urinate)) %>% ## ordering factor by frequency for this table
    group_by(urinate)  %>% 
    summarize(freq = n(),
              mean_age = mean(age), 
              med_age = median(age),
              stddev_age = sd(age))

<a id="viz"></a>
## Visualizations
Now we'll move onto making charts and graphs, mostly with ggplot2.

<a id="hist"></a>
### Histogram
A histogram is a graphical representation of the distribution of a numerical variable.  The values of the variable are plotted on the x-axis and the bars/y-axis represent the frequency of those observations.

The most basic way to make a histogram in R is with base R

In [None]:
hist(cah_poll$income)

This is not particularly attractive and difficult to customize and make publication quality.  We can do better with ggplot with just the default settings.

In [None]:
options(repr.plot.width=6, repr.plot.height=5) ## plot size options for Jupyter notebook ONLY

cah_poll  %>% ggplot(aes(x=income))  +
            geom_histogram()

## nameofdf  %>% ggplot() creates the graph with the data from that df
## we use  %>% to pass the data to ggplot() but once we call ggplot we use + to add our customizations
## aes() is where we define our aesthetics, these are the things we want included, such as x and y variables
## geom_ describes the geometry we want to use (the shape we want our data to be arranged in)

In [None]:
# I can customize the number of bins like this
cah_poll  %>% mutate(income_in_k = income/1000)  %>% ## divide income by 1000 so that axis ticks are not unreadable
    ggplot(aes(x=income_in_k))  + ## create the ggplot, define our aesthetics, which are variables/data that will make our graph
        geom_histogram(bins = 20) + ## use the "geometry" or shape histogram
        labs(x = "Income in $1000s", y = "Frequency", title = "Histogram of Income") ## relabel axes and add overall title

#### Skewness and Kurtosis
We can visually inspect our histograms for evidence of skewness and kurtosis.  We can also use a density plot which smooths the distribution into a curve.

In [None]:
d <- density(cah_poll$income) # returns the density data 
plot(d)

We can see that the distribution of income is positively skewed (the bulk of the distribution is on the left side).  It also has positive kurtosis (the peak of the distribution is taller and skinnier than a normal distribution).


<a id="bar"></a>
### Bar Chart
Histograms visually show us the frequencies of values of a numerical variable, but what if we wanted to graphically depict the frequencies in each level of a categorical variable?  We would use a bar chart.

In [None]:
cah_poll  %>% ggplot(aes(x = apptrump)) +
                geom_bar()            

This gives us a basic, but boring, graph that needs to be edited to clean up the format, make sure our labels aren't overlapping, etc.


In [None]:
options(repr.plot.width=9, repr.plot.height=6) ## plot size options for Jupyter notebook ONLY
## we can save our plot in progress 
p <- cah_poll  %>% mutate(apptrump = fct_relevel(apptrump, "DK/REF", "Strongly disapprove", "Somewhat disapprove", "Neither approve nor disapprove",
                                          "Somewhat Approve", "Strongly Approve")) %>% ##reorder from disagree to agree
                ggplot(aes(x = apptrump, fill = apptrump )) + ## using fill inside aes will fill the bars based on the variable specified
                geom_bar() +
                theme(legend.position = "none") + #remove legend that is automatically created when we use fill
                coord_flip() + ## flip the chart from vert to horiz so that the labels are readable
                labs(x = "", title = "Approval of Trump")  ## even though we flipped it, apptrump is still the x variable.
                        ## blanking the x label so it doesn't appear

# set up some text formatting
bold.14.text <- element_text(face = "bold", size = 14) ## define a text style
p + theme(text = bold.14.text) # take saved plot that we called p and add more options - defined text formatting/style

<a id="box"></a>
### Boxplot
A boxplot is another way to visually look at the distribution of a numerical variable.  The box is based on the median and the IQR, and does not depict the mean.

In [None]:
cah_poll  %>% ggplot(aes(y = age)) + ## our numerical variable is y, not x.
                geom_boxplot()    

<a id="gbox"></a>
### Grouped Boxplot
More interesting, we can show a box plot for our numerical variable for each level of a categorical variable.  The categorical variable is listed as the x variable

In [None]:
## again customizing it and cleaning it up as well
bold.14.text <- element_text(face = "bold", size = 14) ## define a text style
cah_poll  %>% mutate(urinate = fct_infreq(urinate)) %>%
            ggplot(aes(y = age, x = urinate, fill = urinate)) + 
                ## our numerical variable is y, not x. x is the categorical variable
                ## we use fill to color our boxes.
                geom_boxplot() +
                theme(legend.position = "none", #remove legend that is automatically created when we use fill
                      text = bold.14.text) + ## theme chart with defined text style
                labs(title = "Distribution of Age by acceptability of urinating in the shower",
                    x = "Is it acceptable to urinate in the shower?") +
                scale_fill_manual(values=c("#FF33CC", "#33FF99", "#660099")) ## specify our own fill colors with hex codes

## get your own hex color codes at https://htmlcolorcodes.com/

When our goal is to compare means with a t-test or ANOVA analysis, we can get a feel for the distribution with a boxplot, but since it shows the median and not the mean, we can instead plot our points plus the mean.

<a id="gmean"></a>
### Grouped Mean Plot
I don't know if it has real name, but I call this the grouped mean plot.  We get points that represent all of our observations on our numerical variable grouped by levels of of our categorical variable, with the mean (and error bars that reflect the uncertainty around the mean).

In [None]:
cah_poll %>% mutate(urinate = fct_infreq(urinate)) %>%
            ggline(x = "urinate", y = "age", ## define variables
                   add = c("mean_se", "jitter"),  ## add mean and error bars
                   ## jitter separates the points horizontally so that they are not all overlapping each other in one line
                   add.params = list(color="urinate"), ## use categorical variable to color points
                   ylab = "Age",  #y label
                   xlab = "Acceptable to urinate in shower?")  #x label


In [None]:
## because ggline is built on top of ggplot2 we can add our ggplot2 customizations if we want
bold.14.text <- element_text(face = "bold", size = 14)
cah_poll %>% mutate(urinate = fct_infreq(urinate)) %>%
            ggline(x = "urinate", y = "age", ## define variables
                   add = c("mean_se", "jitter"),  ## add mean and error bars
                   add.params = list(color="urinate"), ## use categorical variable to color points
                   ylab = "Age",  #y label
                   xlab = "Acceptable to urinate in shower?") + #x label
            theme(legend.position = "none", #remove legend that is automatically created when we use param color
                      text = bold.14.text) +
            scale_color_manual(values=c("#FF33CC", "#33FF99", "#660099"))  ## specify our own colors with hex codes

Finally we'll look at scatterplots, which is a graphical way to look at the joint distribution of two numerical variables.

<a id="scatter"></a>
### Scatterplot
Let's look at the correlation between the R's estimate of percentage of federal budget spent on science vs. the R's age.

In [None]:
## basic scatterplot
cah_poll  %>% ggplot(aes(x = fedbudget, y = age)) + ## right now it doesn't matter which is x and which is y
                geom_point()    

It looks like these two variables are not closely correlated, however, we can still attempt to add a "best fit line."

In [None]:
cah_poll  %>% ggplot(aes(x = fedbudget, y = age)) + 
                geom_point() +
                geom_smooth(method = lm, se = FALSE)

It looks like there really is no correlation (the best fit line is nearly flat).
We could use some of the options we used previously, here I'm going to add a categorical variable which will color the dots by group.

In [None]:
cah_poll  %>% ggplot(aes(x = fedbudget, y = age, color = gender)) + 
                geom_point() +
                geom_smooth(method = lm, se = FALSE)

Now, not only are the dots colored by gender, there are 3 best fit lines, one for each gender.

Note - this graph is not PQ - I haven't adjusted all of the labels and added a title.

## YOUR TURN!
Create one graph (of any type shown here) to visualize variable(s) in the cah_poll dataset that we have not yet looked at.  You can run glimpse() to remind you what variables are included in the df.