# Lab 2.5: Creating Publication Quality (PQ) Output
In this lab we will will build on our previous lab in which we looked at descriptive statistics and visualizations looking specifically at how to generate or convert these tables and graphs into PQ output.  For your projects this semester all charts and graphs and tables will need to be PQ.  

- [Descriptive Statistics](#desc)
    - [Frequency Tables - Categorical Variables](#freq)
        - [One-way Table](#freq)
        - [Two-way Table](#twoway)
    - [Summary Statistics - Numerical Variables](#numsum)
- [Vizualizations - ggplot2](#viz)

In [None]:
#load packages
library(tidyverse) # includes ggplot2
library(magrittr) # so I can use the assignment pipe  %<>% 
library(ggpubr) # containes line/dot plot for visualizing means

#install.packages("flextable") 
library(flextable) ## generates PQ frequency/summary tables in .doc or picture format

#install.packages("sjPlot")
library(sjPlot) ## PQ two-way/contingency table

#install.packages("stargazer")
library(stargazer) # PQ numerical summary stats

## installations required to save flextables and ggplot2 objects as imgs
library(webshot)
#install.packages("webshot")
#webshot::install_phantomjs()


In this lab I'm going to again use polling data collected by the creators of the card came Cards Against Humanity - https://thepulseofthenation.com/#the-poll.  There are a mix of serious and silly questions. This particular poll is from September 2017.

See Lab 2 for data cleaning steps.  I've saved that cleaned df as an .rds file that I'm going to directly load here.

In [None]:
#load data
cah_poll <- readRDS('cahpoll9.rds')

#inspect df to make sure it's loaded correctly
glimpse(cah_poll)
head(cah_poll)

<a id="desc"></a>
## Descriptive Statistics 
We'll look at ways to create PQ tables to display basic descriptive statistics of our variables.

We'll look at frequency tables for categorical variables (both one-way and two-way).  Two-way frequency tables are often call cross tabs or contingency tables.

For numerical variables we'll cover PQ tables that list summary statistics (mean, median, quartiles, min, max, etc.) for your numerical variable(s) of interest.

<a id="freq"></a>
### Categorical Variables - Frequency Tables
We'll start with a one-way basic frequency table.  There are a number of ways we can create this, using count() or table() is most common.

#### One-way Frequency Table
A one-way table lists the frequencies (counts) of observations for each level of one categorical/factor variable.  We'll create a frequency table of the variable sci_good using group_by() and summarize() from dplyr.  A good PQ table will include both frequencies and percentages.

In [None]:
# create table and print to screen to inspect
#using syntax:
# dfname  %>% group_by(variablename) %>% 
#             summarize(Frequency = n(),
#                       Percentage = n() / dim(df)[1]) ## dim(df)[1] is the overall number of rows in your df

cah_poll %>% group_by(sci_good) %>% 
             summarize(Frequency = n(), ## Note that I'm making my summary vars PQ quality names because these will be the column headings
                       Percentage = n() / dim(cah_poll)[1])

This looks like it is generating the right output (the counts of observations in each category of sci_good) but to make it PQ I need to reorder my factor levels so that they are in an appropriate order.  In this case I want the values in order for the range of Somewhat Agree to Somewhat Disagree.  If I want to keep DK/REF as a level I need to relabel it with a PQ descriptive name and include it at the end of the table.  

Note that your levels must have descriptive names in order to be PQ.  For example, the label "nhwhite" for a race variable would not be PQ, it should be listed as "White, non-hispanic." (this is a real race category from US gov't surveys, FYI)

I'm going to do the re-labeling of the DK/REF level, the re-ordering of the factors, create the final freq table and save it as an object.  I'm also going to convert the numerical proportions in the Percentage column to strings that represent the percentages formatted with % and rounding to one decimal place.

Note, creating the table in this way creates a df object.  Do not overwrite your df containing your individual observations with your frequency table, save it with a different object name.

In [None]:
## note - I'm not making these adjustments to the underlying dataset, I'm only making them as I'm building the table
## I'm doing this all in one statement/pipe but I'm putting comments between lines

sci_good_table <- #save the resulting table object
    cah_poll %>% # start with the df with the observations
# relabel DK/REF - we have to use fct_ functions within mutate because we're editing one variable inside a df
    mutate(sci_good = fct_recode(sci_good, "Don't Know or Refused" = "DK/REF")) %>% 
# reorder factor levels - again I need to do this inside mutate
    mutate(sci_good = fct_relevel(sci_good, "Strongly Agree", "Somewhat Agree", "Neither Agree nor Disagree", 
                                             "Somewhat Disagree", "Strongly Disagree"))  %>% 
# create table
    group_by(sci_good) %>% 
    summarize(Frequency = n(), ## Note that I'm making my summary vars PQ quality names because these will be the column headings
              Percentage = n() / dim(cah_poll)[1])  %>% 
# format percentages - now I'm using mutate on the df that resulted from the summarize statement.  
# This needs to be done after creating the table but still in the same pipe.  We're passing forward a df in it's current state
# in the create table step we're creating a different df from our df of observations and that is now the df passed forward
    mutate(Percentage = paste0(format(Percentage * 100, digits = 2), "%")) # multplying by 100 and keeping one decimal place / 2 sig digits
    # paste0 is used to add % to the end of the number

#inspect resulting table / df
sci_good_table

Now that we have the information in a table format we want, we can proceed to "prettify" our table for output to a paper.

We'll use the package `flextable` for this. https://davidgohel.github.io/flextable/

In [None]:
flextable(sci_good_table)

First, a note, if you run this code in RStudio in an .rmd you will get a preview of the formatted table in the "Viewer" pane.  This doesn't occur within jupyter, we're instead seeing the underlying structure of the "flextable" object.

This, just passing our df to the flextable() function creates the most basic flextable with default options.  When I ran this code in RStudio I ended up with something that needs a bit of customization:

<img src="pqimages/flex1.jpeg" height = '50'>

The first and most important thing I need to do has nothing to do with the table formatting - I need to relabel the first column "sci_good" with a descriptive name.  I'm going to do this in the df before creating the flex table

In [None]:
# using vector index [1] to access only the first column name of the df sci_good_table
colnames(sci_good_table)[1] <- "Response"

I get: 
<img src="pqimages/ft2.PNG" height = '50'>

Now I need to autofit the table to the contents to make the table more attractive and readable

In [None]:
## create flextable object and save it as base_table
base_table <- flextable(sci_good_table)

# use autofit() on the base_table object
autofit(base_table)

We now have a good basic structure. 
<img src="pqimages/ft3.PNG" height = '50'>

Time to add a title.

In [None]:
base_table <- flextable(sci_good_table)

bt2 <- autofit(base_table)
add_header_lines(bt2, "Table 1: Scientists are honest and serving public good")

We're getting close to a PQ table.  I also want to alter the justification for the columns of values.

This is current state:
<img src="pqimages/ft4.PNG" height = '50'>


In [None]:
# create base_table
base_table <- flextable(sci_good_table)
# autofit and save as bt2
bt2 <- autofit(base_table)
# add title, save as bt3
bt3 <- add_header_lines(bt2, "Table 1: Scientists are honest and serving public good")
# align numerical columns, save as bt4
# align can equal "center" "left" or "right"
# part = "all" to do the action on the entire table
# j = 2:3 indicates I only want to align the 2nd and 3rd columns. 
bt4 <- align(bt3, align = "center", part = "all", j = 2:3)
bt4

Rendered in RStudio this looks like:
<img src="pqimages/ft5.PNG" height = '50'>

I now need to save this object in a way that I can include it in my paper.  I can do this either as a picture file or in .doc (Microsoft Word) format.  First, this is the image that will be saved, rendered as an R "plot."

In [None]:
## plot is a way to render the image as an internal R image object - 
## this will not save it to your computer as an image
## this is how the plot will appear when saved.  As you can see it's slightly better than the RStudio preview
plot(bt4)

### SAVE AS IMAGE

To save a flexable as an img you need to install `phantomjs` and `webshot`.  

In [None]:
## SAVE AS IMG FILE

save_as_image(bt4, path = "sci_good_table.png") ## need an img extension, .png

I can print this table here in the markdown block to show you the saved image (which will need to be sized to fit appropriately in your document):

![](sci_good_table.png)


### SAVE AS A MICROSOFT WORD DOCX


In [None]:
## SAVE AS .DOCX
save_as_docx(bt4, path = "sci_good_tab.docx")

Running that code won't output anyting in RStudio, but you should now have a docx file by the name you selected in `path = ` in your working directory.  Remember if you don't know what your working directory is (where you are in the file structure of your computer) you can run `getwd()`

#### More customizations
Flextable documentation is available here - https://davidgohel.github.io/flextable/
There are multiple other options and customizations you can make to your tables.

<a id = "twoway"></a>
### Two-way Frequency Table
I'll quickly show you how to make a crosstab / contingency table / two-way table.  For this we'll use `tab_xtab()` from the `sjPlot` package. http://www.strengejacke.de/sjPlot/

In [None]:
## adjust factor labels
cp2 <- cah_poll %>% ## create new clean version for this table
    mutate(sci_good = fct_recode(sci_good, "Don't Know or Refused" = "DK/REF")) %>% 
    mutate(robots = fct_recode(robots, "Don't Know or Refused" = "DK/REF")) %>% 
    mutate(sci_good = fct_relevel(sci_good, "Strongly Agree", "Somewhat Agree", "Neither Agree nor Disagree", 
                                  "Somewhat Disagree", "Strongly Disagree"))  %>% 
    mutate(robots = fct_infreq(robots))


## create two way table
tab_xtab(var.row = cp2$sci_good, ## variable that makes up the rows
         var.col = cp2$robots,  ### variable that makes up the columns
         ### specify descriptive overall table title
         title = "Table #: Opinion on whether scientists are good versus likelihood of robots taking over R's job",
         ## specify variable labels in order of row then column (as a vector of strings)
         var.labels = c("Scientists are good", "Robots will take over job"),
         show.cell.prc = TRUE, ## show percentages in the cells
         show.summary = FALSE ### don't show overall summary - this is stuff we'll need when we get to chi-square but not now
         )
         

When I run this in jupyter it pops up the result in a new browser window.  You will see your result in your "Viewer" pane in RStudio.  It looks like this:

<img src="pqimages/sj2.PNG" height = '50'>

### Saving your two-way contingency table

sjPlot builds the tables using .html.  We can save that .html and use `webshot` to turn it into a picture.  For this you'll add the `file =` argument when making your tab and name the filename you want to save your table as.


In [None]:
## create two way table and save it
tab_xtab(var.row = cp2$sci_good, 
         var.col = cp2$robots,  
         title = "Table #: Opinion on whether scientists are good versus likelihood of robots taking over R's job",
         var.labels = c("Scientists are good", "Robots will take over job"),
         show.cell.prc = TRUE, 
         show.summary = FALSE,
         file = "sci_robot_tab.html" ## name of file, must end with html
         )

The file will be saved and nothing will print in RStudio, but if you go to your working directory on your computer you'll see that html file.  We can use webshot to convert it to a picture.

In [None]:
### use webshot to convert html to img

webshot("sci_robot_tab.html", "sci_robot_tab.png")

Now if you look in your working directory you will see a .png file that you can include in your Word document.  Make sure to resize it appropriately without distortion.  There is an option when resizing images in Microsoft Word that should be checked - "lock aspect ratio."

#### More customizations -
For other options see the sjPlot documentation - https://strengejacke.github.io/sjPlot/

<a id = "numsum"></a>
### Summary Statistics - Numerical Variables
Often we want to look at numerical descriptive statistics (mean, median) before we begin to work with our data.  We can build tables of these in PQ format using `stargazer`.

I will show a basic layout/format here that is acceptable for PQ.  There are additional options you can review at https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf or https://www.rdocumentation.org/packages/stargazer/versions/5.2.2/topics/stargazer

In [None]:
stargazer(as.data.frame(cah_poll), 
          #note you have to specify type html
          type = "html",
          #note that the argument is "out" not "file"
          out = "cah_numdesc.html",
          title = "Table #: Summary Statistics of Numerical Variables", # descriptive overall table title
          # relabel variable names to descriptive names
          covariate.labels = c("Income", "Age", "Transformers movies seen", 
                               "Books read this year", "Guess of % Fed Budget for science"),
          digits = 2) # round values to two decimal places

The table is rendered in HTML, which I saved to a file called cah_numdesc.html using the "out = " argument.  I again use webshot to convert the table to an image.

In [None]:
### use webshot to convert html to img

webshot("cah_numdesc.html", "cah_numdesc.png")

I'll insert the picture in this markdown block so that you can see what it looks like when the html is rendered.

![](cah_numdesc.png)

#### Grouped Descriptive Statistics Tables
We can also make the Grouped Descriptive Statistics Tables we've been generating PQ using flextable, the same way we made the frequency tables for categorical variables above, with group_by() and summarize().

In [None]:
# build the dataframe of summary data to pass to flextable
## flextable takes a df and prints it, so it needs to be a df of the summaries

urine_desc <- cah_poll  %>% 
    mutate(urinate = fct_infreq(urinate)) %>% ## ordering factor by frequency for this table
    group_by(urinate)  %>% 
    summarize(Frequency = n(),
              Mean = format(mean(age), digits = 4), 
              Median = median(age),
              "Std. Dev." = format(sd(age), digits = 4) )
              ## if I want to have my column name have a space in it it needs to be in quotes 
colnames(urine_desc)[1] <- "Response"

In [None]:
# create the flextable

urine_flextab <- flextable(urine_desc)  %>% autofit()  %>% 
                    add_header_lines("Table #: Distribution of Age by Acceptability of Urinating in the Shower") %>% 
                    align(align = "center", part = "all", j = 2:5)

plot(urine_flextab)

<a id="viz"></a>
## Visualizations
Now we'll move onto making charts and graphs with ggplot2.  In this notebook I'll use one example plot and build it up from basic to fully customized item by item.  These customizations will work on any ggplot2 objects, including the ones created from ggpubr (the grouped mean plot).

A common customization is to adjust colors.  Remember the colors can be specified using Hex Color Codes. https://www.hexcolortool.com/

For this example I'm going to look at the distribution of the R's estimate of federal budget spent on scientific research by political party.


In [None]:
# edit factor levels of politcal party
cah_poll %<>% #save changes overwriting cah_poll
    mutate(polaffil = fct_recode(polaffil, "Don't Know or Refused" = "DK/REF")) %>% 
    mutate(polaffil = fct_relevel(polaffil, "Democrat", "Independent", "Republican"))

In [None]:
options(repr.plot.width=6, repr.plot.height=4) ## plot size options for Jupyter notebook ONLY

#create base ggplot - grouped boxplot
## note that I can save this base plot as a ggplot object. I'm going to call mine bp (for base plot)
## I can then take that base plot and add other options to it without restating all of the plot setup
bp <- cah_poll %>% ggplot(aes(x = polaffil, y = fedbudget, fill = polaffil)) +
                geom_boxplot()

bp ## print plot - when you save the plot it doesn't print to screen

That is the most basic plot laying out our variables and the type of plot we want (the geom or geometry).  Now we'll add a customization that is vital to PQ format, custom descriptive labels and titles.

In [None]:
bp2 <- bp + # start with saved base plot and add labels, save as bp2
    labs(x = "Political Affiliation", y = "Percent spent on research", 
         title = "Guess of Federal Budget Spent on Science by Political Affiliation")

bp2

Now we have descriptive labels on everything but the Legend for the key of the different colors.  Because those colored bars are labeled on the x-axis we can suppress the legend from printing.  I'll start with bp2, the base ggplot + labels we just added, remove the legend and save it as bp3.

In [None]:
bp3 <- bp2 +
    theme(legend.position = "none") ## legend position = none removes legend.  It's part of the theme() setup of the plot
bp3

This would satisfy PQ at the lowest possible acceptable level, but there is so much more we could do.  In the next cells I'm going to progressively add additional customizations.

In [None]:
## custom colors for box fills.

bp4 <- bp3 +
    scale_fill_manual(values=c("#0b5394", "#38761d", "#990000", "#741b47")) #hex color codes

# use scale_fill_manual when using fill in aes() use scale_color_manual when using color() in aes()

bp4

In [None]:
# remove gridlines
bp5 <- bp4 + 
    theme(panel.grid.major.x = element_blank(), ## supress vertical grids
          panel.grid.minor.y = element_blank(), ## supress "minor" horiz grid - the ones not labeled
          axis.ticks = element_blank()) ## suppress the little ticks between graph and label
bp5


In [None]:
# adjust background color
bp6 <- bp5 + theme(plot.background = element_rect(fill = "#ffd966"), ## the outside background
             panel.background = element_rect(fill = "#fce5cd")) ## the background of the plotting area
bp6

We can also add other objects (geoms) to our graph.  I'm going to add a horizontal line that indicates the overall median, and a dot and text that displays the mean within each group.

In [None]:
#geom_hline(yintercept, linetype, color, size)
bp6 + 
    # add a line across the plot that indicates the overall median
    geom_hline(yintercept = median(cah_poll$fedbudget), linetype = "dashed", color = "#e69138", size = 2) +
    # add dots that indicate the group means
    stat_summary(fun.y=mean, colour="#8e7cc3", geom="point", 
                 shape="circle", size=3,show_guide = FALSE) +
    # add text that indicate the group means
    stat_summary(fun.y=mean, colour="#8e7cc3", geom="text", aes(label = round(..y.., digits=1)), 
                  vjust=0, hjust = -0.2, size = 4)

#### Pre-set themes

If you don't want to play around with each individual color yourself, you can use a preset theme from `ggthemes`.
https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/

This theme is meant to look like the one used on the website fivethirtyeight.com

In [None]:
#install.packages("ggthemes")
library(ggthemes)
cah_poll %>% ggplot(aes(x = polaffil, y = fedbudget, fill = polaffil)) +
                geom_boxplot() + 
                theme_fivethirtyeight() +
                labs(x = "Political Affiliation", y = "Percent spent on research", 
                title = "Federal Budget Spent on Science by PartyID") +
                theme(legend.position = "none") 

## Saving a ggplot object
As you learned in the second homework, we use `ggsave()` to save a ggplot object

In [None]:
# save the last plot you ran/created
ggsave('lastplot.png')

In [None]:
# save a specific plot by plot object name
ggsave(filename = "boxplotsix.png", plot = bp6)

## YOUR TURN!
Create one graph or table to visualize variable(s) in the cah_poll dataset that we have not yet looked at.  You can run glimpse() to remind you what variables are included in the df.  Customize the figure to PQ.