-
Notifications
You must be signed in to change notification settings - Fork 0
/
HelpMyCollaboratorUsesR.Rmd
1302 lines (977 loc) · 89.4 KB
/
HelpMyCollaboratorUsesR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title : "Help, My Collaborator Uses R! An Introduction to Reproducible Statistical Analyses in R" # "The title"
shorttitle : "Help, My Collaborator Uses R!" # "Title"
author:
- name : "Wouter de Nooy" # "First Author"
affiliation : "1"
corresponding : yes # Define only one corresponding author
address : "Nieuwe Achtergracht 166, 1018 WV Amsterdam" # "Postal address"
email : "w.denooy@uva.nl"
- name : "Optional Second (Third, ...) Author - see .Rmd file" # An optional second author. Repeat name and affilitation for more authors.
affiliation : "1,2"
affiliation:
- id : "1"
institution : "Amsterdam School of Communication Research ASCoR"
- id : "2"
institution : "University of Amsterdam"
authornote: |
Wouter de Nooy, Department of Communication Science, University of Amsterdam.
abstract: |
This is a concise introduction to using R for reproducible statistical analyses. It offers a guide and reference for researchers who want to perform statistical analyses in R or write reproducible reports in R Markdown but are new to the software or have not used it for some time. They want it done as soon as possible and their data is ready for analysis.
Instead of a bottom-up introduction to R and R Markdown, this paper presents examples, which researchers can adapt for their own use. Imitate and apply! R code is shown that can be adapted by changing names of data sets and variables. This document itself is written in R Markdown. Download the [zip file](https://wdenooy.github.io/Switch2R/HelpMyCollaboratorUsesR.zip), which contains the R Markdown document, to see how things are done. The zip file also contains a PDF version of this web page.
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
<!-- https://tinyurl.com/ybremelq -->
keywords : "R; statistical analyses; reproducible research; R Markdown"
wordcount : 6874 "(added manually using the wordcountaddin add-in)"
bibliography : ["r-references.bib", "references.bib"]
link-citations : true #citation year is link to list of references
floatsintext : yes #if no: tables and figures at the end of the document in "man" classoption
figurelist : no
tablelist : no
footnotelist : no
linenumbers : no
mask : no
draft : no
documentclass : "apa6"
classoption : "doc" #Other options: "jou" (journal format), "man" (manuscript format)
output :
bookdown::html_document2: #additional HTML output option
toc: yes #with a table of contents
toc_depth: 3 #using three levels
toc_float: yes #which is situated in the left margin
papaja::apa6_pdf:
toc: no #no table of contents
number_sections: true #number sections (for referencing)
papaja::apa6_word:
toc: no #no table of contents
geometry: #page setup for PDF (LaTeX) output
a4paper,
top=30mm,
bottom = 25mm,
asymmetric = FALSE # set paper size and margins
# left=30mm,
# right=30mm
header-includes:
- \usepackage{float} #position of tables and figures
---
```{r setup, include = FALSE, echo=FALSE}
# Set overall options for knitting (rendering) the document.
set.seed(42) # Seed for random number generation
knitr::opts_chunk$set(
cache.extra = knitr::rand_seed,
eval = TRUE, #by deault, execute the code in code chunks
echo = TRUE, #by default, show the contents of code chunks (to learn R)
message = FALSE, #by default, do not show messages
warning = FALSE, #by default, do not show warnings
fig.pos = "H" #keep plots as close as possible to the code
)
# Required libraries.
# It is good practice to load all packages that you need at the start of the
# RMarkdown document. This way, you know where to look if you wonder whether the
# package that you need has already been loaded and you do not have to load it
# over and over again.
# Note that all these packages must have been installed (at least once).
# R library (package) for APA6-style tables, figures, and statistical results
# formatting.
library(papaja)
# Load the haven package for reading SPSS data files.
library(haven)
# Load the broom package, which converts statistical (results) objects.
library(broom)
# Load the coefplot package.
library(coefplot)
# Load the visreg package.
library(visreg)
# Load the car package.
library(car)
# Load the mediation package.
library(mediation)
# Library to insert figures and create tables (among other things) in RMarkdown
# documents.
library(knitr)
# Set of R libraries for consistent and transparent data handling and graphing.
# Load this package last, so its functions are used by default.
library(tidyverse)
#Create bib file with references to R packages.
knitr::write_bib(c("base", "knitr", "tidyverse", "papaja", "dplyr", "haven", "readr", "jsonlite", "stargazer", "apaTables", "texreg", "broom", "kableExtra", "coefplot", "ggplot2", "visreg", "effects", "car", "mediation", "psych", "nlme", "lme4", "rstanarm", "lavaan", "OpenMx", "sem", "lubridate", "bookdown", "rvest", "httr"), file = "r-references.bib")
```
# Why Use R?
Why Use R? Because you want to be one of the nerds. Instead of publicly acknowledging this deep-felt need, you resort to arguments such as:
* I want to be up-to-date. The latest developments in data handling and analysis are implemented (first) in R.
* I want to do all my work in one environment. From data scraping via all kinds of analyses to publishing papers and books, all can be done in R.
* I want my analyses to be reproducible. All steps from data cleaning to final results can be specified and documented in one document in R.
* I want maximum flexibility and full control of my analyses, tables, and figures.
* I want to use free, open source software, so everybody can check and improve my work.
# Start Working with R
In contrast to SPSS, R [@R-base] is not a single software package (well, neither is SPSS but we usually install all licensed SPSS parts) and it does not have an attractive user interface. Working with R, you may have to install additional packages (also called _libraries_) for doing the things that you want to do. A more user-friendly interface to R is provided by RStudio, which is free like R. Let us install both R and RStudio, learn how to install additional packages, and how to operate R from within RStudio.
## Install or update R
On your _private computer_ or UvA _self-support laptop_, download and install R directly from https://cloud.r-project.org/.
* Download the R version for your operating system: Linux, (Mac) OSX, or Windows.
* Select the `base` package (or click `install R for the first time`).
* Download and install it.
For a _UvA-supported laptop_:
* Install R from the Software Center. Be sure to select the staff version (_R-Project MDW_), which allows installing packages (Section \@ref(packages)) locally on your computer.
* Install RStudio from the Software Center.
Use the same procedure to update to a newer version of R. If you update R, you do not have to update RStudio (or the other way around).
If you want to know more about R and the organization behind R (CRAN), visit https://www.r-project.org/.
## Install RStudio
When R has been installed, you can install RStudio (Desktop, Open Source License) from http://www.rstudio.com/download:
* Select your operating system.
* Download and install.
Allow RStudio to install additional packages when it asks for permission to do so.
## Create an RStudio Project
When you start working on a new project, create a new RStudio project. A project tells RStudio where to store and find files that you create.
Create a new project:
* Use the menu _File>New Project..._ or the _Create a project_ button.
* Select _New Directory_ to start a project in a brand new working directory or select _Existing Directory_ to create a new project in an existing directory.
The directory name is the name of the project. It is called _Switch2R_ in Figure \@ref(fig:RStudio).
When you start RStudio, R starts automatically. The project that was open when you last closed RStudio will be reopened. If you want to open another project, use _Open Project..._ in the _File_ menu or use the drop-down list of projects in the top-right of the RStudio interface.
Figure \@ref(fig:RStudio) shows the main panels of the RStudio interface with a project opened. It features a syntax file demonstrating some R commands that are discussed in Section \@ref(rcommands).
```{r RStudio, echo=FALSE, fig.cap="Main panels of the RStudio user interface."}
# This is a safe way to include an external picture, add a caption, and assign
# it a name (here: RStudio), so you can cross-reference it.
include_graphics("RStudio.png") #from the knitr package
```
## Running R Commands within RStudio {#rcommands}
RStudio (and R) does not have a menu offering all possible data manipulations and analyses. Instead, you have to type commands.
### Commands are functions
The general R layout of a command: `y <- f(x, arg = z)`
This command means: Do something (function `f(arg = z)`) to `x` and save the result in `y`.
Both `x` and `y` can be a data set, a single value, a vector (series) of values, (code to create) a plot, analysis results, and so on. An important difference between R and SPSS is that R stores analysis results in such a way that you can use them in a next step of the analysis. You can extract the results that you are interested in and use them in calculations, tables, or plots (see Section \@ref(firstanalysis)). This gives you maximum flexibility and full control.
Let us refer to `x` and `y` as _data objects_.
If the left-hand data object `y`:
* does not exist: a new data object is created in the project environment.
* exists: data object is overwritten in the project environment. There cannot be more than one data object with the same name.
* is not named: output is sent to the screen (console or plot area).
If no function is specified as in `y <- x`, the data object (`x`) will be simply copied to another data object or to the screen, depending on `y`.
### Options are arguments to functions
A function can take more input from the user than just a data object (`x`). Additional input are arguments that have a name (`arg`, for example). The arguments for a function and their default values are listed in the help for the function (the vignette). The user may specify the value of arguments or rely on the default settings. For example, the `mean()` function does _not_ remove missing values (`NA` in R) as default: `na.rm = FALSE`. If there is a missing value, the resulting mean is a missing value. That is great to warn you about missing values, but you would probably want to remove them. Set the `na.rm = ` argument to `TRUE` (note the capitals).
> Separate arguments and the data object by commas!
Sometimes, you can supply more than one value for an argument. For example, the `quantile()` function can calculate several quantiles in one go. If you have to specify more than on value, enumerate them within the `c()` function, which creates a vector or list.
Below are some examples of commands and the textual output that the `summary` function sends to the console, which also feature in Figure \@ref(fig:RStudio). Note that R is _case-sensitive_: uppercase letters are considered different letters than lowercase letters.
```{r store_results, eval=FALSE}
# Note that this code chunk is not executed (eval = FALSE).
# Copy the student sleep data set (always available in R) to a data object that
# we can inspect in the environment.
sleepdata <- sleep #no function, so just a copy
# Show summary information about this data set to the console.
summary(sleepdata) #we apply a function but do not store the results
# Store the mean of extra sleeping time.
# Note that we use $ to identify a variable within the sleep data set.
# First, inspect help for the mean function: ? plus function name.
?mean
# Second, store the non-trimmed mean (default), ignoring missing values (NA).
mean_sleep <- mean(sleepdata$extra, na.rm = TRUE) # na.rm stands for NA remove.
# Calculate the quartiles of extra sleeping time.
quartiles <- quantile(sleepdata$extra, probs = c(0.25, 0.5, 0.75))
```
### Data frames are lists of variables
The `mean()` and `quartiles()` functions require a variable: extra sleeping time (`extra`) in the example above. Usually, variables are stored in a data frame. A data frame is (more or less) a list of variables.
> In R, you can retrieve an item from a list with the `$` character.
We get the extra sleeping time variable (`extra`) from the `sleepdata` data frame by first specifying the name of the data frame, adding a dollar sign, and finally the name of the variable (list item): `sleepdata$extra`. If you are typing a command at the R prompt in RStudio, a popup menu will list the items that are available once you add the dollar sign to the name of a data frame that is present in the environment.
Results from statistical analyses (Section \@ref(firstanalysis) and on) are also stored as lists. We can use the dollar sign to extract the parts of the results that we are interested in.
### A script is a syntax file {#scriptfile}
Commands can be typed after the prompt (`>`) in the console. Press _Enter_ to execute the command.
It is more efficient, however, to create a syntax file (called a script in R) and type the commands and comments in this file. The script file can be saved and reopened in the next R session. Select one or more commands in the syntax file and press Ctrl-Enter or Cmd-Enter to execute them.
Create a new script file:
* Use the File menu.
* Select _New File_.
* Select _R Script_.
Save script files in the default directory (this is the working directory) and use the default file extension (_.R_).
Reopen a saved script file:
* If the project is not open yet, open it (_File > Open Project..._).
* Open the script file by clicking it in the _Files_ tab (bottom-right panel of RStudio).
Section \@ref(integratingtextcode) discusses a third option for running commands, which integrates code and text.
## Recommended RStudio settings {#rsettings}
A data object created by an R command is saved in memory, which is called the project's environment. Data objects in the environment are directly accessible to R.
When you exit RStudio (and R), the data objects in the environment are deleted: memory is cleared. It is possible to save all data objects in memory as a file on disk, which is called a _work space_, with the default file extension _.RData_.
Actually, RStudio saves the work space by default when you exit RStudio. This may sound great but it is strongly recommended NOT to save the work space. Your R code in the syntax file should create all data objects that you need. Your work space, however, may contain additional objects that you created with the console or with syntax commands that have been removed. If you rerun the syntax when you reopen an RStudio project, you can check that all necessary data objects are created. If not, you will receive an error message.
For this reason, it is a good idea to turn off saving work spaces when exiting RStudio and reloading them when reopening a project in RStudio. To this end, adjust the settings in the _Tools_ menu under _Global Options_ as indicated in Figure \@ref(fig:RStudiosettings).
```{r RStudiosettings, echo=FALSE, fig.cap="Workspace settings in the Tools menu under Global Options."}
# Show a picture from an external file.
include_graphics("Settings.png")
```
## Install and Load Packages {#packages}
The base installation of R includes several important packages, such as the `base` package and the `stats` package, which contains many statistical analyses. These packages are loaded when you start R, so they are always ready for usage. Additional packages must be installed once but they must be loaded in every new R session that uses them.
```{r installscreen, echo=FALSE, fig.cap="The RStudio dialog for installing packages."}
# Picture of the RStudio dialog for installing packages.
include_graphics("installpackage.png")
```
The easiest way to install a package is the _Install Packages..._ option in the _Tools_ menu of RStudio. This opens a dialog box in which you can enter the package names (Fig. \@ref(fig:installscreen)). Auto-completion will assist you. If a package is not available here, you have to install it directly from a web site. This is the case for the `papaja` package (see the code below).
While installing a package, RStudio may ask: "Do you want to install from sources the package which needs compilation?" As a rule, answer _no_ to this question.
Note the _Install to Library:_ field. Here you can see where the packages are placed, which is where R looks for them. By default, the directory is named after the version number of R that you are using. As a consequence, updating to a new R version will create a new directory for the packages and R will look for packages in this new directory. Packages installed in the directory for a previous R version are not automatically copied to the new directory. Do this by hand or re-install all packages. Re-installing them has the additional advantage of working with the latest version.
Installed packages must be loaded in a new R session before they can be used. The code below loads the `tidyverse` package [@R-tidyverse] and the `papaja` package [@R-papaja]. The `tidyverse` package actually loads a set of packages that allow us to handle data in a transparent way. The `papaja` package offers functions for creating tables and figures that conform to APA6 standards.
```{r packages, eval=FALSE}
# Load the packages that you need. Not executed here because the packages were
# loaded in the first code chunk.
# Install the tidyverse package from CRAN (only once!).
install.packages("tidyverse")
# Load a set of R libraries for consistent and transparent data handling and
# graphing.
library(tidyverse)
# papaja cannot be installed from CRAN, so we need the devtools package to
# install it.
# Install devtools package (from CRAN, only once!).
install.packages("devtools")
# Install the stable development verion of papaja from GitHub
devtools::install_github("crsh/papaja")
# Load an R library (package) for APA6-style tables, figures, and statistical
# results formatting.
library(papaja)
```
Don't mind warnings that a package was built under another version of R, but do update R, RStudio, and packages regularly. There is an update button in the _Packages_ tab of the bottom-left panel of the RStudio screen.
Loading a package, R may inform you of conflicts. Different packages can use the same name for a function. R uses the function that was last loaded if you do not specify a package. For example, the `filter()` function appears both in the `dplyr` and `stats` packages. Package `dplyr` [@R-dplyr] is part of the `tidyverse` set, which was loaded later than the `stats` package, which was loaded when R started. As a result, R uses `filter()` from `dplyr` instead of from `stats`. If you would like to use the `filter()` function in the `stats` package, you have to add the package name to the function as in `stats::filter()` (note the two colons).
## Getting Help
Read more about installing and using R and RStudio in the (web) books [Modern Dive](https://moderndive.com/1-getting-started.html) [@IsmayIntroductionStatisticalData] and [Getting used to R, RStudio, and R Markdown](https://ismayc.github.io/rbasics-book/3-rstudiobasics.html). The latter book has a nice [section on common programming errors in R](https://ismayc.github.io/rbasics-book/6-errors.html). I bet that you will encounter some of these problems, for example using `=` for "is equal to" where you must use `==`.
The R code in Section \@ref(rcommands) shows that running a function with a question mark added to the start, e.g., `?mean`, opens the help vignette about the function. As an alternative, you can enter the function name in the search box of the help tab in the RStudio interface. In addition, pressing F1 (Mac: fn-F1) while the cursor is on a typed function name also opens the help vignette. _For getting help in these ways, the package containing the function must have been loaded._
Of course, you can always Google your question. Start your search string with the capital letter R and a space. For instance, searching for "R chisquare" will quickly lead you to the function for a chi-squared test in R (Section \@ref(chisquared)).
# My First Statistical Analysis In R: Linear Regression {#firstanalysis}
This section applies linear regression to demonstrate the principles and peculiarities of statistical analyses in R.
## Open a Data Set
Our example data set addresses attitudes towards global warming. The data set was collected by Erik Nisbet and it is available at [Andrew Hayes' website](http://afhayes.com/introduction-to-mediation-moderation-and-conditional-process-analysis.html) under __Click here to download the data files used in this book.__. The data files are also included in the zip file [HelpMyCollaboratorUsesR.zip](https://wdenooy.github.io/Switch2R/HelpMyCollaboratorUsesR.zip).
The data set is available in several formats including SPSS and CSV. The SPSS file (_glbwarm.sav_) has variable and value labels but the CSV file (_glbwarm.csv_) only has numeric values. Download the data sets and save them in your R project directory.
### Import SPSS (or STATA or SAS) data
The recommended function to load an SPSS data file in R is `read_spss()` in the `haven` package [@R-haven]. This function reads SPSS data and turns it into a _data frame_, which is the R version of a data matrix. The `haven` package can also import STATA (DTA) and SAS data files.
```{r readSPSS}
# Load the haven package for reading SPSS data files.
library(haven)
# Read the SPSS data file and store it in glbwarm_spss.
# Note: The data file glbwarm.sav must be present in the project directory.
glbwarm_spss <- read_spss("glbwarm.sav")
```
Inspect the data object `glbwarm_spss` in the _Environment_ tab of the RStudio interface. Click on the little 'play' button preceding the data frame in the _Environment_ panel to see the variables and their features. The variable and value labels are preserved as attributes (`attr`). R, however, normally does not use attributes. If you want R to handle the value labels in analyses, it is best to turn the variables into factors.
> In R, a _factor_ is a categorical variable with labels for the values. In contrast to SPSS, you must use the _value labels instead of the values_ if you want to select or manipulate particular values.
The `haven` package contains the command `as_factor()` to turn all variables with value labels into factors. Note the underscore in `as_factor()`. There is another function `as.factor()` in the `base` package, which uses a dot. This happens more often. The `tidyverse` package has improved versions of basic functions and it replaces dots in function names by underscores.
```{r in2factors}
# Change variables with value labels into factors.
# Note: We overwrite the data frame glbwarm_spss now.
glbwarm_spss <- as_factor(glbwarm_spss)
```
If you describe the new data frame (use the command `summary(glbwarm_spss)`), the value labels and their frequencies are shown for the _ideology_, _sex_, and _partyid_ variables.
### Import CSV data
CSV (Comma Separated Values) is a very general file format, which is often used to exchange data between applications. It uses plain text format, so you can easily view the contents of a CSV file.
There is a complication with CSV files. Countries that use the comma as decimal separator tend not to use the comma to separate fields. Instead, a semicolon is used. For this reason, the `readr` package [@R-readr], which is part of `tidyverse`, contains two commands for reading CSV files: `read_csv()` for comma-delimited files and `read_csv2()` for files using semicolons between fields. Open the CSV file in a text editor to see which delimiter is used if you are not sure.
```{r loadCSV, message=TRUE}
# Read the CSV file and store it in a data frame.
# Note: The data file glbwarm.csv must be present in the project directory.
glbwarm_csv <- read_csv("glbwarm.csv")
```
If you read the csv file with one of the `read_csv` functions, the function guesses the type of each variable:
* `numeric`: `double` for numbers with decimal places or `integer` for numbers without decimal places.
* `logical`: `TRUE` or `FALSE` (note the capitals), which are treated as `1` and `0` in calculations.
* `date`, `time`, or `date-time`.
* `factor`: categories with labels.
* `character` for textual data.
In this example, all variables are read as numeric with (possibly) decimal places. For _ideology_, _sex_, and _partyid_ we only have numbers. We must remember the meaning of these numbers ourselves.
If you want to have the value labels instead of the numeric codes for the categorical variables, you can export the SPSS file to CSV (in SPSS: File > Export > CSV Data) with the option _Save value labels where defined instead of data values_ checked. This CSV, named _glbwarm2.csv_ is [available here](https://wdenooy.github.io/Switch2R/glbwarm2.csv). If you import this data file with the `read_csv()` function, _ideology_, _sex_, and _partyid_ are string variables. In contrast to SPSS, R handles string variables as categorical variables, just like factors. The code below shows how you can set the types of variables with `read_csv()`.
```{r loadCSV2, eval=FALSE}
# Read the CSV file exported from SPSS and store it in a data frame.
# Note: The data file glbwarm.csv must be present in the project directory.
glbwarm2_csv <- read_csv("glbwarm2.csv",
col_types = cols(
govact = col_double(),
posemot = col_double(),
negemot = col_double(),
ideology = col_character(),
age = col_integer(),
sex = col_character(),
partyid = col_character()
)
)
```
### Import JSON data
The package `jsonlite` [@R-jsonlite] contains functions, `read_json()` among others, to read JSON data and transform the data into R data structures [@oomsJsonlitePackagePractical2014].
## Estimating a Regression Model {#regressionmodel}
We can estimate linear regression models with the `lm()` function (for _linear model_) in the `stats` package. The regression model that we want to estimate must be formulated as an R formula. In an R formula, a tilde (`~`) separates the outcome(s) from the predictor(s).
If we want to predict the opinion on government intervention (variable `govact`) from positive and negative emotions on global warning (`posemot` and `negemot`), age, and ideology, the R formula would be:
> govact ~ posemot + negemot + age + ideology
In addition to a formula, we have to specify the data frame containing the variables in the `data =` argument. The code below estimates this regression model, using the SPSS data with categorical variables transformed into factors. Note that we can include factors (or character variables) directly in the formula. The `lm()` function will create dummy variables automatically, using the first factor level (value) as the reference group.
It is possible to transform variables within the regression formula. For example, age (in years) is divided by 10 to obtain age in decades. This will increase the effect size of age, which otherwise is very close to zero. Finally, note that we store the results as a data object (`govact_model`) instead of sending them to the screen, as SPSS would do.
```{r regression1}
# A regression model predicting opinions about government intervention.
# The results are saved as data object "govact_model".
govact_model <- lm(govact ~ posemot + negemot + I(age/10) + ideology,
data = glbwarm_spss)
```
The data object storing the results contains information about how the model was estimated and all estimates. It even includes the original variables. Have a look at this data object in the _environment_ tab of the RStudio interface. It is a list containing the coefficients, the residuals, which are series (_arrays_) of numbers of different length (10 coefficients, 815 residuals). The list also contains single numbers, like the degrees of freedom of the residuals (_df.residual_).
If we store the analysis results as a data object, we can extract the information that we need at any time. Even better, we can use functions created by others to extract and present the results.
## Presenting Results {#presentingresults}
Storing the analysis results as a data object is all fine and well, but we want to see the results. We need functions to extract the relevant results. Let us first tabulate the results and plot them afterwards.
### Basic tabular results
In general, the `summary()` function presents the most basic results. For a linear regression model, it tells us the model formula, some statistics for the residuals, the values of the regression coefficients, and model fit statistics (_R_^2^ and an _F_ test).
```{r summary}
# Show a table of regression results as plain text.
summary(govact_model)
```
The output of the `print()` function is not fit for publication. There are two ways of obtaining publication-ready tables: Using dedicated packages that create finished tables or creating the table yourself. The next two subsections present the two ways.
### APA6-style or journal-style tables {#APAtables}
Your problem has been experienced by many other researchers, so it is quite likely that someone has created an R package that solves your problem. Indeed, several packages have been published that create camera-ready tables summarizing statistical results.
Usually, these packages use LaTex (or TeX more generally) to create camera-ready tables because this language offers greatest control over the appearance of the table. If you are used to creating documents in (La)TeX, that is fine. If you don't, you had best use RMarkdown (see Section \@ref(rmarkdown)) as your word processor for research reports.
The `papaja` package contains functions to create tables of statistical results that conform to the APA6 specifications. It works best if you use the RMarkdown template for an APA6-style paper, which is actually used for the web site or PDF that you are reading right now (see Section \@ref(apa6template)).
```{r papajatable, results='asis'}
# Load the papaja package.
library(papaja)
# Step 1: Extract summary information from the regression results data object.
govact_results <- apa_print(govact_model)
# Step 2 (optional): Rename the predictors with the tidyverse package.
# The predictor names are a variable (predictor) in the results table (table) in
# the list created by apa_print(), which is named govact_results here.
govact_results$table$predictor <- recode(
govact_results$table$predictor, #variable to be recoded
"Posemot" = "Positive emotions", #old vaue = new value
"Negemot" = "Negative emotions",
"Iage/10" = "Age (in decades)"
#all other values remain the same
)
#Step 3: Extract and format the table of regression coefficients.
apa_table(govact_results$table, #select the table
caption = "Predicting citizens' opinion about government
intervention concerning climate change.",
note = "Just to show that it is easy to add a note.",
placement="hbt", #order: exact spot (h), page bottom (b), page top (t)
font_size = "small",
escape = TRUE
)
```
If you are looking at web version of Table \@ref(tab:papajatable), you will probably not be satisfied with the way the able looks. The table uses two different fonts and the lines are not where they should be according to APA6. The [PDF version](HelpMyCollaboratorUsesR.pdf) is much better. Fortunately, you are going to submit the PDF, not the web version of your paper.
The `apa_table()` function can also combine results from two or more regression models in one table. Results for different models are stacked: the first set of rows contain the results for the first model, the second set of rows contain the second model, and so on. The more familiar table with the models spread over different columns (side-by-side) cannot be made with this package.
The `stargazer` package [@R-stargazer] can create side-by-side tables It can create tables in formats required by some important sociological, political sciences, and economics journals. Unfortunately, it does not support the APA6 style. In contrast to the `papaja`package, it can also create HTML tables but it cannot create Word tables (import the PDF or HTML table in Word). It is possible to send the table created by `stargazer()` to a file, so you can include the code from the file in your (LaTeX or HTML) document.
You can create a Word document with a regression results table in APA6 format with the `apa.reg.table()` function in the `apaTables` package [@R-apaTables]. Note that it cannot show models side-by-side. This package can also create correlation and ANOVA tables. The `texreg` package [@R-texreg] can create and save a table with several regression models side-by-side as PDF (`texreg()`) or HTML (`htmlreg()`).
### Custom tables with the `broom`, `knitr` and `kableExtra` packages {#customtables}
> This section is meant for readers who want to have full control over the tables they create.
Instead of using packages that may produce tables that are not exactly what you want, you can create results tables yourself using the `broom` package [@R-broom], which is part of `tidyverse` but it must be loaded separately. The `broom` package extracts relevant results from statistical data objects as a data frame. This data frame can be saved as CSV file, which can be manually transformed into publication-ready tables using your favorite spreadsheet or word processor software. As an alternative, it can be manipulated and printed as a publication-ready table with R using the general `knitr` package [@R-knitr], with finer details managed by the `kableExtra` package [@R-kableExtra].
It takes a some time to learn using `knitr` but it is a good investment because you are probably going to use `knitr` anyway for tables that do not contain statistical results. The nice thing about `knitr` is that it gives good results in HTML, LaTeX, and Word.
The `tidy()` function in the `broom` package extracts the estimated coefficients and associated test statistics and confidence levels from the data object containing statistical results. There is a version of this function for many types of statistical analyses (more than `apa_print()` can handle). You can just use `tidy()` because it will recognize the type of statistical analysis and use the appropriate version to extract the results for this type of analysis.
Let us estimate an additional regression model with an interaction effect of age with negative emotions. Note how easy it is to specify an interaction effect.
```{r broom}
# Load the broom package, which converts statistical (results) objects.
library(broom)
# Estimate a second regression model with an interaction and store the results.
govact_model2 <- lm(govact ~ posemot + negemot*I(age/10) + ideology,
data = glbwarm_spss)
# Store the regression coefficients in a data frame with broom::tidy.
govact_tidy2 <- tidy(govact_model2, #the data object created with lm()
conf.int = TRUE, #add confidence interval limits
conf.level = 0.95 #confidence level
)
# Show the regression results with the kable() function in the knitr package.
kable(govact_tidy2, #the regression coefficients
#the number of digits for all seven columns
digits = c(0, 2, 2, 2, 3, 2, 2),
caption = "Unedited statistical results of the interaction model
extracted with tidy() and shown with kable().",
format.args = list(zero.print = NULL),
booktabs = TRUE #layout with header and bottom lines in PDF output
)
# Alternatively, save the tidied results data frame as CSV for use in, for
# example, MS Excel and MS Word.
write_csv(govact_tidy2, "myresultstable.csv")
# Note: use write_csv2() if you want semicolons instead of commas to separate
# columns, which is required if your computer uses comma as decimal separator.
```
This table is not as it should be according to the APA6 standard. The dependent variable should be mentioned, the column names should be different, not all columns should be shown and the lower and upper limits of the confidence interval should be merged, p values below .005 should not be rounded to .000. In addition,the predictor variable names could be more informative. Finally, the model's _R^2^_ and _F_ values should be reported and significance levels should be signaled with stars, which are explained in a note to the table.
This requires quite some data wrangling, which is exemplified in the code below. In addition, we need the `kableExtra` package [@R-kableExtra] for additional formatting of the table and adding a note.
```{r regressiontablecustom}
# Use this code for your own purpose:
# 1. Change the name of the fitted model: replace govac_model2 everywhere.
# 2. Change the old and new names of variables/parameters in mutate().
# 3. Change the names in functions caption() and add_header_above().
# Required packages.
library(tidyverse)
library(broom)
library(knitr)
# Load the kableExtra package for fine-tuning the table.
library(kableExtra)
# hide NA (missing) values in the table
options(knitr.kable.NA = '')
# The statistics for the model can be retrieved with broom::glance function.
govact_glance2 <- glance(govact_model2)
# We will append them later to the regression coefficients.
# Change the table of regression coefficients
# %>% is a pipe: the result of the preceding step is the input for the next
# step.
govact_tidy2 %>%
mutate(#change predictor variable names (case-sensitive!)
term= case_when( #the variable to be recoded
#if term equals old name = new name
term == "(Intercept)" ~ "Constant",
term == "posemot" ~ "Positive emotions",
term == "negemot" ~ "Negative emotions",
term == "I(age/10)" ~ "Age (in decades)",
term == "ideologyLiberal" ~ "Liberal",
term == "ideologySomewhat Liberal" ~ "Somewhat Liberal",
term == "ideologyModerate; Middle of the Road" ~ "Moderate",
term == "ideologySomewhat Conservative" ~ "Somewhat Conservative",
term == "ideologyConservative" ~ "Conservative",
term == "ideologyVery Conservative" ~ "Very Conservative",
term == "negemot:I(age/10)" ~ "Negative emotions*Age"
),
#join the lower and upper limits of the confidence interval
CI = #name of new variable
paste0( #collate
"[", #opening bracket
format( #lower limit as number with 2 decimal places
round(conf.low, digits = 2), #2 decimal places
nsmall = 2 #keep 0 if it is the 2nd decimal
),
", ", #comma
format( #upper limit as number with 2 decimal places
round(conf.high, digits = 2), #2 decimal places
nsmall = 2 #keep 0 if it is the 2nd decimal
),
"]" #closing bracket
),
#add new variable with significance stars
sig = case_when(
p.value < 0.001 ~ "$***$",
p.value < 0.01 ~ "$**$",
p.value < 0.05 ~ "$*$",
TRUE ~ "" #all remaining cases
)
) %>%
#add rows with R2 and F
#dollar signs mark text as math (other font type)
bind_rows(
data.frame(
term = c("$R^2$", "$F$"),
estimate = c(
govact_glance2$r.squared,
govact_glance2$statistic
),
sig = c(
"",
case_when(
govact_glance2$p.value < 0.001 ~ "$***$",
govact_glance2$p.value < 0.01 ~ "$**$",
govact_glance2$p.value < 0.05 ~ "$*$",
TRUE ~ "" #all remaining cases
)
),
stringsAsFactors = FALSE
)
) %>%
#select the relevant columns (variables) in right order
select(term, estimate, sig, CI) %>%
#create a table from the data
kable( #the data come from the 'pipe',
digits = c(0, 2, 0, 0),
col.names = c("Variable", "B", "", "95\\% CI"),
align = "lrlc",
caption = "APA6 formatted statistical results extracted with tidy() and
shown with kable() and kableExtra().",
booktabs = TRUE, #nicer layout
escape = FALSE #pay attention to special characters
) %>%
kable_styling(
latex_options = c("hold_position")
) %>%
#add dependent variable as additional header
# (all 3 columns after the first column)
add_header_above(c(" ", "Opinion on government intervention" = 3)) %>%
footnote(general = paste0(
"$Note$. $N$ = ",
govact_glance2$df + govact_glance2$df.residual,
". $CI$ = confidence interval."
),
general_title = "",
symbol = "$*$ $p$ < .05. $**$ $p$ < .01. $***$ $p$ < .001."
)
```
It is possible to have models side-by-side (Table \@ref(tab:regressiontablecustom2)) but it requires some data wrangling. If you want to see the code, download the [R Markdown file](https://wdenooy.github.io/Switch2R/HelpMyCollaboratorUsesR.Rmd) that produces this paper. You can adapt this code for your own applications.
```{r regressiontablecustom2, echo=FALSE}
# Table with two regression models side-by-side.
# Store the regression coefficients for the first model in a data frame.
govact_tidy <- tidy(govact_model, #the data object created with lm()
conf.int = TRUE, #add confidence interval limits
conf.level = 0.95 #confidence level
)
# The statistics for the first model.
govact_glance <- glance(govact_model)
# Stack the two models.
govact_tidy %>%
#add results of second model as additional columns
full_join(govact_tidy2, by = "term") %>%
#change and add variables
mutate(#change predictor variable names (case-sensitive!)
term= case_when( #the variable to be recoded
#if term equals old name = new name
term == "(Intercept)" ~ "Constant",
term == "posemot" ~ "Positive emotions",
term == "negemot" ~ "Negative emotions",
term == "I(age/10)" ~ "Age (in decades)",
term == "ideologyLiberal" ~ "Liberal",
term == "ideologySomewhat Liberal" ~ "Somewhat Liberal",
term == "ideologyModerate; Middle of the Road" ~ "Moderate",
term == "ideologySomewhat Conservative" ~ "Somewhat Conservative",
term == "ideologyConservative" ~ "Conservative",
term == "ideologyVery Conservative" ~ "Very Conservative",
term == "negemot:I(age/10)" ~ "Negative emotions*Age"
),
#add CI and p value as new variables for both models
#join the lower and upper limits of the confidence interval
#Model 1
CI.x = #name of new variable
ifelse (!is.na(conf.low.x),
paste0( #collate
"[", #opening bracket
format( #lower limit as number with 2 decimal places
round(conf.low.x, digits = 2), #2 decimal places
nsmall = 2 #keep 0 if it is the 2nd decimal
),
", ", #comma
format( #upper limit as number with 2 decimal places
round(conf.high.x, digits = 2), #2 decimal places
nsmall = 2 #keep 0 if it is the 2nd decimal
),
"]" #closing bracket
), NA
),
#Model2
CI.y = #name of new variable
paste0( #collate
"[", #opening bracket
format( #lower limit as number with 2 decimal places
round(conf.low.y, digits = 2), #2 decimal places
nsmall = 2 #keep 0 if it is the 2nd decimal
),
", ", #comma
format( #upper limit as number with 2 decimal places
round(conf.high.y, digits = 2), #2 decimal places
nsmall = 2 #keep 0 if it is the 2nd decimal
),
"]" #closing bracket
),
#add new variable with significance stars
#Model 1
sig.x = case_when(
p.value.x < 0.001 ~ "$***$",
p.value.x < 0.01 ~ "$**$",
p.value.x< 0.05 ~ "$*$",
TRUE ~ "" #all remaining cases
),
#Model 2
sig.y = case_when(
p.value.y < 0.001 ~ "$***$",
p.value.y < 0.01 ~ "$**$",
p.value.y< 0.05 ~ "$*$",
TRUE ~ "" #all remaining cases
)
) %>%
#select relevant variables
select(term, estimate.x, sig.x, CI.x, estimate.y, sig.y, CI.y) %>%
#add rows with R2 and F
bind_rows(
tibble(
#a line for R2 and F for each model
term = c("$R^2$", "$F$"),
estimate.x = c(
govact_glance$r.squared, #Model 1
govact_glance$statistic), #Model 1
estimate.y = c(
govact_glance2$r.squared, #Model 2
govact_glance2$statistic), #Model 2
sig.x = c(
"", #not for R2
case_when(
govact_glance$p.value < 0.001 ~ "$***$",
govact_glance$p.value < 0.01 ~ "$**$",
govact_glance$p.value < 0.05 ~ "$*$",
TRUE ~ "" #all remaining cases
)
),
sig.y = c(
"", #not for R2
case_when(
govact_glance2$p.value < 0.001 ~ "$***$",
govact_glance2$p.value < 0.01 ~ "$**$",
govact_glance2$p.value < 0.05 ~ "$*$",
TRUE ~ "" #all remaining cases
)
)
)
) %>%
#create a table from the data
kable( #the data come from the 'pipe',
digits = c(0, 2, 0, 0, 2, 0, 0),
col.names = c("Variable", "B", "", "95\\% CI", "B", "", "95\\% CI"),
align = "lrlcrlc",
caption = "APA6 formatted statistical results extracted with tidy() and
shown with kable() and kableExtra().",
booktabs = TRUE, #nicer layout
escape = FALSE #pay attention to special characters
) %>%
kable_styling(
latex_options = c("hold_position")
) %>%
#add model as additional header
# (all 3 columns after the first column)
add_header_above(c(" ", "Model 1" = 3, "Model 2" = 3)) %>%
#add dependent variable as additional header
# (all 6 columns after the first column)
add_header_above(c(" ", "Opinion on government intervention" = 6)) %>%
footnote(general = paste0(
"$Note$. $N$ = ",
govact_glance$df + govact_glance$df.residual,
". $CI$ = confidence interval."
),
general_title = "",
symbol = "$*$ $p$ < .05. $**$ $p$ < .01. $***$ $p$ < .001."
)
```
### Plots {#plots}
It becomes more and more common to plot regression coefficients instead of presenting them in a table. The `coefplot` package [@R-coefplot] can plot regression coefficients for one or more models. It has an extensive set of options for changing the appearance of the plot. Figure \@ref(fig:coefplot) and the accompanying R code showcase some options.
```{r coefplot, fig.cap="Unstandardized regression coefficients for a model without and a model with an interaction effect. The fat and thin lines represent confidence intervals based on one and two standard errors."}
# Plot the regression coefficients with their standard errors using the
# coefplot package.
# Load the coefplot package.
library(coefplot)
# Plot of the regression coefficients of two models.
multiplot(Model1 = govact_model, Model2 = govact_model2, #the fitted models (renamed)
intercept = FALSE, #don't show intercept coefficient
title = "", #omit plot title
newNames = c("negemot" = "Negative emotions",
"posemot" = "Positive emotions",
"I(age/10)" = "Age (in decades)",
"negemot:I(age/10)" = "Negative emotions * Age"
),
plot.shapes=TRUE, plot.linetypes=TRUE,
dodgeHeight = 0.8 #distance between shapes
) +
theme_apa() +
theme(legend.position="bottom")
```
The plot created by `coefplot` is a `ggplot2` plot. The `ggplot2` package [@R-ggplot2] is a versatile, widely used package. A nice feature of a `ggplot2`-plot is that we can customize it further. In the code above, the APA plotting theme from the `papaja` package is added and the legend is re-positioned at the bottom of the figure. The plotting functions in `coefplot` yield the data to be plotted instead of a plot if you add the option `plot = FALSE`. With this data set, you can create your own custom plot with `ggplot2`.
If you want to have full control of your plots, learn `ggplot2`. The book _R for Data Science_ [@WickhamDataScienceImport2017] offers a concise introduction; it is available [online](https://r4ds.had.co.nz/data-visualisation.html). The details of `ggplot2` can be found in _ggplot2, Elegant Graphics for Data Analysis_ [@wickhamGgplot2ElegantGraphics2009], which has an [online 3rd edition](https://ggplot2-book.org/).
Plotting regression lines is another appealing way of presenting your results (Figure \@ref(fig:plotregline)). The `visreg` package [@R-visreg] can do this. It can create `ggplot2` plots, which can be further customized. More extensive options are available in the `effects` package [@R-effects], but this package cannot produce `ggplot2` plots.
```{r plotregline, fig.cap="The effect of negative emotions at different age levels, 95\\% confidence intervals, and partial residuals. Note that The age groups are defined by default at the 10th, 50th, and 90th percentiles. Partial residuals have the color of the closest percentile."}
# Plot the effect of negative emotions for different age groups.
# Load the visreg package.
library(visreg)
# Create a plot
visreg(govact_model2,
xvar = "negemot", #variable for the x axis
by = "age", #different lines for age groups
jitter = TRUE, #negemot has few (discrete) values
overlay = TRUE, #show lines in one plot
gg = TRUE #use ggplot2
) +
theme_apa() + #add theme
#set the axis labels
labs(x = "Negative emotions",
y = "Opinion on government intervention")
```
Residual plots are commonly used to check regression assumptions. A fitted regression model contains the residuals. The residuals of the second regression model estimated in this section are available as `govact_model2$residuals`.
The `car` package offers the `residualPlot()` function to plot the (unstandardized) residuals against the (unstandardized) fitted (predicted) values. The related `residualPlots()` function (note the plural) also displays the residuals against each of the predictor variables. Note that these are not `ggplot` plots. If you are looking for additional checks and tests for a regression model, consult the `car` package.
```{r plotresiduals, fig.cap="A residuals plot for the model predicting opinion about government intervention from political ideology, positive emotions, and the interaction between negative emotions and age."}
# Pplot the residuals against the predicted values.
# Load package car.
library(car)
residualPlot(govact_model2,
quadratic = FALSE)
```
# Other Common Statistical Analyses
R packages offer more statistical analyses than we will probably ever use. In addition, there can be several packages for the same type of analysis. The main problem: How do we find the right package?
Volunteers maintain overviews of packages that are dedicated to a particular task, which are called _task views_. An overview is available at https://cran.rstudio.com/web/views/. The [SocialSciences](https://cran.rstudio.com/web/views/SocialSciences.html) task view is a good place to start. It reviews packages for estimating general linear models (regression, analysis of variance) and for analyzing categorical data.
The most commonly used model types are briefly discussed below in alphabetical order.
## Analysis of Variance (ANOVA)
The `aov()` function in the `stats` package fits an analysis of variance model. It actually uses the `lm()` function, so it can also handle numeric predictors, in which case it offers analysis of covariance. Apply the `anova()` function to the results of `aov()` or `lm()` to obtain an ANOVA table with F tests. If `anova()` is applied to two or more fitted models, the F change test is calculated. The models should be nested, otherwise the F change test does not make sense.
The `aov()` function uses a "Type I" test and it should only be applied to balanced designs. For "Type II" and "Type III" tests or unbalanced designs, use the `Anova()` function in the `car` package. Just like the `anova()` function, the model must first be estimated with `lm()`, `aov()`, or another linear model.
These functions, as well as `manova()` in the `stats` package, can also be used for multivariate models (more than one dependent variable) and repeated measures designs.
The `papaja` package contains several plots for factorial designs that conform to the APA standard. The `visreg` package can be used to visualize results.
## Chi-squared Tests {#chisquared}
A chi-squared test of independence in a contingency table can be executed with the `chisq.test()` function in the `stats` package. Fisher's exact test is available through the `fisher.test()` function in the same package. Both tests require the specification of two categorical variables with the `x =` and `y = ` arguments.
For example, the following command executes a chi-squared test on the independence of political ideology and sex in the global warming data set and stores the result in a data object:
`fit_chisq <- chisq.test(x = glbwarm_spss$ideology, y = glbwarm_spss$sex)`
Note that the data frame must be specified for both variables.
The fitted model is (stored as) a `htest` object, which includes both statistical results (chi-squared value, degrees of freedom, and p value) and the contents of the contingency table: observed frequencies, expected frequencies under the null hypothesis of independence, residuals, and standardized residuals. You can inspect the contingency table by simply sending the list item to the screen, for example, `fit_chisq$observed` or `fit_chisq$stdres`. The code below creates an APA6-style contingency table.
```{r chisqtable}
# Use this code for your own results:
# 1. Change the names of the data frame and variables in chisq.test().
# 2. Change the category names of the columns in select() and
# add_header_above(): replace female and male.
# If there are more than two columns (values), add those.
# 3. Change the table's caption()
# Create a contingency table with observed frequencies and standardized
# residuals.
# Execute the chi-squared test.
fit_chisq <- chisq.test(x = glbwarm_spss$ideology, y = glbwarm_spss$sex)
# Create a table in APA6 style.
data.frame( #create a data frame from
#residuals matrix added to observed freqs matrix
cbind(fit_chisq$observed, fit_chisq$stdres)
) %>% #send to next function
#reorder the columns (variables)
select(female, female.1, male, male.1) %>%
#format as table
kable(caption = "Observed frequencies and standardized residuals
for ideology by sex.",
col.names = rep("", 4), #suppress variable names
#the number of digits for all columns
digits = c(0, 2, 0, 2),
format.args = list(zero.print = NULL),
booktabs = TRUE #layout with header and bottom lines
) %>%
kable_styling(
#don't move table to top or bottom of a page
latex_options = c("hold_position")
) %>%
#add observed versus standardized residuals header
add_header_above(c("", rep(c("Observed", "St.Residual"), 2))) %>%
#add sex as additional header
# (2 columns for female, 2 for male)
add_header_above(c("", "Female" = 2, "Male" = 2)) %>%
#footnote with test result, formatted with the papaja::apa_print function
footnote(general = paste0("$Note$. ",
apa_print(fit_chisq, n = sum(fit_chisq$observed))$statistic),
general_title = "" #suppress standard "Note:"
)
```
The same functions can be used to apply a one-sample chi-squared test or goodness-of-fit test. The `x =` argument must specify the observed frequencies of the categories on the test variable. The `base` package function `table()` does the job. Because there is no second variable, we do not use the `y =` argument. Instead, we specify the hypothesized population proportions (argument `p =`) for the categories. Use the `c()` function to combine the proportions and ensure that they are in the right order. For example, the following code tests the null hypothesis that half of the population is female and the other half male:
`fit_chisq2 <- chisq.test(x = table(glbwarm_spss$sex), p = c(0.5, 0.5))`
The fitted model is stored as a `htest` object.
## Regression: Logistic, Poisson, Negative binomial, Multinomial {#glm}
Some regression models for dependent variables that are not numeric and (in principle) continuous can be estimated with the `glm()` function in the `stats` package. GLM stands for General Linear Model. The regression model is specified in the same way as the linear regression model discussed in Section \@ref(regressionmodel). The only addition is the type of linear predictor and error distribution (family). The most common families are:
* `binomial(link = "logit")` for logistic regression,
* `poisson(link = "log")` for Poisson regression,
* `quasipoisson(link = "log")` for overdispersed Poisson regression.
A fitted `glm()` model is very like a fitted `lm()` model, so functions for displaying coefficients (`coefplot` package) and fitted regression lines (`visreg` package) also work for these models. The functions in the `car`package such as `residualPlot()`, however, only works for linear models.
Negative-binomial regression models can be fit with the `glm.nb()` function in the `MASS` package. This function follows the same logic as the `lm()` function for ordinary regression models. Instead of a model family, you can specify a link function as an argument. A fitted negative-binomial model has a structure that is similar to the results of `lm()` and `glm()`. The `car` and `coefplot` packages, however, cannot handle this type of model.
The `MASS` package also contains a function for ordered logistic regression, namely `polr()`. (Unordered) multinomial regression is provided by the `multinom()` function in the `nnet` package. In both cases, the dependent variable must be a factor. Logistic, ordered logistic, and multinomial regression with R are explained in https://rpubs.com/rslbliss/r_logistic_ws.
## Regression: Mediation
### Package `mediation`
For estimating indirect effects in models with one mediator, the recommended R package is `mediation` [@R-mediation], consult this [manual](https://web.mit.edu/teppei/www/research/mediationR.pdf) for more information. This package can handle a continuous outcome variable in combination with a continuous, ordered, or binary moderator as well as a binary outcome variable with a continuous or binary moderator. In all cases, it can handle both continuous and binary predictors (or treatment variables). In addition, it can control for continuous and binary covariates.
An important advantage of this package is its ability to perform a sensitivity test on the indirect (mediated) effect. How sensitive is the estimated indirect effect to unobserved common causes of the mediator and the outcome variable? The mediator variable is normally not randomized, so it can be affected by an unobserved covariate that also affects the outcome variable. These effects may be incorrectly attributed to the effect of the mediator on the outcome variable and, as a consequence, to the indirect effect of the predictor via the mediator on the outcome.
An unobserved common cause of the mediator and outcome variable would create a correlation between the residuals in the model predicting the mediator and the residuals in the model predicting the outcome. The sensitivity analysis estimates the indirect effect and its confidence interval for different values of the correlation among the residuals.
A downside to this package is that we have to estimate two regression models ourselves: one with the mediator as the dependent variable, the other with the outcome as the dependent variable. This allows us, however, to report and check the two regression models with the tools provided in Section \@ref(presentingresults).
Also note that interactions must be included in the regression formula as separate terms for the conditional and interaction effects, e.g., `y ~ x1 + x2 + x1:x2` instead of `y ~ x1*x2` and the argument `INT =` must be set to `TRUE` in the `mediate()` function.
```{r mediation1}
# Estimate the indirect effect of sex via negative emotions on opinion about
# governmental intervention.
# Load the mediation package.
library(mediation)
# Estimate the regression model for the mediator.
fit_mediator <- lm(negemot ~ age + sex + ideology, data = glbwarm_spss)
# Estimate the regression model for the outcome.
fit_outcome <- lm(govact ~ age + sex + ideology + negemot, data = glbwarm_spss)
# Estimate the indirect effect (may take some time).
fit_indirect <- mediate(fit_mediator, fit_outcome, #the two regression models
treat = "sex", #the predictor or treatment variable name
mediator = "negemot", #the mediator name
sims = 1000, #number of simulations
boot = FALSE #use bootstrapping or parametric estimation
)
# Show results:
# ACME = Average Causal Mediated Effect (= indirect effect).
# ADE = Average Direct Effect.
summary(fit_indirect)
```
It is more attractive to present the direct and indirect effects in a diagram. The code below creates a diagram using the `ggplot2` package.
```{r mediation1plot, fig.asp=0.3, fig.cap="Main results of a mediation model presented as a causal diagram."}
# Plot the mediation diagram with ggplot2.
# NOTE: Make sure that you have stored the mediation results in an object named
# fit_indirect. Change the outcome variable name in the code below.