-
Notifications
You must be signed in to change notification settings - Fork 2
/
SPSStoR_2018.Rmd
756 lines (521 loc) · 36.7 KB
/
SPSStoR_2018.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
---
title: "Transitioning from SPSS to R - R-Ladies London"
author: "Daryn Cushnie-Sparrow"
date: "November 29, 2018"
output:
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, include = TRUE, tidy = TRUE, error = FALSE, message = FALSE, warning = FALSE, collapse = TRUE)
options(scipen = 10, digits = 5)
```
## Introduction
For many of us, SPSS has been something we've used for statistical analysis since early in our experience with statistics. Running tests in SPSS might be second-nature to you, and there are lots of resources online for how to do things you may be unclear about. The comfort of SPSS can make it daunting to explore other options; however, SPSS has limitations, and you may have stumbled upon an interesting statistical test that can be run in R and not in SPSS. Alternatively, you may have decided that SPSS licenses are very expensive and inconvenient, and you're interested in other options. I have gradually been transitioning from SPSS to R, so I'm hoping to share some of what I've learned that might ease your transition.
In my opinion, the biggest shift one makes in the move from SPSS to R is a change in how you think about your data and your analysis. If you are used to using the SPSS to GUI and clicky navigation to move around and explore your data, you'll notice that won't be as possible0 in R. Once you get over this hurdle, the flexibility and power of R is sure to win you over!
### Packages
SPSS is a one-piece statistical package. All the functionalities and features available to you are included in the installation. R has many useful base functions, but there are tons of **packages** available from CRAN that allow you to do more things or do them better. To install a package from CRAN, use the function `install.packages`: for example, `install.packages("ggplot2")`. In order to use packages, you need to **load** them into R, which makes their functionalities available. Do this using `library`. Here are the packages we'll be using today.
```{r Libraries}
library(psych)
library(tidyverse)
library(ggplot2)
library(haven)
library(car)
library(Hmisc)
library(moments)
library(flextable)
library(broom)
select <- dplyr::select
summarize <- dplyr::summarize
path <- getwd()
```
### R Basics
In R, a single 'line' of data is called a vector - a row or column could be a vector. The most common higher dimensions of data you'll use are matrices (if they're all one variable type eg. numeric), lists (bundles of different kinds of information) or data.frames (like the spreadsheets you'd use in Excel or SPSS). These 3 simple base R functions are some of your best friends - you will use them often! Especially `c()` (combine) - you'll use this on its own and within functions. To assign a piece of information to a name (eg. to make a vector named "item1"), use the <- operator with the Label on the left and the Contents on the right (by convention). <- and = are not the same thing in R!
```{r basics}
# makes a vector
item1 <- c(1, 2, 3, 5)
item2 <- c(3, 6, 7, 2)
# binds these by column - ie. these are variable vectors
cbind(item1, item2)
# binds these by row - ie. these are observation vectors
rbind(item1, item2)
```
## Setting Up Your Data
We're going to be using one of the default datasets available in R through **ggplot2**, `diamonds`. We'll start by loading in that data. You would *usually* need to import your own data. You can do this using `read_csv` from **readr** (e.g. `my_data <- read_csv(/myfolder/my_data.csv)`). A helpful tip is to use `path <- getwd()` at the beginning of your .R or .Rmd - this saves your current path. As long as you keep your .R or .Rmd in the same place as the files you're looking for, you can easily input the path of a file at anytime using `file.path(path, "filename.csv")`.
```{r load-in}
data(diamonds)
diamonds
```
If you'd like to look a the data in R Studio's GUI, you can do this by typing `View(diamonds)` into the console. Coming from SPSS, this is probably the view of your data that you're familiar with. In R, there are also other ways to explore the data.
### Navigating Your Data in R
`str` is a useful base R function that gives an overview of your data, including the number of observations and variables, the classes of your data frame (more on this later), and previews the data by showing you each variable (and its class) and the first few observations. `glimpse` is a similar function from **dplyr** - the major difference here is it just shows the data a bit more cleanly (shows exactly as many observations as will fit in your current window size). `head` previews the first 6 rows of your data, and `tail` previews the last 6 rows. You can also customize the number of rows to display: e.g. `head(diamonds, 10)`. `names` will output the names of your columns, which are usually your variables.
```{r data exploration}
str(diamonds)
head(diamonds)
tail(diamonds)
names(diamonds)
glimpse(diamonds)
```
There are also multiple ways to call specific rows (observations) and columns (variables). You can call row/column indices in square brackets (e.g. `dataframe[row,column]`) or using a `$` and the variable name (`dataframe$variable`). You always need to tell R where to find a particular variable or observation. I've wrapped these all in `head` for convenience, because 54,000 is a lot of diamonds!
```{r indexing}
# 7th column (variable)
head(diamonds[,7])
# Also the 7th column - defaults to column if there's no comma!
head(diamonds[7])
# 7th row (observation)
head(diamonds[7,])
# Value of the 3rd variable for the 50th observation
head(diamonds[50, 3])
# List of all the price values
head(diamonds$price)
```
**dplyr** also gives us great tools for selecting particular rows and variables.
```{r quick dplyr}
# Grab the first 20 rows of diamonds for convenience
short_diamonds <- diamonds[1:20,]
# Select only the physical dimensions
short_diamonds %>% select(x, y, z)
# Select everything but the physical dimensions
short_diamonds %>% select(carat:price)
## OR
short_diamonds %>% select(-x, -y, -z)
# Filter to get only observations with a Premium cut
short_diamonds %>% filter(cut == "Premium")
# Combine filter and select to see only the physical dimensions
# of the observations with a Premium cut
short_diamonds %>% filter(cut == "Premium") %>%
select(-(carat:price))
# Make sure you do this in the right order!
# The following would throw an error because you've already
# selected out 'cut' by the time you asked filter to use it!
# short_diamonds %>% select(-(carat:price)) %>%
# filter(cut == "Premium")
```
-----
##### Exercise 1: Indexing
Can you find 2 different ways to find the price of the 10,000th diamond?
```{r exercise1, include = FALSE}
```
-----
You might also want to sort the rows of your table based on the values of particular variables. You can do this with `arrange` and `desc` from **dplyr**. Sorting by an ordinal factor will sort in order of the levels of the variable. Other variables will sort in alphanumeric order.
```{r arrange}
# Ascending
short_diamonds %>% arrange(cut)
# Descending
short_diamonds %>% arrange(desc(cut))
# Custom Order
short_diamonds %>%
arrange(match(cut, c("Premium", "Ideal", "Very Good", "Fair", "Good")))
```
Looks great... but what is this data trying to show us? If we want to find out what this dataset contains, we can look up the documentation by typing `??diamonds` into the console. In the documentation, we find that `diamonds` is a dataset containing prices and attributes of almost 54,000 diamonds. The variables are:
* price: price in US dollars
* carat: weight of the diamond
* cut: quality of the cut
* colour: diamond colour (J being the worst, D being the best)
* clarity:ranging from I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)
* physical dimensions in mm: x (length), y (width), z (depth)
* depth: depth percentage calculated based on physical dimensions
* table: width of top of the diamond relative to the widest point (flat part at the top)
Because **ggplot2** gave us this data out of the goodness of its heart, it's very neat! All the variables already have their appropriate *classes* - and we also saw this when we called `str`. You can check a particular variable by calling `class`. Also, we don't see in my variable summary what the levels of `cut` are. We can check this by calling `levels`.
```{r cut_explore}
class(diamonds$cut)
levels(diamonds$cut)
```
We see that `cut` is an ordered factor, and we know based on our knowledge of the data that diamond cut is an ordinal variable, so this is what we expect. The levels of the data seem correct, with no obvious errors, and they appear to go in a logical order.
### Intro to Wrangling
When you are loading in your own data, it rarely will be so neat! It's important that all variables are classed correctly in order for appropriate calculation and visualization down the line. In SPSS, you would make these kinds of changes in Variable View. In order to play in parallel between R and SPSS today, I'm going to start by outputting our `diamonds` data to a .csv.
```{r writing_csv}
write_csv(diamonds, file.path(path, "diamonds.csv"))
```
When I import this into SPSS straight from the .csv, here's what I get:
[![](1_variableview.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
We know that all our categorical variables are ordinal in this dataset, so we change those by clicking in the Measure column.
[![](2_variableview_ordinal.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
You'll notice the Values column is empty - this is important because some tests in SPSS will not allow variables with strings, so you need to set up key-value pairs. For example, for cut, you could assign a number of 1 to Fair, 2 to Good, etc. You can do this manually in SPSS - but R can help us out! We'll come back to this.
When you first import data into R, variables tend to be chr - character (string) if they have **any** letters, or num, int, or dbl (numerical, integer, or double) if they are all numbers. You will need to change your variables' classes to fit what you know them to be. Categorical variables in R are called **factors**, and these can be unordered (nominal) or ordered (ordinal).
In this block of code, I'm going to create a messy version of `diamonds` so we can practice cleaning it up. If you're looking at the .Rmd, no peeking! It'll give away how you fix it.
```{r messy_diamonds, include = FALSE, eval = TRUE}
messy_diamonds <- diamonds
messy_diamonds[90, 5] <- "a"
messy_diamonds$color <- as.character(messy_diamonds$color)
messy_diamonds$price <- factor(messy_diamonds$price)
```
Take a look at `messy_diamonds` with `str`. What's wrong with it?
* depth is a character vector, but should be numeric
* color is a character vector, but should be an ordered factor
* price is factored, but it should be integer
```{r cleaning up}
## DEPTH
# Try putting depth back to numeric by coercing it with as.numeric
head(as.numeric(messy_diamonds$depth))
# The warning 'NAs introduced by coercion' indicates that there's a
# non numerical value somewhere - let's find it to understand
# what's wrong before we erase it!
# I'm asking R to tell me which value is NA when I coerced it
which(is.na(as.numeric(messy_diamonds$depth)))
# Let's look at what the depth of diamond #90 is
messy_diamonds[90,5]
# Oops! Looks like a data-entry error. If we had another place the data
# was entered (such as an original output document or paper questionnaire),
# we could check it and assign the correct value.
# I'm going to cheat and use the original diamonds to find out!
diamonds[90,5]
# Originally, this value was 62.9 - we'll go ahead and assign that now.
messy_diamonds[90,5] <- 62.9
# If you didn't have a way to check the original value, you probably
# have to leave it as a missing value. R will automatically
# turn this character into a NA - which is what the warning was telling us.
# Let's go ahead and coerce depth to numeric now that we've checked it.
messy_diamonds$depth <- as.numeric(messy_diamonds$depth)
## COLOR
# Color should be a factor! It needs to be ordered. If this were our data,
# we'd know what the order should be. In this case, let's check
# the original diamonds to find out.
levels(diamonds$color)
messy_diamonds$color <- factor(messy_diamonds$color, ordered = TRUE, levels = c("D", "E", "F", "G", "H", "I", "J"))
## PRICE
# Price should be integer, but it's a factor right now.
# If we were to call as.integer right away, it would convert
# them based on the factor levels! To avoid this,
# we'll wrap it with as.character first.
messy_diamonds$price <- as.integer(as.character(messy_diamonds$price))
# No warnings! Let's call str to make sure things look right.
str(messy_diamonds)
str(diamonds)
```
-----
##### Exercise 2: Cleaning
No peeking on the Rmd!
```{r messy_cars, include = FALSE, eval = TRUE}
data(mtcars)
cars <- mtcars
cars[20, 3] <- "bbbb"
cars[9, 2] <- "cc"
cars$hp <- factor(cars$hp)
```
I've done the same thing again - messed up a perfectly good dataset and called it `cars`. This comes from the dataset `mtcars` - you can search its documentation by typing `?mtcars` into the console. Make sure the data ends up clean based on your expectations from the documentation.
```{r exercise2, include = FALSE}
```
-----
Now that we've gotten things cleaned up, we're ready to go! Remember how I said that we'd come back to an easier way to get data set up in SPSS? The package **haven** is our best friend. Using `write_sav`, we can create an SPSS data file (.sav) with variable types and values set-up nicely!
```{r haven}
write_sav(diamonds, file.path(path, "diamonds.sav"))
```
Take a look at the Variable View of *diamonds.sav* in SPSS.
[![](3_variableview_haven.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
The categorical values now have numerical values corresponding to their original factor levels, and everything is appropriately classed! Take a look at the Data View for the two files to compare.
**Straight from the CSV:**
[![](4_dataview_csv.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
**Through Haven:**
[![](5_dataview_haven.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
This is great, because many of us don't immediately abandon our old tools when we pick up new ones. You might still want to use SPSS for some things, but R can still help you manage your data. Alternatively, it's helpful if you have collaborators who use SPSS.
#### More Tips for Setting Up Your Data
The 'Missing' column in Variable View of SPSS is a helpful way to quickly identify variables with missing data. In R, you can quickly find out if there are any missing data points using `is.na` and a couple friends.
```{r missingdata}
testvariable <- c(1, 2, 3, 5, 3, 1, 3, 3, 4, 5, 6, NA, 3, 3, 4, 7, 8, 1, 3, NA, 3, 4, 6)
any(is.na(testvariable))
which(is.na(testvariable))
```
You can use Values (key-value pairs) and Labels in SPSS to store information about your variables. In R, you can give your values names (similar to key-value pairs), and you can make a data dictionary to store detailed information that you want access to. **dataMeta** provides some helpful tools for making data dictionaries, but you can scrape by with base R. Here are some simple examples of ways to label data and store additional information about variables. You could of course import these by reading in a .csv (`read_csv` through **readr** is my go-to) or other spreadsheet - you don't have to type these things directly into R.
```{r labels}
var1 <- c(1, 2, 3, 3, 4, 5, 3, 2, 1, 2)
var2 <- c(4, 5, 3, 5, 5, 4, 2, 1, 2, 3)
var3 <- c(1, 1, 2, 4, 5, 3, 3, 3, 2, 2)
df <- data.frame(var1, var2, var3)
# Labeling individual values
df$var1 <- factor(var1, levels = c("1", "2", "3", "4", "5"), labels = c("Poor", "Fair", "Good", "Very Good", "Excellent"), ordered = TRUE)
## All 3 of these questions have the same levels/labels so for convenience
## we'll define vectors that store this info once
q_labels <- c("Poor", "Fair", "Good", "Very Good", "Excellent")
q_levels <- c("1", "2", "3", "4", "5")
df$var1 <- factor(var1, levels = q_levels, labels = q_labels, ordered = TRUE)
df$var2 <- factor(var2, levels = q_levels, labels = q_labels, ordered = TRUE)
df$var3 <- factor(var3, levels = q_levels, labels = q_labels, ordered = TRUE)
## Note: if you use an ordered factor but want to access numerical
## features (ie. to calculate things like mean, min or max) you can
## wrap it in as.numeric - does not matter if factor is labeled as characters
mean(as.numeric(df$var1))
# Naming observations can be helpful to keep track
rownames(df) <- c("PD01", "YC01", "PD02", "PD03", "PD04", "YC02", "OC01", "OC02", "YC03", "OC03")
# Renaming the variables
colnames(df) <- c("delivery", "value", "meal")
# If you want to put these into their own column (helpful if tidying):
rownames_to_column(df)
# Quick and dirty data dictionary with question details and levels
## Because the levels are consistent, I've only specified them once
## and R will multiply it to fill the dataframe. If you have different
## details for each question, make sure you input them all separately.
## This is a good thing to do in your data entry/data prep.
question <- c("How was your delivery?", "Value for cost?", "How was your meal?")
level <- c("1 = Poor, 2 = Fair, 3 = Good, 4 = Very Good, 5 = Excellent")
data_dict <- data.frame(colnames(df), question, level)
```
Importing, cleaning, and wrangling your data is a huge topic, but there are tons of resources to help you with this!
## Running Tests
Running statistical tests in R is a bit different from running them in SPSS, but thankfully Google is your best friend here! Let's work through a sample analysis on our `diamonds` data.
Let's say I wanted to do a t-test to compare the average prices of Ideal and Premium diamonds.
To set this up in SPSS, I would select an Indepedendent Samples t-test from the Compare Means menu. This is what my setup window would look like - I know that Ideal and Premium are coded as "4" and "5" by clicking on the Values in Data View.
[![](6_t_setup.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
And this would be my output:
[![](7_t_output.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
In R, to run this t-test I'm going to use the function `t.test`. However, SPSS runs Levene's test to check for homogeneity of variance as part of the t-test. We'll have to do that ourselves, using `leveneTest` from the **car** package. We'll also need to filter in order to just have Ideal and Premium diamonds.
Seeing that Levene's test is significant, our assumption of homogeneity of variance is not upheld, and we should use a Welch t-test, rather than a Student's t-test. This is the default in R's `t.test`.
```{r t-test}
diamonds_subset <- diamonds %>% filter(cut == "Ideal" | cut == "Premium")
leveneTest(diamonds_subset$price, diamonds_subset$cut, center = mean)
# Welch - the one we're using
t.test(diamonds_subset$price ~ diamonds_subset$cut, var.equal = FALSE)
# Student - if Levene's hadn't been true
t.test(diamonds_subset$price ~ diamonds_subset$cut, var.equal = TRUE)
```
The output from our t-test includes almost all of the same information as the SPSS output. All we're missing is mean difference and the standard error of the difference. Mean difference is relatively easy to calculate - standard error of the difference is a bit more complex but here it is! Once again, these funky formulas are coming from the SPSS algorithm manual. In most cases, you won't need to get a specific value that comes from the SPSS output - but this is just to provide an example of how to match things up.
```{r}
# Mean difference - the same for Student and Welch tests
ideal <- filter(diamonds, diamonds$cut == "Ideal")
premium <- filter(diamonds, diamonds$cut == "Premium")
meandiff <- abs(mean(ideal$price) - mean(premium$price))
n_ideal <- sum(!is.na(ideal$price))
n_premium <- sum(!is.na(premium$price))
# Standard error of the difference: unpooled - for Welch's t-test
SE_diff_unpooled <- sqrt((var(ideal$price)/n_ideal) + (var(premium$price)/n_premium))
# Standard error of the difference: pooled variance - for Student's t-test
pooled_variance <- ((n_ideal-1)*var(ideal$price) + (n_premium-1)*var(premium$price))/(n_ideal+n_premium - 2)
SE_diff_pooled <- sqrt(pooled_variance)*sqrt(1/n_ideal+1/n_premium)
```
### Creating Summary Documents
Once you've finished your beautiful work in R, you need a way to output your documents! A few ways you can do this are:
* Spreadsheets: e.g. outputting to a .csv with `readr::write_csv` - *Syntax: write_csv(dataFrame, file.path(path, "results.csv"))*
* Images: e.g. outputting a plot to a .png with `ggplot2::ggsave` - *Syntax: ggsave(plotName, file = "plot.png")*
* R Markdown: Knitting to HTML, PDF, or Word
My current favourite package for formatting summary documents in R Markdown is **flextable**, which allows you to output to HTML, PDF and Word with ease. It also offers significant customizability. If you don't use Word documents, **kableExtra** is another fabulous option, and a bit simpler!
The first thing you might want to include in a summary document is some descriptives. Calculating them is easy to do using **dplyr**'s `summarize`. Here's a quick sample - there's more on this in the Explore section of the Resources. Once you've got your summary, you can use **flextable** to prep it for output. **flextable** takes a data.frame as its input.
```{r summarizing}
diamonds %>%
summarize(n = n(), n_missing = sum(is.na(price)), mean = mean(price), sd = sd(price), min = min(price), max = max(price)) %>%
mutate(stderror = sd/sqrt(n)) %>%
mutate(confint_lower = mean-2*stderror) %>%
mutate(confint_upper = mean+2*stderror)
summary <- diamonds %>%
summarise_each(vars = c(carat, depth, table, price, x, y, z), funs(mean, sd, min, max, median, IQR))
sum_ft <- flextable(summary)
# Subset tables by stat
flextable(select(summary, contains("mean")))
flextable(select(summary, contains("sd")))
# Subset tables by variable
flextable(select(summary, contains("price")))
# Playing with formatting
sum_ft <- flextable(summary) %>%
theme_zebra %>%
font(part = "header", fontname = "Cambria") %>%
bold(part = "header") %>%
color(part = "header", color = "blue") %>%
fontsize(part = "all", size = 16)
```
Let's say we want to output our t-test output from above. Output from statistical tests is often a little cluttered - `broom::tidy` can clean that up for us.
```{r t-test output}
output <- tidy(t.test(diamonds_subset$price ~ diamonds_subset$cut, var.equal = FALSE)) %>%
select(-estimate)
# Very basic
flextable(output)
# How exciting!
out_ft <- flextable(output) %>%
set_header_labels(estimate1 = "Premium Mean", estimate2 = "Ideal Mean", statistic = "Statistic", p.value = "p", parameter = "df", conf.low = "Conf. Int. Lower", conf.high = "Conf. Int. High", method = "Method", alternative = "Alternative") %>%
bold(part = "header") %>%
fontsize(part = "header", size = 16) %>%
fontsize(part = "body", size = 14) %>%
bg(part = "body", bg = "#E9E9E9")
out_ft
```
Great! So we've got a couple tables we'd like to add to our R Markdown file. I think a good beginner strategy for getting easy summary documents is to work in a .Rmd (rather than a .R) and just put all your rough work in a code chunk with `include = FALSE`. That way, when you knit your document, nothing in your code chunk will be included - only things you call outside (such as with in-line code or with another chunk). Here's an example of what I mean.
**Here's what the sample below looks like in R Markdown:**
[![](20_summarydoc.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
#### Summary Document Sample
```{r sample, include = FALSE}
price_summary <- diamonds %>%
summarize(n = n(), n_missing = sum(is.na(price)), mean = mean(price), sd = sd(price), min = min(price), max = max(price)) %>%
mutate(stderror = sd/sqrt(n)) %>%
mutate(confint_lower = mean-2*stderror) %>%
mutate(confint_upper = mean+2*stderror)
price_summary_ft <- flextable(price_summary) %>%
set_header_labels(n = "N", n_missing = "N Missing", mean = "Mean", sd = "SD", min = "Minimum", max = "Maximum", stderror = "Standard Error", confint_lower = "Conf. Int. Lower", confint_upper = "Conf. Int. Upper") %>%
bold(part = "header")
diamonds_subset <- diamonds %>% filter(cut == "Ideal" | cut == "Premium")
price_levene <- leveneTest(diamonds_subset$price, diamonds_subset$cut, center = mean)
colnames(price_levene) <- c("df", "F", "p")
price_t <- t.test(diamonds_subset$price ~ diamonds_subset$cut, var.equal = FALSE)
price_t_out <- tidy(price_t) %>% select(-estimate)
price_t_ft <- flextable(price_t_out) %>%
set_header_labels(estimate1 = "Premium Mean", estimate2 = "Ideal Mean", statistic = "Statistic", p.value = "p", parameter = "df", conf.low = "Conf. Int. Lower", conf.high = "Conf. Int. Upper", method = "Method", alternative = "Alternative") %>%
bold(part = "header")
```
The data included a sample of `r price_summary$n` diamonds. The average price among the data was `r price_summary$mean` USD. Descriptive statistics of diamond prices are below:
`r price_summary_ft`
I was interested in the relationship between a diamond's cut and its price at the two highest levels of cut (Ideal and Premium). I ran an independent samples t-test to compare the average price in these two groups. Testing for homogeneity of variance with Levene's test indicated a F-statistic of `r round(price_levene$F[1], 3)` and a p-value of < .001. As a result, a Welch t-test was conducted - results are below.
`r price_t_ft`
-----
##### Exercise 3: Summary Stats
Use what you've learned to do the following:
* Calculate the mean and SD of price
* Calculate the mean and SD of price among diamonds with a cut of Ideal
* Turn the second one into a flextable
* Bold the body of the table
```{r exercise3, include = FALSE}
```
-----
## Tips
* Be careful when running things in your console. Code you run from your .R or .Rmd will stay when you save/close it, but work done in the console will disappear when you close R or if your R session ever needs to restart. I usually use the console for experimentation, and then write things that work/that I'm keeping in my .R or .Rmd file.
* Keep your .R or .Rmd in the same place as the files you use (Also check out R Projects!). Set up your directory at the beginning of your document by assigning it to a variable: e.g. `path <- getwd()`. You can then use the function `file.path` to paste together your directory with a file you're reading (importing) or writing (exporting): e.g. `read_csv(file.path(path, "test.csv"))`. The advantage here is that `file.path` will take care of the Mac/PC/Other OS differences in slash defaults.
* Assign results of a statistical test to an object (e.g. `t_results <- t.test(df$var1 ~ df$var2))`) because it makes it easier access individuals elements (using the `$`) or tidy/export the output.
## Resources
If you're stuck, you can reach out to me - I probably won't have an answer for you off-hand, but I'm certainly willing to look into things with you! Email: *dcushni@uwo.ca*
### More Samples of R's Output vs. SPSS Output
#### Correlation
[![](8_corr_setup.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
[![](9_corr_output.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
Bivariate correlation between carat and price:
```{r correlation}
# Base R - very simple, only gives the r value
cor(diamonds$carat, diamonds$price)
cor(diamonds$carat, diamonds$price, method = "spearman")
# from Hmisc - also works on a matrix; includes p-values
rcorr(diamonds$carat, diamonds$price)
rcorr(diamonds$carat, diamonds$price, type = "spearman")
rcorr(as.matrix(select(diamonds, -cut, -color, -clarity)))
```
#### Regression
Predicting price as a function of carat:
[![](10_regression_setup.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
[![](11_regression_output.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
```{r regression}
price_carat <- lm(diamonds$price ~ diamonds$carat)
summary(price_carat)
```
#### ANOVA
**Note:** If you want your outputted values to match SPSS, you need to use the same contrasts as SPSS. This is discussed [here](http://myowelt.blogspot.ca/2008/05/obtaining-same-anova-results-in-r-as-in.html). To change your contrasts, do the following.
```{r anova contrasts}
options(contrasts = c("contr.sum", "contr.poly"))
```
One-Way ANOVA comparing mean price between all diamond cuts:
[![](12_anova_setup.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
[![](13_anova_output.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
[![](14_anova_posthoc.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
```{r anova}
cut_price <- lm(price ~ cut, data = diamonds)
cut_price_aov <- aov(cut_price)
summary(cut_price_aov)
# Post-hoc with Bonferroni Correction
pairwise.t.test(diamonds$price, diamonds$cut, p.adj = "bonferroni")
```
Two-Way ANOVA comparing mean price as a function of diamond cut and clarity:
[![](15_twoway_setup.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
[![](16_twoway_output.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
[![](17_twoway_posthoc1.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
[![](18_twoway_posthoc2.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
```{r twoway}
cut_clarity_price <- lm(price ~ cut*clarity, data = diamonds)
cut_clarity_price_aov <- aov(cut_clarity_price)
summary(cut_clarity_price_aov)
# Post-hoc with Bonferroni correction
# This print-out would be very long - showing you cut element as a sample
cut_clarity_price_posthoc <- TukeyHSD(cut_clarity_price_aov)
cut_clarity_price_posthoc$cut
```
#### Explore
Running SPSS Explore on Price (limited screenshots because this is long output!):
* Descriptives
[![](19_explore.png)](https://github.com/daryncsparrow/SPSStoR_2018/blob/master/)
* Histograms
* Boxplots
**Note:** Kurtosis is calculated differently between SPSS and R. Different kurtosis calculation methods are discussed [here](https://stats.stackexchange.com/questions/61740/differences-in-kurtosis-definition-and-their-interpretation).
```{r explore_r, fig.height=10}
# Simple investigation of price
diamonds %>%
summarize(n = n(), n_missing = sum(is.na(price)), mean = mean(price, na.rm = TRUE), sd = sd(price, na.rm = TRUE), min = min(price, na.rm = TRUE), max = max(price, na.rm = TRUE), median = median (price, na.rm = TRUE), IQR = IQR(price, na.rm = TRUE), variance = var(price, na.rm = TRUE), trimmed = mean(price, na.rm = TRUE, trim = 0.05), skewness = skewness(price, na.rm = TRUE), kurtosis = kurtosis(price, na.rm = TRUE)) %>%
mutate(range = max - min) %>%
mutate(stderror = sd/sqrt(n)) %>%
mutate(confint_lower = mean-2*stderror) %>%
mutate(confint_upper = mean+2*stderror)
hist(diamonds$price)
boxplot(diamonds$price)
# Price grouped by cut
diamonds %>%
group_by(cut) %>%
summarize(n = n(), n_missing = sum(is.na(price)), mean = mean(price, na.rm = TRUE), sd = sd(price, na.rm = TRUE), min = min(price, na.rm = TRUE), max = max(price, na.rm = TRUE), median = median (price, na.rm = TRUE), IQR = IQR(price, na.rm = TRUE), variance = var(price, na.rm = TRUE), trimmed = mean(price, na.rm = TRUE, trim = 0.05), skewness = skewness(diamonds$price, na.rm = TRUE), kurtosis = kurtosis(diamonds$price, na.rm = TRUE)) %>%
mutate(range = max - min) %>%
mutate(stderror = sd/sqrt(n)) %>%
mutate(confint_lower = mean-2*stderror) %>%
mutate(confint_upper = mean+2*stderror)
ggplot(diamonds, aes(x = price)) +
geom_histogram(color = "black", fill = "white") +
facet_grid(cut ~.)
ggplot(diamonds, aes(x = cut, y = price)) +
geom_boxplot(color = "black", fill = "white")
```
### Questionnaires?
Working with questionnaire data offers unique challenges. For those of you who are into that, here are some quick tips on how to manage this kind of data.
We're going to use the `bfi` dataset - documentation can tell you lots about it! The `bfi` documentation actually includes a sample of how to use an item key and score negatively keyed items - so this section is us working through the provided sample. In SPSS, you would negatively key items using Transform.
Briefly, `bfi` is a set of 25 personality self-report items from 2800 subjects, plus demogrpahics on gender, education, and age. The 25 items are oranized around 5 factors: (A) Agreeableness, (C) Conscientiousness, (E) Extraversion, (N) Neuroticism, (O) Openness. As is common with questionnaire data, some of these items are negatively keyed (ie. high scores on the question represent low levels of the dimension measured). This is easy to do in R! To make a keys.list, define vectors for each factor (ie. Agreeableness) and list the included questions by name. Negatively keyed variables are prefaced with a -.
```{r bfi}
data(bfi)
glimpse(bfi)
keys.list <- list(agree=c("-A1","A2","A3","A4","A5"),conscientious=c("C1","C2","C3","-C4","-C5"),
extraversion=c("-E1","-E2","E3","E4","E5"),neuroticism=c("N1","N2","N3","N4","N5"),
openness = c("O1","-O2","O3","O4","-O5"))
# scoreItems gives a ton of info about items!
scores <- scoreItems(keys.list,bfi,min=1,max=6) # Need to specify the minimum and maximum values
scores
# Call str() on scores to see all the pieces of information you can access
# individually! The *scores* element of scores includes the re-keyed items
# Summarise from dplyr can help you investigate these scores.
keyed_scores <- data.frame(scores$scores)
keyed_scores %>%
summarise_all(mean)
# If you wanted to group by a variable (such as gender) - remember that
# we dropped demographics when we used keys.list. We can get those back!
# According to the documentation, males = 1 and females =2
keyed_scores <- cbind(gender = bfi$gender, keyed_scores)
keyed_scores %>%
group_by(gender) %>%
summarise_all(mean)
# Using dplyr::select to help you subset to look at individual pieces of the data
open <- bfi %>%
select(starts_with("O"))
```
Dummy coding is useful and not exclusive to questionnaire data. This is very simple in R. Let's say we wanted to look at the relationship between education and openness.
```{r dummycoding}
# Factor the variable first mostly to re-label nicely
factored_edu <- factor(bfi$education, labels = c("someHS", "doneHS", "someCollege", "doneCollege", "graduate"), ordered = TRUE, exclude = NA)
edu <- dummy.code(factored_edu)
# Your dummy coded variables are now an island!
# Bind these dummy coded education variables to the openness df
# we made earlier with dplyr::select.
open_edu <- cbind(edu, open)
```
### Links
#### Basics, Importing, Wrangling
* https://datacarpentry.org/R-genomics/index.html
* https://www.statmethods.net/input/importingdata.html
* https://www.datacamp.com/community/tutorials/r-data-import-tutorial
* https://rpubs.com/williamsurles/292547
* http://stcorp.nl/R_course/tutorial_dplyr.html
#### Questionnaires/Surveys
* https://mvuorre.github.io/toolbox/questionnaire.html
#### Stats
* http://www.milanor.net/blog/aggregation-dplyr-summarise-summarise_each/
* http://www.r-tutor.com/elementary-statistics
* http://tutorials.iq.harvard.edu/R/Rstatistics/Rstatistics.html
* http://r-statistics.co/Assumptions-of-Linear-Regression.html
* http://www.statmethods.net/stats/anova.html
* http://myowelt.blogspot.ca/2008/05/obtaining-same-anova-results-in-r-as-in.html
* http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software
* https://www.r-bloggers.com/example-2014-6-comparing-medians-and-the-wilcoxon-rank-sum-test/
* http://gribblelab.org/stats/index.html
* ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/24.0/en/client/Manuals/IBM_SPSS_Statistics_Algorithms.pdf
#### Summary Documents
* https://bookdown.org/yihui/rmarkdown/
* https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf
* https://davidgohel.github.io/flextable/articles/overview.html
* https://cran.r-project.org/web/packages/flextable/vignettes/format.html
* https://cran.r-project.org/web/packages/flextable/flextable.pdf
* https://cran.r-project.org/web/packages/captioner/vignettes/using_captioner.html
* https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
Thanks for reading/listening!