# <div style="text-align: right"> Chapter __16__</div>

# __Vectors__

In [110]:
# config
repr_html.tbl_df <- function(obj, ..., rows = 6) repr:::repr_html.data.frame(obj, ..., rows = rows)
options(dplyr.summarise.inform = FALSE)

In [4]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.1
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## __Vector Basics__

There are two types of vectors:
* Atomic vectors, of which there are six types: logical, integer, dou‐
ble, character, complex, and raw. Integer and double vectors are
collectively known as numeric vectors.
* Lists, which are sometimes called recursive vectors because lists
can contain other lists.

__The chief difference between atomic vectors and lists is that atomic
vectors are homogeneous, while lists can be heterogeneous. There’s
one other related object:__

__NULL . NULL is often used to represent the
absence of a vector (as opposed to NA , which is used to represent the
absence of a value in a vector). NULL typically behaves like a vector of
length 0.__

Every vector has two key properties:

* Its type, which you can determine with `typeof()` :

In [5]:
typeof(letters)

In [6]:
typeof(1:10)

* Its length, which you can determine with `length()` :

In [7]:
x <- list('a', 'b', 1:10)
length(x)

Vectors can also contain arbitrary additional metadata in the form
of attributes. These attributes are used to create augmented vectors,
which build on additional behavior. There are four important types
of augmented vector:

* Factors are built on top of integer vectors.
* Dates and date-times are built on top of numeric vectors.
* Data frames and tibbles are built on top of lists.

## Important Types of Atomic Vector

The four most important types of atomic vector are logical, integer,
double, and character. Raw and complex are rarely used during a
data analysis.

### Logical

Logical vectors are the simplest type of atomic vector because they
can take only three possible values: FALSE , TRUE , and NA . Logical vectors are usually constructed with comparison operators.

In [8]:
1:10 %% 3 == 0

In [9]:
c(TRUE, TRUE, FALSE, NA)

### Numeric

Integer and double vectors are known collectively as numeric vectors. In R, numbers are doubles by default. To make an integer, place
a L after the number:

In [10]:
typeof(1)

In [11]:
typeof(1L)

The distinction between integers and doubles is not usually impor‐
tant, but there are two important differences that you should be
aware of:

* Doubles are approximations. Doubles represent floating-point
numbers that cannot always be precisely represented with a
fixed amount of memory. This means that you should consider
all doubles to be approximations. For example, what is square of
the square root of two?

In [12]:
x <- sqrt(2) ^ 2
x

In [13]:
x - 2

This behavior is common when working with floating-point
numbers: most calculations include some approximation error.
Instead of comparing floating-point numbers using `==` , you
should use `dplyr::near()` , which allows for some numerical
tolerance.

* Integers have one special value, NA , while doubles have four, NA ,
NaN , Inf , and -Inf . All three special values can arise during
division:

In [14]:
c(-1, 0, 1) / 0

Avoid using `==` to check for these other special values. Instead
use the helper functions  `is.finite()` , `is.infinite()` , and
`is.nan()` :

### Missing Values

In [17]:
(NA) # logical
(NA_integer_) # integer
(NA_real_) # double
(NA_character_) # character

## Using Atomic Vectors

Now that you understand the different types of atomic vector, it’s
useful to review some of the important tools for working with them.
These include:
* How to convert from one type to another, and when that hap‐
pens automatically.
* How to tell if an object is a specific type of vector.
* What happens when you work with vectors of different lengths.
* How to name the elements of a vector.
* How to pull out elements of interest.

### Coercion

There are two ways to convert, or coerce, one type of vector to
another:

* Explicit coercion happens when you call a function like 
`as.logical()` , `as.integer()` , `as.double()` , or `as.character()` . Whenever you find yourself using explicit coercion, you should
always check whether you can make the fix upstream, so that
the vector never had the wrong type in the first place. For example, you may need to tweak your readr `col_types` specification.

* Implicit coercion happens when you use a vector in a specific
context that expects a certain type of vector. For example, when
you use a logical vector with a numeric summary function, or
when you use a double vector where an integer vector is
expected.

You’ve already seen the most important type of implicit coercion:
using a logical vector in a numeric context. In this case TRUE is converted to 1 and FALSE is converted to 0. That means the sum of a
logical vector is the number of trues, and the mean of a logical vector is the proportion of trues:

In [18]:
x <- sample(20, 100, replace = TRUE)
y <- x > 10

In [19]:
sum(y)

In [20]:
mean(y)

It’s also important to understand what happens when you try and
create a vector containing multiple types with `c()` —the most com‐
plex type always wins:

In [21]:
typeof(c(TRUE, 1L))

In [22]:
typeof(c(1L, 1.5))

In [23]:
typeof(c(1.5, 'a'))

An atomic vector cannot have a mix of different types because the
type is a property of the complete vector, not the individual elements. If you need to mix multiple types in the same vector, you
should use a list, which you’ll learn about shortly

### Test Functions

Sometimes you want to do different things based on the type of vec‐
tor. One option is to use `typeof()` . Another is to use a test function
that returns a `TRUE` or `FALSE` . Base R provides many functions like
`is.vector()` and `is.atomic()` , but they often return surprising
results. Instead, it’s safer to use the `is_*` functions provided by
__purrr__

### Scalars and Recycling Rules

As well as implicitly coercing the types of vectors to be compatible,
R will also implicitly coerce the length of vectors. This is called vector recycling, because the shorter vector is repeated, or recycled, to
the same length as the longer vector.
This is generally most useful when you are mixing vectors and
“scalars.” I put scalars in quotes because R doesn’t actually have
scalars: instead, a single number is a vector of length 1. Because
there are no scalars, most built-in functions are vectorized, meaning
that they will operate on a vector of numbers. That’s why, for example, this code works:

In [26]:
sample(10) + 100

In [27]:
runif(10) > 0.5

In R, basic mathematical operations work with vectors. That means
that you should never need to perform explicit iteration when performing simple mathematical computations.

It’s intuitive what should happen if you add two vectors of the same
length, or a vector and a “scalar,” but what happens if you add two
vectors of different lengths?

In [28]:
1:10 + 1:2

Here, R will expand the shortest vector to the same length as the
longest, so-called recycling. This is silent except when the length of
the longer is not an integer multiple of the length of the shorter:

In [30]:
1:10 + 1:3

“longer object length is not a multiple of shorter object length”


While vector recycling can be used to create very succinct, clever
code, it can also silently conceal problems. For this reason, the vectorized functions in tidyverse will throw errors when you recycle
anything other than a scalar. If you do want to recycle, you’ll need to
do it yourself with `rep()` :

In [31]:
tibble(x = 1:4, y = 1:2)

ERROR: Error: Tibble columns must have compatible sizes.
* Size 4: Existing data.
* Size 2: Column `y`.
[34mℹ[39m Only values of size one are recycled.


In [32]:
tibble(1:4, y = rep(1:2, 2))

1:4,y
<int>,<int>
1,1
2,2
3,1
4,2


In [33]:
tibble(x = 1:4, y = rep(1:2, each = 2))

x,y
<int>,<int>
1,1
2,1
3,2
4,2


All types of vectors can be named. You can name them during cre‐
ation with `c()` :

In [34]:
c(x = 1, y = 2, z = 4)

Or after the fact with `purrr::set_names()` :

In [35]:
set_names(1:3, c('a', 'b', 'c'))

## __Subsetting__

So far we’ve used `dplyr::filter()` to filter the rows in a tibble. fil
ter() only works with tibble, so we’ll need a new tool for vectors: \[ .
\[ is the subsetting function, and is called like x\[a\] . There are four
types of things that you can subset a vector with:

* A numeric vector containing only integers. The integers must
either be all positive, all negative, or zero.
Subsetting with positive integers keeps the elements at those
positions:

In [36]:
x <- c('one', 'two', 'three', 'four', 'five')
x[c(3, 2, 5)]

By repeating a position, you can actually make a longer output
than input:

In [37]:
x[c(1, 1, 5, 5, 5, 2)]

Negative values drop the elements at the specified positions:

In [38]:
x[c(-1, -3, -5)]

It’s an error to mix positive and negative values:

In [40]:
x[c(1, -1)]

ERROR: Error in x[c(1, -1)]: only 0's may be mixed with negative subscripts


The error message mentions subsetting with zero, which returns
no values

In [42]:
x[0]

Subsetting with a logical vector keeps all values corresponding
to a TRUE value. This is most often useful in conjunction with
the comparison functions:

In [43]:
x <- c(10, 3, NA, 5, 8, 1, NA)

# all non-missing values of x
x[!is.na(x)]

In [44]:
# all even (or missing!) values of x
x[x %% 2 == 0]

If you have a named vector, you can subset it with a character
vector:

In [45]:
x <- c(abc = 1, def = 2, xyz = 5)
x[c('xyz', 'deg')]

Like with positive integers, you can also use a character vector
to duplicate individual entries.

The simplest type of subsetting is nothing, `x[]` , which returns
the complete x . This is not useful for subsetting vectors, but it is
useful when subsetting matrices (and other high-dimensional
structures) because it lets you select all the rows or all the columns, by leaving that index blank. For example, if x is 2D, `x[1, ]` selects the first row and all the columns, and `x[, -1]`
selects all rows and all columns except the first.


There is an important variation of `[` called `[[` . `[[` only ever extracts a
single element, and always drops names. It’s a good idea to use it
whenever you want to make it clear that you’re extracting a single
item, as in a for loop. The distinction between `[` and `[[` is most
important for lists, as we’ll see shortly.

Excerciese



What does mean(is.na(x)) tell you about a vector x? What about sum(!is.finite(x))?

In [47]:
x <- c(-Inf, -1, 0, 1, Inf, NA, NaN)
(mean(is.na(x)))
(sum(!is.finite(x)))

Excercise

Carefully read the documentation of `is.vector()`. What does it actually test for? Why does `is.atomic()` not agree with the definition of atomic vectors above?

The function `is.vector()` only checks whether the object has no attributes other than names. Thus a `list` is a vector:

In [49]:
is.vector(list(a = 1, b = 2))

But any object that has an attribute (other than names) is not:

In [50]:
x <- 1:10
attr(x, 'something') <- TRUE
is.vector(x)

The idea behind this is that object oriented classes will include attributes, including, but not limited to `"class"`.

The function `is.atomic()` explicitly checks whether an object is one of the atomic types (“logical”, “integer”, “numeric”, “complex”, “character”, and “raw”) or NULL

In [52]:
is.atomic(1:10)

In [54]:
is.atomic(list(A = 1))

The function `is.atomic()` will consider objects to be atomic even if they have extra attributes.

In [55]:
is.atomic(x)

Excercise



Compare and contrast `setNames()` with `purrr::set_names()`.


The function `setNames()` takes two arguments, a vector to be named and a vector of names to apply to its elements.

In [56]:
setNames(1:4, c('a', 'b', 'c', 'd'))

You can use the values of the vector as its names if the nm argument is used.

In [57]:
setNames(nm = c('a', 'b', 'c', 'd'))

The function `set_names()` has more ways to set the names than `setNames()`. The names can be specified in the same manner as `setNames()`

In [58]:
set_names(1:4, c('a', 'b', 'c', 'd'))

The names can also be specified as unnamed arguments,

In [59]:
set_names(1:4, 'a', 'b', 'c', 'd')

The function `set_names()` will name an object with itself if no nm argument is provided (the opposite of `setNames()` behavior).

In [60]:
set_names(c('a', 'b', 'c', 'd'))

The biggest difference between `set_names()` and `setNames()` is that `set_names()` allows for using a function or formula to transform the existing names.

In [61]:
set_names(c(a = 1, b = 2, c = 3), toupper)

In [63]:
set_names(c(a = 1, b = 2, c = 3), ~toupper(.))

The `set_names()` function also checks that the length of the names argument is the same length as the vector that is being named, and will raise an error if it is not.

In [64]:
set_names(1:4, c('a', 'b'))

ERROR: Error: `nm` must be `NULL` or a character vector the same length as `x`


The `setNames()` function will allow the names to be shorter than the vector being named, and will set the missing names to `NA`.

Excercise



Create functions that take a vector as input and returns:

    The last value. Should you use [ or [[?
    The elements at even numbered positions.
    Every element except the last value.
    Only even numbers (and no missing values).



In [68]:
# this function finds the last value in a vector
last_value <- function(x) {
    # check for case with no length
    if (length(x) > 0) {
        x[[length(x)]]
    } else {
        x
    }
}

last_value(c(1, 2, 3, 4, 5, 4, 3, 4, 5, 6, 5, 6, 5, 4))

In [69]:
# this function returns the elements at event number
# positions
even_indices <- function(x) {
    if (length(x) > 0) {
        x[seq_along(x) %% 2 == 0]
    } else {
        x
    }
}

even_indices(letters)

In [70]:
# this function returns a vector with every
# element except the last
not_last <- function(x) {
    n <- length(x)
    if (n) {
        x[-n]
    } else {
        x
    }
}

not_last(1:3)

In [71]:
# this function returns the elements
# of a vector that are even numbers
even_numbers <- function(x) {
    x[x %% 2 == 0]
}

even_numbers(-4:4)

Excercise



Why is `x[-which(x > 0)] ` not the same as ` x[x <= 0]` ?


These expressions differ in the way that they treat missing values. Let’s test how they work by creating a vector with positive and negative integers, and special values `(NA, NaN, and Inf)`. These values should encompass all relevant types of values that these expressions would encounter.

In [72]:
x <- c(-1:1, Inf, -Inf, NaN, NA)
x[-which(x > 0)]

In [73]:
x[x <= 0]

The expressions `x[-which(x > 0)]` and `x[x <= 0]` return the same values except for a `NaN` instead of an `NA` in the expression using which.

## Recursive Vectors (Lists)

Lists are a step up in complexity from atomic vectors, because lists
can contain other lists. This makes them suitable for representing
hierarchical or tree-like structures. You create a list with `list()`:

In [74]:
x <- list(1, 2, 3)
x

A very useful tool for working with lists is `str()` because it focuses
on the structure, not the contents:

In [75]:
str(x)

List of 3
 $ : num 1
 $ : num 2
 $ : num 3


In [77]:
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)

List of 3
 $ a: num 1
 $ b: num 2
 $ c: num 3


Unlike atomic vectors, `lists()` can contain a mix of objects:

In [78]:
y <- list('a', 1L, 1.5, TRUE)
str(y)

List of 4
 $ : chr "a"
 $ : int 1
 $ : num 1.5
 $ : logi TRUE


Lists can even contain other lists!

In [79]:
z <- list(list(1, 2), list(3, 4))
str(z)

List of 2
 $ :List of 2
  ..$ : num 1
  ..$ : num 2
 $ :List of 2
  ..$ : num 3
  ..$ : num 4


### Subsetting

There are three ways to subset a list, which I’ll illustrate with `a` :

In [80]:
a <- list(a = 1:3, b = 'a string', c = pi, d = list(-1, -5))

`[` extracts a sublist. The result will always be a list:

In [82]:
str(a[1:2])

List of 2
 $ a: int [1:3] 1 2 3
 $ b: chr "a string"


In [83]:
str(a[4])

List of 1
 $ d:List of 2
  ..$ : num -1
  ..$ : num -5


Like with vectors, you can subset with a logical, integer, or character vector.

`[[` extracts a single component from a list. It removes a level of
hierarchy from the list:

In [84]:
str(y[[1]])

 chr "a"


In [85]:
str(y[[4]])

 logi TRUE


In [86]:
str(a[[1]])

 int [1:3] 1 2 3


`$` is a shorthand for extracting named elements of a list. It works
similarly to `[[` except that you don’t need to use quotes:

In [87]:
a$a

In [88]:
a[['a']]

The distinction between `[`and `[[` is really important for lists `[[` is really important for lists[[ is really important for lists,
because `[[` `d`rills down into the list while [ returns a new, smaller
list.

## Attributes

Any vector can contain arbitrary additional metadata through its
attributes. You can think of attributes as a named list of vectors that
can be attached to any object. You can get and set individual
attribute values with `attr()` or see them all at once with
`attributes()` :

In [89]:
x <- 1:10
attr(x, 'greeting')

NULL

In [90]:
attr(x, 'greeting') <- 'Hi!'
attr(x, 'farewell') <- 'Bye!'
attributes(x)

There are three very important attributes that are used to implement
fundamental parts of R:

* _Names_ are used to name the elements of a vector.
* _Dimensions_ (dims, for short) make a vector behave like a matrix
or array.
* _Class_ is used to implement the S3 object-oriented system.

You’ve seen names earlier, and we won’t cover dimensions because
we don’t use matrices in this book. It remains to describe the class,
which controls how generic functions work. Generic functions are
key to object-oriented programming in R, because they make functions behave differently for different classes of input.

In [91]:
as.Date

The call to “UseMethod” means that this is a generic function, and it
will call a specific method, a function, based on the class of the first
argument. (All methods are functions; not all functions are meth‐
ods.) You can list all the methods for a generic with `methods()` :

In [92]:
methods('as.Date')

[1] as.Date.character   as.Date.default     as.Date.factor     
[4] as.Date.numeric     as.Date.POSIXct     as.Date.POSIXlt    
[7] as.Date.vctrs_sclr* as.Date.vctrs_vctr*
see '?methods' for accessing help and source code

For example, if x is a character vector, `as.Date()` will call
`as.Date.character()` ; if it’s a factor, it’ll call `as.Date.factor()` .

You can see the specific implementation of a method with
`getS3method()` :

In [93]:
getS3method('as.Date', 'default')

In [94]:
getS3method('as.Date', 'numeric')

The most important S3 generic is `print()` : it controls how the
object is printed when you type its name at the console. Other
important generics are the subsetting functions `[ , [[ , and $ `.

## Augmented Vectors

Atomic vectors and lists are the building blocks for other important
vector types like factors and dates. I call these augmented vectors,
because they are vectors with additional attributes, including class.
Because augmented vectors have a class, they behave differently to
the atomic vector on which they are built.

### Factors
Factors are designed to represent categorical data that can take a
fixed set of possible values. Factors are built on top of integers, and
have a levels attribute:

In [95]:
x <- factor(c('ab', 'cd', 'ab'), levels = c('ab', 'cd', 'ef'))
typeof(x)

In [96]:
attributes(x)

### Dates and Date-Times

Dates in R are numeric vectors that represent the number of days
since 1 January 1970:

In [97]:
x <- as.Date('1971-01-01')
unclass(x)

In [98]:
typeof(x)

In [99]:
attributes(x)

Date-times are numeric vectors with class POSIXct that represent
the number of seconds since 1 January 1970. (In case you were wondering, “POSIXct” stands for “Portable Operating System Interface,”
calendar time.)

In [100]:
x <- lubridate::ymd_hm('1970-01-01 01:00')

In [101]:
unclass(x)

In [102]:
typeof(x)

In [103]:
attributes(x)

The `tzone` attribute is optional. It controls how the time is printed,
not what absolute time it refers to:

In [104]:
attr(x, 'tzone') <- 'US/Pacific'
x

[1] "1969-12-31 17:00:00 PST"

In [106]:
attr(x, 'tzone') <- 'US/Eastern'
x

[1] "1969-12-31 20:00:00 EST"

### Tibbles

Tibbles are augmented lists. They have three classes: `tbl_df` , `tbl` ,
and `data.frame` . They have two attributes: (column) `names` and
`row.names` .

In [107]:
tb <- tibble::tibble(x = 1:5, y = 5:1)
typeof(tb)
attributes(tb)

Traditional `data.frames` have a very similar structure:

In [108]:
df <- data.frame(x = 1:5, y = 5:1)
typeof(df)
attributes(df)