<center><h1>The `NA` Type in R</h1></center>
<center><h3>Paul Stey, Ph.D.</h3></center>
<center><h3>2020-09-16</h3></center>


# 1. `NA` Type
  - Used to represent missing data
  - Frequently occurs in "real" data sets

In [1]:
a <- NA           # `NA` is the missing data literal

In [2]:
a + 4             # NA values propagate

In [3]:
(42 + a)/2        # returns NA

## 1.1 Checking for `NA`

  - Not necessarily obvious 

In [4]:
print(a)          # recall `a` is NA

[1] NA


In [7]:
a == NA           # equality check fails because of propagation

In [8]:
NA == NA          # ¯\_(ツ)_/¯

### 1.1.1 Correctly Checking for `NA` Values

In [9]:
is.na(a)          # the `is.na()` funciton lets us check for missingness

In [10]:
is.na(NA)         # yep

In [13]:
is.na(42)         # nope

## 1.2 Containers with `NA` Values

  - Propagated `NA`s can lead to surprising behavior

In [15]:
v <- c(3, 2, NA, 5)            # create vector with NA

In [16]:
mean(v)                        # recall the NA propagates

In [18]:
mean(v, na.rm = TRUE)          # `na.rm = TRUE` removes NAs

### 1.2.1 Finding `NA`s in `vector`

In [20]:
w <- c(4, 5, 33, NA, 7)

is.na(w)

# 2. Applied Example

 - We are given a large vector with many missing values
 - We want to replace the missing values with the mean of the non-missings

In [21]:
n <- 10000                      # make up our sample size

x <- rnorm(n)                   # simulate 10k draws from normal dist'n

n_miss <- 100                   # number of missing values to generate

x[sample(1:n, n_miss)] <- NA    # set `n_miss` values to be NA

## 2.1 Finding `NA` Values

  - We can use `is.na()` to find our missing values

In [24]:
print(x)

    [1] -0.9567970841  0.5790187232  0.7770726178 -1.1768428857  0.9706907395
    [6] -1.6325974820 -2.5564692389 -0.6132367722  1.2200571644 -0.8608689466
   [11]  0.7242786026  0.2786743861 -0.1195953412  0.0976578918  0.3351682084
   [16] -0.0813928466  1.9986532931 -1.8018300727 -1.1543910947  0.7234624562
   [21] -0.1685455494  0.0952096687  0.3769857343 -0.2327431911  0.5051146168
   [26]            NA  0.6234677604 -0.9710526960  0.9338689702 -0.1214318103
   [31] -1.0313001207 -1.1698370196 -1.1023318733 -0.2514128798 -0.7392868648
   [36]  0.8441845154  0.3447464168  1.4797166939  0.5372408897  0.6583285071
   [41] -0.1653023163  2.0905956325  1.5096391170  0.0453731416  0.7660911021
   [46]  0.1118427629  0.1595289056  1.2602812350 -0.2909843458  0.1416162166
   [51] -1.2184524213  2.2678021280  1.2985946216  0.2876972414  1.3619375964
   [56] -0.5253505944 -0.1673978592  0.0162475822 -0.4259452061  0.1643240506
   [61]  0.2394389410  0.3876420922 -0.8983858970  0.3461135561 

### 2.1.1 Replace `NA`s with Mean

In [25]:
idx_miss <- which(is.na(x))          # get indices of missing values

print(idx_miss)

  [1]   26  245  263  629  822  837  866  887  892  939 1013 1148 1207 1474 1698
 [16] 1741 1754 1912 2069 2160 2188 2248 2268 2382 2600 2614 2620 2812 2886 3002
 [31] 3008 3086 3102 3107 3170 3232 3824 3894 3975 3992 4286 4399 4651 4656 4669
 [46] 4834 4951 5139 5172 5210 5672 5739 5862 5948 6071 6080 6123 6161 6240 6383
 [61] 6402 6410 6422 6436 6438 6459 6533 6551 6884 6929 7086 7120 7233 7296 7359
 [76] 7543 7641 7647 7694 7758 7830 7839 7993 8294 8366 8397 8465 8519 8540 8626
 [91] 8702 8971 8984 9008 9111 9259 9588 9806 9817 9920


In [26]:
mean_obs <- mean(x, na.rm = TRUE)   # get mean of observed values

In [27]:
x[idx_miss] <- mean_obs             # replace NAs with `mean_obs`

In [28]:
mean_obs == mean(x)                 # confirm that mean of `x` is same as `mean_obs`

In [29]:
any(is.na(x))                       # confirm there are no longer NA values in `x`