In [2]:
library(tidyverse)

# Implicit and explicit NA

Changing the representation of a dataset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways:

- __Explicitly__, i.e. flagged with NA.
- __Implicitly__, i.e. simply not present in the data.

In [3]:
stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

stocks

year,qtr,return
2015,1,1.88
2015,2,0.59
2015,3,0.35
2015,4,
2016,2,0.92
2016,3,0.17
2016,4,2.66


There are two missing values in this dataset:

- The return for the fourth quarter of 2015 is _explicitly_ missing, because the cell where its value should be instead contains NA.

- The return for the first quarter of 2016 is _implicitly_ missing, because it simply does not appear in the dataset.

The way that a dataset is represented can make implicit values explicit. For example, we can make the implicit missing value explicit by putting years in the columns:

In [5]:
stocks_explicitly <- stocks %>% pivot_wider(names_from = year, values_from = return)
stocks_explicitly

qtr,2015,2016
1,1.88,
2,0.59,0.92
3,0.35,0.17
4,,2.66


Because these explicit missing values may not be important in other representations of the data, we can turn explicit NA to implicit


In [6]:
stocks_explicitly %>% pivot_longer(cols = !qtr, names_to = 'year', values_to = 'return', values_drop_na = T)

qtr,year,return
1,2015,1.88
2,2015,0.59
2,2016,0.92
3,2015,0.35
3,2016,0.17
4,2016,2.66


Another important tool for making missing values explicit in tidy data is **`complete()`**:  
**`complete()`** takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NAs where necessary.

In [8]:
stocks %>% complete(year, qtr)

year,qtr,return
2015,1,1.88
2015,2,0.59
2015,3,0.35
2015,4,
2016,1,
2016,2,0.92
2016,3,0.17
2016,4,2.66


There’s one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:

In [10]:
treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)

treatment

person,treatment,response
Derrick Whitmore,1,7
,2,10
,3,9
Katherine Burke,1,4


In [12]:
treatment %>% fill(person, .direction = 'down')

person,treatment,response
Derrick Whitmore,1,7
Derrick Whitmore,2,10
Derrick Whitmore,3,9
Katherine Burke,1,4


# How to handle missing value

- Drop observations having missing value
- Inputation (mean, mode, ...)
- Use a model to predict the value of missing value

# Case study

>Look at the `who` dataset.  Are there implicit missing values? What’s the difference between an `NA` and zero?

In [3]:
who %>% head()

country,iso2,iso3,year,new_sp_m014,new_sp_m1524,new_sp_m2534,new_sp_m3544,new_sp_m4554,new_sp_m5564,...,newrel_m4554,newrel_m5564,newrel_m65,newrel_f014,newrel_f1524,newrel_f2534,newrel_f3544,newrel_f4554,newrel_f5564,newrel_f65
Afghanistan,AF,AFG,1980,,,,,,,...,,,,,,,,,,
Afghanistan,AF,AFG,1981,,,,,,,...,,,,,,,,,,
Afghanistan,AF,AFG,1982,,,,,,,...,,,,,,,,,,
Afghanistan,AF,AFG,1983,,,,,,,...,,,,,,,,,,
Afghanistan,AF,AFG,1984,,,,,,,...,,,,,,,,,,
Afghanistan,AF,AFG,1985,,,,,,,...,,,,,,,,,,


In [4]:
who1 <- who %>% pivot_longer(
    starts_with('new'),
    names_pattern = 'new_?(.*)_(.)(.*)',
    names_to = c('diagnosis', 'sex', 'age'),
    values_to = 'cases'
)

who1 %>% head()

country,iso2,iso3,year,diagnosis,sex,age,cases
Afghanistan,AF,AFG,1980,sp,m,14,
Afghanistan,AF,AFG,1980,sp,m,1524,
Afghanistan,AF,AFG,1980,sp,m,2534,
Afghanistan,AF,AFG,1980,sp,m,3544,
Afghanistan,AF,AFG,1980,sp,m,4554,
Afghanistan,AF,AFG,1980,sp,m,5564,


The main concern is whether a missing value means that there were no cases of TB or whether it means that the WHO does not have data on the number of TB cases. Here are some things we should look for to help distinguish between these cases.

- If there are no 0 values in the data, then missing values may be used to indicate no cases.

- If there are both explicit and implicit missing values, then it suggests that missing values are being used differently. In that case, it is likely that explicit missing values would mean no cases, and implicit missing values would mean no data on the number of cases.

First, I'll check the presence of 0 in the data

In [8]:
who1 %>% filter(cases == 0) %>% tbl_sum()

There are zeros in the data, so it appears that cases of zero TB are explicitly indicated, and the value of NA is used to indicate missing data.

Second, I should check whether all values for a (country, year) are missing or whether it is possible for only some columns to be missing.

In [22]:
# the number of 0 in each TB group column

who %>% summarize(across(starts_with('new'), ~ sum(. == 0, na.rm = T)))

new_sp_m014,new_sp_m1524,new_sp_m2534,new_sp_m3544,new_sp_m4554,new_sp_m5564,new_sp_m65,new_sp_f014,new_sp_f1524,new_sp_f2534,...,newrel_m4554,newrel_m5564,newrel_m65,newrel_f014,newrel_f1524,newrel_f2534,newrel_f3544,newrel_f4554,newrel_f5564,newrel_f65
862,252,212,198,195,223,259,737,241,226,...,13,13,14,27,15,15,15,17,19,15


 Next, I will check for implicit missing values. Implicit missing values are (year, country) combinations that do not appear in the data.

In [12]:
# The number of (year, country) combinations in the data
who1 %>% distinct(country, year) %>% nrow()

In [14]:
# The number of possible (year, country) combinations in the data
who1 %>% expand(country, year) %>% nrow()

Since the number of complete cases of (country, year) is greater than the number of rows in `who`, there are some implicit values. But that doesn’t tell us what those implicit missing values are. Let's find out combinations of (country, year) that does not appear in `who` dataset:

In [17]:
country_year_implicit <- who1 %>% expand(country, year) %>% anti_join(who1, by = c('country', 'year'))

country_year_implicit

country,year
"Bonaire, Saint Eustatius and Saba",1980
"Bonaire, Saint Eustatius and Saba",1981
"Bonaire, Saint Eustatius and Saba",1982
"Bonaire, Saint Eustatius and Saba",1983
"Bonaire, Saint Eustatius and Saba",1984
"Bonaire, Saint Eustatius and Saba",1985
"Bonaire, Saint Eustatius and Saba",1986
"Bonaire, Saint Eustatius and Saba",1987
"Bonaire, Saint Eustatius and Saba",1988
"Bonaire, Saint Eustatius and Saba",1989


In [20]:
country_year_implicit %>% group_by(country) %>% summarize(min_year = min(year), max_year = max(year))

`summarise()` ungrouping output (override with `.groups` argument)


country,min_year,max_year
"Bonaire, Saint Eustatius and Saba",1980,2009
Curacao,1980,2009
Montenegro,1980,2004
Netherlands Antilles,2010,2013
Serbia,1980,2004
Serbia & Montenegro,2005,2013
Sint Maarten (Dutch part),1980,2009
South Sudan,1980,2010
Timor-Leste,1980,2001


All of these refer to (country, year) combinations for years prior to the existence of the country. For example, Timor-Leste achieved independence in 2002, so years prior to that are not included in the data.

To summarize:

- `0` is used to represent no cases of TB.
- Explicit missing values (`NAs`) are used to represent missing data for (country, year) combinations in which the country existed in that year.
- Implicit missing values are used to represent missing data because a country did not exist in that year.