# Missing Values in R
We need to do two things:
1. Count the number of missing values
2. Decide what to do with them

I'll show this process using the Titanic dataframe (passengers who were on the Titanic).

In [1]:
titanic_df = read.csv('titanic.csv')
head(titanic_df, 10)   # This displays the first 10 rows of the dataframe. Without the number, the default is 6.

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Unnamed: 0_level_1,<int>,<int>,<int>,<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<dbl>,<chr>,<chr>
1,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
3,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## 1. Counting the missing values in your dataframe
You'll want to count the number of missing values in a column to determine what to do with the column. The following command will do that:
```R
sum( is.na( df$col ) )
```

* `df$col` will call just the one variable you want to address
* `is.na( )` will test if each value in that variable is NA or if it has a valid value.
  * If the value is valid, it is assigned a 0
  * If the value is NA, then it is assigned a 1
* `sum( )` will add all the numbers. In this case, it will add all of the 1's (that is, it is counting all the NA's)

The above command will only count the number of NA's in a single column. To do multiple columns, there are two methods:

### 1.1 sapply
Remember that we can make a function as follows:
```R
ftn <- function(x)
    {
        sum(is.na(x))
    }
```

We can write this all on one line:
```R
ftn <- function(x) sum(is.na(x))
```

Now, we can call `ftn(df$col)` to execute the function for column `col`. But if we want to use the function on all columns, then we need to use the `sapply` function. This will apply the function (*ftn*) to the dataframe (*df*):
```R
sapply( df , ftn )
```

You can also do this without saving the function first. You can do this by replacing (*ftn*) with the function itself:
```R
sapply( df , function(x) sum(is.na(x))
```

In [2]:
sum(is.na(titanic_df$Age))

In [3]:
sapply( titanic_df, function(x) sum(is.na(x)))

### 1.2 dplyr
Here, we are going to do 3 things:
1. Call the data frame
2. Filter the data frame to just the data we want/need
3. Count the NA's

In [None]:
library('dplyr')
# If not installed, run * install.packages('dplyr') *

titanic_df %>%
    #filter(Sex == 'female') %>%     # Filter if needed. Can do just the summarise function
    summarise_all(funs(sum(is.na(.))))

## 2. What to do with Missing Data
There are a few different possibilities. Each one depends on how much data is missing and what you want to do with it. But basically, we can say that we'll do one of three things:
1. Eliminate missing observations (rows)
2. Eliminate variables (columns) that have too much missing data
3. Fill in missing values

### 2.1 Eliminate missing observations
If you are comparing multiple variables together, then having observations that have data for one variable but not the other don't make sense. In this case, you may want to eliminate those observations.

To see this, let's look at the first 10 rows of our dataframe as below.
* Usually, you will use the dataframe itself: `titanic_df`
* For the first 10 rows, call `head(titanic_df,10)`

In [None]:
head(titanic_df,10)  # Notice the missing value under Age for row 6

In [None]:
library('tidyr')
# If not installed, run * install.packages('tidyr') *

head(titanic_df,10) %>% drop_na(Age)  # Row 6 is removed

In [None]:
# Another option:

head(titanic_df,10) %>% na.omit()

#### Eliminating values for a calculation
Try finding the mean age for the first 10 rows:

In [None]:
mean(head(titanic_df$Age,10))

This doesn't work because there are NA values. We can find the mean when we omit the NA values

In [None]:
# The problem with this method is that if there is an NA for another variable, that line will be omitted even if there is a value for the age.
# Should work just fine, but just be aware
mean(head(titanic_df$Age, 10) %>% na.omit())

In [None]:
# This is another method, slightly more straightforward as it doesn't omit NA value, just ignores them in the calculation
mean(head(titanic_df$Age, 10), na.rm=T)

### 2.2 Ignoring Variables
Sometimes, there are just too many NA's within one variable to make it useful. If this is the case, then the variable may not be helpful.

### 2.3 Filling in values
If there isn't too much missing data, then it could be acceptable to insert a dummy value. This could be useful when we still want to compare two variables, but missing data makes comparison of some data points impossible.
* _*WARNING*_: If there is a lot of missing data, then filling in values *could* affect the result and would be falsifying data. As long as the percentage of missing data is low, filling values will help to 

In [None]:
# Original/Unfilled values
titanic_df[106:112,]

#### Filling with Average Values

In [None]:
# This will permanently change the NA values. You can always reload the data to try a different method.
titanic_df$Age[is.na(titanic_df$Age)] <- mean(titanic_df$Age, na.rm = T)

# Could also replace with Min, Max, or some other value

#### Filling with Neighboring Values

In [None]:
# Start from the bottom and work UP. Fills any NA value with the value below it.
# Note entries 108 and 110 for the 'Age' column - Each is filled with the value below it.

titanic_df[106:112,] <- titanic_df[106:112,] %>% fill(Age, .direction="up")

In [None]:
# Start from the top and work DOWN. Fills any NA value with the value above it.
# Note entries 108 and 110 for the 'Age' column - Each is now filled with the value above it.

titanic_df[106:112,] <- titanic_df[106:112,] %>% fill(Age, .direction="down")

#### Filling Multiple NA values
Some data is missing multiple data points in a row. All of these could be filled just as seen in the UP/DOWN passes seen earlier. But if it happens to have an anomolous value, there a high percentage of data is filled with that anomolous value.

In [None]:
titanic_df[40:53, ]

In [None]:
# Start from the bottom and work UP. Fills any NA value with the value below it.
# If it originally had an NA below it, it remains an NA

# Second pass - Start from the top and work DOWN. Fills any NA value with the value below it.
# If it originally had an NA above it after the first pass, it remains an NA

# Third pass - same as the first

# etc.

titanic_df[40:53,] <- titanic_df[40:53,] %>% fill(Age, .direction="downup") # could also do downup