# Consequences of emptyness

Emptiness is not the simplest of concepts for a computer.  To represent true emptiness, a computer needs to be doing nothing - which begs the question "how do you process emptiness?".  The answer is that we designate special values as representing emptiness, we explore these here

## Zero is not nothing

Consider the rainfall data for Mount Boyce

In [None]:
import pandas

mtboyce = pandas.read_csv("data/rainfall/IDCJAC0009_063292_1800_Data.csv")
mtboyce[mtboyce["Rainfall amount (millimetres)"] == 0]

There are 531 rows where the rainfall measurements is missing, and there are 4947 rows were the rainfall measured was 0.  _These are clearly different things_.  If you were to put zero in all the empty spots, you would be suggesting an extra 531 days were it didn't rain.  There was surely rain on at least some of those days, even if it didn't get measured.

Keeping the concept of "Zero" separate from "emptiness" is very important.

The main type of emptiness to consider in pandas is `NaN` (not a number).  Where pandas expects a number, but sees nothing, it will put this special value.  `NaN`s are treated very differently to 0.  Note that `NaN` comes from the `numpy` library which pandas has been using all along without us realising and we need to import it is we want to make our own `NaN`s

In [None]:
import numpy

some_nans = pandas.Series([0,1,2,3,numpy.NaN, 4, 5, numpy.NaN])
no_nans = pandas.Series([0,1,2,3,0,4,5,0])

print(some_nans.mean())
print(no_nans.mean())

print(some_nans.sum())
print(no_nans.sum())

print(some_nans.info())
print(no_nans.info())

As you can see, you can the _aggregation_ operations ignore `NaN`s, treating them as missing instead of as 0.  Thus you need to be quite careful with what you do to `NaN`s

# Finding NaNs

pandas has methods to help us find NaNs
  * isna()
  * isnull()  

In [None]:
some_nans.isna()

In [None]:
some_nans.isnull()

They both do the same thing!  Why is there two names?  There are actually even more names!  "NaN", "na", "Null", "None" are all names you will see used for missing values.  There _are sometimes_ differences between them, but for our purposes we will treat them as synonyms.

# Exercise - copy me

Work out the two lines of pandas I used to work out how many missing values and how many zeros there were in the mtboyce rainfall column.

In [None]:
count_of_nulls = len(mtboyce[mtboyce["Rainfall amount (millimetres)"].isna()].axes[0])
count_of_zeros = len(mtboyce[mtboyce["Rainfall amount (millimetres)"] == 0].axes[0])

print("number of null values = " + str(count_of_nulls))
print("number of 0 values = " + str(count_of_zeros))

# Dealing with missing values

You have three options:
  * fillna
  * dropna
  * do it by hand

fillna will put some new value everywhere there is a missing value.  dropna will drop all rows that have a missing value in any of the columns.  Both of these are very blunt instruments.  We strongly recommend you do it by hand.  Make a judgement call, and write the pandas to put effect to your choice.

# Exercise - drop it like its hot

Try out `dropna` on the Mt Boyce data frame.  How many rows did you lose?  Can you explain why that many were lost?

In [None]:
drop_na_test = mtboyce.dropna()

orig_count = len(mtboyce.axes[0])
drop_count = len(drop_na_test.axes[0])

diff = orig_count - drop_count

print("Original count \t= " + str(orig_count))
print("Dropped count \t= " + str(drop_count))
print("-" * 25)
print("Lost rows \t= " + str(diff))
print("\nRows were lost because dropna drops the whole row if there is a na in any column!")