# Consequences of emptyness

Emptyness is not the simplest of concepts for a computer.  To represent true emptyness, a comptuer needs to be doing nothing - which begs the question "how do you process emtptyness?".  The answer is that we designate special values as representing emptyness, we explore these here

## Zero is not nothing

Consider the rainfall data for Mount Boyce

In [14]:
import pandas as pd
import numpy as np

data = {
    'A': ['1A', '2A', '3A', '4A'],
    'B': ['1B', '2B', np.nan, '4B'],
    'C': ['1C', np.nan, '3C', '4C'],
    'D': ['1D', '2D', '3D', '4D']}

data = pd.DataFrame(data, index=[0, 1, 2, 3])
data = data.dropna()

data # ['A'][1])

Unnamed: 0,A,B,C,D
0,1A,1B,1C,1D
3,4A,4B,4C,4D


In [4]:
import pandas

mtboyce = pandas.read_csv("data/rainfall/IDCJAC0009_063292_1800_Data.csv")
mtboyce[mtboyce["Rainfall amount (millimetres)"] == 0]

Unnamed: 0,Product code,Bureau of Meteorology station number,Year,Month,Day,Rainfall amount (millimetres),Period over which rainfall was measured (days),Quality
514,IDCJAC0009,63292,1995,5,30,0.0,,N
515,IDCJAC0009,63292,1995,5,31,0.0,,N
516,IDCJAC0009,63292,1995,6,1,0.0,,N
517,IDCJAC0009,63292,1995,6,2,0.0,,N
522,IDCJAC0009,63292,1995,6,7,0.0,,N
...,...,...,...,...,...,...,...,...
10609,IDCJAC0009,63292,2023,1,18,0.0,1.0,N
10617,IDCJAC0009,63292,2023,1,26,0.0,1.0,N
10620,IDCJAC0009,63292,2023,1,29,0.0,1.0,N
10623,IDCJAC0009,63292,2023,2,1,0.0,1.0,N


There are 531 rows where the rainfall measurements is missing, and there are 4947 rows were the rainfall measured was 0.  _These are clearly different things_.  If you were to put zero in all the empty spots, you would be suggesting an extra 531 days were it didn't rain.  There was surely rain on at least some of those days, even if it didn't get measured.

Keeping the concept of "Zero" separate from "emptyness" is very important.

The main type of emptyness to consider in pandas is `NaN` (not a number).  Where pandas expects a number, but sees nothing, it will put this special value.  `NaN`s are treated very differently to 0.  Note that `NaN` comes from the `numpy` library which pandas has been using all along without us realising and we need to import it is we want to make our own `NaN`s

In [6]:
import numpy

some_nans = pandas.Series([0,1,2,3,numpy.NaN, 4, 5, numpy.NaN])
no_nans = pandas.Series([0,1,2,3,0,4,5,0])

print(some_nans.mean())
print(no_nans.mean())

print(some_nans.sum())
print(no_nans.sum())

print(some_nans.info())
print(no_nans.info())

2.5
1.875
15.0
15
<class 'pandas.core.series.Series'>
RangeIndex: 8 entries, 0 to 7
Series name: None
Non-Null Count  Dtype  
--------------  -----  
6 non-null      float64
dtypes: float64(1)
memory usage: 192.0 bytes
None
<class 'pandas.core.series.Series'>
RangeIndex: 8 entries, 0 to 7
Series name: None
Non-Null Count  Dtype
--------------  -----
8 non-null      int64
dtypes: int64(1)
memory usage: 192.0 bytes
None


As you can see, you can the _aggregation_ operations ignore `NaN`s, treating them as missing instead of as 0.  Thus you need to be quite carefull with what you do to `NaN`s

# Finding NaNs

pandas has methods to help us find NaNs
  * isna()
  * isnull()  

In [7]:
some_nans.isna()

0    False
1    False
2    False
3    False
4     True
5    False
6    False
7     True
dtype: bool

In [8]:
some_nans.isnull()

0    False
1    False
2    False
3    False
4     True
5    False
6    False
7     True
dtype: bool

They both do the same thing!  Why is there two names?  Ther are actually even more names!  "NaN", "na", "Null", "None" are all names you will see used for missing values.  There _are sometimes_ differences between them, but for our purposes we will treat them as synonyms.

# Exercise - copy me

Work out the two lines of pandas I used to work out how many missing values and how many zeros there were in the mtboyce rainfall column.

# Dealing with missing values

You have three options:
  * fillna
  * dropna
  * do it by hand

fillna will put some new value everywhere there is a missing value.  dropna will drop all rows that have a missing value in any of the columns.  Both of these are very blunt instruments.  We strongly recommend you do it by hand.  Make a judgement call, and write the pandas to put effect to your choice.

# Exercise - drop it like its hot

Try out `dropna` on the Mt Boyce data frame.  How many rows did you lose?  Can you explain why that many were lost?