In [None]:
import numpy as np
import pandas as pd

# Missing Data: In Python

There are many notions of 'missing data', and what data is considered missing vs. inaccurate vs. suitable depends heavily on the specific nature of the data and the _goals_ of analyzing that data.

But there is a _specific type_ of missing data that we can talk about concretely: Missing data in our programs.

## Python and Pandas and Numpy

### Introduction and Warning

Our tools in this class (Python and libraries) have their own notion of 'missing data'. We can talk about what that is and how to manipulate it, but it's worth repeating: This doesn't tell you _anything_ about what you _should_ do about missing data. Each dataset and analysis will present different problem and what is reasonable to do with the missing data in one analysis may not be reasonable in another!

### Python values and Missing Data

#### Numpy

Let's start with `numpy`.

`numpy` has the ability to represent something that is 'not a number', written as `NaN` or `nan`. Consider the following function:

In [None]:
def is_nan(n):
    return n == np.NaN

It would be reasonable to believe that this function could help you determine whether some field in your dataset is `NaN` or `nan`. Try to predict the following:

In [None]:
print(np.NaN == np.NaN)
print(np.NaN == np.nan)
print(np.nan == np.NaN)
print(np.nan == np.nan)

Because of this, it's safer to use `np.isnan()`:

In [None]:
print(np.isnan(np.nan))
print(np.isnan(np.NaN))

The 'benefit' of `NaN` is that it can be respresented efficiently as a floating point number. The downside of `NaN` is twofold:

* there are many `NaN`s, so you cannot reliably check for equality and must therefore use something like `np.isnan`.
* we lose out on the distinction between 'missing' and 'went wrong'. `NaN` also represents values where a computation resulted in a calculational error, which isn't the same as 'missing'!

#### Python

Python values can be `None`. While it can be dangerous to use `None` in your Python code (same issue as Null pointers in Java), the reality is that they do come up, so we'll need to be able to deal with them.

#### Pandas

Pandas also has some functionality for representing missing values. In particular there are two main Pandas values:

* `pd.NA`
* `pd.NaT`

To see Pandas explanation of why we want `pd.NA` and `pd.NaT` in addition to `NaN`, see the Pandas documentation page on missing values: [https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html). In short, sometimes we want to reason about missing values for a type that cannot be as efficiently represented as a numpy value.

* `pd.NA` just means "not available"
* `pd.NaT` means "not a time", which is used when dealing with datetime values. Consider the following:

In [None]:
na = pd.NA
print(na == pd.NA) 
print(na == na)
print(pd.NaT == pd.NaT)

### Manipulating Missing Values

Sometimes you want to do something with missing values. Pandas does provide a good amount of functionality for doing so, but it raises an important question: _What does Pandas consider a missing value?_

Luckily, we can write some code that will help us answer this question. Study the following dataframe and how it was constructed:

In [None]:
dat = [["Numpy NaN", np.NaN]
      ,["Numpy NaN 2", np.nan]
      ,["Pandas 'Not Available'", pd.NA]
      ,["Pandas 'Not a Time'", pd.NaT]
      ,["Python None", None]
      ,["A String that just says 'NaN'", "NaN"]]

df = pd.DataFrame(dat)
df

This dataframe contains all of the 'types' of missing data that we considered above, as well as a string that containts "NaN". It's important to test this string because sometimes when you parse data or scrape data from the internet, you get a string that containts "NaN" or "NA", etc.

Pandas provides a bunch of functionality for dealing with missing data (see the link to the documentation above), one such function is `dropna()`, which will remove _all_ rows that contain missing data. This is _almost never_ appropriate, but here we're trying to determine what Pandas considers missing data, so it'll do the trick:

In [None]:
df.dropna()

As you can see, all of the 'types' of missing data we showed above are considered "missing" by Pandas _except_ the string that containted `"NaN"`. The reason for this is simple: it's a string with a value in it, and as such it is not the same as a missing string! A string with `"NaN"` in it is just as much a string as a string with `"North"` in it, and therefore not 'missing'.