<a href="https://colab.research.google.com/github/edelord/DS-practice/blob/main/2_5_Series__missing_values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://www.gormanalysis.com/blog/python-pandas-for-your-grandpa-2-5-series-missing-values/

In this section, we’ll see how to use NaN to represent missing or invalid values in a Series. Let’s start by talking about NaN prior to version 1.0.0. So, back in the day, if you wanted to represent missing or invalid data, you had to use NumPy’s special floating point constant, np.nan. So, if you had a Pandas Series of integers like this

In [1]:
import numpy as np
import pandas as pd

roux = pd.Series([1, 2, 3])
print(roux)
## 0    1
## 1    2
## 2    3
## dtype: int64

0    1
1    2
2    3
dtype: int64


And then you set the 2nd element to np.nan, The Series would get cast to floats because nan only exists in NumPy as a floating point constant.

In [2]:
roux.iloc[1] = np.nan
print(roux)
## 0    1.0
## 1    NaN
## 2    3.0
## dtype: float64

0    1.0
1    NaN
2    3.0
dtype: float64


By the time you’re reading this article, this may have changed, but at the moment, fixing this problem is still on the NumPy Roadmap.

So, in the past you couldn’t have a Pandas Series of integers with NaN values, because you couldn’t (and still can’t) have a NumPy array of integers with NaN values. If you wanted NaN values, your Series had to be a Series of floats.

Then Pandas released version 1.0.0 which included a Nullable integer datatype. It’s called “Int64” with a capital “I” to differentiate it from NumPy’s “int64” with a lower case “i”. So, let’s rebuild that Series, roux, this time specifying dtype='Int64'.
And, again, let’s set the 2nd element to np.nan.

In [3]:
roux = pd.Series([1, 2, 3], dtype='Int64')
roux.iloc[1] = np.nan
print(roux)
## 0       1
## 1    <NA>
## 2       3
## dtype: Int64

0       1
1    <NA>
2       3
dtype: Int64


This time, the Series retains its Int64 datatype, and doesn’t get cast to float. In this case, a better way set that value to NaN is to use pd.NA.

In [4]:
roux.iloc[1] = pd.NA
print(roux)
## 0       1
## 1    <NA>
## 2       3
## dtype: Int64

0       1
1    <NA>
2       3
dtype: Int64


You could also use the None keyword, but I’d probably opt for pd.NA.

Alright, now let’s see how this works on a Series of strings. So back in the day, if you wanted to build a Series of strings, you would do something like

In [10]:
gumbo = pd.Series(['a', 'b', 'c'])
print(gumbo)
## 0    a
## 1    b
## 2    c
## dtype: object

0    a
1    b
2    c
dtype: object


And then if you set the 2nd value to np.nan and the third value to None,

In [11]:
gumbo.iloc[1] = np.nan
gumbo.iloc[2] = None

#and then you print the Series, it actually looks like this worked pretty well... but did it?

print(gumbo)
## 0       a
## 1     NaN
## 2    None
## dtype: object

0       a
1     NaN
2    None
dtype: object


Notice, the Series has dtype object. What this means is, we basically have a python list. Each element of the Series is actually just a pointer, or a memory address, pointing to some random location in your computer’s memory that’s storing the value of the element. This is bad because:

it’s inefficient for data access and
it doesn’t enforce a homogeneous datatype constraint on our Series
We’re supposed to have a Series of strings, but I set the second element to a floating point. Pandas 1.0.0 fixed both of these issues in one-fell-swoop with the StringDtype extension type. So, today we’d rebuild that Series just like before, except we’d specify dtype='string'

In [12]:
gumbo = pd.Series(['a', 'b', 'c'], dtype='string')
print(gumbo)
## 0    a
## 1    b
## 2    c
## dtype: string

0    a
1    b
2    c
dtype: string


And now if we set the 2nd value to np.nan and the third value to None, our Series would end up looking like this

In [13]:
gumbo.iloc[1] = pd.NA
gumbo.iloc[2] = None
print(gumbo)
## 0       a
## 1    <NA>
## 2    <NA>
## dtype: string

0       a
1    <NA>
2    <NA>
dtype: string


If you’re a little confused by this - don’t worry it’s not that important for using Pandas and it’s something you’ll probably understand more over time.

In any case, Pandas provides two helper functions for identifying nan values. If you have a Series x with some nan values,

In [20]:
x = pd.Series([1, pd.NA, 3, pd.NA], dtype='Int64')
print(x)
## 0       1
## 1    <NA>
## 2       3
## 3    <NA>
## dtype: Int64

0       1
1    <NA>
2       3
3    <NA>
dtype: Int64


you can use 
*   pd.isna() to check whether each value is nan
*   pd.notna() to do the opposite



In [15]:
pd.isna(x)
## 0    False
## 1     True
## 2    False
## 3     True
## dtype: bool

0    False
1     True
2    False
3     True
dtype: bool

In [16]:
pd.notna(x)
## 0     True
## 1    False
## 2     True
## 3    False
## dtype: bool

0     True
1    False
2     True
3    False
dtype: bool

If you want to replace nan values with -1, you could do something like

In [18]:
x.loc[pd.isna(x)] = -1
print(x)

0     1
1    -1
2     3
3    -1
dtype: Int64


and this works, but Pandas provides a really convenient fillna() method that makes this event simpler. So instead you could just do

In [21]:
x = pd.Series([1, pd.NA, 3, pd.NA], dtype='Int64')
print(x)
## 0       1
## 1    <NA>
## 2       3
## 3    <NA>
## dtype: Int64

x.fillna(-1)

0     1
1    -1
2     3
3    -1
dtype: Int64

Note that this returns a modified copy of x, so x doesn’t actually get changed here. You can see if I print(x) it hasn’t changed.

In [22]:
print(x)

0       1
1    <NA>
2       3
3    <NA>
dtype: Int64


If you want the changes to stick, you can do the same thing and set the inplace parameter equal to True.

In [23]:
x.fillna(-1, inplace=True)
print(x)
## 0     1
## 1    -1
## 2     3
## 3    -1
## dtype: Int64

0     1
1    -1
2     3
3    -1
dtype: Int64


It’s also important to understand how NaNs work with boolean indexing. Suppose you have a Series of values like this.

In [24]:
goo = pd.Series([10,20,30,40])
print(goo)
## 0    10
## 1    20
## 2    30
## 3    40
## dtype: int64

0    10
1    20
2    30
3    40
dtype: int64


And a corresponding series of booleans like this:

In [25]:
choo = pd.Series([True, False, pd.NA, True])
print(choo)
## 0     True
## 1    False
## 2     <NA>
## 3     True
## dtype: object

0     True
1    False
2     <NA>
3     True
dtype: object


what do you think goo.loc[choo] will return?

In [26]:
goo.loc[choo]  # ValueError: Cannot mask with non-boolean array containing NA / NaN values

ValueError: ignored

In this case we get “ValueError: Cannot mask with non-boolean array containing NA / NaN values” Notice that choo here is one of those pesky Series with dtype ‘object’. In other words, it’s a Series of pointers. To fix this, we can rebuild choo, specifying dtype = "boolean".

In [27]:
choo = pd.Series([True, False, np.NaN, True], dtype = "boolean")
print(choo)
## 0     True
## 1    False
## 2     <NA>
## 3     True
## dtype: boolean

0     True
1    False
2     <NA>
3     True
dtype: boolean


and now when we do goo.loc[choo] we get back 10 and 40, so the NaN value in choo is essentially ignored.

In [28]:
goo.loc[choo]
## 0    10
## 3    40
## dtype: int64

0    10
3    40
dtype: int64

Keep in mind that the negation of NaN is still NaN, so if we do goo.loc[~choo], we only get back one row, not the two rows excluded in the previous subset.

In [29]:
goo.loc[~choo]
## 1    20
## dtype: int64

1    20
dtype: int64