# Handling Missing Data - Welcome to real world !


# Python's None , Numpy's np.NaN and Pandas pd.NA

### ``None``: Pythonic missing data



In [3]:
import numpy as np
import pandas as pd

In [13]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

In [5]:
vals2 = np.array([1,3, 4])
vals2.dtype

dtype('int64')

This ``dtype=object`` means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects.
While this kind of object array is useful for some purposes, any operations on the data will be done at the Python level, with much more overhead than the typically fast operations seen for arrays with native types:

In [6]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()
    
#sum  from 0 to 1E6 - 999999

dtype = object
41.6 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
303 µs ± 85.4 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)



### ``NaN``: Missing numerical data

The other missing data representation, ``NaN`` (acronym for *Not a Number*)

In [7]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

dtype('float64')

Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code.
You should be aware that ``NaN`` is a bit like a data virus–it infects any other object it touches.
Regardless of the operation, the result of arithmetic with ``NaN`` will be another ``NaN``:

In [8]:
1 + np.nan

nan

In [60]:
0 *  np.nan

nan

Note that this means that aggregates over the values are well defined (i.e., they don't result in an error) but not always useful:

In [61]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

NumPy does provide some special aggregations that will ignore these missing values:

In [62]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

(8.0, 1.0, 4.0)

Keep in mind that ``NaN`` is specifically a floating-point value; there is no equivalent NaN value for integers, strings, or other types.

### NaN and None in Pandas

``NaN`` and ``None`` both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:

In [72]:
data = pd.Series([1, np.nan, 'hello', None, 4, pd.NA])
data


0        1
1      NaN
2    hello
3     None
4        4
5     <NA>
dtype: object

In [73]:
data.isnull()

0    False
1     True
2    False
3     True
4    False
5     True
dtype: bool

In [74]:
data[data.notnull()]

0        1
2    hello
4        4
dtype: object

In [75]:
data[data.isnull()]

1     NaN
3    None
5    <NA>
dtype: object

The ``isnull()`` and ``notnull()`` methods produce similar Boolean results for ``DataFrame``s.

### Dropping null values

In addition to the masking used before, there are the convenience methods, ``dropna()``
(which removes NA values) and ``fillna()`` (which fills in NA values). For a ``Series``,
the result is straightforward:

In [76]:
data.dropna()

0        1
2    hello
4        4
dtype: object

For a ``DataFrame``, there are more options.
Consider the following ``DataFrame``:

In [77]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      None,      5],
                   [np.nan, 4,      pd.NA]])
df

Unnamed: 0,0,1,2
0,1.0,,2.0
1,2.0,,5.0
2,,4.0,


We cannot drop single values from a ``DataFrame``; we can only drop full rows or full columns.
Depending on the application, you might want one or the other, so ``dropna()`` gives a number of options for a ``DataFrame``.

By default, ``dropna()`` will drop all rows in which *any* null value is present:

In [78]:
df.dropna()

Unnamed: 0,0,1,2


Alternatively, you can drop NA values along a different axis; ``axis=1`` drops all columns containing a null value:

In [30]:
df.dropna(axis=1)

0
1
2


In [45]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2.0,
1,2.0,,5.0,
2,,4.0,,


But this drops some good data as well; you might rather be interested in dropping rows or columns with *all* NA values, or a majority of NA values.
This can be specified through the ``how`` or ``thresh`` parameters, which allow fine control of the number of nulls to allow through.

The default is ``how='any'``, such that any row or column (depending on the ``axis`` keyword) containing a null value will be dropped.
You can also specify ``how='all'``, which will only drop rows/columns that are *all* null values:

In [46]:
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2.0,
1,2.0,,5.0,
2,,4.0,,


In [47]:
df.dropna(axis=0, how='all')

Unnamed: 0,0,1,2,3
0,1.0,,2.0,
1,2.0,,5.0,
2,,4.0,,


In [48]:
df.dropna(axis=0, how='any')

Unnamed: 0,0,1,2,3


For finer-grained control, the ``thresh`` parameter lets you specify a minimum number of non-null values for the row/column to be kept:

In [50]:
df.dropna(axis=0, thresh=2)

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,,5,


### Filling null values


In [51]:
data = pd.Series([1, np.nan, 2, None, 3, pd.NA], index=list('abcdef'))
data

a       1
b     NaN
c       2
d    None
e       3
f    <NA>
dtype: object

We can fill NA entries with a single value, such as zero:

In [52]:
data.fillna(0)

a    1
b    0
c    2
d    0
e    3
f    0
dtype: int64

We can specify a forward-fill to propagate the previous value forward:

In [54]:
# forward-fill
data.fillna(method='ffill')

a    1
b    1
c    2
d    2
e    3
f    3
dtype: int64

Or we can specify a back-fill to propagate the next values backward:

In [55]:
# back-fill
data.fillna(method='bfill')

a       1
b       2
c       2
d       3
e       3
f    <NA>
dtype: object

For ``DataFrame``s, the options are similar, but we can also specify an ``axis`` along which the fills take place:

In [56]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2.0,
1,2.0,,5.0,
2,,4.0,,


In [58]:
df.fillna(method='ffill', axis=1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,2.0,5.0,5.0
2,,4.0,4.0,4.0


Notice that if a previous value is not available during a forward fill, the NA value remains.