<h1>Handling Missing Data</h1>

In [1]:
# Interesting data sets have some amount of data missing. 
# To make matters even more complicated, different data sources may indicate missing data in different ways

# Missing data in general is refered to as null, NaN, or NA values. 

<h3>Trade Offs in Missing Data Conventions</h3>

In [2]:
# Strategies around missing data in a table in Data frame:

# Strategy 1: Using a mask that globally indicates missing values. 
# Strategy 2: Choosing a sentinel value that indicates a missing entry. 

# In the maksing approach, the mask might be an entirely separate Boolean array, or it may involve appropriation 
# of one bit in the data representation to locally indicate the null status of a value. 

# In the sentinel approach, the sentinel value could be some data specific convention. 

# A sentinel value reduces the range of valid values that can be represented and may require extra logic in CPU
# and GPU arithmetic. 

<h3>Missing Data in Pandas</h3>

In [3]:
# NumPy supports fourteen basic integer types once you account for available precisions, signedness
# and endianness of the encoding.

# NumPy does have support for masked arrays - that is arrays that have separate Boolean Mask array attachedd
# for marking data as "good" or "bad"

In [4]:
# Pandas chose to use sentinels for missing data and further chose to use already existing null values:
# the special floating point NaN value
# the python None object.

<h4>None: Pythonic missing data</h4>

In [19]:
# This first sentinel value used by Pandas is None, a Python singletone object that is often used for missing
# data in Python code. 

# As None is Python object, it cannont be used in any arbitrary NumPy/Pandas array, but only in arrays with 
# data type "object"

import numpy as np
import pandas as pd

In [6]:
vals1 = np.array([1, None, 3,4])
vals1

array([1, None, 3, 4], dtype=object)

In [8]:
# With object array any operations on the data will be done at python level will have more overhead
for dtype in ["object", "int"]:
    print("dtype = ", dtype)
    %timeit np.arange(1E6,dtype=dtype).sum()
    print()

dtype =  object
26.2 ms ± 302 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype =  int
713 µs ± 837 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)



In [9]:
# The use of Python objects in an array also means that if you perform aggregations like sum() or min() 
# across an array with a None value you will generally get an error. 
vals1.sum()

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/Users/awaneeshtiwari/anaconda3/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
  File "/var/folders/c8/z4lh2j2s5kq5jvv28mrv0rw40000gn/T/ipykernel_95111/3962276766.py", line 3, in <module>
    vals1.sum()
  File "/Users/awaneeshtiwari/anaconda3/lib/python3.10/site-packages/numpy/core/_methods.py", line 48, in _sum
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/awaneeshtiwari/anaconda3/lib/python3.10/site-packages/pygments/styles/__init__.py", line 78, in get_style_by_name
ModuleNotFoundError: No module named 'pygments.styles.default'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/awaneeshtiwari/anaconda3/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 2057, in showtraceback
 

In [10]:
# This reflects the fact that addition between an integer and None is undefined.

<h4>Nan: Missing Numerical Data</h4>

In [11]:
# The other missing data represntation, NaN (acronym for not a number). It is special floating point value
# recognized by all systems that use the IEEE floating point representation.

vals2 = np.array([1, np.nan, 3,4])
vals2

array([ 1., nan,  3.,  4.])

In [12]:
vals2.dtype

dtype('float64')

In [13]:
# NaN is bit like a data virus - it infects any other object it touches. Regardless of the operation, the result 
# of arithmetic with NaN would be another NaN
1 + np.nan

nan

In [14]:
0 * np.nan

nan

In [16]:
# This means that aggregates over the values are well defined that is they don't result in an error 
# but not always useful
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

In [17]:
# Numpy does provide some special aggregations that will ignore these missing values
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

(8.0, 1.0, 4.0)

In [18]:
# Keep in mind that NaN is specifically a floating-point value, there is no equivalent NaN value for intergers,
# strings or other types

<h4>NaN and None in Pandas</h4>

In [20]:
# NaN and None both have their place and Pandas is built to handle thw two of them nearly interchangeably,
# converting between them where appropriate:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [21]:
# For types that don't have an available sentinel value, Pandas automatically type casts when NA values are 
# present
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int64

In [22]:
x[0] = None
x

0    NaN
1    1.0
dtype: float64

In [23]:
# Pandas automatically converts the None and NaN value. 
# In Pandas, string data is always stored with an object type.

<h3>Operating on Null Values</h3>

In [24]:
# Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. 
# To facilitate this convention, there are several useful methods for detecting, removing, and replacing
# null values in Pandas data structures. 

# isnull() : Generate a Boolean mask indicating missing values
# notnull(): Opposite of isnull()
# dropna() : Return a filtered version of the data
# fillna() : Return a copy of the data with missing values filled or imputed. 

<h4>Detecting null values</h4>

In [25]:
# Pandas Data Structures have two useful methods for detecting null data: isnull() and notnull().
# Either one will return a Boolean mask over the data. 

data = pd.Series([1,np.nan,"hello",None])
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [27]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [28]:
data[data.notnull()]

0        1
2    hello
dtype: object

In [29]:
# The isnull() and notnull() methods produce similar Boolean results for Data Frames. 

<h4>Dropping Null values</h4>

In [30]:
# Use of convenience methods dropna() and fillna()

data.dropna()

0        1
2    hello
dtype: object

In [32]:
data.fillna(0)

0        1
1        0
2    hello
3        0
dtype: object

In [40]:
# For a dataframe there are more options:
df = pd.Dataframe([ [1, np.nan, 2],
                    [2, 3, 5],
                    [np.nan, 4, 6]])
df.dropna()

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/Users/awaneeshtiwari/anaconda3/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
  File "/var/folders/c8/z4lh2j2s5kq5jvv28mrv0rw40000gn/T/ipykernel_95111/3905829358.py", line 2, in <module>
    df = pd.Dataframe([ [1, np.nan, 2],
  File "/Users/awaneeshtiwari/anaconda3/lib/python3.10/site-packages/pandas/__init__.py", line 264, in __getattr__
AttributeError: module 'pandas' has no attribute 'Dataframe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/awaneeshtiwari/anaconda3/lib/python3.10/site-packages/pygments/styles/__init__.py", line 78, in get_style_by_name
ModuleNotFoundError: No module named 'pygments.styles.default'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/awaneeshtiwari/anaconda3/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 20

<h4>Filling Null Values</h4>

In [36]:
# Consider the following series:
data = pd.Series([1, np.nan, 2, None, 3], index = list("abcde"))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [37]:
# NA values can be filled with 0s
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [38]:
# Specify a forward-fill to propagate the previous value forward
data.fillna(method="ffill")

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [39]:
# Specify a backward-fill to propagate the values backward
data.fillna(method="bfill")

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [41]:
# If previous value is not available during a forward fill, the NA value remains.