# Working with missing data

Handling missing data is an essential part of data analysis. Missing data can often lead to incorrect conclusions during the analysis. *Missing values propagate through arithmetic operations in NumPy and Pandas unless they are dropped or filled with a value.* In NumPy, missing values can be represented using `np.nan` object, the NumPy representation of `NaN` (Not a Number).


In [1]:
import numpy as np

# Let's start by creating a NumPy array with some missing values
data = np.array([1, 2, np.nan, 4, 5])
print(data)

[ 1.  2. nan  4.  5.]



In this array, the third element is a missing value represented as `np.nan`.

**Note:** NumPy only supports its NaN objects and throws an error if we pass other null objects to `np.isnan`.



## Checking for missing values

NumPy provides the `isnan()` function, which returns a boolean array that is `True` where the array element is `NaN`.


In [2]:
print(np.isnan(data))

[False False  True False False]


In [3]:
print(np.isnan(data).any()) # True, since `np.any` function tests whether any array element along a given axis evaluates to True

True



## Handling missing values

How you deal with missing values usually depends on the specific situation. One common strategy is simply removing them. However, this can result in loss of information if you have a lot of missing values. Another strategy is filling missing values with a specific value, like the mean or median of the other values in the array, or a constant value like zero.



*Removing missing values:*


In [4]:
data_without_nan = data[~np.isnan(data)]
print(data_without_nan)

[1. 2. 4. 5.]



*Filling missing values with a constant:*


In [5]:
data_with_constant = np.where(np.isnan(data), 0, data)
print(data_with_constant)

[1. 2. 0. 4. 5.]



*Filling missing values with the mean:*


In [None]:
data_with_mean = np.where(np.isnan(data), np.nanmean(data), data)
print(data_with_mean)

In [6]:
np.nanmean(data)

3.0

In [9]:
np.mean(data)

nan


**Note:** When performing operations on arrays with `np.nan` values, regular functions like `np.mean` or `np.sum` will return `np.nan` as result. NumPy provides functions like `np.nanmean`, `np.nanstd` and `np.nansum` that ignore `np.nan` values in these operations.



## Caution with NaN:

NaNs are a bit tricky, as they don't compare equal to themselves. So, if you use the equality operator `==` to find NaNs in your array, it won't work as expected. 

Also, when comparing Python objects that may be NaN there is a difference between `is` and `==`. Remember `is` compares identities of two variables, while `==` compares two variables by checking whether they are equal.


In [10]:
# This won't work!
print(data == np.nan)

[False False False False False]


In [11]:
np.nan == np.nan

False

In [12]:
np.isnan(np.nan)

True


Always use built-in methods like `np.isnan` to check for NaN.



### Conclusion:

NumPy provides various functions to handle missing data like `np.isnan()`, `np.nanmean()`, and `np.nansum()`. How to handle missing values depends on the specific case, and you need to be cautious with NaNs as they don't behave like normal values.


In [13]:
np.random.seed(42)
data_with_nan = np.random.rand(10, 3)

In [16]:
data_with_nan[2, 1] = np.nan
data_with_nan[5, 0] = np.nan
data_with_nan[7, 2] = np.nan
data_with_nan[9, 1] = np.nan


In [17]:
data_with_nan

array([[0.37454012, 0.95071431, 0.73199394],
       [0.59865848, 0.15601864, 0.15599452],
       [0.05808361,        nan, 0.60111501],
       [0.70807258, 0.02058449, 0.96990985],
       [0.83244264, 0.21233911, 0.18182497],
       [       nan, 0.30424224, 0.52475643],
       [0.43194502, 0.29122914, 0.61185289],
       [0.13949386, 0.29214465,        nan],
       [0.45606998, 0.78517596, 0.19967378],
       [0.51423444,        nan, 0.04645041]])

In [None]:
# Strategy: Removing any rows that contain missing data
# Strategy; Filling missing values with the mean of each feature
# Strategy; Filling missing calues with a constant value (any value you want, like 0)

> Content created by [**Carlos Cruz-Maldonado**](https://www.linkedin.com/in/carloscruzmaldonado/).  
> I am available to answer any questions or provide further assistance.   
> Feel free to reach out to me at any time.