# Identify, calculate and replace the missing values using Python

In this notebook we will explore different ways to identify, calculate and repalce missing values using Python.

In Pandas, np.nan is used for missing values. 
In Python, None is used for missing values.

Let's start by identifying missing values

## Identifying missing values
There are two different method in Pandas to check for NaNs in a Series or Dataframe: .isna() and .notna()

- isna() returns an array of booleans indicanting if the values are NA. Returns True on np.nan, None and <NA>
- notna() is the inverse of the previous method, returning False on np.nan, None and <NA>

In [4]:
import pandas as pd
import numpy as np

data = [
    ['James', 50, 'Web Developer', 'None'], \
    ['Astrid', np.nan, 'Data Analyst', 'astridleiland@hotmail.com.com'], \
    ['', 27, 'Cloud Architect', 'louiselane@supercloud.com'], \
    ['Shawn', 36, 'Senior Flow Controller', None], \
    ]

missing_values = pd.DataFrame(data, columns=['Name', 'Age', 'Position', 'email'])

missing_values.head()

Unnamed: 0,Name,Age,Position,email
0,James,50.0,Web Developer,
1,Astrid,,Data Analyst,astridleiland@hotmail.com.com
2,,27.0,Cloud Architect,louiselane@supercloud.com
3,Shawn,36.0,Senior Flow Controller,


In [5]:
missing_values.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      4 non-null      object 
 1   Age       3 non-null      float64
 2   Position  4 non-null      object 
 3   email     3 non-null      object 
dtypes: float64(1), object(3)
memory usage: 256.0+ bytes


In [6]:
missing_values.isna()

Unnamed: 0,Name,Age,Position,email
0,False,False,False,False
1,False,True,False,False
2,False,False,False,False
3,False,False,False,True


### Important notes

None is a missing value to Pandas. String 'None' is not. An empty string is neither a NaN.

When there is a value np.nan in a column, the whole column is identified as float64 (as np.nan is a float too).


## Calculating missing values
Given isna() returns an array of booleans, we can calculate the number of missing values per column using the sum method.

In [7]:
missing_values.isna().sum()

Name        0
Age         1
Position    0
email       1
dtype: int64

To get correct results, we can replace empty strings for NaN.

Note that this approach is not needed when importing the data from a csv file, as [it performs its own NaN detection and replacement](https://wesmckinney.com/book/accessing-data.html#io_flat_files)

In [9]:
# replace using regular expressions
missing_values = missing_values.replace(r'^\s*$', np.nan, regex = True)
missing_values.isna().sum()

Name        1
Age         1
Position    0
email       1
dtype: int64

We could also use a similar technique to replace 'None' strings for None/NaN (But there could be some called Mr. John None)

In [10]:
# replace string 'None' for None value using replace
missing_values = missing_values.replace('None', None)
missing_values.isna().sum()

Name        1
Age         1
Position    0
email       2
dtype: int64

## Replacing missing values