# Identify, calculate and replace the missing values using Python
> Topic: Data Management in Python, formerly known as Importing and cleaning data in Python

In this notebook we will explore different ways to identify, calculate and repalce missing values using Python.

In Pandas, np.nan is used for missing values. 
In Python, None is used for missing values.

Let's start by identifying missing values


## Identifying missing values
There are two different method in Pandas to check for NaNs in a Series or Dataframe: .isna() and .notna()

- isna() returns an array of booleans indicanting if the values are NA. Returns True on np.nan, None and <NA>
- notna() is the inverse of the previous method, returning False on np.nan, None and <NA>

In [1]:
import pandas as pd
import numpy as np

sample_people_data = [
    ['James', 50, 'Web Developer', 'None'], 
    ['Astrid', np.nan, 'Data Analyst', 'astridleiland@hotmail.com.com'],
    ['', 27, 'Cloud Architect', 'louiselane@supercloud.com'],
    ['Shawn', 36, 'Senior Flow Controller', None],
    ]

missing_values = pd.DataFrame(sample_people_data, columns=['Name', 'Age', 'Position', 'email'])

missing_values.head()

Unnamed: 0,Name,Age,Position,email
0,James,50.0,Web Developer,
1,Astrid,,Data Analyst,astridleiland@hotmail.com.com
2,,27.0,Cloud Architect,louiselane@supercloud.com
3,Shawn,36.0,Senior Flow Controller,


In [2]:
missing_values.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      4 non-null      object 
 1   Age       3 non-null      float64
 2   Position  4 non-null      object 
 3   email     3 non-null      object 
dtypes: float64(1), object(3)
memory usage: 256.0+ bytes


In [3]:
missing_values.isna()

Unnamed: 0,Name,Age,Position,email
0,False,False,False,False
1,False,True,False,False
2,False,False,False,False
3,False,False,False,True


### Important notes

None is a missing value to Pandas. String 'None' is not. An empty string is neither a NaN.

When there is a value np.nan in a column, the whole column is identified as float64 (as np.nan is a float too).


## Calculating missing values
Given isna() returns an array of booleans, we can calculate the number of missing values per column using the sum method.

In [4]:
missing_values.isna().sum()

Name        0
Age         1
Position    0
email       1
dtype: int64

To get correct results, we can replace empty strings for NaN.

Note that this approach is not needed when importing the data from a csv file, as [it performs its own NaN detection and replacement](https://wesmckinney.com/book/accessing-data.html#io_flat_files)

In [5]:
# replace using regular expressions
missing_values = missing_values.replace(r'^\s*$', np.nan, regex = True)
missing_values.isna().sum()

Name        1
Age         1
Position    0
email       1
dtype: int64

We could also use a similar technique to replace 'None' strings for None/NaN (But there could be some called Mr. John None)

In [6]:
# replace string 'None' for None value using replace
missing_values = missing_values.replace('None', None)
missing_values.isna().sum()

Name        1
Age         1
Position    0
email       2
dtype: int64

## Replacing missing values
Once we have identified the missing values, there are several actions we can perform on them:

- Replacing NaN for other values like 0, or a default value.
- Imputing, replacing the NaN for a value calculated from other values in the column.
- Filling NaN with other existing values in the column
- Dropping rows or columns that contain one or all NaNs

### Replacing NaN for other simple values

In [7]:
# filling zeros instead of NaNs
missing_values_to_zero = missing_values.copy()
missing_values_to_zero.fillna('0', inplace=True) #You must specify inplace=True, otherwise it doesn't overwrite original values
missing_values_to_zero

Unnamed: 0,Name,Age,Position,email
0,James,50.0,Web Developer,0
1,Astrid,0.0,Data Analyst,astridleiland@hotmail.com.com
2,0,27.0,Cloud Architect,louiselane@supercloud.com
3,Shawn,36.0,Senior Flow Controller,0


In [8]:
# filling using a dictionary - non seen in DataCamp but quite useful
# Ref: WesMcKinney
missing_values_from_dictionary = missing_values.copy()
missing_values_from_dictionary.fillna({'Name': 'SonGoku', 'Age':'99', 'email':'demo@fakemail.com'}, inplace=True)
missing_values_from_dictionary

Unnamed: 0,Name,Age,Position,email
0,James,50.0,Web Developer,demo@fakemail.com
1,Astrid,99.0,Data Analyst,astridleiland@hotmail.com.com
2,SonGoku,27.0,Cloud Architect,louiselane@supercloud.com
3,Shawn,36.0,Senior Flow Controller,demo@fakemail.com


### Imputing calculated values

In [9]:
sample_city_data = {'neighborhood': ['Alameda de Osuna', 'Aeropuerto', 'Casco Histórico de Barajas', 'Timón', 'Corralejos'],
                    'neighborhood_id': [211, 212, 213, 214, 215], 
                    'air_quality': [13, 5, np.nan, 45, 39]}
imputing_missing_values = pd.DataFrame(sample_city_data)
imputing_missing_values

Unnamed: 0,neighborhood,neighborhood_id,air_quality
0,Alameda de Osuna,211,13.0
1,Aeropuerto,212,5.0
2,Casco Histórico de Barajas,213,
3,Timón,214,45.0
4,Corralejos,215,39.0


In [10]:
imputing_missing_values['neighborhood_id'] = imputing_missing_values['neighborhood_id'].astype('category')
# Alternative transformation imputing_missing_values['neighborhood_id'] = pd.Categorical(imputing_missing_values['neighborhood_id'])
imputing_missing_values.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   neighborhood     5 non-null      object  
 1   neighborhood_id  5 non-null      category
 2   air_quality      4 non-null      float64 
dtypes: category(1), float64(1), object(1)
memory usage: 425.0+ bytes


In [11]:
imputing_missing_values.fillna(imputing_missing_values.mean(numeric_only=True))

Unnamed: 0,neighborhood,neighborhood_id,air_quality
0,Alameda de Osuna,211,13.0
1,Aeropuerto,212,5.0
2,Casco Histórico de Barajas,213,25.5
3,Timón,214,45.0
4,Corralejos,215,39.0


### Filling NaNs with other values in the column

If you are working with time series or other continuous data, instead of imputing, it might make more sense to fill the gaps with some of the previous or following values. To do so, we must pass the method parameter to fillna, with either ffill for forward fill or bfill for backward fill.

In [12]:
temperature_data = temperature_data = {"station_id": [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
                   "date": ['2022-07-13', '2022-07-13', '2022-07-13', '2022-07-13', \
                   '2022-07-14', '2022-07-14', '2022-07-14', '2022-07-14', \
                    '2022-07-15', '2022-07-15', '2022-07-15', '2022-07-15'],
                   "temperature": [36.5, 37.8, 34.3, 40.2, 38.2, 39.8, np.nan, 41.7, 34.1, 37.2, 30.9, np.nan]}
temperatures = pd.DataFrame(temperature_data)
temperatures

Unnamed: 0,station_id,date,temperature
0,1,2022-07-13,36.5
1,2,2022-07-13,37.8
2,3,2022-07-13,34.3
3,4,2022-07-13,40.2
4,1,2022-07-14,38.2
5,2,2022-07-14,39.8
6,3,2022-07-14,
7,4,2022-07-14,41.7
8,1,2022-07-15,34.1
9,2,2022-07-15,37.2


Checking the way that the data is sorted out, it wouldn't make a lot of sense to do a backward or forward fill, as it will be filled by data from other stations. In order to make it work, we need to organize our resources by station id, then date. 

Right now, they are arranged by date, then station_id. Let's visualize it better by passing the columns argument to pd.DataFrame

In [13]:
temperatures = pd.DataFrame(temperature_data, columns=['date', 'station_id', 'temperature'])
temperatures

Unnamed: 0,date,station_id,temperature
0,2022-07-13,1,36.5
1,2022-07-13,2,37.8
2,2022-07-13,3,34.3
3,2022-07-13,4,40.2
4,2022-07-14,1,38.2
5,2022-07-14,2,39.8
6,2022-07-14,3,
7,2022-07-14,4,41.7
8,2022-07-15,1,34.1
9,2022-07-15,2,37.2


In [14]:
temperatures_vs_dates = temperatures.pivot(index='station_id', columns='date')
temperatures_vs_dates

Unnamed: 0_level_0,temperature,temperature,temperature
date,2022-07-13,2022-07-14,2022-07-15
station_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,36.5,38.2,34.1
2,37.8,39.8,37.2
3,34.3,,30.9
4,40.2,41.7,


In [16]:
# We need to do one extra step to get tidy data as a result:
# - Each observation forms a row
# - Each variable forms a column
temperatures_by_station = temperatures_vs_dates.stack(dropna=False)
temperatures_by_station


Unnamed: 0_level_0,Unnamed: 1_level_0,temperature
station_id,date,Unnamed: 2_level_1
1,2022-07-13,36.5
1,2022-07-14,38.2
1,2022-07-15,34.1
2,2022-07-13,37.8
2,2022-07-14,39.8
2,2022-07-15,37.2
3,2022-07-13,34.3
3,2022-07-14,
3,2022-07-15,30.9
4,2022-07-13,40.2


In [21]:
# Time to fill those NaN
temperatures_by_station.fillna(method='ffill')

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature
station_id,date,Unnamed: 2_level_1
1,2022-07-13,36.5
1,2022-07-14,38.2
1,2022-07-15,34.1
2,2022-07-13,37.8
2,2022-07-14,39.8
2,2022-07-15,37.2
3,2022-07-13,34.3
3,2022-07-14,34.3
3,2022-07-15,30.9
4,2022-07-13,40.2


In [22]:
# TODO Create different fill patterns based on station_id (e.g. station_id 3 is bfill and station 4 is ffill). 
# Mixing different fill patterns could make no sense under real world conditions.

You can limit how many cells are filled by fillna() using the limit=num_rows parameter when doing forward fill or backward fill.

### Dropping rows or columns that contain one or all NaNs

Method dropna() allows us to drop rows or columns that contain one or more NaNs.

As a first step, we will create a new dataframe and then replace some random cells by NaN values.

In [25]:
import numpy as np
nan_df = pd.DataFrame(np.random.randn(7, 5))

for col in nan_df.columns:
    nan_df.loc[nan_df.sample(frac=0.15).index, col] = np.nan

nan_df

Unnamed: 0,0,1,2,3,4
0,0.179805,-0.25018,-2.366158,1.347757,-1.329891
1,1.506101,0.803641,-0.698248,-1.548194,-1.097894
2,-0.409992,,2.117172,2.234353,
3,0.102895,0.345339,-0.208538,0.106483,1.648533
4,-0.5187,-0.789185,-0.172567,-0.071722,-1.464463
5,,1.299306,0.950453,,0.052422
6,0.847351,1.05424,,-0.472565,-2.598221


Given that NaNs are created randomly, different runs of this notebook could return different results.

In [26]:
# If we only want to drop rows where all columns are NaNs we must pass the how='all' argument
nan_df.dropna(how='all')

Unnamed: 0,0,1,2,3,4
0,0.179805,-0.25018,-2.366158,1.347757,-1.329891
1,1.506101,0.803641,-0.698248,-1.548194,-1.097894
2,-0.409992,,2.117172,2.234353,
3,0.102895,0.345339,-0.208538,0.106483,1.648533
4,-0.5187,-0.789185,-0.172567,-0.071722,-1.464463
5,,1.299306,0.950453,,0.052422
6,0.847351,1.05424,,-0.472565,-2.598221


In [31]:
# To drop rows with NaN values with just use the dropna() method
nan_df.dropna(inplace=True)
nan_df

Unnamed: 0,0,1,2,3,4
0,0.179805,-0.25018,-2.366158,1.347757,-1.329891
1,1.506101,0.803641,-0.698248,-1.548194,-1.097894
3,0.102895,0.345339,-0.208538,0.106483,1.648533
4,-0.5187,-0.789185,-0.172567,-0.071722,-1.464463


In [32]:
# To drop colums with NaN values, we must pass the parameter axis='columns'
nan_df.dropna(axis='columns')

Unnamed: 0,0,1,2,3,4
0,0.179805,-0.25018,-2.366158,1.347757,-1.329891
1,1.506101,0.803641,-0.698248,-1.548194,-1.097894
3,0.102895,0.345339,-0.208538,0.106483,1.648533
4,-0.5187,-0.789185,-0.172567,-0.071722,-1.464463
