# Handling missing data points

## Option 1 - Reducing missing data points

In [2]:
import pandas as pd
import numpy as np

Add a sample data frame to work with

In [3]:
sample_df = pd.DataFrame({'numbers': [1, 2, 3, np.nan, np.nan, np.nan, 7]})
sample_df

Unnamed: 0,numbers
0,1.0
1,2.0
2,3.0
3,
4,
5,
6,7.0


The below will find the count of missing values in the data frame

In [5]:
sample_df.isnull().sum()

numbers    3
dtype: int64

`dropna` will get rid of rows with missing values.

In [6]:
dropped_sample_df = sample_df.dropna();
dropped_sample_df

Unnamed: 0,numbers
0,1.0
1,2.0
2,3.0
6,7.0


This would in effect reduce the number of samples that you have to work with, but ensure that missing data is not affecting the quality of the overall dataset.

## Option 2 - Filling missing data points

Another option is filling missing data points with a flat value, like `0`

In [9]:
filled_sample_df = sample_df.fillna(0)
filled_sample_df

Unnamed: 0,numbers
0,1.0
1,2.0
2,3.0
3,0.0
4,0.0
5,0.0
6,7.0


Using a flat value like `0` can make a data set biased.  For this reason, you can use the `bfill` method of `fillna` to backfill the missing field with actual data from another row.

In [10]:
backfilled_sample_df = sample_df.fillna(method='bfill')
backfilled_sample_df

Unnamed: 0,numbers
0,1.0
1,2.0
2,3.0
3,7.0
4,7.0
5,7.0
6,7.0


Alternatively, you can forward-fill with `fillna`'s `ffill` method

In [11]:
forwardfilled_sample_df = sample_df.fillna(method='ffill')
forwardfilled_sample_df

Unnamed: 0,numbers
0,1.0
1,2.0
2,3.0
3,3.0
4,3.0
5,3.0
6,7.0


However, a better way might be to use the average of a given column, which is possible as well, going back to the use of the flat value fill and computing the average of the column from the data frame

In [12]:
average_filled_sample_df = sample_df.fillna(sample_df['numbers'].mean())
average_filled_sample_df

Unnamed: 0,numbers
0,1.0
1,2.0
2,3.0
3,3.25
4,3.25
5,3.25
6,7.0


If you'd like, you can also interpolate missing values, letting pandas help you decide what values might need to be there

In [13]:
interpolated_sample_df = sample_df.interpolate()
interpolated_sample_df

Unnamed: 0,numbers
0,1.0
1,2.0
2,3.0
3,4.0
4,5.0
5,6.0
6,7.0
