# Managing missing data with pandas

This notebook introduces a few ways to manage nulls using panda’s DataFrames. Further information can be found in the documentation of the panda: [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html).

## See also:

* [Missing data cookbook](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook-missing-data)

In [1]:
import pandas as pd
from numpy import random

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv')

## Check the data

In [3]:
df.head()

Unnamed: 0,timestamp,username,temperature,heartrate,build,latest,note
0,2017-01-01T12:00:23,michaelsmith,12.0,67,4e6a7805-8faa-2768-6ef6-eb3198b483ac,0.0,interval
1,2017-01-01T12:01:09,kharrison,6.0,78,7256b7b0-e502-f576-62ec-ed73533c9c84,0.0,wake
2,2017-01-01T12:01:34,smithadam,5.0,89,9226c94b-bb4b-a6c8-8e02-cb42b53e9c90,0.0,
3,2017-01-01T12:02:09,eddierodriguez,28.0,76,,0.0,update
4,2017-01-01T12:02:36,kenneth94,29.0,62,122f1c6a-403c-2221-6ed1-b5caa08f11e0,,


In [4]:
df.dtypes

timestamp       object
username        object
temperature    float64
heartrate        int64
build           object
latest         float64
note            object
dtype: object

In [5]:
df.note.value_counts()

wake        16496
user        16416
interval    16274
sleep       16226
update      16213
test        16068
Name: note, dtype: int64

## Remove all null values (including the note `n/a`)

df = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv',
                 na_values=['n/a'])

### Test if we can use `dropna`

In [6]:
df.shape

(146397, 7)

In [7]:
df.dropna().shape

(46116, 7)

In [8]:
df.dropna(how='all', axis=1).shape

(146397, 7)

### Find all columns in which all data is present

In [9]:
my_columns = list(df.columns)

In [10]:
my_columns

['timestamp',
 'username',
 'temperature',
 'heartrate',
 'build',
 'latest',
 'note']

In [11]:
list(df.dropna(thresh=int(df.shape[0] * .9), axis=1).columns)

['timestamp', 'username', 'heartrate']

### Finding all columns that are missing data

In [12]:
missing_info = list(df.columns[df.isnull().any()])

In [13]:
missing_info

['temperature', 'build', 'latest', 'note']

In [14]:
for col in missing_info:
    num_missing = df[df[col].isnull() == True].shape[0]
    print('number missing for column {}: {}'.format(col, 
                                                    num_missing))

number missing for column temperature: 32357
number missing for column build: 32350
number missing for column latest: 32298
number missing for column note: 48704


In [15]:
for col in missing_info:
    percent_missing = df[df[col].isnull() == True].shape[0] / df.shape[0]
    print('percent missing for column {}: {}'.format(
        col, percent_missing))

percent missing for column temperature: 0.22102228870810195
percent missing for column build: 0.22097447352063226
percent missing for column latest: 0.22061927498514314
percent missing for column note: 0.332684412931959


## Replace missing data with majority values

In [16]:
df.note.value_counts()

wake        16496
user        16416
interval    16274
sleep       16226
update      16213
test        16068
Name: note, dtype: int64

In [17]:
df.build.value_counts().head()

b1d3b3a7-6639-9b0b-9b4c-22a976563f74    1
43b11996-707a-0522-23d5-19d17b1f41e6    1
ee8339c4-cbab-8164-a17e-2efb4f80dc18    1
012ba321-84f3-83e6-7d63-b344674bd40c    1
aacd60a6-100c-ac70-8322-13b5909604d9    1
Name: build, dtype: int64

In [18]:
df.latest.value_counts()

0.0    75735
1.0    38364
Name: latest, dtype: int64

In [19]:
df.latest = df.latest.fillna(0)

### Example for the missing temperature values

In [20]:
df.username.value_counts().head()

esmith    45
zsmith    43
vsmith    41
ysmith    40
jsmith    37
Name: username, dtype: int64

In [21]:
df = df.set_index('timestamp')

In [22]:
df.head()

Unnamed: 0_level_0,username,temperature,heartrate,build,latest,note
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-01T12:00:23,michaelsmith,12.0,67,4e6a7805-8faa-2768-6ef6-eb3198b483ac,0.0,interval
2017-01-01T12:01:09,kharrison,6.0,78,7256b7b0-e502-f576-62ec-ed73533c9c84,0.0,wake
2017-01-01T12:01:34,smithadam,5.0,89,9226c94b-bb4b-a6c8-8e02-cb42b53e9c90,0.0,
2017-01-01T12:02:09,eddierodriguez,28.0,76,,0.0,update
2017-01-01T12:02:36,kenneth94,29.0,62,122f1c6a-403c-2221-6ed1-b5caa08f11e0,0.0,


In [23]:
df.temperature = df.groupby('username').temperature.fillna(
    method='backfill', limit=3)