## Common Operations for Missing Data

Something to remember:

- Missing values are shown as `NaN`, which is stored as `float64`. When you read
  data from a csv file via `pd_read_csv()`, empty values will become `NaN`.
- If a column from a pandas dataframe has dtype `int64`, we automatically know 
  it contains 0 `NaN`s since `NaN` cannot be stored as `int64`.
- Use `np.nan` to explicitly refer to missing value, where `np` is short for `numpy`.
- There's an experimental `pd.NA` with type `Int64` (Note the capitalized `I`). **Don't use it**. 

In [1]:
import pandas as pd
import numpy as np

### Most pandas functions ignore `NaN`

In [2]:
ss = pd.Series([1, 2, 3, np.nan])
print("How many non-missing elements does the series have?", ss.count())
print("How many elements, including NaN's, are there?", ss.size)

How many non-missing elements does the series have? 3
How many elements, including NaN's, are there? 4


In [3]:
print("What's the average of the elements in the series?", ss.mean())

What's the average of the elements in the series? 2.0


The result is 2, which is (1+2+3) / 3, not (1+2+3) / 4, so `ss.mean()` drops `NaN` first and computes the average.

### Count, Filter, and Fill the Missing Values

In [8]:
# make a dataframe
tsdf = pd.DataFrame(
    np.random.randn(1000, 3),
    columns=["Alice", "Bob", "Cobie"],
    index=pd.date_range("1/1/2022", periods=1000),
)
# fill row 4 to 7 and col 2 to 3 with NaNs
tsdf.iloc[3:7, 1:3] = np.nan 
# add an empty column
tsdf['David'] = np.nan
tsdf.head(10)

Unnamed: 0,Alice,Bob,Cobie,David
2022-01-01,0.30222,1.603122,-1.069278,
2022-01-02,0.238815,0.277919,1.876919,
2022-01-03,-1.599163,0.568747,0.088066,
2022-01-04,0.790807,,,
2022-01-05,0.663034,,,
2022-01-06,-1.206111,,,
2022-01-07,-0.300708,,,
2022-01-08,0.071667,2.013236,1.628534,
2022-01-09,0.202228,1.161622,0.057194,
2022-01-10,-3.599803,2.448248,0.325577,


2022-01-01    True
2022-01-02    True
2022-01-03    True
2022-01-04    True
2022-01-05    True
              ... 
2024-09-22    True
2024-09-23    True
2024-09-24    True
2024-09-25    True
2024-09-26    True
Freq: D, Length: 1000, dtype: bool

In [5]:
# find rows with any missing values
tsdf[tsdf.isna().any(axis=1)] # wrong syntax: tsdf[tsdf.isna()]

Unnamed: 0,Alice,Bob,Cobie,David
2022-01-01,0.237216,-0.269082,0.903084,
2022-01-02,0.233252,-0.677294,-1.111106,
2022-01-03,-0.947688,1.890816,0.520633,
2022-01-04,-0.409923,,,
2022-01-05,-0.430996,,,
...,...,...,...,...
2024-09-22,-0.971411,-0.598304,0.977353,
2024-09-23,-0.616488,-0.357100,0.605210,
2024-09-24,-0.460412,-0.443838,0.253615,
2024-09-25,0.184447,-1.443742,-0.281709,


In [6]:
# calc count and % of missing values in each column
s1 = tsdf.isna().sum()
s1.name = 'cnt_missing'
s2 = tsdf.isna().mean()
s2.name = 'pct_missing'
pd.concat([s1, s2], axis=1)

Unnamed: 0,cnt_missing,pct_missing
Alice,0,0.0
Bob,4,0.004
Cobie,4,0.004
David,1000,1.0


In [7]:
# drop empty columns (i.e., 100% missing)
tsdf = tsdf.dropna(axis=1, how='all')
tsdf.head(10)

Unnamed: 0,Alice,Bob,Cobie
2022-01-01,0.237216,-0.269082,0.903084
2022-01-02,0.233252,-0.677294,-1.111106
2022-01-03,-0.947688,1.890816,0.520633
2022-01-04,-0.409923,,
2022-01-05,-0.430996,,
2022-01-06,-0.342682,,
2022-01-07,1.15368,,
2022-01-08,-1.536057,-0.254464,1.052593
2022-01-09,1.648652,-0.429074,0.533771
2022-01-10,0.544285,-0.708646,-1.40897


In [8]:
# drop rows with at least 1 NaN, we often do this when there are relatively 
# few missing values compared to the total number of records
df_cleaned = tsdf.dropna()
assert df_cleaned.isna().sum().sum() == 0

In [9]:
# alternatively, we can fill the missing values
print(tsdf.fillna(0).iloc[2:8], "\n\n") # with 0
print(tsdf.fillna(tsdf.mean()).iloc[2:8], "\n\n") # with the means
print(tsdf.fillna(tsdf.median()).iloc[2:8], "\n\n") # with the medians

               Alice       Bob     Cobie
2022-01-03  0.295427  0.296839  0.179729
2022-01-04  0.532440  0.000000  0.000000
2022-01-05 -0.714349  0.000000  0.000000
2022-01-06  0.148591  0.000000  0.000000
2022-01-07 -0.461915  0.000000  0.000000
2022-01-08  1.635206  1.142187  1.284139 


               Alice       Bob     Cobie
2022-01-03  0.295427  0.296839  0.179729
2022-01-04  0.532440 -0.010349 -0.010662
2022-01-05 -0.714349 -0.010349 -0.010662
2022-01-06  0.148591 -0.010349 -0.010662
2022-01-07 -0.461915 -0.010349 -0.010662
2022-01-08  1.635206  1.142187  1.284139 


               Alice       Bob     Cobie
2022-01-03  0.295427  0.296839  0.179729
2022-01-04  0.532440 -0.019023 -0.046640
2022-01-05 -0.714349 -0.019023 -0.046640
2022-01-06  0.148591 -0.019023 -0.046640
2022-01-07 -0.461915 -0.019023 -0.046640
2022-01-08  1.635206  1.142187  1.284139 




In [10]:
# or we can fill the missing values via interpolation
df_cleaned = tsdf.interpolate(method='linear')
df_cleaned.iloc[2:8]

Unnamed: 0,Alice,Bob,Cobie
2022-01-03,0.295427,0.296839,0.179729
2022-01-04,0.53244,0.465909,0.400611
2022-01-05,-0.714349,0.634978,0.621493
2022-01-06,0.148591,0.804048,0.842375
2022-01-07,-0.461915,0.973118,1.063257
2022-01-08,1.635206,1.142187,1.284139


### Summary

I showed the common things to do for missing data in this notebook. 

### Referral

- Digital Ocean is a cloud computing platform where you can rent remote servers for cheap. 
  I have my remote data science server there. You can do the same and [get $200 credit](https://m.do.co/c/0a435cb96813). 