## Common Operations for Missing Data

Something to remember:

- Missing values are shown as `NaN`, which is stored as `float64`. When you read
  data from a csv file via `pd_read_csv()`, empty values will become `NaN`.
- If a column from a pandas dataframe has dtype `int64`, we automatically know 
  it contains 0 `NaN`s since `NaN` cannot be stored as `int64`.
- Use `np.nan` to explicitly refer to missing value, where `np` is short for `numpy`.
- There's an experimental `pd.NA` with type `Int64` (Note the capitalized `I`). **Don't use it**. 

In [1]:
import pandas as pd
import numpy as np

### Most pandas functions ignore `NaN`

In [2]:
ss = pd.Series([1, 2, 3, np.nan])
print("How many non-missing elements does the series have?", ss.count())
print("How many elements, including NaN's, are there?", ss.size)

How many non-missing elements does the series have? 3
How many elements, including NaN's, are there? 4


In [3]:
print("What's the average of the elements in the series?", ss.mean())

What's the average of the elements in the series? 2.0


The result is 2, which is (1+2+3) / 3, not (1+2+3) / 4, so `ss.mean()` drops `NaN` first and computes the average.

### Count, Filter, and Fill the Missing Values

In [4]:
# make a dataframe
tsdf = pd.DataFrame(
    np.random.randn(1000, 3),
    columns=["Alice", "Bob", "Cobie"],
    index=pd.date_range("1/1/2022", periods=1000),
)
# fill row 4 to 7 and col 2 to 3 with NaNs
tsdf.iloc[3:7, 1:3] = np.nan 
# add an empty column
tsdf['David'] = np.nan
tsdf.head(10)

Unnamed: 0,Alice,Bob,Cobie,David
2022-01-01,0.230933,0.334012,-1.577554,
2022-01-02,2.531524,1.216325,0.244607,
2022-01-03,-0.184044,1.180566,-0.475057,
2022-01-04,1.084662,,,
2022-01-05,0.361834,,,
2022-01-06,1.103508,,,
2022-01-07,1.016692,,,
2022-01-08,-0.299632,-0.246687,-1.774613,
2022-01-09,-1.431498,0.032685,-0.388391,
2022-01-10,-0.126378,-0.809115,0.963127,


In [5]:
# find rows with any missing values
tsdf[tsdf.isna().any(axis=1)] # wrong syntax: tsdf[tsdf.isna()]

Unnamed: 0,Alice,Bob,Cobie,David
2022-01-01,0.230933,0.334012,-1.577554,
2022-01-02,2.531524,1.216325,0.244607,
2022-01-03,-0.184044,1.180566,-0.475057,
2022-01-04,1.084662,,,
2022-01-05,0.361834,,,
...,...,...,...,...
2024-09-22,0.361721,-0.463857,1.114484,
2024-09-23,0.606096,0.007074,-1.182563,
2024-09-24,1.751390,-0.488815,0.678080,
2024-09-25,-1.173087,-0.787922,-0.968923,


In [6]:
# calc count and % of missing values in each column
s1 = tsdf.isna().sum()
s1.name = 'cnt_missing'
s2 = tsdf.isna().mean()
s2.name = 'pct_missing'
pd.concat([s1, s2], axis=1)

Unnamed: 0,cnt_missing,pct_missing
Alice,0,0.0
Bob,4,0.004
Cobie,4,0.004
David,1000,1.0


In [7]:
# drop empty columns (i.e., 100% missing)
tsdf = tsdf.dropna(axis=1, how='all')
tsdf.head(10)

Unnamed: 0,Alice,Bob,Cobie
2022-01-01,0.230933,0.334012,-1.577554
2022-01-02,2.531524,1.216325,0.244607
2022-01-03,-0.184044,1.180566,-0.475057
2022-01-04,1.084662,,
2022-01-05,0.361834,,
2022-01-06,1.103508,,
2022-01-07,1.016692,,
2022-01-08,-0.299632,-0.246687,-1.774613
2022-01-09,-1.431498,0.032685,-0.388391
2022-01-10,-0.126378,-0.809115,0.963127


In [8]:
# drop rows with at least 1 NaN, we often do this when there are relatively 
# few missing values compared to the total number of records
df_cleaned = tsdf.dropna()
assert df_cleaned.isna().sum().sum() == 0

In [9]:
# alternatively, we can fill the missing values
print(tsdf.fillna(0).iloc[2:8], "\n\n") # with 0
print(tsdf.fillna(tsdf.mean()).iloc[2:8], "\n\n") # with the means
print(tsdf.fillna(tsdf.median()).iloc[2:8], "\n\n") # with the medians

               Alice       Bob     Cobie
2022-01-03 -0.184044  1.180566 -0.475057
2022-01-04  1.084662  0.000000  0.000000
2022-01-05  0.361834  0.000000  0.000000
2022-01-06  1.103508  0.000000  0.000000
2022-01-07  1.016692  0.000000  0.000000
2022-01-08 -0.299632 -0.246687 -1.774613 


               Alice       Bob     Cobie
2022-01-03 -0.184044  1.180566 -0.475057
2022-01-04  1.084662  0.025789 -0.008394
2022-01-05  0.361834  0.025789 -0.008394
2022-01-06  1.103508  0.025789 -0.008394
2022-01-07  1.016692  0.025789 -0.008394
2022-01-08 -0.299632 -0.246687 -1.774613 


               Alice       Bob     Cobie
2022-01-03 -0.184044  1.180566 -0.475057
2022-01-04  1.084662  0.002695  0.012617
2022-01-05  0.361834  0.002695  0.012617
2022-01-06  1.103508  0.002695  0.012617
2022-01-07  1.016692  0.002695  0.012617
2022-01-08 -0.299632 -0.246687 -1.774613 




In [10]:
# or we can fill the missing values via interpolation
df_cleaned = tsdf.interpolate(method='linear')
df_cleaned.iloc[2:8]

Unnamed: 0,Alice,Bob,Cobie
2022-01-03,-0.184044,1.180566,-0.475057
2022-01-04,1.084662,0.895116,-0.734968
2022-01-05,0.361834,0.609665,-0.994879
2022-01-06,1.103508,0.324214,-1.254791
2022-01-07,1.016692,0.038764,-1.514702
2022-01-08,-0.299632,-0.246687,-1.774613


### Summary

I showed the common things to do for missing data in this notebook. 

### Referral

- Digital Ocean is a cloud computing platform where you can rent remote servers for cheap. 
  I have my remote data science server there. You can do the same and [get $200 credit](https://m.do.co/c/0a435cb96813). 