In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

### Missing data
Missing data occurs when values are simply absent or contain NaN (not a number) for any feature (column) in a given dataset. This will cause issues with many machine learning algorithms.

**Missing data can negatively impact:**
- Data visualization
- Arithemetic computations
- Machine learning algorithms

**Common methods to deal with missing data:**
- Remove rows or columns containing missing data
- Impute with mean or median
- Impute with mode (most frequently occuring feature)
- Impute with forward or backward fill
- Interpolate data between two points

*Note: Domain knowledge is often neeeded to decide how to fill nulls.*

In [7]:
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])
df = df[df > 0]
df

Unnamed: 0,A,B,C,D
0,0.233468,,,
1,,,,
2,2.104171,,1.305072,0.654175
3,,0.242884,,2.707263
4,0.435345,2.92375,0.050736,0.131905
5,,,,0.134448
6,,,0.182764,
7,1.01653,0.333731,1.573967,
8,,,0.540937,
9,,,1.0249,0.886574


In [10]:
copy = df.copy()
copy.drop(columns="D", inplace=True)

## dropna
Remove rows (default), or columns, containing null.

**Parameters**
- **how** = "any" (default), or "all"
- **thresh** = set number of (non-missing) values a row must contain in order to *not* drop
- **subset** only look for NaN in subset of columns (or rows)
- **axis** = "index" (default), or "columns"

In [31]:
copy.dropna(axis="columns", subset=2)

Unnamed: 0,A,C
0,0.233468,
1,,
2,2.104171,1.305072
3,,
4,0.435345,0.050736
5,,
6,,0.182764
7,1.01653,1.573967
8,,0.540937
9,,1.0249
