- Missing data is very common in real-world datasets.
- Some value might be unavailable, lost, or simply not recorded.
- Pandas is designed to make handling this kind of data easy and efficient.

Key Points:
1. Pandas automatically handles missing data when performing operations like:
    - mean(), sum(), describe() etc
    - it ignores missing values by default in these computations

2. How Missing Data is Represented:
    - For numeric data (float64), pandas use NaN to indicate a missing value

3. Sentinel Value:
    - NaN is called a sentinel value - it acts like a flag or placeholder to mark "this value is missing"
    - It's not the same as zero or an empty string - it has a special meaning

In [6]:
import pandas as pd 
import numpy as np 

data = pd.Series([1.0, 2.0, np.nan, 4.0])
data

0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

In [7]:
data.mean() # ignores NaN

np.float64(2.3333333333333335)

- In pandas, missing data is referred to as NA, short for "Not Available"
- This idea comes from the R programming language, which uses NA for missing values

#### What does NA mean?
- NA can represent:
    - data that doesn't exist (e.g., not applicable)
    - data that should exist but was not recorded or observed

#### Why Is This Important ?
- Analyzing where and why data is missing is important
    - it may reveal issues in data collection
    - it helps detect bias in the dataset
    - it guides how to handle or fill in missing values properly

- `None` in Python = `NA` in Pandas
- Pandas treats Python's built-in None value as a missing NA value in most contexts

In [8]:
data = pd.Series([1, None, 3])
data

0    1.0
1    NaN
2    3.0
dtype: float64

 Function Table

| **Category**             | **Function**         | **Description**                                                                 |
|--------------------------|----------------------|---------------------------------------------------------------------------------|
| 🔍 **Detect Missing Data**  | `isna()` / `isnull()`  | Detects missing values; returns `True` where values are missing (`NaN`)         |
|                          | `notna()` / `notnull()`| Detects non-missing values; returns `True` where data is **not** missing        |
| 🧹 **Remove Missing Data** | `dropna()`            | Removes rows (or columns) containing missing values                             |
|                          | `dropna(axis=1)`      | Removes **columns** with missing values                                         |
|                          | `dropna(how='all')`   | Drops rows/columns **only if all values are missing**                           |
|                          | `dropna(thresh=n)`    | Keeps rows/columns with at least `n` non-NA values                              |
| 🧯 **Fill Missing Data**   | `fillna(value)`       | Fills missing values with a specified scalar, dict, or method                   |
|                          | `fillna(method='ffill')` or `ffill()` | Forward fills using the previous non-null value               |
|                          | `fillna(method='bfill')` or `bfill()` | Backward fills using the next non-null value                   |
|                          | `interpolate()`       | Fills missing values using interpolation (linear, polynomial, time, etc.)       |
| 📊 **Summarize Missing Data** | `isna().sum()`         | Counts number of missing values per column                                      |
|                          | `isna().any()`        | Returns `True` for columns with **any** missing values                          |
|                          | `isna().all()`        | Returns `True` for columns where **all** values are missing                     |
|                          | `info()`              | Summary of DataFrame including non-null counts per column                       |




## [ Filtering Out Missing Data ] 
- few ways to filter out missing data like we always have the option to do it by hand using pandas.isna and Boolean indexing.
- dropna can be helpful. On a Series, it returns the Series with only the nonnull data and index values.

### Boolean Indexing

In [9]:
# it means creating a mask (a series of True/False values) and using it to filter rows or columns in a DataFrame.

df = pd.DataFrame({
    'A': [1, 2, 7, 4],
    'B': [np.nan, 2, 3, 4],
    'C': [5, np.nan, 7, 8],
    'D': [np.nan, np.nan, np.nan, 5],
    'E': [4,5,6,7]
})
df

Unnamed: 0,A,B,C,D,E
0,1,,5.0,,4
1,2,2.0,,,5
2,7,3.0,7.0,,6
3,4,4.0,8.0,5.0,7


In [10]:
# filter rows where column 'A' is not missing
df[df['A'].notna()] # keeps only rows whre A is not NaN

Unnamed: 0,A,B,C,D,E
0,1,,5.0,,4
1,2,2.0,,,5
2,7,3.0,7.0,,6
3,4,4.0,8.0,5.0,7


In [11]:
# filter rows where column 'B' is missing
df[df['B'].isna()]  # keeps rows where B is NaN

Unnamed: 0,A,B,C,D,E
0,1,,5.0,,4


In [12]:
# filter rows where any column is missing
df[df.isna().any(axis=1)]   # returns all rows that contain at least one missing value


# axis=0 operate down rows(column-wise)
# axis=1 operate across columns(row-wise)

Unnamed: 0,A,B,C,D,E
0,1,,5.0,,4
1,2,2.0,,,5
2,7,3.0,7.0,,6


In [13]:
# filter rows where all columns are missing
df[df.isna().all(axis=1)]   # returns rows where all values are missing

Unnamed: 0,A,B,C,D,E


In [14]:
# filter rows where no columns are missing (fully complete rows)
df[df.notna().all(axis=1)]  # keeps rows with on missing values

Unnamed: 0,A,B,C,D,E
3,4,4.0,8.0,5.0,7


In [16]:
# combine multiple conditions
# ex: A is not missing and B > 2
df[df['A'].notna() & (df['B'] >= 2)]

# boolean login allows powerful combinations

Unnamed: 0,A,B,C,D,E
1,2,2.0,,,5
2,7,3.0,7.0,,6
3,4,4.0,8.0,5.0,7



####  Summary of Functions

| Function | Purpose |
|----------|---------|
| `isna()` or `isnull()` | Detect missing values |
| `notna()` or `notnull()` | Detect non-missing values |
| `any(axis=1)` | Checks if **any** value in a row is `NaN` |
| `all(axis=1)` | Checks if **all** values in a row are `NaN` |


### dropna() 

In [27]:
# syntax
# df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

df = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan,1],
    'B': [4, 5, np.nan, np.nan,1],
    'C': [7, 8, 9, np.nan,1],
    'D': [1,2,3,4,1]
})
df

Unnamed: 0,A,B,C,D
0,1.0,4.0,7.0,1
1,,5.0,8.0,2
2,3.0,,9.0,3
3,,,,4
4,1.0,1.0,1.0,1


In [28]:
# drop rows with any missing values
df.dropna()

# same as df.dropna(axis=0, how='any')
# drops/removes any row that contains even one NaN value

Unnamed: 0,A,B,C,D
0,1.0,4.0,7.0,1
4,1.0,1.0,1.0,1


In [29]:
# drop rows where all values are missing
df.dropna(how='all')

# removes rows where all columns are NaN

Unnamed: 0,A,B,C,D
0,1.0,4.0,7.0,1
1,,5.0,8.0,2
2,3.0,,9.0,3
3,,,,4
4,1.0,1.0,1.0,1


In [30]:
# drop columns instead of rows
df.dropna(axis=1)

# removes columns that contain any NaN

Unnamed: 0,D
0,1
1,2
2,3
3,4
4,1


In [31]:
# drop rows with less than a certain number of non-NaN values
df.dropna(thresh=2)

# only keeps those rows that have at least 2 non-null values

Unnamed: 0,A,B,C,D
0,1.0,4.0,7.0,1
1,,5.0,8.0,2
2,3.0,,9.0,3
4,1.0,1.0,1.0,1


In [32]:
# drop missing values in specific columns only
df.dropna(subset=['A', 'B'])

# only checks columns 'A' and 'B' for NaNs
# row is dropped if either has NaN

Unnamed: 0,A,B,C,D
0,1.0,4.0,7.0,1
4,1.0,1.0,1.0,1


In [33]:
# make changes permanent
df.dropna(inplace=True)

# removes missing data in-place, directly modifies the original DataFrame

In [34]:
# Keep in mind that these functions return new objects by default and do not modify the contents of the original object

## Summary Table

| Argument      | Description                                      |
|---------------|--------------------------------------------------|
| `axis=0`      | Drop rows (default)                              |
| `axis=1`      | Drop columns                                     |
| `how='any'`   | Drop if **any** NaN in row/col (default)         |
| `how='all'`   | Drop if **all** values are NaN                   |
| `thresh=n`    | Keep rows with at least `n` non-NaN values       |
| `subset=[...]`| Only check specific columns                      |
| `inplace=True`| Modify the DataFrame directly                    |


## [ Filling in Missing Data ]
- Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways. 
- For most purposes, the fillna method is the workhorse function to use. 

In [37]:
df = pd.DataFrame(np.random.standard_normal((7,3)))
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
df

Unnamed: 0,0,1,2
0,2.91102,,
1,0.019315,,
2,-1.246096,,0.95237
3,0.493455,,0.451122
4,0.200458,0.895797,-0.248329
5,0.649391,0.250523,-1.313816
6,-2.457948,-0.191226,-1.152931


In [38]:
df.fillna(0)

Unnamed: 0,0,1,2
0,2.91102,0.0,0.0
1,0.019315,0.0,0.0
2,-1.246096,0.0,0.95237
3,0.493455,0.0,0.451122
4,0.200458,0.895797,-0.248329
5,0.649391,0.250523,-1.313816
6,-2.457948,-0.191226,-1.152931


In [39]:
# calling fillna with a dictionary, you can use a different fill value for each column
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,2.91102,0.5,0.0
1,0.019315,0.5,0.0
2,-1.246096,0.5,0.95237
3,0.493455,0.5,0.451122
4,0.200458,0.895797,-0.248329
5,0.649391,0.250523,-1.313816
6,-2.457948,-0.191226,-1.152931


In [41]:
# fill with the previous value (forward fill)

df = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan],
    'B': [4, 5, np.nan, 7]
})
df.fillna(method='ffill')

# fills NaN with the value above it

  df.fillna(method='ffill')


Unnamed: 0,A,B
0,1.0,4.0
1,1.0,5.0
2,3.0,5.0
3,3.0,7.0


In [42]:
# fill only 1 step forward
df.fillna(method='ffill', limit=1)

  df.fillna(method='ffill', limit=1)


Unnamed: 0,A,B
0,1.0,4.0
1,1.0,5.0
2,3.0,5.0
3,3.0,7.0


In [43]:
# fill with the next value (backward fill)
df.fillna(method='bfill')

# fills NaN with the value below it

  df.fillna(method='bfill')


Unnamed: 0,A,B
0,1.0,4.0
1,3.0,5.0
2,3.0,7.0
3,,7.0


In [47]:
# try to reduce memory usage by downcasting types (e.g., from float64 to float32)
# infer is the best downcast
df.fillna(0, downcast="infer")

  df.fillna(0, downcast="infer")


Unnamed: 0,A,B
0,1,4
1,2,5
2,3,5
3,2,7


In [46]:
# fill with column mean, median, or mode