#### Part 22: Working with Missing Data in Pandas

In this notebook, we'll explore:
- Additional string methods in pandas
- Working with missing data (NA values)
- Handling, detecting, and replacing missing values

##### Setup
First, let's import the necessary libraries:

In [None]:
import pandas as pd
import numpy as np

##### 1. Additional String Methods

Let's first look at some additional string methods that were not covered in the previous notebook:

In [None]:
# Create a sample Series
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', 'CABA', 'dog', 'cat'], dtype="string")

# Capitalize
print("Capitalize:")
print(s.str.capitalize())

# Find the position of a substring
print("\nFind 'a':")
print(s.str.find('a'))

# Check if strings are alphanumeric
print("\nIs alphanumeric:")
print(s.str.isalnum())

# Check if strings are alphabetic
print("\nIs alphabetic:")
print(s.str.isalpha())

##### 2. Working with Missing Data

### 2.1 Values Considered "Missing"

In pandas, several values are treated as missing:
- `NaN` (Not a Number): Default missing value marker for computational speed and convenience
- `None`: Python's built-in null object
- `NaT` (Not a Time): Missing value for datetime data

Let's create a DataFrame with some missing values:

In [None]:
# Create a DataFrame with random values
df = pd.DataFrame(np.random.randn(5, 3), 
                 index=['a', 'c', 'e', 'f', 'h'],
                 columns=['one', 'two', 'three'])

# Add some non-numeric columns
df['four'] = 'bar'
df['five'] = df['one'] > 0

# Display the DataFrame
df

Now, let's reindex the DataFrame to introduce missing values:

In [None]:
# Reindex to introduce missing values
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df2

### 2.2 Detecting Missing Values

To detect missing values, we can use the `isna()` and `notna()` methods:

In [None]:
# Detect missing values
print("Missing values (isna):")
print(df2.isna())

# Detect non-missing values
print("\nNon-missing values (notna):")
print(df2.notna())

We can also check if any or all values in a Series or DataFrame are missing:

In [None]:
# Check if any values are missing in each column
print("Any missing values per column:")
print(df2.isna().any())

# Check if all values are missing in each column
print("\nAll missing values per column:")
print(df2.isna().all())

# Count the number of missing values in each column
print("\nCount of missing values per column:")
print(df2.isna().sum())

### 2.3 Filling Missing Values

There are several ways to fill missing values in pandas:

In [None]:
# Fill missing values with a scalar value
print("Fill with scalar:")
print(df2['one'].fillna(0))

# Fill missing values with the mean of the column
print("\nFill with mean:")
print(df2['one'].fillna(df2['one'].mean()))

We can also fill missing values using different methods like forward fill (`ffill`) or backward fill (`bfill`):

In [None]:
# Forward fill (propagate last valid observation forward)
print("Forward fill:")
print(df2.fillna(method='ffill'))

# Backward fill (use next valid observation to fill gap)
print("\nBackward fill:")
print(df2.fillna(method='bfill'))

### 2.4 Dropping Missing Values

We can drop rows or columns with missing values using the `dropna()` method:

In [None]:
# Drop rows with any missing values
print("Drop rows with any missing values:")
print(df2.dropna())

# Drop rows with all missing values
print("\nDrop rows with all missing values:")
print(df2.dropna(how='all'))

# Drop columns with any missing values
print("\nDrop columns with any missing values:")
print(df2.dropna(axis=1))

### 2.5 Replacing Values

We can replace specific values in a DataFrame using the `replace()` method:

In [None]:
# Create a DataFrame with some values to replace
df = pd.DataFrame(np.random.randn(10, 2))

# Replace some values with 1.5
df[np.random.rand(df.shape[0]) > 0.5] = 1.5
print("Original DataFrame:")
print(df)

# Replace 1.5 with NaN
print("\nReplace 1.5 with NaN:")
print(df.replace(1.5, np.nan))

We can replace multiple values at once by passing lists:

In [None]:
# Get the value at position (0, 0)
df00 = df.iloc[0, 0]

# Replace 1.5 with NaN and df00 with 'a'
print("Replace multiple values:")
print(df.replace([1.5, df00], [np.nan, 'a']))

# Check the data type of column 1
print("\nData type of column 1:")
print(df[1].dtype)

### 2.6 Missing Data Casting Rules and Indexing

When a reindexing operation introduces missing data, the Series will be cast according to specific rules:

In [None]:
# Create a Series with random values
s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7])
print("Original Series:")
print(s)

# Create a boolean Series
bool_series = s > 0
print("\nBoolean Series:")
print(bool_series)
print("Data type:", bool_series.dtype)

# Reindex the boolean Series to introduce missing values
crit = bool_series.reindex(list(range(8)))
print("\nReindexed Boolean Series:")
print(crit)
print("Data type:", crit.dtype)

Notice that the data type changed from `bool` to `object` when missing values were introduced. This is because boolean arrays in NumPy cannot store missing values.

Here's a summary of the casting rules when missing values are introduced:

| Data Type | Cast To |
|-----------|--------|
| integer   | float  |
| boolean   | object |
| float     | no cast |
| object    | no cast |

##### Summary

In this notebook, we've explored:

1. Additional string methods in pandas
2. Working with missing data in pandas, including:
   - Detecting missing values with `isna()` and `notna()`
   - Filling missing values with `fillna()`
   - Dropping missing values with `dropna()`
   - Replacing values with `replace()`
3. Understanding the casting rules when missing values are introduced

These techniques are essential for data cleaning and preprocessing in pandas.