#### Part 22: Working with Missing Data in Pandas

In this notebook, we'll explore:
- Additional string methods in pandas
- Working with missing data (NA values)
- Handling, detecting, and replacing missing values

##### Setup
First, let's import the necessary libraries:

In [1]:
import pandas as pd
import numpy as np

##### 1. Additional String Methods

Let's first look at some additional string methods that were not covered in the previous notebook:

In [2]:
# Create a sample Series
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', 'CABA', 'dog', 'cat'], dtype="string")

# Capitalize
print("Capitalize:")
print(s.str.capitalize())

# Find the position of a substring
print("\nFind 'a':")
print(s.str.find('a'))

# Check if strings are alphanumeric
print("\nIs alphanumeric:")
print(s.str.isalnum())

# Check if strings are alphabetic
print("\nIs alphabetic:")
print(s.str.isalpha())

Capitalize:
0       A
1       B
2       C
3    Aaba
4    Baca
5    Caba
6     Dog
7     Cat
dtype: string

Find 'a':
0    -1
1    -1
2    -1
3     1
4     1
5    -1
6    -1
7     1
dtype: Int64

Is alphanumeric:
0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
dtype: boolean

Is alphabetic:
0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
dtype: boolean


##### 2. Working with Missing Data

### 2.1 Values Considered "Missing"

In pandas, several values are treated as missing:
- `NaN` (Not a Number): Default missing value marker for computational speed and convenience
- `None`: Python's built-in null object
- `NaT` (Not a Time): Missing value for datetime data

Let's create a DataFrame with some missing values:

In [3]:
# Create a DataFrame with random values
df = pd.DataFrame(np.random.randn(5, 3), 
                 index=['a', 'c', 'e', 'f', 'h'],
                 columns=['one', 'two', 'three'])

# Add some non-numeric columns
df['four'] = 'bar'
df['five'] = df['one'] > 0

# Display the DataFrame
df

Unnamed: 0,one,two,three,four,five
a,1.447213,0.983156,0.673623,bar,True
c,-0.755494,0.650344,-1.073701,bar,False
e,0.009665,0.692197,0.858147,bar,True
f,1.387295,-1.342553,0.538517,bar,True
h,0.458777,-1.429637,-0.407609,bar,True


Now, let's reindex the DataFrame to introduce missing values:

In [4]:
# Reindex to introduce missing values
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df2

Unnamed: 0,one,two,three,four,five
a,1.447213,0.983156,0.673623,bar,True
b,,,,,
c,-0.755494,0.650344,-1.073701,bar,False
d,,,,,
e,0.009665,0.692197,0.858147,bar,True
f,1.387295,-1.342553,0.538517,bar,True
g,,,,,
h,0.458777,-1.429637,-0.407609,bar,True


### 2.2 Detecting Missing Values

To detect missing values, we can use the `isna()` and `notna()` methods:

In [5]:
# Detect missing values
print("Missing values (isna):")
print(df2.isna())

# Detect non-missing values
print("\nNon-missing values (notna):")
print(df2.notna())

Missing values (isna):
     one    two  three   four   five
a  False  False  False  False  False
b   True   True   True   True   True
c  False  False  False  False  False
d   True   True   True   True   True
e  False  False  False  False  False
f  False  False  False  False  False
g   True   True   True   True   True
h  False  False  False  False  False

Non-missing values (notna):
     one    two  three   four   five
a   True   True   True   True   True
b  False  False  False  False  False
c   True   True   True   True   True
d  False  False  False  False  False
e   True   True   True   True   True
f   True   True   True   True   True
g  False  False  False  False  False
h   True   True   True   True   True


We can also check if any or all values in a Series or DataFrame are missing:

In [6]:
# Check if any values are missing in each column
print("Any missing values per column:")
print(df2.isna().any())

# Check if all values are missing in each column
print("\nAll missing values per column:")
print(df2.isna().all())

# Count the number of missing values in each column
print("\nCount of missing values per column:")
print(df2.isna().sum())

Any missing values per column:
one      True
two      True
three    True
four     True
five     True
dtype: bool

All missing values per column:
one      False
two      False
three    False
four     False
five     False
dtype: bool

Count of missing values per column:
one      3
two      3
three    3
four     3
five     3
dtype: int64


### 2.3 Filling Missing Values

There are several ways to fill missing values in pandas:

In [7]:
# Fill missing values with a scalar value
print("Fill with scalar:")
print(df2['one'].fillna(0))

# Fill missing values with the mean of the column
print("\nFill with mean:")
print(df2['one'].fillna(df2['one'].mean()))

Fill with scalar:
a    1.447213
b    0.000000
c   -0.755494
d    0.000000
e    0.009665
f    1.387295
g    0.000000
h    0.458777
Name: one, dtype: float64

Fill with mean:
a    1.447213
b    0.509491
c   -0.755494
d    0.509491
e    0.009665
f    1.387295
g    0.509491
h    0.458777
Name: one, dtype: float64


We can also fill missing values using different methods like forward fill (`ffill`) or backward fill (`bfill`):

In [8]:
# Forward fill (propagate last valid observation forward)
print("Forward fill:")
print(df2.fillna(method='ffill'))

# Backward fill (use next valid observation to fill gap)
print("\nBackward fill:")
print(df2.fillna(method='bfill'))

Forward fill:
        one       two     three four   five
a  1.447213  0.983156  0.673623  bar   True
b  1.447213  0.983156  0.673623  bar   True
c -0.755494  0.650344 -1.073701  bar  False
d -0.755494  0.650344 -1.073701  bar  False
e  0.009665  0.692197  0.858147  bar   True
f  1.387295 -1.342553  0.538517  bar   True
g  1.387295 -1.342553  0.538517  bar   True
h  0.458777 -1.429637 -0.407609  bar   True

Backward fill:
        one       two     three four   five
a  1.447213  0.983156  0.673623  bar   True
b -0.755494  0.650344 -1.073701  bar  False
c -0.755494  0.650344 -1.073701  bar  False
d  0.009665  0.692197  0.858147  bar   True
e  0.009665  0.692197  0.858147  bar   True
f  1.387295 -1.342553  0.538517  bar   True
g  0.458777 -1.429637 -0.407609  bar   True
h  0.458777 -1.429637 -0.407609  bar   True


  print(df2.fillna(method='ffill'))
  print(df2.fillna(method='ffill'))
  print(df2.fillna(method='bfill'))
  print(df2.fillna(method='bfill'))


### 2.4 Dropping Missing Values

We can drop rows or columns with missing values using the `dropna()` method:

In [9]:
# Drop rows with any missing values
print("Drop rows with any missing values:")
print(df2.dropna())

# Drop rows with all missing values
print("\nDrop rows with all missing values:")
print(df2.dropna(how='all'))

# Drop columns with any missing values
print("\nDrop columns with any missing values:")
print(df2.dropna(axis=1))

Drop rows with any missing values:
        one       two     three four   five
a  1.447213  0.983156  0.673623  bar   True
c -0.755494  0.650344 -1.073701  bar  False
e  0.009665  0.692197  0.858147  bar   True
f  1.387295 -1.342553  0.538517  bar   True
h  0.458777 -1.429637 -0.407609  bar   True

Drop rows with all missing values:
        one       two     three four   five
a  1.447213  0.983156  0.673623  bar   True
c -0.755494  0.650344 -1.073701  bar  False
e  0.009665  0.692197  0.858147  bar   True
f  1.387295 -1.342553  0.538517  bar   True
h  0.458777 -1.429637 -0.407609  bar   True

Drop columns with any missing values:
Empty DataFrame
Columns: []
Index: [a, b, c, d, e, f, g, h]


### 2.5 Replacing Values

We can replace specific values in a DataFrame using the `replace()` method:

In [10]:
# Create a DataFrame with some values to replace
df = pd.DataFrame(np.random.randn(10, 2))

# Replace some values with 1.5
df[np.random.rand(df.shape[0]) > 0.5] = 1.5
print("Original DataFrame:")
print(df)

# Replace 1.5 with NaN
print("\nReplace 1.5 with NaN:")
print(df.replace(1.5, np.nan))

Original DataFrame:
          0         1
0  1.500000  1.500000
1 -0.495747  1.051746
2  1.500000  1.500000
3  1.500000  1.500000
4 -0.793576  0.176938
5  1.500000  1.500000
6  1.023569 -1.756491
7 -0.831668 -0.058708
8  1.500000  1.500000
9  1.713379  0.649588

Replace 1.5 with NaN:
          0         1
0       NaN       NaN
1 -0.495747  1.051746
2       NaN       NaN
3       NaN       NaN
4 -0.793576  0.176938
5       NaN       NaN
6  1.023569 -1.756491
7 -0.831668 -0.058708
8       NaN       NaN
9  1.713379  0.649588


We can replace multiple values at once by passing lists:

In [11]:
# Get the value at position (0, 0)
df00 = df.iloc[0, 0]

# Replace 1.5 with NaN and df00 with 'a'
print("Replace multiple values:")
print(df.replace([1.5, df00], [np.nan, 'a']))

# Check the data type of column 1
print("\nData type of column 1:")
print(df[1].dtype)

Replace multiple values:
          0         1
0         a         a
1 -0.495747  1.051746
2         a         a
3         a         a
4 -0.793576  0.176938
5         a         a
6  1.023569 -1.756491
7 -0.831668 -0.058708
8         a         a
9  1.713379  0.649588

Data type of column 1:
float64


### 2.6 Missing Data Casting Rules and Indexing

When a reindexing operation introduces missing data, the Series will be cast according to specific rules:

In [12]:
# Create a Series with random values
s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7])
print("Original Series:")
print(s)

# Create a boolean Series
bool_series = s > 0
print("\nBoolean Series:")
print(bool_series)
print("Data type:", bool_series.dtype)

# Reindex the boolean Series to introduce missing values
crit = bool_series.reindex(list(range(8)))
print("\nReindexed Boolean Series:")
print(crit)
print("Data type:", crit.dtype)

Original Series:
0    0.322338
2   -0.801383
4   -1.920282
6   -2.393908
7   -1.368155
dtype: float64

Boolean Series:
0     True
2    False
4    False
6    False
7    False
dtype: bool
Data type: bool

Reindexed Boolean Series:
0     True
1      NaN
2    False
3      NaN
4    False
5      NaN
6    False
7    False
dtype: object
Data type: object


Notice that the data type changed from `bool` to `object` when missing values were introduced. This is because boolean arrays in NumPy cannot store missing values.

Here's a summary of the casting rules when missing values are introduced:

| Data Type | Cast To |
|-----------|--------|
| integer   | float  |
| boolean   | object |
| float     | no cast |
| object    | no cast |

##### Summary

In this notebook, we've explored:

1. Additional string methods in pandas
2. Working with missing data in pandas, including:
   - Detecting missing values with `isna()` and `notna()`
   - Filling missing values with `fillna()`
   - Dropping missing values with `dropna()`
   - Replacing values with `replace()`
3. Understanding the casting rules when missing values are introduced

These techniques are essential for data cleaning and preprocessing in pandas.