# [Handling Missing Values in Pandas](#)

Missing values are a common occurrence in real-world datasets. They can arise due to various reasons such as data entry errors, equipment malfunctions, or simply because the information was not available at the time of data collection. Properly handling missing values is crucial for maintaining data integrity and ensuring accurate analyses.


<img src="../images/missing-values.png" width="800">

In Pandas, missing values are typically represented by the `NaN` (Not a Number) value, which is part of the IEEE floating-point specification. However, Pandas also recognizes `None` as a missing value in certain contexts and in DateTime data, missing values are represented by `NaT` (Not a Time).

Understanding and effectively managing missing values is essential because:

1. **Data Quality**: Missing values can skew your analysis and lead to incorrect conclusions if not handled properly.

2. **Statistical Integrity**: Many statistical operations and machine learning algorithms cannot handle missing values directly.

3. **Performance**: Large numbers of missing values can impact the performance of your data processing operations.

4. **Insight Generation**: The pattern of missing values itself can sometimes provide valuable insights about your data.


In this lecture, we'll explore various aspects of missing values in Pandas, including:

- How to detect missing values
- The different types of missing values
- How missing values behave in various operations
- Strategies for handling missing values
- Best practices for dealing with missing data


Let's start by importing Pandas and creating a sample DataFrame with some missing values:


In [4]:
import pandas as pd
import numpy as np

# Create a sample DataFrame with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [5, np.nan, np.nan, 3, 2],
    'C': [1, 2, 3, np.nan, 5],
    'D': ['a', 'b', 'c', None, 'e']
})

df

Unnamed: 0,A,B,C,D
0,1.0,5.0,1.0,a
1,2.0,,2.0,b
2,,,3.0,c
3,4.0,3.0,,
4,5.0,2.0,5.0,e


This DataFrame will serve as our example throughout this lecture as we explore various aspects of handling missing values in Pandas.


Understanding how to effectively manage missing values is a critical skill in data analysis and preprocessing. It allows you to clean and prepare your data properly, leading to more reliable and accurate results in your subsequent analyses.

## <a id='toc1_'></a>[Understanding Missing Data](#toc0_)

Missing data is a common challenge in data analysis across various fields. Before diving into specific techniques for handling missing values, it's crucial to understand the nature of missing data and its implications for your analysis.


### <a id='toc1_1_'></a>[Types of Missing Data](#toc0_)


Missing data typically falls into three categories:

1. **Missing Completely at Random (MCAR)**: 
   - The missingness is unrelated to any variables in the dataset.
   - Example: A survey respondent accidentally skips a question.
   - Implications: Least problematic; analysis remains unbiased.

2. **Missing at Random (MAR)**:
   - The missingness is related to other observed variables but not to the missing data itself.
   - Example: Men are less likely to answer questions about emotions in a survey.
   - Implications: Can be handled with appropriate imputation techniques.

3. **Missing Not at Random (MNAR)**:
   - The missingness is related to the missing values themselves.
   - Example: People with high incomes are less likely to report their income in surveys.
   - Implications: Most problematic; may lead to biased results if not handled carefully.


### <a id='toc1_2_'></a>[Impact of Missing Data](#toc0_)


Understanding the impact of missing data is crucial:

1. **Reduced Statistical Power**: Missing data can decrease the sample size, leading to less precise estimates.
2. **Bias**: Especially in MNAR cases, missing data can lead to biased results and incorrect conclusions.
3. **Complication of Analysis**: Many statistical methods are designed for complete datasets, and missing data can complicate their application.


### <a id='toc1_3_'></a>[Assessing Missing Data](#toc0_)


Before deciding how to handle missing data, it's important to assess its extent and pattern:

1. **Quantity**: Calculate the percentage of missing values for each variable.
2. **Pattern**: Visualize the pattern of missingness to identify any systematic issues.
3. **Relationships**: Examine if missingness in one variable is related to values in other variables.


### <a id='toc1_4_'></a>[General Approaches to Handling Missing Data](#toc0_)


1. **Deletion Methods**:
   - Listwise deletion (complete case analysis)
   - Pairwise deletion

2. **Single Imputation Methods**:
   - Mean/median/mode imputation
   - Regression imputation
   - Last observation carried forward (for time series)

3. **Multiple Imputation**:
   - Creating multiple plausible imputed datasets
   - Analyzing each dataset and pooling results

4. **Model-Based Methods**:
   - Maximum Likelihood estimation
   - Expectation-Maximization algorithm

5. **Machine Learning Methods**:
   - K-Nearest Neighbors imputation
   - Decision tree-based methods


### <a id='toc1_5_'></a>[Considerations in Choosing a Method](#toc0_)


When deciding how to handle missing data, consider:

1. **The mechanism of missingness** (MCAR, MAR, MNAR)
2. **The amount of missing data**
3. **The type of variables** (categorical, continuous)
4. **The intended analysis** and its assumptions
5. **The computational resources** available


### <a id='toc1_6_'></a>[Best Practices](#toc0_)


1. **Always investigate** the nature and pattern of missing data before deciding on a method.
2. **Document your approach** thoroughly for transparency and reproducibility.
3. **Consider multiple methods** and compare their impact on your results.
4. **Consult domain experts** to understand potential reasons for missingness.
5. **Conduct sensitivity analyses** to assess the impact of your chosen method.
6. **Be transparent** about the presence and handling of missing data when reporting results.


Understanding the nature of missing data and carefully considering how to handle it are crucial steps in ensuring the validity and reliability of your data analysis. The methods and considerations discussed in this chapter will provide you with the tools to address missing data effectively in your work.

## <a id='toc2_'></a>[Values Considered "Missing"](#toc0_)

Pandas recognizes several types of values as "missing." Understanding these different types is crucial for effective data handling and analysis.


### <a id='toc2_1_'></a>[NaN (Not a Number)](#toc0_)


`NaN` (Not a Number) is the primary way Pandas represents missing or undefined values for floating-point and non-floating-point numeric data.

- `NaN` is part of the IEEE floating-point specification.
- It's represented by the `numpy.nan` object.
- `NaN` values are propagated in arithmetic operations.


In [5]:
import numpy as np
import pandas as pd

In [6]:
# Creating a Series with NaN values
s = pd.Series([1, 2, np.nan, 4, 5])
s

0    1.0
1    2.0
2    NaN
3    4.0
4    5.0
dtype: float64

In [7]:
# Checking for NaN values
s.isna()

0    False
1    False
2     True
3    False
4    False
dtype: bool

### <a id='toc2_2_'></a>[None](#toc0_)


`None` is Python's built-in null object. Pandas also recognizes `None` as a missing value.

- When used in numeric data, `None` is typically converted to `NaN`.
- In object-dtype arrays, `None` is preserved as-is.


In [8]:
# Creating a Series with None values
s_none = pd.Series([1, 2, None, 4, 5])
s_none

0    1.0
1    2.0
2    NaN
3    4.0
4    5.0
dtype: float64

In [9]:
# None is converted to NaN in numeric Series
s_none.astype(float)

0    1.0
1    2.0
2    NaN
3    4.0
4    5.0
dtype: float64

In [10]:
# None is preserved in object-dtype Series
s_obj = pd.Series(['a', 'b', None, 'd'], dtype=object)
s_obj

0       a
1       b
2    None
3       d
dtype: object

> `None` and `NaN` sound similar, look similar but are actually quite different. None is a Python internal type which can be considered as the equivalent of `NULL`. The None keyword is used to define a null value, or no value at all. `None` is not the same as 0, `False`, or an empty string. It is a datatype of its own (`NoneType`) and only None can be … None. While missing values are `NaN` in numerical arrays, they are `None` in object arrays. It is best to check for None by using `foo` is `None` instead of `foo == None` which brings us back to our previous issue with the peculiar results I found in my NaN operations.

### <a id='toc2_3_'></a>[Custom Missing Value Indicators](#toc0_)


In some datasets, missing values might be represented by custom indicators like -999, 'N/A', or 'Unknown'. Pandas allows you to specify these as missing values during data import or through the `replace()` method.


In [11]:
# Creating a DataFrame with custom missing value indicators
df = pd.DataFrame({
    'A': [1, 2, -999, 4, 5],
    'B': ['a', 'N/A', 'c', 'd', 'Unknown', '']
})
df

Unnamed: 0,A,B
0,1,a
1,2,
2,-999,c
3,4,d
4,5,Unknown


In [12]:
# Specifying custom missing values
df_cleaned = df.replace([-999, 'N/A', 'Unknown', ''], np.nan)
df_cleaned

Unnamed: 0,A,B
0,1.0,a
1,2.0,
2,,c
3,4.0,d
4,5.0,


In [13]:
# Using custom na_values during CSV import
# df = pd.read_csv('your_file.csv', na_values=[-999, 'N/A', 'Unknown'])

In datetime data, missing values are typically represented by `NaT` (Not a Time) in Pandas. This is similar to `NaN` but specific to datetime data.

In [14]:
pd.DatetimeIndex(["2017-07-05", "2017-07-06", None, "2017-07-08"])

DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'], dtype='datetime64[ns]', freq=None)

It's important to note that while Pandas recognizes these different types of missing values, it generally treats them similarly in most operations. However, there can be subtle differences in behavior, especially when working with different data types.

In [15]:
# Demonstrating equivalence in boolean operations
pd.isna(np.nan) == pd.isna(None)

True

In [16]:
# But they're not identical
np.nan is None

False

In [17]:
# NaN doesn't equal itself
np.nan == np.nan

False

Understanding these different representations of missing values is crucial for correctly identifying and handling missing data in your datasets. It allows you to preprocess your data effectively, ensuring that all forms of missing values are properly accounted for in your analyses.

## <a id='toc3_'></a>[Detecting Missing Values](#toc0_)

Identifying missing values in your dataset is a crucial first step in data preprocessing. Pandas provides several methods to detect and quantify missing values in Series and DataFrames.


### <a id='toc3_1_'></a>[Using `.isnull()` and `.notnull()`](#toc0_)


Pandas offers two primary methods for detecting missing values: `.isnull()` and `.notnull()`. These methods return boolean masks indicating which values are missing (or not missing).


In [18]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [5, np.nan, np.nan, 3, 2],
    'C': [1, 2, 3, np.nan, 5],
    'D': ['a', 'b', 'c', None, 'e']
})
df

Unnamed: 0,A,B,C,D
0,1.0,5.0,1.0,a
1,2.0,,2.0,b
2,,,3.0,c
3,4.0,3.0,,
4,5.0,2.0,5.0,e


In [19]:
# Using .isnull()
df.isnull()


Unnamed: 0,A,B,C,D
0,False,False,False,False
1,False,True,False,False
2,True,True,False,False
3,False,False,True,True
4,False,False,False,False


In [20]:
# Using .notnull()
df.notnull()

Unnamed: 0,A,B,C,D
0,True,True,True,True
1,True,False,True,True
2,False,False,True,True
3,True,True,False,False
4,True,True,True,True


These methods work on both Series and DataFrames:


In [21]:
# For a single column (Series)
df['B'].isnull()

0    False
1     True
2     True
3    False
4    False
Name: B, dtype: bool

In [22]:
# Checking for at least one missing value in each row
df.isnull().any(axis=1)

0    False
1     True
2     True
3     True
4    False
dtype: bool

In [23]:
# Checking for any missing values in the entire DataFrame
df.isnull().any().any()

True

### <a id='toc3_2_'></a>[Counting Missing Values](#toc0_)


To quantify missing values, you can use the `.isnull()` method in combination with `.sum()`.


In [24]:
# Count missing values in each column
df.isnull().sum()

A    1
B    2
C    1
D    1
dtype: int64

In [25]:
# Count missing values in each row
df.isnull().sum(axis=1)

0    0
1    1
2    2
3    2
4    0
dtype: int64

In [26]:
# Total number of missing values in the DataFrame
df.isnull().sum().sum()

5

In [27]:
# Percentage of missing values in each column
(df.isnull().sum() / len(df)) * 100

A    20.0
B    40.0
C    20.0
D    20.0
dtype: float64

You can also use these methods to filter your data:


In [28]:
# Rows with at least one missing value
df[df.isnull().any(axis=1)]

Unnamed: 0,A,B,C,D
1,2.0,,2.0,b
2,,,3.0,c
3,4.0,3.0,,


In [29]:
# Columns with at least one missing value
df.loc[:, df.isnull().any()]

Unnamed: 0,A,B,C,D
0,1.0,5.0,1.0,a
1,2.0,,2.0,b
2,,,3.0,c
3,4.0,3.0,,
4,5.0,2.0,5.0,e


For a more comprehensive summary of missing values, you can use the `.info()` method:


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       4 non-null      float64
 1   B       3 non-null      float64
 2   C       4 non-null      float64
 3   D       4 non-null      object 
dtypes: float64(3), object(1)
memory usage: 288.0+ bytes


This method provides an overview of your DataFrame, including the number of non-null values in each column.


Additionally, Pandas' `describe()` method excludes missing values by default when computing summary statistics:


In [31]:
df.describe()

Unnamed: 0,A,B,C
count,4.0,3.0,4.0
mean,3.0,3.333333,2.75
std,1.825742,1.527525,1.707825
min,1.0,2.0,1.0
25%,1.75,2.5,1.75
50%,3.0,3.0,2.5
75%,4.25,4.0,3.5
max,5.0,5.0,5.0


Understanding the extent and pattern of missing values in your dataset is crucial for deciding how to handle them. It can inform your choice of imputation method, help you identify potential issues in data collection, or guide your decision on whether to drop certain observations or variables.


Remember that different types of analyses may require different approaches to missing data. Always consider the nature of your data and the requirements of your analysis when deciding how to proceed with missing values.

## <a id='toc4_'></a>[NA Semantics and Behavior](#toc0_)

Understanding how Pandas handles missing values (NA - Not Available) in various operations is crucial for correct data manipulation and analysis. Let's explore the behavior of NA values in different contexts.


### <a id='toc4_1_'></a>[Propagation in Arithmetic and Comparison Operations](#toc0_)


In arithmetic operations, NA values generally propagate. This means that any operation involving an NA value will result in NA.


In [32]:
s = pd.Series([1, 2, np.nan, 4, 5])

In [33]:
# Arithmetic operations
s + 1

0    2.0
1    3.0
2    NaN
3    5.0
4    6.0
dtype: float64

In [34]:
s * 2

0     2.0
1     4.0
2     NaN
3     8.0
4    10.0
dtype: float64

In [35]:
s / 2

0    0.5
1    1.0
2    NaN
3    2.0
4    2.5
dtype: float64

In [36]:
# Comparison operations
s > 2

0    False
1    False
2    False
3     True
4     True
dtype: bool

In [37]:
s == np.nan  # Note: this doesn't work as expected

0    False
1    False
2    False
3    False
4    False
dtype: bool

Note that comparing anything to `np.nan` using `==` always returns `False`. To check for NA values, use `pd.isna()` or the `.isnull()` method.


In [38]:
pd.isna(s)

0    False
1    False
2     True
3    False
4    False
dtype: bool

### <a id='toc4_2_'></a>[Logical Operations with Missing Values](#toc0_)


In [87]:
# todo
# fix this

In logical operations, Pandas follows the three-valued logic: True, False, and Unknown (NA).


In [39]:
# Create two series with some NA values
a = pd.Series([True, False, np.nan])
b = pd.Series([False, True, np.nan])

In [40]:
# Logical AND
a & b

0    False
1    False
2    False
dtype: bool

In [41]:
# Logical OR
a | b

0     True
1     True
2    False
dtype: bool

In [42]:
# Logical NOT
# ~a # raises an error

The rules for three-valued logic are:

- `True` or `Unknown` = `True`
- `False` or `Unknown` = `Unknown`
- `True` and `Unknown` = `Unknown`
- `False` and `Unknown` = `False`


### <a id='toc4_3_'></a>[NA in a Boolean Context](#toc0_)


When using NA values in a boolean context (like in filtering operations), they are treated as False.


In [43]:
df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
df

Unnamed: 0,A,B
0,1.0,4.0
1,2.0,
2,,6.0


In [44]:
# Filtering with a condition
df[df > 2]

Unnamed: 0,A,B
0,,4.0
1,,
2,,6.0


In [45]:
# Using .bool() method
s = pd.Series([True, False, np.nan])
# s.bool()  # This will raise a ValueError

Note that calling `.bool()` on a Series or DataFrame containing NA values will raise a `ValueError`. If you need to convert to boolean in such cases, you might need to fill NA values first.


## <a id='toc5_'></a>[Inserting Missing Data](#toc0_)

There may be situations where you need to insert missing values into your dataset, either to represent unknown information or to prepare data for specific analyses. Let's explore how to insert missing data and understand the consequences of doing so.


### <a id='toc5_1_'></a>[Explicitly Inserting NaN or None](#toc0_)


Pandas allows you to insert missing values using either `np.nan` or `None`. Here are several ways to insert missing data:


In [46]:
# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [47]:
# Insert NaN into a specific location
df.loc[1, 'A'] = np.nan
df

Unnamed: 0,A,B
0,1.0,4
1,,5
2,3.0,6


In [48]:
# Insert None into a specific location
df.loc[2, 'B'] = None
df

Unnamed: 0,A,B
0,1.0,4.0
1,,5.0
2,3.0,


In [49]:
# Add a new column with some missing values
df['C'] = [7, np.nan, 9]
df

Unnamed: 0,A,B,C
0,1.0,4.0,7.0
1,,5.0,
2,3.0,,9.0


In [50]:
# Create a column with all missing values
df['D'] = np.nan
df

Unnamed: 0,A,B,C,D
0,1.0,4.0,7.0,
1,,5.0,,
2,3.0,,9.0,


You can also insert missing values when creating a DataFrame:


In [51]:
# Create a DataFrame with missing values
df2 = pd.DataFrame({
    'A': [1, np.nan, 3],
    'B': [4, 5, None],
    'C': [np.nan, np.nan, np.nan]
})

df2

Unnamed: 0,A,B,C
0,1.0,4.0,
1,,5.0,
2,3.0,,


### <a id='toc5_2_'></a>[Consequences of Inserting Missing Data](#toc0_)


Inserting missing data can have several consequences that you should be aware of:

1. **Data Type Changes**:
   Inserting `np.nan` into an integer column will cause the column to be converted to float.


In [52]:
# Create an integer Series
s = pd.Series([1, 2, 3], dtype=int)
s.dtype

dtype('int64')

In [53]:
# Insert np.nan
s[1] = np.nan
s.dtype  # Now float64

dtype('float64')

2. **Impact on Calculations**:
   Missing values can affect calculations. Many Pandas methods exclude missing values by default.


In [54]:
# Mean calculation
df2['A'].mean()  # Excludes NaN by default

2.0

In [55]:
# Sum calculation
df2['A'].sum()  # Also excludes NaN

4.0

3. **Filtering Behavior**:
   Missing values are treated as False in boolean indexing.


In [56]:
df2[df2['A'] > 2]  # NaN rows are excluded

Unnamed: 0,A,B,C
2,3.0,,


4. **Performance**:
   Operations on data with missing values can be slower, as Pandas needs to check for and handle these values.


5. **Groupby Operations**:
   By default, group keys with missing values are excluded.


In [57]:
df2.groupby('A').sum()  # Group with NaN is excluded

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,4.0,0.0
3.0,0.0,0.0


6. **Concatenation and Merging**:
   When combining datasets, missing values can be introduced or propagated.


In [58]:
pd.concat([df, df2])

Unnamed: 0,A,B,C,D
0,1.0,4.0,7.0,
1,,5.0,,
2,3.0,,9.0,
0,1.0,4.0,,
1,,5.0,,
2,3.0,,,


7. **Plotting**:
   Missing values are typically excluded from plots, which can lead to discontinuities in line plots or gaps in bar charts.


It's important to be aware of these consequences when inserting missing data. Depending on your analysis requirements, you may need to handle these missing values explicitly or choose appropriate methods that can work with missing data.


Remember, while inserting missing values can be necessary to accurately represent your data, it's generally a good practice to minimize missing data where possible, as it can complicate analyses and reduce the statistical power of your dataset.

## <a id='toc6_'></a>[Handling Missing Values](#toc0_)

Once you've identified missing values in your dataset, you need to decide how to handle them. Pandas provides several methods for dealing with missing data, including dropping rows or columns with missing values, filling missing values with specific values, or using more advanced interpolation techniques.


### <a id='toc6_1_'></a>[Dropping Missing Values](#toc0_)


For a Series, `dropna()` removes all entries containing missing values.


In [59]:
# Create a sample Series
s = pd.Series([1, 2, np.nan, 4, 5, np.nan])
s


0    1.0
1    2.0
2    NaN
3    4.0
4    5.0
5    NaN
dtype: float64

In [60]:
# Drop NA values
s.dropna()

0    1.0
1    2.0
3    4.0
4    5.0
dtype: float64

For DataFrames, `dropna()` is more flexible:


In [61]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 3],
    'C': [1, 2, 3, np.nan]
})
df

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,,2.0
2,,,3.0
3,4.0,3.0,


In [62]:
# Drop rows with any NA values
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1.0


In [63]:
# Drop rows where all values are NA
df.dropna(how='all')

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,,2.0
2,,,3.0
3,4.0,3.0,


In [64]:
# Drop columns with any NA values
df.dropna(axis=1)

0
1
2
3


In [65]:
# Drop rows with at least 2 non-NA values
df.dropna(thresh=2)

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,,2.0
3,4.0,3.0,


### <a id='toc6_2_'></a>[Filling Missing Values](#toc0_)


You can fill NA values with a constant:


In [66]:
# Fill NA with 0
df.fillna(0)

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,0.0,2.0
2,0.0,0.0,3.0
3,4.0,3.0,0.0


In [67]:
# Fill NA with different values for each column
df.fillna({'A': 0, 'B': 5, 'C': 10})

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,5.0,2.0
2,0.0,5.0,3.0
3,4.0,3.0,10.0


You can propagate the last valid observation forward or backward:


In [68]:
# Forward fill
df.ffill()

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,5.0,2.0
2,2.0,5.0,3.0
3,4.0,3.0,3.0


In [69]:
# Backward fill
df.bfill()


Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,3.0,2.0
2,4.0,3.0,3.0
3,4.0,3.0,


You can use calculated values, such as mean or median:


In [70]:
# Fill with mean of each column
df.fillna(df.mean())

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,4.0,2.0
2,2.333333,4.0,3.0
3,4.0,3.0,2.0


In [71]:
# Fill with median of each column
df.fillna(df.median())

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,4.0,2.0
2,2.0,4.0,3.0
3,4.0,3.0,2.0


### <a id='toc6_3_'></a>[Interpolation Methods](#toc0_)


Pandas offers various interpolation methods for filling missing values:


In [72]:
# Create a Series with missing values
s = pd.Series([1, np.nan, 2, np.nan, 3, 4])
s

0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
5    4.0
dtype: float64

In [73]:
# Linear interpolation
s.interpolate()

0    1.0
1    1.5
2    2.0
3    2.5
4    3.0
5    4.0
dtype: float64

In [74]:
# Polynomial interpolation
s.interpolate(method='polynomial', order=2)

0    1.000000
1    1.529412
2    2.000000
3    2.411765
4    3.000000
5    4.000000
dtype: float64

In [75]:
# Time-based interpolation (for time series data)
time_series = pd.Series(
    [1, np.nan, 2, np.nan, 3],
    index=pd.to_datetime(['2020-01-01', '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08'])
)
time_series.interpolate(method='time')

2020-01-01    1.0
2020-01-05    1.8
2020-01-06    2.0
2020-01-07    2.5
2020-01-08    3.0
dtype: float64

When choosing a method to handle missing values, consider:

1. The nature of your data: Is it time series? Categorical? Numerical?
2. The amount of missing data: If too much data is missing, imputation might introduce bias.
3. The mechanism of missingness: Is the data missing completely at random, or is there a pattern?
4. The requirements of your subsequent analysis: Some models can handle missing data, while others cannot.


Remember, there's no one-size-fits-all approach to handling missing data. The best method depends on your specific dataset and analysis goals. It's often a good practice to try multiple approaches and compare their impact on your results.


In [76]:
# Example: Comparing different methods
methods = ['drop', 'fill_mean', 'ffill', 'interpolate']
results = {}

In [77]:
for method in methods:
    df_copy = df.copy()
    if method == 'drop':
        df_copy = df_copy.dropna()
    elif method == 'fill_mean':
        df_copy = df_copy.fillna(df_copy.mean())
    elif method == 'ffill':
        df_copy = df_copy.ffill()
    elif method == 'interpolate':
        df_copy = df_copy.interpolate()

    results[method] = df_copy.mean()

In [78]:
pd.DataFrame(results)

Unnamed: 0,drop,fill_mean,ffill,interpolate
A,1.0,2.333333,2.25,2.5
B,5.0,4.0,4.5,4.0
C,1.0,2.0,2.25,2.25


This comparison can help you understand how different missing value handling techniques affect your data summary statistics.

## <a id='toc7_'></a>[Advanced Techniques for Handling Missing Values](#toc0_)

As you become more proficient in data analysis, you may encounter situations where basic methods for handling missing values are insufficient. This section covers more advanced techniques that can provide more nuanced and effective ways to deal with missing data.


### <a id='toc7_1_'></a>[Using `replace()` for Custom Missing Value Handling](#toc0_)


The `replace()` method allows for more flexible handling of missing or unwanted values, including the ability to replace specific values or patterns.


In [79]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, -999, 4, 5],
    'B': ['a', 'N/A', 'c', 'Missing', 'e']
})
df

Unnamed: 0,A,B
0,1,a
1,2,
2,-999,c
3,4,Missing
4,5,e


In [80]:
# Replace multiple values
df.replace([-999, 'N/A', 'Missing'], np.nan)

Unnamed: 0,A,B
0,1.0,a
1,2.0,
2,,c
3,4.0,
4,5.0,e


In [81]:
# Using a dictionary for column-specific replacements
df.replace({
    'A': {-999: np.nan},
    'B': {'N/A': np.nan, 'Missing': 'Unknown'}
})


Unnamed: 0,A,B
0,1.0,a
1,2.0,
2,,c
3,4.0,Unknown
4,5.0,e


You can also use regular expressions with `replace()`:


In [82]:
# Replace any string starting with 'Miss'
df['B'].replace('^Miss.*', np.nan, regex=True)

0      a
1    N/A
2      c
3    NaN
4      e
Name: B, dtype: object

### <a id='toc7_2_'></a>[Imputation Strategies](#toc0_)


Imputation involves replacing missing values with estimated ones. While simple imputation methods like mean or median filling are common, more sophisticated techniques can provide better estimates.


KNN imputation fills missing values using the mean of K nearest neighbors.


In [83]:
df = pd.DataFrame({
    'age': [2, 8, None, 25, 30, None],
    'height': [None, 1.2, None, 1.7, 1.8, 1.6],
    'weight': [10, 15, 18, 20, None, 30]
})
df

Unnamed: 0,age,height,weight
0,2.0,,10.0
1,8.0,1.2,15.0
2,,,18.0
3,25.0,1.7,20.0
4,30.0,1.8,
5,,1.6,30.0


In [84]:
from sklearn.impute import KNNImputer

# Create an imputer
imputer = KNNImputer(n_neighbors=2)

# Impute missing values
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

df_imputed

Unnamed: 0,age,height,weight
0,2.0,1.45,10.0
1,8.0,1.2,15.0
2,16.5,1.45,18.0
3,25.0,1.7,20.0
4,30.0,1.8,25.0
5,27.5,1.6,30.0


Multiple imputation creates multiple complete datasets, analyzes each, and pools the results.


In [85]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Create an imputer
imputer = IterativeImputer(random_state=0)

# Impute missing values
df_multi_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

df_multi_imputed

Unnamed: 0,age,height,weight
0,2.0,1.404177,10.0
1,8.0,1.2,15.0
2,18.423256,1.516431,18.0
3,25.0,1.7,20.0
4,30.0,1.8,23.056747
5,45.14294,1.6,30.0


These advanced techniques offer more sophisticated ways to handle missing data, potentially leading to more accurate imputations and better preservation of data relationships and patterns. However, they also come with increased complexity and computational cost. 


When using these methods, it's crucial to:
1. Understand the assumptions behind each technique
2. Validate the imputed results to ensure they make sense in the context of your data
3. Consider the impact of imputation on your subsequent analyses
4. Document your imputation strategy clearly for transparency and reproducibility


Remember, the choice of imputation method can significantly affect your results, so it's often wise to compare multiple approaches and consider their impact on your specific analytical goals.