In [1]:
import numpy as np
import pandas as pd

In [2]:
np.random.seed(0)

In [3]:
pd.options.mode.copy_on_write = True

# Data Cleaning

Real-world data can be pretty messy, so you’ll often spend a significant amount of time on data cleaning and rearranging.

Common data cleaning tasks include dealing with missing and duplicate data, applying simple transformations, type casts and making changes like renaming columns or indexes.

# Handling Missing Data

Missing data is common in most datasets, and pandas makes it easy to handle. After all, a lot of methods have options to manage missing data by default.

In pandas, missing data representation might seem confusing at first, but it's generally effective for most real-world scenarios. The "sentinel values" used vary depending on the data type. Here are some examples:

- `numpy.nan` for NumPy data types. The downside of using NumPy data types is that the original data type will be converted to `np.float64` or `object`.
- `NaT` for NumPy `np.datetime64`, `np.timedelta64`, and `PeriodDtype`.
- `NA` for `StringDtype`, `Int64Dtype` (and other bit widths), `Float64Dtype` (and other bit widths), `BooleanDtype`, and `ArrowDtype`. These types retain the original data type. For typing applications, use `api.types.NAType`.

The built-in Python `None` value is also treated as `NA`.


In [4]:
s = pd.Series(["apple", np.nan, None, "avocado"])
s

0      apple
1        NaN
2       None
3    avocado
dtype: object

In [5]:
# To detect missing value
# use the isna()
# or notna() methods

s.isna()

0    False
1     True
2     True
3    False
dtype: bool

In [6]:
s.notna()

0     True
1    False
2    False
3     True
dtype: bool

## Filtering out missing data

There are several ways to filter out missing data in pandas. While you can always use `pandas.isna` and Boolean indexing manually, `dropna` is particularly useful.

- `dropna`: Filters axis labels based on whether values for each label have missing data, with adjustable thresholds for how much missing data to tolerate.

For a `Series`, it returns the Series with only the non-null data and index value. For `DataFrame`, you can choose to drop rows or columns that are entirely `NA` or only those containing any `NAs`:

Note that these functions return new objects by default and do not modify the original data. To change this behavior, use `inplace=True`.

In [7]:
s = pd.Series(["apple", np.nan, None, "avocado"])
s

0      apple
1        NaN
2       None
3    avocado
dtype: object

In [8]:
# Drop NA values
# Equivalent to:
# s[s.notna()]

s.dropna()

0      apple
3    avocado
dtype: object

In [9]:
df = pd.DataFrame({
    'a': ['apple', 'acerola', 'avocado', 'acai'],
    'b': ['banana', None, 'blackberry', np.nan]
})
df

Unnamed: 0,a,b
0,apple,banana
1,acerola,
2,avocado,blackberry
3,acai,


In [10]:
# Drop NA values
# based on axis

df.dropna(axis='columns')

Unnamed: 0,a
0,apple
1,acerola
2,avocado
3,acai


In [11]:
# By default, dropna removes rows and columns
# that have at least one null
# You can change it to all

df = pd.DataFrame({
    'a': ['apple', 'acerola', 'avocado', 'acai'],
    'b': ['banana', None, 'blackberry', np.nan],
    'c': [None, None, None, None],
})
df

Unnamed: 0,a,b,c
0,apple,banana,
1,acerola,,
2,avocado,blackberry,
3,acai,,


In [12]:
df.dropna(axis='columns', how='all')

Unnamed: 0,a,b
0,apple,banana
1,acerola,
2,avocado,blackberry
3,acai,


In [13]:
# You can also define a subset of rows or columns
# to apply the operation

df = pd.DataFrame({
    'a': ['apple', None, 'avocado', 'acai'],
    'b': [None, 'blueberry', 'blackberry', 'banana'],
    'c': ['cherry', 'coconut', None, None],
})
df

Unnamed: 0,a,b,c
0,apple,,cherry
1,,blueberry,coconut
2,avocado,blackberry,
3,acai,banana,


In [14]:
df.dropna(subset=['a', 'b'])

Unnamed: 0,a,b,c
2,avocado,blackberry,
3,acai,banana,


## Filtering in missing data

Instead of filtering out missing data and potentially losing valuable information, you might want to replace NAs with some alternative value.

There are a couple of efficient ways to do this. Using map or apply, for instance, is a valid approach. However, you can benefit even more by using `fillna` and `replace`.

- `fillna`: Fill in missing data with a specified value or by using methods like "ffill" (forward fill) or "bfill" (backward fill).
- `replace`: Replace values in the Series/DataFrame with other specified values dynamically (this method differs from updating with `.loc` or `.iloc`, as it does not require specifying the exact location)

In [15]:
# Filling NA with a constant

s = pd.Series([1, 2, None, 4])
s.fillna(100)

0      1.0
1      2.0
2    100.0
3      4.0
dtype: float64

In [16]:
# Filling NA with ffill (interpolation)

s = pd.Series([1, 2, None, 4, None, None])
s.ffill()

0    1.0
1    2.0
2    2.0
3    4.0
4    4.0
5    4.0
dtype: float64

In [17]:
# Filling NA with median

s = pd.Series([1, 2, None, 4, None, None])
s.fillna(s.median())

0    1.0
1    2.0
2    2.0
3    4.0
4    2.0
5    2.0
dtype: float64

In [18]:
# Filling NA with maps (for each column)

df = pd.DataFrame(
    data=np.random.uniform(-1, 1, size=(3,3)),
    columns=['a', 'b', 'c']
)
df.iloc[:2, 0] = np.nan
df.iloc[2:, 2] = np.nan
df

Unnamed: 0,a,b,c
0,,0.430379,0.205527
1,,-0.15269,0.291788
2,-0.124826,0.783546,


In [19]:
df.fillna({'a': 0, 'b': 1, 'c': 2})

Unnamed: 0,a,b,c
0,0.0,0.430379,0.205527
1,0.0,-0.15269,0.291788
2,-0.124826,0.783546,2.0


In [20]:
# Replacing values

df = pd.DataFrame({
        'A': [1, 2, None, 4],
        'B': [None, 2, 3, None]
})
df

Unnamed: 0,A,B
0,1.0,
1,2.0,2.0
2,,3.0
3,4.0,


In [21]:
# Note that NaN is an np.nan
# which is different from None (NA)
# So it'll not be replaced

df.replace({None: 0, 2: 100})

Unnamed: 0,A,B
0,1.0,
1,100.0,100.0
2,,3.0
3,4.0,


# Removing duplicates

Handling duplicate records is another important and frequent step in data cleaning. Again, pandas offers several mechanisms to deal with duplicates in Series and DataFrames, with the most commonly used being:

- `duplicated`: Returns a Boolean Series indicating whether each row is a duplicate (i.e., its column values are exactly the same as those in an earlier row).
- `drop_duplicates`: Returns a DataFrame with duplicate rows (where the duplicated array is `False`) removed.

In [22]:
df = pd.DataFrame({
    'A': [1, 2, 2, 2, 1],
    'B': [5, 6, 6, 1, 1]
})
df

Unnamed: 0,A,B
0,1,5
1,2,6
2,2,6
3,2,1
4,1,1


In [23]:
# By default, duplicated mark duplicates as True
# except for the first occurrence.

df.duplicated()

0    False
1    False
2     True
3    False
4    False
dtype: bool

In [24]:
# Alternatively, we can mark the 'last' occurrence
# or 'False' to mark all occurrences

df.duplicated(keep=False)

0    False
1     True
2     True
3    False
4    False
dtype: bool

In [25]:
# We can also select a subset of columns

df.duplicated(subset=['B'])

0    False
1    False
2     True
3    False
4     True
dtype: bool

In [26]:
# To drop duplicates directly,
# we can use drop_duplicates
#
# The behavior (parameters) is pretty similar
# to .duplicated

df = pd.DataFrame({
    'A': [1, 2, 2, 2, 1],
    'B': [5, 6, 6, 1, 1]
})
df.drop_duplicates()

Unnamed: 0,A,B
0,1,5
1,2,6
3,2,1
4,1,1


# Renaming Axis Indexes

Axis labels in a Series or DataFrame can also be transformed using a function or mapping (just like values), producing a new and differently labeled objects. To modify the axes in place, you can just overwrite them.

Otherwise, if you prefer to create a transformed version of a dataset without modifying the original, you can use the `rename` method. This method allows you to rename axis labels using a dictionary-like object, which provides new values for a subset of the axis labels. This saves you from manually copying the DataFrame and assigning new values to its index and columns attributes.

In [27]:
df = pd.DataFrame({
    'MY DirtXXl Alpha': [1, 2, 3],
    'Bxta-_': [4, 5, 6]
})
df.rename(columns={'MY DirtXXl Alpha': 'alpha', 'Bxta-_': 'beta'})

Unnamed: 0,alpha,beta
0,1,4
1,2,5
2,3,6


# Dropping Axis Indexes

There are various methods to drop one or more entries from an pandas object axis. For example, you can use `reindex`, label-based indexing or `del` keyword.

However, if you wanna a more clean and expressive alternative, you can use the `drop` method. It provides a straightforward way to remove entrie and returns a new object with the specified value or values removed from the axis.

In [28]:
df = pd.DataFrame({
    'month': [1, 4, 7, 10],
    'year': [2012, 2014, 2013, 2014],
    'sale': [55, 40, 84, 31]
})
df

Unnamed: 0,month,year,sale
0,1,2012,55
1,4,2014,40
2,7,2013,84
3,10,2014,31


In [29]:
df.drop(columns=['year'])

Unnamed: 0,month,sale
0,1,55
1,4,40
2,7,84
3,10,31


In [30]:
df.drop(index=[0])

Unnamed: 0,month,year,sale
1,4,2014,40
2,7,2013,84
3,10,2014,31


# Axis Indexes with Duplicate Label

While many pandas functions, such as `reindex`, works only with unique axis labels, having it isn't a strict requirement. You can have duplicate row or column labels, which can affect data selection behavior. When you index a label with multiple occurrences, pandas returns a Series, while a single occurrence returns a scalar value.

From a pragmatic perspective, index labels should be unique and pandas provides tools to help ensure this. To check if an index's labels are unique, use the `is_unique` property of the index. Alternativel, `Index.duplicated()` method provides a boolean ndarray indicating if a label is repeated, which can be used as a filter to remove duplicate rows.

For more sophisticated handling of duplicate labels, beyond simply removing them, using `groupby()` on the index is a powerful technique. For example, you can resolve duplicates by averaging all rows with the same label. This allows for more nuanced data manipulation and aggregation, ensuring your dataset is accurate and meaningful.

# References 

- [Python for Data Analysis by Wes McKinney (3e)](https://wesmckinney.com/book/)
- [Pandas Official Documentation](https://pandas.pydata.org/docs/user_guide/10min.html)
- [Frequently Asked Questions (FAQ) on Pandas](https://pandas.pydata.org/docs/user_guide/gotchas.html)