# Data Cleaning and Preparation

Data Preparation tasks such as loading, cleaning, trasforming and rearranging are ofen reported to take up 80% or more of an analyst's time. Pandas and other python features provide flexible and fast set of tools for manipulate data into the right form.

## Index

* [Handling Missing Data](#handling-missing-data)
    * [Filtering Out Missing Data](#filtering-out-missing-data)
    * [Filling In Missing Data](#filling-in-missing-data)
* [Data Transformation](#data-transformation)
    * [Removing Duplicates](#removing-duplicates)
    * [Trasforming Data Using a Function or Mapping](#trasforming-data-using-a-function-or-mapping)
    * [Replacing Values](#replacing-values)
    * [Renaming Axis Indexes](#renaming-axis-indexes)
    * [Discretization and Binning](#discretization-and-binning)

## Handling Missing Data

*NA handling object methods*
|Method|Description|
|---|---|
|dropna |Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.|
|fillna |Fill in missing data with some value or using an interpolation method such as "ffill" or "bfill".|
|isna |Return Boolean values indicating which values are missing/NA.|
|notna |Negation of isna, returns True for non-NA values and False for NA values|

In [6]:
import pandas as pd 
import numpy as np 

flo_data = pd.Series([1.2, -3.5, np.nan, 0])

print(f"Series \n{flo_data}")
print(f"\nMissing data: \n{flo_data.isna()}")

# python 'None', is also treated as NA
flo_data = pd.Series([1.2, -3.5, np.nan, 0, None])

print(f"\nSeries \n{flo_data}")
print(f"\nMissing data: \n{flo_data.isna()}")

Series 
0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

Missing data: 
0    False
1    False
2     True
3    False
dtype: bool
Series 
0    1.2
1   -3.5
2    NaN
3    0.0
4    NaN
dtype: float64

Missing data: 
0    False
1    False
2     True
3    False
4     True
dtype: bool


### Filtering Out Missing Data

In [8]:
import pandas as pd 
import numpy as np 

flo_data = pd.Series([1.2, -3.5, np.nan, 0, None])

print(f"Series wthout NA values: \n{flo_data.dropna()}")
# Same result with filter
print(f"\nUsing filter flo_data[flo_data.notna()]: \n{flo_data[flo_data.notna()]}")

Series wthout NA values: 
0    1.2
1   -3.5
3    0.0
dtype: float64

Using filter flo_data[flo_data.notna()]: 
0    1.2
1   -3.5
3    0.0
dtype: float64


In [11]:
import pandas as pd 
import numpy as np 

frame = pd.DataFrame([[2., 5.5, 3.], 
                      [2., np.nan, np.nan], 
                      [np.nan, np.nan, np.nan], 
                      [np.nan, 5.5, 3.]])

print(f"DFrame: \n{frame}")

print(f"\nDFrame without rows containing NA values: \n{frame.dropna()}")
print(f"\nDFrame dropping full NA rows: \n{frame.dropna(how='all')}")

frame[4] = np.nan 
print(f"DFrame with new column 4 with NA values: \n{frame}")
print(f"\nDropping full NA columns specifying axis:"
      f"\n{frame.dropna(axis='columns', how='all')}")

DFrame: 
     0    1    2
0  2.0  5.5  3.0
1  2.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  5.5  3.0

DFrame without rows containing NA values: 
     0    1    2
0  2.0  5.5  3.0

DFrame dropping full NA rows: 
     0    1    2
0  2.0  5.5  3.0
1  2.0  NaN  NaN
3  NaN  5.5  3.0
DFrame with new column 4 with NA values: 
     0    1    2   4
0  2.0  5.5  3.0 NaN
1  2.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  5.5  3.0 NaN

Dropping full NA columns specifying axis:
     0    1    2
0  2.0  5.5  3.0
1  2.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  5.5  3.0


In [13]:
## Keep rows with at most a certain number of missing observations
import pandas as pd 
import numpy as np 

df = pd.DataFrame(np.random.standard_normal((7,3)))
df.iloc[:4, 1] = np.nan 
df.iloc[:2, 2] = np.nan 

print(f"DataFrame: \n{df}")

print(f"\nDropping rows with NA values: \n{df.dropna()}")

print(f"\nDropping rows with at least 2 NA values: \n{df.dropna(thresh=2)}")

DataFrame: 
          0         1         2
0 -0.424858       NaN       NaN
1 -0.156521       NaN       NaN
2 -0.475313       NaN  0.014500
3 -0.299724       NaN -0.754694
4 -0.068748 -0.675605  0.069436
5 -0.223925 -0.006411  2.636000
6 -1.614529  1.470182 -1.130516

Dropping rows with NA values: 
          0         1         2
4 -0.068748 -0.675605  0.069436
5 -0.223925 -0.006411  2.636000
6 -1.614529  1.470182 -1.130516

Dropping rows with at least 2 NA values: 
          0         1         2
2 -0.475313       NaN  0.014500
3 -0.299724       NaN -0.754694
4 -0.068748 -0.675605  0.069436
5 -0.223925 -0.006411  2.636000
6 -1.614529  1.470182 -1.130516


### Filling In Missing Data

Filling misssing data prevent us of discard some data with it. The `fillna()` method will help in this task. Otherwise, we can use *method='ffill'* argument with or without *limit* to fill NA values with values of the other cells in the same column. Now, argument *method='ffill'* is deprecated, instead we can use `ffill()` method or `bfill()` method.

*fillna Function arguments*
|Argument|Description|
|---|---|
|value |Scalar value or dictionary-like object to use to fill missing values|
|method |Interpolation method: one of "bfill" (backward fill) or "ffill" (forward fill); default is None|
|axis |Axis to fill on ("index" or "columns"); default is axis="index"|
|limit |For forward and backward filling, maximum number of consecutive periods to fill|

In [14]:
import pandas as pd 
import numpy as np 

df = pd.DataFrame(np.random.standard_normal((7,3)))
df.iloc[:4, 1] = np.nan 
df.iloc[:2, 2] = np.nan 

print(f"DataFrame: \n{df}")

print(f"\nDataFrame filling NA with '0': \n{df.fillna(0)}")

DataFrame: 
          0         1         2
0 -0.260455       NaN       NaN
1 -1.361895       NaN       NaN
2  0.485779       NaN -0.932451
3 -0.977748       NaN  0.855924
4  0.173131  1.399493  0.354406
5  0.039476 -0.582290 -1.913124
6 -0.291814  0.146749 -0.044685

DataFrame filling NA with '0': 
          0         1         2
0 -0.260455  0.000000  0.000000
1 -1.361895  0.000000  0.000000
2  0.485779  0.000000 -0.932451
3 -0.977748  0.000000  0.855924
4  0.173131  1.399493  0.354406
5  0.039476 -0.582290 -1.913124
6 -0.291814  0.146749 -0.044685


In [18]:
import pandas as pd 
import numpy as np 

df = pd.DataFrame(np.random.standard_normal((7,3)))
df.iloc[2:, 1] = np.nan 
df.iloc[4:, 2] = np.nan 

print(f"DataFrame: \n{df}")

#print(f"\nDataFrame filling NA with values in the same column:"
#      f"\n{df.fillna(method='ffill', limit=2)}")

print(f"\nDataFrame filling NA with values in the same column:"
      f"\n{df.ffill(limit=2)}")


DataFrame: 
          0         1         2
0  0.925019  0.332396  1.077413
1  0.360310  0.789294  0.749743
2  0.606949       NaN -0.234555
3 -0.404238       NaN  1.091170
4 -2.075037       NaN       NaN
5 -0.521436       NaN       NaN
6 -0.551778       NaN       NaN

DataFrame filling NA with values in the same column:
          0         1         2
0  0.925019  0.332396  1.077413
1  0.360310  0.789294  0.749743
2  0.606949  0.789294 -0.234555
3 -0.404238  0.789294  1.091170
4 -2.075037       NaN  1.091170
5 -0.521436       NaN  1.091170
6 -0.551778       NaN       NaN


In [19]:
## Filling data in a Series with the mean of the series
import pandas as pd 
import numpy as np 

data = pd.Series([1., np.nan, 3.5, np.nan, 7])

print(f"\nOriginal Series: \n{data}")

print(f"\nFillin NA with 'mean': \n{data.fillna(data.mean())}")



Original Series: 
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

Fillin NA with 'mean': 
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64


## Data Transformation

### Removing Duplicates

The DataFrame method `duplicated` returns a True or False whether or not the row is a duplicate (its column values are exactly eual to those in an earlier row). `drop_duplicated()` discards duplicated rows. By default `duplicated()` and `drop_duplicated()` keep the first observed value, we can change this behavior passing the argument `keep="last"`, this will keep the last observerd rows.

In [4]:
import pandas as pd 
import numpy as np 

frame = pd.DataFrame({
    "k1": ["one", "two"] * 3 + ["two"],
    "k2": [1, 1, 2, 3, 3, 4, 4]
})

print(f"DataFrame: \n{frame}")

print(f"\nDuplicated rows: \n{frame.duplicated()}")

# Returning DataFrame without duplicated rows 
print(f"\nDropping Duplicates: \n{frame.drop_duplicates()}")

# Add column 'v1' with range(7)
frame["v1"] = range(7)

# Duplicated values based only on "k1" column
print(f"\nNon-duplicated values based on 'K1' column:"
      f"\n{frame.drop_duplicates(subset=['k1'])}")

# Keeping last observed values
print(f"\nDrop duplicates keeping last rows:"
      f"\n{frame.drop_duplicates(['k1', 'k2'], keep='last')}")

DataFrame: 
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   4

Duplicated rows: 
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Dropping Duplicates: 
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4

Non-duplicated values based on 'K1' column:
    k1  k2  v1
0  one   1   0
1  two   1   1

Drop duplicates keeping last rows:
    k1  k2  v1
0  one   1   0
1  two   1   1
2  one   2   2
3  two   3   3
4  one   3   4
6  two   4   6


### Trasforming Data Using a Function or Mapping

Next, we are going to add a column based on other dictionary. We'll use `map()` method which accepts a function or dictionary-like object containing a mapping to do the transformation of values.

In [7]:
import pandas as pd 
import numpy as np 

frame = pd.DataFrame({
    "food": ["bacon", "pulled pork", "bacon",
             "pastrami", "corned beef", "bacon",
             "pastrami", "honey ham", "nova lox"],
             "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]
})

meat_to_animal = {
    "bacon": "pig",
    "pulled pork": "pig",
    "pastrami": "cow",
    "corned beef": "cow",
    "honey ham": "pig",
    "nova lox": "salmon"
}

print(f"DataFrame: \n{frame}")
print(f"\nDictionary, meat that corresponds to which animal:"
      f"\n{meat_to_animal}")

# Creating "animal" column in 'frame'
# Then, considering 'food' column we map() the 'meat_to_animal' dictionary 
frame["animal"] = frame["food"].map(meat_to_animal)

"""
Alternative using a function:
def get_animal(x):
    return meat_to_animal[x]

frame["animal"] = frame["food"].map(get_animal)
"""

print(f"\nUpdated DataFrame: \n{frame}")

DataFrame: 
          food  ounces
0        bacon     4.0
1  pulled pork     3.0
2        bacon    12.0
3     pastrami     6.0
4  corned beef     7.5
5        bacon     8.0
6     pastrami     3.0
7    honey ham     5.0
8     nova lox     6.0

Dictionary, meat that corresponds to which animal:
{'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow', 'honey ham': 'pig', 'nova lox': 'salmon'}

Updated DataFrame: 
          food  ounces  animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     pastrami     6.0     cow
4  corned beef     7.5     cow
5        bacon     8.0     pig
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon


### Replacing Values

Sometimes, extreme values like -999 represents missing data. `replace()` can provides a simpler and more flexible way to modidy values than `map()`. The `data.replace()` method is distinct from `data.str.replace()` which performs element-wise string substitution.
```python
# Replace one value:
data.replace(-999, np.nan)
# Replace a list of values:
data.replace([-999, -1000], np.nan)
# Replace with diferent values:
data.replace([-999, -1000], [np.nan, 0])
# Replace with a dictionary:
data.replace({-999: np.nan, -1000: 0})
```

### Renaming Axis Indexes



In [5]:
import pandas as pd 
import numpy as np 

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=["Ohio", "Colorado", "New York"],
                    columns=["one", "two", "three", "four"])

# Function returning first 4 letters in uppercase 
def transform(x):
    return x[:4].upper()

print(f"DataFrame before changes: \n{data}")
# Modifying the DataFrame index directly
data.index = data.index.map(transform)

# And renaming as title case, no uppercase
data2 = data.rename(index=str.title, columns=str.upper)

print(f"\nDataFrame after changes: \n{data2}")

# Renaming with dictionary
data3 = data.rename(index={"OHIO": "INDIANA"},
                    columns={"three": "peekaboo"})

print(f"\nDataFrame after renaming: \n{data3}")

DataFrame before changes: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
New York    8    9     10    11

DataFrame after changes: 
      ONE  TWO  THREE  FOUR
Ohio    0    1      2     3
Colo    4    5      6     7
New     8    9     10    11

DataFrame after renaming: 
         one  two  peekaboo  four
INDIANA    0    1         2     3
COLO       4    5         6     7
NEW        8    9        10    11


### Discretization and Binning

Continuous data is often discretized or otherwise separated into 'bins' for analysis.