<a href="https://colab.research.google.com/github/brendenwest/cis276/blob/main/6_data_wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning

Techniques for cleaning & preparing data

### Reading
- Murach's, Chapter 6, 7
- https://wesmckinney.com/book/data-cleaning

### Learning Outcomes
- how to find & fix missing values
- how to simplify your data
- how to fix data-type problems
- how to work with indexes
- how to apply functions


### Data Cleaning

Cleaning data is a crucial and often time-consuming step in data science.

Data scientists might use pure Python, psndas, or other programming tools for this step. Examples here focus on pandas with a few other approaches for specific scenarios.

Common tasks are:

*   Handling missing data
*   Simplifying data
*   Data-type conversion



#### Handling Missing Data

Often data analysts need to account for missing data values.

pandas uses the floating-point value `NaN` (Not a Number) to represent missing numerica data. This is a `sentinel` value that can be easily detected.

The built-in Python `None` value is also treated as NaN.

pandas has several methods for detecting NaN values in a Series or DataFrame:
- isnull
- notnull

These methods can be used as filters in a data query.

`data[data.notnull()]`

Alternatively, programs can use `dropna` to filter axis labels where values may have missing data.

`dropna` has options to control how many missing values a row or column should have to be dropped.

**replace missing values**
Sometimes it's more useful to replace missing data with a specific or interpreted value, using `fillna`.

```
df_2 = df.fillna(-1) # return new dataframe with -1 in place of missing values
df.fillna(-1, inplace=True) # fill missing values with -1 in original dataframe
```


#### Simplifying Data

- **removing duplicates** - DataFrames have built-in methods to identify which rows are `duplicated` and to `drop_duplicates`. By default, these methods consider all columns, but programs can specify a subset.

- **replacing values** - the `replace()` method is a simple approach for replacing values in a pandas object.

- **handling outliers** - programs may want to find & replace or filter values that exceed some threshold.

#### String Conversion

python has a wide range of built-in string methods. Some common ones are:
  - **split** - generate an array of substrings from a string based on a delimiter
  - **lowercase** - convert a string to lower case
  - **uppercase** - convert a string to upper case
  - **join** - combine strings with a delimiter
  - **index** - determine where in a string a substring is first found
  - **find** - determine if a string contains a substring
  - **count** - number of occurences of a substring in a string
  - **replace** - substitute occurrences of one pattern with another.


**Regular Expressions** provide a (mostly) language-agnostic logical syntax for finding/matching string patterns in text.

`regex` patterns can be applied to strings with python's [re module](https://docs.python.org/3/library/re.html).



In [None]:
import re
text = "foo    bar\t baz  \tqux"

# inline regex pattern
re.split('\s+', text)

# reusable regex object
regex = re.compile('\s+')
regex.split(text)

### Hierarchical Indexing

- allows multiple index (MultiIndex) levels on an axis
- either axis of a DataFrame can have a hierarchical index
- partial indexing allows concise selection of data subsets
- selections can be made on the `outer` or `inner` level of a MultiIndex
- can be used to reshape data (using `stack` & `unstack`)
- can be used for group-based operations


#### Creating hierarchical indexes

Hierarchical indexes can be assigned to dataset by passing a multi-dimensional list for the `index` value, with each list item having the same length as the dataset axis. This creates a `MultiIndex` object.

In [None]:
import pandas as pd
import numpy as np
data = pd.Series(np.random.randn(9),
 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
 [1, 2, 3, 1, 3, 1, 2, 2, 3]])
print(data)
print(data.index)

a  1   -0.605719
   2    0.123044
   3    1.811644
b  1    0.823819
   3    0.385486
c  1    0.327702
   2   -0.159786
d  2    0.869689
   3   -1.515328
dtype: float64
MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )


#### Reordering & Sorting Levels

- `swaplevel` changes order of levels in a MultiIndex
- `sort_index` sorts the dataset using only the values in a single level of the MultiIndex

#### Summary Statistics by Level

- Many pandas descriptive and summary statistics support specifying the level you want to aggregate by on a particular axis.

#### Indexing with DataFrame columns

- DataFrame's `set_index` function will create a new DataFrame using one or more of its columns as the index
- index columns are removed from the DataFrame unless specified otherwise with `drop=False`
- `reset_index` moves hierarchical index levels into the DataFrame

### Data Aggregation

Aggregations refer to any data transformation that produces scalar values from arrays.

Programs can use built-in optimized aggregation methods, or custom functions.

Programs can pass any function that aggregates an array to the `aggregate` or `agg` method of a GroupBy object.

Custom aggregation functions are generally much slower than the built-in optimized functions.

#### Column-wise & Multiple-function application

Providing a list of functions or function names, results in a DataFrame with column names taken from the functions. Programs can over-ride the default column name:

`grouped_pct.agg([('foo', 'mean'), ('bar', np.std), peak_to_peak])`

You can specify a list of functions to apply to all of the columns or different functions per column.

#### Aggregated data without row indexes

By default, the aggregated data result has an index, potentially hierarchical, composed from the unique group key combinations. You can disable this behavior in most cases by passing `as_index=False` to groupby.