<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Intro to Data Representation and Data Cleaning

_Authors: Dave Yerrington (SF)_

---

<img src="https://snag.gy/ywU34V.jpg" width="250">

### Learning Objectives
*After this lesson, you will be able to:*
- Inspect data types.
- Clean up a column using `df.apply()`.
- Recognize situations in which to use `.value_counts()` in your code.

### Lesson Guide

- [Common Data Cleaning Strategies](#common_strategies)
- [Data Quality Measures](#data_quality_measures)
- [`pandas` Tools for Cleaning Data](#cleaning_tools)
- [Common Operations on Data by Type](#common_operations)
- [Guided Practice: Inspecting Data Types and Applying Functions](#guided_practice)
- [Independent Practice: Sales Data](#independent_pratice)


<a id='common_strategies'></a>

### Common Data Cleaning Strategies

---

 - Remove missing values.
 - Remove incorrect values.
 - Update incorrect values.
  - Removing invalid characters.
  - Truncating part of a value.
  - Adding an extra numeral or string-based data.
 - Imputate missing or invalid data.
  - Calculating the mean/median/mode of a column, sometimes within group subsets.
  - Implementing model-based imputation (K-Nearest Neighbors, MICE, etc.).
 - Backfill or forward fill.


<a id='data_quality_measures'></a>

### Measures of Data Quality

---

 - What is the relative value of the data column?
 - Is the data encoded properly?
 - Is the data consistently encoded? Does it represent the information it contains appropriately?

<a id='cleaning_tools'></a>

### `pandas` Tools for Cleaning Data

---

We're starting to get more comfortable with using `pandas` for manipulating and examining data. Now, let's add a couple more tools to our toolbox.

The main data types in `pandas` objects are:
- `float`
- `int`
- `bool`
- `datetime64`
- `timedelta`
- `category`
- `object`

It is always important to evaluate the data types of columns to ensure that the information they contain is properly represented.

See [`pandas`: dtypes](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf) for a more detailed reference.

We will be using two tools extensively in this lesson:

**The `.apply()` function**

This built-in function will apply a function to all cells, rows, or columns within a DataFrame. We will explore this process in detail below.

**The Series `.value_counts` attribute**

`pandas` Series objects have `.value_counts` attributes that return a new Series containing the counts of the data's unique values. This Series will be formatted in descending order by default, so the first element is the most frequently occuring value.

Note: `.value_counts` excludes the counts of null values in the column!

See [`pandas` Series: value_counts](http://nullege.com/codes/search/pandas.Series.value_counts) for more detailed information.


<a id='common_operations'></a>

### Common Operations on Data by Type

---

- **`float`**: Precision-specific math operations.
- **`int`**: Operations with whole numbers.
- **`bool`**: Control flow conditions.
- **`datetime64`**: Resampling, slicing/selection, frequency back/front filling on a date range.
- **`timedelta`**: Date comparisons.
- **`category`**: A more powerful set type; can capture for example days as a category with ordinal (ordering) information.
- **`object`**: All data types can be represented as an object, but math and date operations will not be possible. Limited control flow possibilities are available unless you are comparing strings.

<a id='guided_practice'></a>

### Guided Practice: Inspecting Data Types and Applying Functions

---

[This guided practice](./practice-inspecting-data-applying-functions.ipynb) follows along with the questions in the Jupyter notebook provided.


In [1]:
import pandas as pd
import numpy as np

**1. Create a small DataFrame with different data types.**

In [2]:
test_data = dict( 
    A = np.random.rand(3),
    B = 1,
    C = 'foo',
    D = pd.Timestamp('20010102'),
    E = pd.Series([1.0]*3).astype('float32'),
    F = False,
    G = pd.Series([1]*3,dtype='int8')
)

In [3]:
test_data

{'A': array([0.23084836, 0.44286921, 0.03029326]),
 'B': 1,
 'C': 'foo',
 'D': Timestamp('2001-01-02 00:00:00'),
 'E': 0    1.0
 1    1.0
 2    1.0
 dtype: float32,
 'F': False,
 'G': 0    1
 1    1
 2    1
 dtype: int8}

In [4]:
dft = pd.DataFrame(test_data)
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.230848,1,foo,2001-01-02,1.0,False,1
1,0.442869,1,foo,2001-01-02,1.0,False,1
2,0.030293,1,foo,2001-01-02,1.0,False,1


**2. Examine the data types of the columns.**

In [5]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

In [6]:
type(dft.dtypes)

pandas.core.series.Series

In [7]:
# A nice way of breaking lines

pd.DataFrame(dft.dtypes)\
  .reset_index()\
  .rename(columns={'index':'column',0:'type'})

Unnamed: 0,column,type
0,A,float64
1,B,int64
2,C,object
3,D,datetime64[ns]
4,E,float32
5,F,bool
6,G,int8


In [8]:
dft.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 7 columns):
A    3 non-null float64
B    3 non-null int64
C    3 non-null object
D    3 non-null datetime64[ns]
E    3 non-null float32
F    3 non-null bool
G    3 non-null int8
dtypes: bool(1), datetime64[ns](1), float32(1), float64(1), int64(1), int8(1), object(1)
memory usage: 194.0+ bytes


**3. Create a Series object with integers 1-5 and float 6.0. What data type is the Series?**

In [9]:
pd.Series([1, 2, 3, 4, 5, 6.])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

If a `pandas` object contains data of multiple dtypes in a single column, the dtype of that column will be used to accommodate all data types (`object` is the most general).

**4. Create a Series with data `[1, 2, 3, 6., 'foo']`. What data type is the Series?**

In [10]:
pd.Series([1, 2, 3, 'foo'])

0      1
1      2
2      3
3    foo
dtype: object

**5. Use the `.get_dtype_counts()` function to determine how many columns there are of each type.**

In [11]:
dft.get_dtype_counts()

float64           1
float32           1
int64             1
int8              1
datetime64[ns]    1
bool              1
object            1
dtype: int64

# With a partner, take three minutes to discuss:

*Without* running this code with a Python interpreter, what would you expect to be the most common `dtype`?

    [1, 3, 9, .33, False, '03-20-1978', np.arange(22)]



You can do a lot more with dtypes. For more information, check out 
[`pandas` Documentation](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf).

**Applying Functions to Data with `df.apply()`**

Generally, `df.apply()` will apply a singlular function to every cell of the DataFrame it's used within.  

Note: There is another common built-in function, `df.map()`, that applies a function to each element of a single Series (column). For example:

```python
df['a'].map(my_func)
```

**6. Create another small DataFrame.**

In [12]:
# Create some more test data.
df = pd.DataFrame(np.random.randn(5, 4), 
                  columns=['a', 'b', 'c', 'd'])
df

Unnamed: 0,a,b,c,d
0,-0.338457,0.227305,0.84124,1.055888
1,0.786688,-0.159309,-0.357414,0.247515
2,0.085687,1.700259,-0.664596,-0.663706
3,-0.254013,-1.505525,0.001612,1.574972
4,-1.370106,-0.076077,0.2467,-0.211049


**7. Use the `.apply()` function to find the square root of all cells.**

In [13]:
# Square root ALL CELLS (NaN == Not a Number).
df.apply(np.sqrt)

Unnamed: 0,a,b,c,d
0,,0.476766,0.917192,1.027564
1,0.886954,,,0.497509
2,0.292724,1.30394,,
3,,,0.040153,1.254979
4,,,0.496689,


In [14]:
df.apply(lambda x: x+1)

Unnamed: 0,a,b,c,d
0,0.661543,1.227305,1.84124,2.055888
1,1.786688,0.840691,0.642586,1.247515
2,1.085687,2.700259,0.335404,0.336294
3,0.745987,-0.505525,1.001612,2.574972
4,-0.370106,0.923923,1.2467,0.788951


In [None]:
def myf(x): return x+1

df.apply(myf)

**8. Use `.apply()` to find the mean of the columns.**

In [22]:
df.head()

Unnamed: 0,a,b,c,d
0,-0.338457,0.227305,0.84124,1.055888
1,0.786688,-0.159309,-0.357414,0.247515
2,0.085687,1.700259,-0.664596,-0.663706
3,-0.254013,-1.505525,0.001612,1.574972
4,-1.370106,-0.076077,0.2467,-0.211049


In [15]:
df.apply(np.mean, axis=0)

a   -0.218040
b    0.037331
c    0.013509
d    0.400724
dtype: float64

In [23]:
df.apply(np.mean, axis=1)

0    0.446494
1    0.129370
2    0.114411
3   -0.045739
4   -0.352633
dtype: float64

In [16]:
df.apply(np.mean)

a   -0.218040
b    0.037331
c    0.013509
d    0.400724
dtype: float64

In [17]:
# Strange example!
# x is both the element and the column..!..
df.apply(lambda x: x-np.mean(x), axis=0)

Unnamed: 0,a,b,c,d
0,-0.120417,0.189975,0.827732,0.655164
1,1.004728,-0.19664,-0.370922,-0.153209
2,0.303728,1.662929,-0.678104,-1.06443
3,-0.035973,-1.542856,-0.011896,1.174248
4,-1.152066,-0.113408,0.233191,-0.611773


**9. Find the mean of the rows.**

In [18]:
df.apply(np.mean, axis=1)

0    0.446494
1    0.129370
2    0.114411
3   -0.045739
4   -0.352633
dtype: float64

### Further Reading

For more advanced information on `.apply` usage, check out these links:

- ["Why Not"'s Gist Examples](https://gist.github.com/why-not/4582705)

- [Chris Albon's Map + Apply Examples](http://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html)


**Counting Occurrances of Unique Values With `.value_counts()`**

The `.value_counts` attribute tells us the count of unique values in a column's data. It's helpful for identifying unexpected values and getting a feel for the data's distribution, especially when looking at group membership.  Looking at the value counts per column can give us a quick overview of values expressed in our data.

Some common use cases of `.value_counts` include:
 - Finding strings inside of mostly numeric/continuous data.
 - Finding non-numeric values.
 - General distributions of categorical variables.
 - Identifying the most and least common values.

**10. Use `numpy` to create a random vector of 50 numbers ranging from 0 to 6.**


In [19]:
data = np.random.randint(0, 7, size = 50)
data

array([1, 2, 4, 5, 0, 4, 1, 5, 5, 5, 0, 4, 4, 4, 5, 0, 4, 2, 4, 0, 4, 3,
       3, 4, 2, 2, 6, 6, 1, 4, 2, 0, 6, 0, 1, 5, 1, 5, 6, 6, 5, 0, 0, 0,
       4, 1, 0, 3, 2, 3])

**11. Convert the vector to a Series and count the occurrences of each number.**

In [20]:
s = pd.Series(data)
s.head()

0    1
1    2
2    4
3    5
4    0
dtype: int64

In [21]:
# The counts of each number that occurs in our array is listed.
pd.value_counts(s)

4    11
0    10
5     8
2     6
1     6
6     5
3     4
dtype: int64

<a name="independent_ practice"></a>

### Independent Practice: Sales Data

---

1. Load the `sales.csv` data set from the `datasets` directory.
- Inspect the data types.
- Imagine you've found out that all your values in column 1 are off by one. Use `.apply()` or `.map()` to add `1` to column 1 of the data set.
- Use `.value_counts` to count the values of one column of the data set.
