<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px;">

# Intro to Data Cleaning

Week 2 | Lesson 2.3

<img src="https://snag.gy/ywU34V.jpg" width="250">

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Inspect data types
- Clean up a column using df.apply()
- Know what situations to use .value_counts() in your code

In [None]:
#--Data Clearning--
#Core Principles
#Types
#df.apply()

#DATA JANITOR
#70-90% of time is dedicated to cleaning and exploring data





### LESSON GUIDE
| TIMING  | TYPE  | TOPIC  |
|:-:|---|---|
| 5 min  | [Introduction](#introduction)   | Inpsect data types, df.apply(), .value_counts()  |
| 20 min  | [Demo /Guided Practice](#demo)  | Inpsect data types |
| 20 min  | [Demo /Guided Practice](#demo)  | df.apply() |
| 20 min  | [Demo /Guided Practice](#demo)  | .value_counts() |
| 20 min  | [Independent Practice](#ind-practice)  |   |
| 5 min  | [Conclusion](#conclusion)  |   |

## Discussion (5 mins):  What problems do you anticipate with bad data?



## Quality Metrics

 - Relative value
 - Encoding
 - Consistency 

In [None]:
#Relative Value to missing data. How do we deal with missing data
#Encoding - Is this in the right format?
#Consistency - 

## Common Data Cleaning Strategies
 - Remove missing values
 - Remove incorrect values
 - Update incorrect values
  - Removing invalid characters
  - Truncating part of a value
  - Scaling
  - Adding extra numeral or string-based data
 - Imputation
  - Mean / Median / Mode of common subset
 - Backfill / Forward fill
 - Examine common subsets for consistency

In [None]:
'''
Imputation is common if you want to keep signal

'''

<a name="introduction"></a>
## Introduction: Topic (5 mins)

Since we're starting to get pretty comfortable with using pandas to do EDA, let's add a
couple more tools to our toolbox. 

The main data types stored in pandas objects are **float, int, bool, datetime64, timedelta, 
category,** and **object****. 

df**.apply()** will apply a function along any axis of the DataFrame. We'll see it in action below. 

pandas.Series**.value_counts** returns _Series_ containing counts of unique values. The resulting 
Series will be in descending order so that the first element is the most frequently-occurring 
element. Excludes NA values.

[Pandas: dtypes](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf)

[Pandas Series: value_counts](http://nullege.com/codes/search/pandas.Series.value_counts)


### Common Operations

- **float**: precision specific math operations
- **int**: mathatical operations using whole numbers
- **bool**: control flow conditions
- **datetime64**: resampling, slicing/selection, frequency back/front filling, on date range
- **timedelta**: date comparisons
- **category**: is like a set type, allowing to capture categorical day as a set, with ordinal
- **object**: all types can be represented as an object but math, date operations are not possible.  Limited control flow possibilities unless you are comparing strings.

<a name="Inpsect data types "></a>
## Demo /Guided Practice: Inspect data types  (20 mins)

Let's create a small dictionary with different data types in it. 

> [demo code](../code/w2-2.3-demo.ipynb)
can be found in the code folder and contains all the code in this lesson in a Jupyter
notebook. Follow along or create a new notebook.


### Import Pandas + Numpy

In [2]:
import pandas as pd
import numpy as np

### Create Test Data

In [2]:
test_data = dict( 
    A = np.random.rand(3),
    B = 1,
    C = 'foo',
    D = pd.Timestamp('20010102'),
    E = pd.Series([1.0]*3).astype('float32'),
    F = False,
    G = pd.Series([1]*3,dtype='int8')
)

In [3]:
test_data

{'A': array([ 0.14716712,  0.02145911,  0.03801213]),
 'B': 1,
 'C': 'foo',
 'D': Timestamp('2001-01-02 00:00:00'),
 'E': 0    1.0
 1    1.0
 2    1.0
 dtype: float32,
 'F': False,
 'G': 0    1
 1    1
 2    1
 dtype: int8}

### Create our DataFrame

In [4]:
dft = pd.DataFrame(test_data)
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.147167,1,foo,2001-01-02,1.0,False,1
1,0.021459,1,foo,2001-01-02,1.0,False,1
2,0.038012,1,foo,2001-01-02,1.0,False,1


In [5]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

**What might we expect dtypes in the case of mixed type values in a single dimension?**

ie:  [2, 3, 4, 5, 6, 7, 8.9]

If a pandas object contains data multiple dtypes IN A SINGLE COLUMN, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).

### Ints are cast to floats

In [6]:
pd.Series([1, 2, 3, 4, 5, 6.])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

### String elements are cast to ``object`` dtype

In [7]:
pd.Series([1, 2, 3, 'foo'])

0      1
1      2
2      3
3    foo
dtype: object

In [8]:
dft.get_dtype_counts().astype(list)

bool              1
datetime64[ns]    1
float32           1
float64           1
int64             1
int8              1
object            1
dtype: object

### With a partner, take 3 minutes to discus:

*Without* running this code with a Python interpreter, what types would you expect the common `dtype` to be selected?

    [1, 3, 9, .33, False, '03-20-1978', np.arange(22)]



You can do a lot more with dtypes.  Check out 
[Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf).

## Why do you think it might be important to know what kind of dtypes you're working with? 

<a name=" df.apply()"></a>
## Demo /Guided Practice:  df.apply() (20 mins)

Generally, df.apply(), will apply a singlular function to every cell of the dataframe you use it with.  

Conversely: df.map(), is available when you only want to work with a single dimension of your dataset, ie:  df['a'].map(my_func)

In [18]:
#df.apply() and df.map()
#df.apply() -> apply a singular function to every cell of the dataframe you use it with
#df.map() -> available when you only want to work with a single dimension of your dataset
#


# Create some more test data
df = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd'])
df

Unnamed: 0,a,b,c,d
0,0.410968,1.065398,-1.656901,0.629236
1,-1.599641,-2.428118,-2.78833,0.005204
2,0.873606,-1.23312,0.412481,0.034073
3,-0.783972,-0.940906,-1.126716,-0.86664
4,0.495917,-1.47142,-0.284235,-2.549539


### Some Examples

In [27]:
# square root ALL CELLS (NaN == Not a Number)
df.apply(np.sqrt) #This will apply the np.sqrt function to all cells in the dataframe
############

Unnamed: 0,volume_sold,2015_margin,2015_q1_sales,2016_q1_sales
0,4.291941,9.685158,580.660426,581.209128
1,2.185523,4.591560,149.505385,147.433477
2,4.074604,9.675355,527.033642,554.023709
3,2.072706,4.101793,129.634525,96.476681
4,2.855875,5.917048,233.262556,242.775411
5,2.237213,5.646011,505.904942,577.043352
6,3.821878,8.747512,564.819166,550.084430
7,2.111034,4.397425,212.932689,235.191900
8,2.246671,5.112971,240.518669,205.909130
9,2.321222,4.735718,225.900509,237.153052


_Note: Illustrate with whiteboard DataFrame, with blank axis labels._

### Apply method to only one axis, 0 (columns)
ie:  `df['a'] == [-1.369438  , -0.76541512,  1.75835588,  1.17270527,  0.02630271]`

In [4]:
#this will only return a single result per axis because it is an aggregate function
df.apply(np.mean, axis=0)
df['a'].map(lambda value: value + 1)

0    2.657000
1    1.112650
2   -0.485292
3    1.518611
4    1.139144
Name: a, dtype: float64

### Apply method to only axis 1 (rows)
This is what the data slice would look like if we were to select only rows.  Here is the slice of the first row that would be affected with axis #1 with .apply():

`df.iloc[0].values == [-1.369438  ,  0.0804468 , -1.22457047,  0.42207757]`

_We are calculating the mean of lists of "rows", not "columns"._

In [None]:
df = pd.read_csv('/Users/austinwhaley/Desktop/DSI-SF-4-austinmwhaley/datasets/drug_use_by_age/drug-use-by-age.csv')
#print df



### Further Reading

For more advanced `.apply` usage, check out these links:

["Why Not"'s Gist Examples](https://gist.github.com/why-not/4582705)

[Chris Albon's Map + Apply Examples](http://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html)


### **Check:** How would find the std of the columns and rows? 

<a name=".value_counts()"></a>
## Demo /Guided Practice: .value_counts() (< 20 mins)

Why is this important?  Basically, this tells us the count of unique values that exist.  It's helpful to identify anything unexpected.  Looking at value_counts(), per series, can give us a quick overview of values expressed in our data.

 - Strings inside of mostly numeric / continious data
 - Non-numeric values
 - General counts of values that we might expect to see
 - Most common / least common values

In [None]:
#df.value_counts() will uncover any inconsitencies
#df._() has the flexibility of both df.apply() and df.map()
#df.apply() takes the entire series as input
#df.map() takes each individiual cell in a series as input

#df.applymap() is the child of df.apply() and df.map()
#This lets you apply a function for every value in every series in the dataframe

Let's create some random data

In [15]:
data = np.random.randint(0, 7, size = 50)
data

array([2, 2, 3, 1, 5, 6, 0, 6, 5, 5, 0, 4, 2, 6, 0, 4, 1, 0, 3, 3, 4, 5, 6,
       6, 5, 6, 5, 6, 6, 2, 6, 1, 6, 5, 4, 6, 3, 2, 3, 4, 4, 3, 1, 2, 4, 4,
       1, 4, 4, 4])

In [16]:
s = pd.Series(data)
s.head()

0    2
1    2
2    3
3    1
4    5
dtype: int64

In [17]:
# The counts of each number that occurs in our array is listed
pd.value_counts(s)

6    11
4    11
5     7
3     6
2     6
1     5
0     4
dtype: int64

In [None]:
#Import Sales Dataset
import pandas as pd
import numpy as np
df = pd.read_csv('/Users/austinwhaley/Desktop/DSI-SF-4-austinmwhaley/datasets/sales_data_simple/sales.csv')
#print df.head()

#print df.info()

#use df.apply() to add 1 to every value in column 1 of df
#df = df['volume_sold'].apply(lambda x: x + 1)
#print df.head()

#use df.value_counts to count the values of 1 column of the dataset
#df.value_counts()

#df['2015_q1_sales'].apply(lambda x: x + 3)

#Use .value_counts for each column of the dataset
#Don't know how to do that

df.apply()


<a name="ind-practice"></a>
## Independent Practice: Topic (20+ minutes)
- Use the sales.csv data set, we've seen this a few times in previous lessons
- Inspect the data types
- You've found out that all your values in column 1 are off by 1. Use df.apply to add 1 to column 1 of the dataset
- Use .value_counts to count the values of 1 column of the dataset

**Bonus** 
- Add 3 to column 2
- Use .value_counts for each column of the dataset

**Bonus Bonus -- COMPLETELY OPTIONAL!!!**
<img src="http://vignette3.wikia.nocookie.net/erbparodies/images/a/a3/Troll_Based_On.png/revision/latest?cb=20151109194505" style="width: 100px;">

Ruining data should give you a better sense of how to clean it.  Don't feel like you need to attempt this as it's completely optional and it's meant to be _extranious_.  Real-life datasets will not be like this.  The solution isn't as important as the process and thinking behind your approach.  Another way you might want to try to do this, is map out how you would do this with pseudo code with a step-by-step plan, without actually coding anything.

- Add an extra column to your dataframe that is a copy of an existing column with continious data
    - Randomly change the value of continious data cells within it to the following:
      - NaN
      - A blank string
      - A numeric string
      - The same value
    - Report value_counts post-"random data troll" processing. Does it seem random?
    - Convert blank strings and NaN values to float(0)
    - Convert numeric strings to floats with 2f precision
    - Divide by 2 if cell value is prime, use remainder as value
    - Post solution as Gist with comments to Slack

<a name="conclusion"></a>
## Conclusion (5 mins)
So far we've used pandas to look at the head and tail of a data set. We've also taken a look at summary stats and different
types of data types. We've selected and sliced data too. Today we added inspecting data types, df.apply, .value_counts to
our pandas arsenal.

