# Intro to Data Cleaning

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Inspect data types
- Clean up a column using df.apply()
- Know what situations to use .value_counts() in your code


## Introduction: Topic (5 mins)

Since we're starting to get pretty comfortable with using pandas to do EDA, let's add a
couple more tools to our toolbox.

The main data types stored in pandas objects are float, int, bool, datetime64, datetime64, timedelta,
category, and object.

df.apply() will apply a function along any axis of the DataFrame. We'll see it in action below.

pandas.Series.value_counts returns Series containing counts of unique values. The resulting
Series will be in descending order so that the first element is the most frequently-occurring
element. Excludes NA values.

- Examples of [dtypes](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf).
- Examples of [value_counts](http://nullege.com/codes/search/pandas.Series.value_counts).



<a name="Inspect data types "></a>
## Demo /Guided Practice: Inspect data types  (20 mins)

Let's create a small dictionary with different data types in it.

> Instructor Note: The [demo code](./code/w2-2.3-demo.ipynb) contains all the code for this lesson in a Jupyter notebook. Use it to review the following code output:

in iPython notebook type:

In [16]:
import pandas as pd
import numpy as np
dft = pd.DataFrame(dict(A = np.random.rand(3),
                        B = 1,
                        C = 'foo',
                        D = pd.Timestamp('20010102'),
                        E = pd.Series([1.0]*3).astype('float32'),
                                F = False,
                                G = pd.Series([1]*3,dtype='int8')))
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.096202,1,foo,2001-01-02,1.0,False,1
1,0.918523,1,foo,2001-01-02,1.0,False,1
2,0.082195,1,foo,2001-01-02,1.0,False,1


In [17]:
#There is a really easy way to see what kind of dtypes are in each column.
dft.dtypes



A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

If a pandas object contains data multiple dtypes IN A SINGLE COLUMN, the dtype of the
column will be chosen to accommodate all of the data types (object is the most general).

In [18]:
# these ints are coerced to floats
pd.Series([1, 2, 3, 4, 5, 6.])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

In [19]:
# string data forces an ``object`` dtype
pd.Series([1, 2, 3, 6., 'foo'])

0      1
1      2
2      3
3      6
4    foo
dtype: object

In [20]:
#The method get_dtype_counts() will return the number of columns of each type in a DataFrame:
dft.get_dtype_counts()

bool              1
datetime64[ns]    1
float32           1
float64           1
int64             1
int8              1
object            1
dtype: int64

You can do a lot more with dtypes that you can check out [here](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf).

**Check:** Why do you think it might be important to know what kind of dtypes you're working with?



## Demo /Guided Practice:  df.apply() 

Let's create a small data frame.



In [27]:
df = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd'])
df

Unnamed: 0,a,b,c,d
0,-1.528529,-0.159901,-0.619863,-1.659255
1,-1.032695,-1.186721,-1.074595,0.245309
2,0.973328,-1.688048,-1.930654,1.235455
3,-0.919974,0.556679,0.809766,-0.148931
4,-1.21881,0.849273,-1.401231,1.476625


In [29]:
#Use df.apply to find the square root of all the values.
df.apply(np.sqrt)


Unnamed: 0,a,b,c,d
0,,,,
1,,,,0.495286
2,0.986574,,,1.11151
3,,0.746109,0.89987,
4,,0.92156,,1.215165


In [30]:
#Find the mean of all of the columns.
df.apply(np.mean, axis=0)


a   -0.745336
b   -0.325744
c   -0.843316
d    0.229841
dtype: float64

In [34]:
#Find the mean of all of the rows.
df.apply(np.mean, axis=1)

0   -0.991887
1   -0.762176
2   -0.352480
3    0.074385
4   -0.073536
dtype: float64

[df.apply](https://gist.github.com/why-not/4582705)
[df.apply](http://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html)

**Check:** How would find the std of the columns and rows?


<a name=".value_counts()"></a>
## Demo /Guided Practice: .value_counts() (20 mins)

Let's create a random array with 50 numbers, ranging from 0 to 7.

In [35]:
data = np.random.randint(0, 7, size = 50)

#Convert the array into a series.
s = pd.Series(data)

In [36]:
#How many of each number is there in the series? Enter value_counts():
pd.value_counts(s)


0    12
4    10
5     7
3     6
2     6
6     5
1     4
dtype: int64