# Data Wrangling
In data preparation includes:
* loading
* cleaning
* transforming
* rearranging

## Combining and merging data sets
There are different ways to combine the dataset together in pandas:
* [pandas.merge](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge.html#pandas.merge) -- a SQL or relational database _join_ method
* [pandas.concat](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) -- glues or stacks together objects along an axis
* combine_first -- splicing together overlapping data to fill in missing values in one object with values from another.

 ### Database-Style DataFrame join
Pandas provides a single funciton, __merge__, as the entry point for all standard database join operations between DataFrame objects. The function is also available as a DataFrame instance method.

There are several cases in Database __join__:
* __one-to-one__ joins: joining two DataFrame objects on their indexes which must contain unique values
* __many-to-one__ joins: joining a unique index to one or more columns in a DataFrame
* __many-to-many__ joins: joining columns on columns. If a key combination appears more than once in both tables, the resulting table will have the __Cartesian product__ of the associated data. 

In [1]:
import pandas as pd

In [2]:
left = pd.DataFrame({'key1' : ['K0', 'K0', 'K1', 'K2'],
                    'key2' : ['K0', 'K1', 'K0', 'K1'],
                    'A' : [1, 2, 3, 4],
                    'B': [0.1, 0.2, 0.3,0.4]})
left

Unnamed: 0,A,B,key1,key2
0,1,0.1,K0,K0
1,2,0.2,K0,K1
2,3,0.3,K1,K0
3,4,0.4,K2,K1


In [3]:
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'k2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': [-0.1, -0.2, -0.3, -0.4],
                      'D': [-1, -2, -3, -4]})
right

Unnamed: 0,C,D,key1,key2
0,-0.1,-1,K0,K0
1,-0.2,-2,K1,K0
2,-0.3,-3,K1,K0
3,-0.4,-4,k2,K0


In [4]:
result = pd.merge(left, right, on = ['key1', 'key2'])
result

Unnamed: 0,A,B,key1,key2,C,D
0,1,0.1,K0,K0,-0.1,-1
1,3,0.3,K1,K0,-0.2,-2
2,3,0.3,K1,K0,-0.3,-3


In __merge__ function, there is a __how__ argument to specify how to determine which keys are to be included in the resulting table. In general, if a key combination __does not appear__ in either the left or right tables, the values in the joined table will be __NA__. By default, __merge__ function use _inner join_. Here is a summary:

__Merge method | SQL Join Name | Description 
---------------|----------------------------
left | LEFT OUTER JOIN | Use keys from left frame only
right| RIGHT OUTER JOIN |Use keys from right frame only
outer| FULL OUTER JOIN | Use union of keys from both frames
inner| INNER JOIN | Use intersection of keys from both frames

In [5]:
result = pd.merge(left, right, how = 'left', on = ['key1', 'key2'])
result

Unnamed: 0,A,B,key1,key2,C,D
0,1,0.1,K0,K0,-0.1,-1.0
1,2,0.2,K0,K1,,
2,3,0.3,K1,K0,-0.2,-2.0
3,3,0.3,K1,K0,-0.3,-3.0
4,4,0.4,K2,K1,,


In [6]:
result = pd.merge(left, right, how = 'right', on = ['key1', 'key2'])
result

Unnamed: 0,A,B,key1,key2,C,D
0,1.0,0.1,K0,K0,-0.1,-1
1,3.0,0.3,K1,K0,-0.2,-2
2,3.0,0.3,K1,K0,-0.3,-3
3,,,k2,K0,-0.4,-4


In [7]:
result = pd.merge(left, right, how = 'outer', on = ['key1', 'key2'])
result

Unnamed: 0,A,B,key1,key2,C,D
0,1.0,0.1,K0,K0,-0.1,-1.0
1,2.0,0.2,K0,K1,,
2,3.0,0.3,K1,K0,-0.2,-2.0
3,3.0,0.3,K1,K0,-0.3,-3.0
4,4.0,0.4,K2,K1,,
5,,,k2,K0,-0.4,-4.0


The __indicator__ argument will shows which values are taken:

__Observation Origin__ | __\_merge value__
---------------------|----------------
Merge key only in 'left' frame|left_only
Merge key only in 'right' frame|right_only
Merge key in bot frames|both

In [8]:
result = pd.merge(left, right, how = 'left', on = ['key1', 'key2'], indicator = True)
result

Unnamed: 0,A,B,key1,key2,C,D,_merge
0,1,0.1,K0,K0,-0.1,-1.0,both
1,2,0.2,K0,K1,,,left_only
2,3,0.3,K1,K0,-0.2,-2.0,both
3,3,0.3,K1,K0,-0.3,-3.0,both
4,4,0.4,K2,K1,,,left_only


In [9]:
result = pd.merge(left, right, how = 'left', on = ['key1', 'key2'], indicator = 'take_on_columns')
result

Unnamed: 0,A,B,key1,key2,C,D,take_on_columns
0,1,0.1,K0,K0,-0.1,-1.0,both
1,2,0.2,K0,K1,,,left_only
2,3,0.3,K1,K0,-0.2,-2.0,both
3,3,0.3,K1,K0,-0.3,-3.0,both
4,4,0.4,K2,K1,,,left_only


Sometimes two DataFrame may share the same column names, __suffix__ can disambiguate the result columns:

In [10]:
left = pd.DataFrame({'k' : ['k0', 'k1', 'k2'], 'v': [1, 2, 3]})

right = pd.DataFrame({'k' : ['k0', 'k0', 'k3'], 'v': [4, 5, 6]})

left

Unnamed: 0,k,v
0,k0,1
1,k1,2
2,k2,3


In [11]:
right

Unnamed: 0,k,v
0,k0,4
1,k0,5
2,k3,6


In [12]:
result = pd.merge(left, right, on = 'k', suffixes = ['_l', '_r'])
result

Unnamed: 0,k,v_l,v_r
0,k0,1,4
1,k0,1,5


In some cases, the merge key or keys in a DataFrame will be found in its index. 

In [13]:
left = pd.DataFrame( {'a': ['a10', 'a11', 'a12'],
                      'b': ['b10', 'b11', 'b12']},
                      index = ['k0', 'k1', 'k2'])
left

Unnamed: 0,a,b
k0,a10,b10
k1,a11,b11
k2,a12,b12


In [14]:
right = pd.DataFrame( {'c' : ['c10', 'c20', 'c30'],
                       'd' : ['d10', 'd20', 'd30']},
                       index = ['k0', 'k2', 'k3'])
right

Unnamed: 0,c,d
k0,c10,d10
k2,c20,d20
k3,c30,d30


In [15]:
result = pd.merge(left, right, left_index = True, right_index = True)
result

Unnamed: 0,a,b,c,d
k0,a10,b10,c10,d10
k2,a12,b12,c20,d20


There is a convenient method from DataFrame -- __join__ that can do the similar job.

In [16]:
result = left.join(right, how = 'inner')
result

Unnamed: 0,a,b,c,d
k0,a10,b10,c10,d10
k2,a12,b12,c20,d20


The index of the DataFrame can be aligned on the column in another DataFrame while joining two DataFrame:

In [17]:
left = pd.DataFrame({'a' : ['a0', 'a1', 'a2', 'a3'],
                     'b' : ['b0', 'b1', 'b2', 'b3'],
                     'key': ['k0', 'k1', 'k0', 'k1']})
left

Unnamed: 0,a,b,key
0,a0,b0,k0
1,a1,b1,k1
2,a2,b2,k0
3,a3,b3,k1


In [18]:
right = pd.DataFrame ({'c' : ['c0', 'c1'],
                       'd' : ['d0', 'd1']},
                       index = ['k0', 'k1'])
right 

Unnamed: 0,c,d
k0,c0,d0
k1,c1,d1


In [19]:
result = pd.merge(left, right, left_on = 'key', right_index = True, how = 'left', sort = False)
result

Unnamed: 0,a,b,key,c,d
0,a0,b0,k0,c0,d0
1,a1,b1,k1,c1,d1
2,a2,b2,k0,c0,d0
3,a3,b3,k1,c1,d1


In [20]:
left.join(right, on = 'key')

Unnamed: 0,a,b,key,c,d
0,a0,b0,k0,c0,d0
1,a1,b1,k1,c1,d1
2,a2,b2,k0,c0,d0
3,a3,b3,k1,c1,d1


To join on multiple keys, the passed DataFrame must have a MultiIndex:

In [21]:
left = pd.DataFrame({'A' : ['A0', 'A1', 'A2', 'A3'],
                     'B' : ['B0', 'B1', 'B2', 'B3'], 
                     'key1' : ['K0', 'K0', 'K1', 'K2'],
                     'key2' : ['K0', 'K1', 'K0', 'K1']})

index = pd.MultiIndex.from_tuples([('K0', 'K0'), ('K1', 'K0'),
                                   ('K2', 'K0'), ('K2', 'K1')])
left

Unnamed: 0,A,B,key1,key2
0,A0,B0,K0,K0
1,A1,B1,K0,K1
2,A2,B2,K1,K0
3,A3,B3,K2,K1


In [22]:
right = pd.DataFrame({'C' : ['C0', 'C1', 'C2', 'C3'],
                      'D' : ['D0', 'D1', 'D2', 'D3']},
                      index = index)
right

Unnamed: 0,Unnamed: 1,C,D
K0,K0,C0,D0
K1,K0,C1,D1
K2,K0,C2,D2
K2,K1,C3,D3


In [23]:
result = left.join (right, on = ['key1', 'key2'])
result

Unnamed: 0,A,B,key1,key2,C,D
0,A0,B0,K0,K0,C0,D0
1,A1,B1,K0,K1,,
2,A2,B2,K1,K0,C1,D1
3,A3,B3,K2,K1,C3,D3


In [24]:
result = pd.merge(left, right, left_on = ['key1', 'key2'], right_index = True, how = 'left')
result

Unnamed: 0,A,B,key1,key2,C,D
0,A0,B0,K0,K0,C0,D0
1,A1,B1,K0,K1,,
2,A2,B2,K1,K0,C1,D1
3,A3,B3,K2,K1,C3,D3


One can join multiple DataFrame at the same time.

In [25]:
left = pd.DataFrame({'k' : ['K0', 'K1', 'K2'], 'v': [1, 2, 3]})
left = left.set_index('k')
left

Unnamed: 0_level_0,v
k,Unnamed: 1_level_1
K0,1
K1,2
K2,3


In [26]:
right = pd.DataFrame({'k' : ['K0', 'K0', 'K3'], 'v' : [4, 5, 6]})
right = right.set_index('k')
right

Unnamed: 0_level_0,v
k,Unnamed: 1_level_1
K0,4
K0,5
K3,6


In [27]:
right2 = pd.DataFrame({'v' : [7, 8, 9]}, index = ['K1', 'K1', 'K2'])
right2

Unnamed: 0,v
K1,7
K1,8
K2,9


In [28]:
result = left.join([right, right2])
result

Unnamed: 0,v_x,v_y,v
K0,1.0,4.0,
K0,1.0,5.0,
K1,2.0,,7.0
K1,2.0,,8.0
K2,3.0,,9.0
K3,,6.0,


### Concatenating along the axis
Normally there are several issues the concatenating needs to address:
* If the objects are indexed differently on the other axis, should the collection of axes be unioned or intersected?
* Do the groups need to be identifiable in the resulting object?
* Does the concatenation axis matter at all?

In [29]:
df1 = pd.DataFrame({'a' :['a0', 'a1', 'a2'],
                    'b' :['b0', 'b1', 'b2'],
                    'c' :['c0', 'c1', 'c2']})
df1

Unnamed: 0,a,b,c
0,a0,b0,c0
1,a1,b1,c1
2,a2,b2,c2


In [30]:
df2 = pd.DataFrame({'a' : ['a3', 'a4', 'a5'],
                    'b' : ['b3', 'b4', 'b5'],
                    'c' : ['c3', 'c4', 'c5']})
df2

Unnamed: 0,a,b,c
0,a3,b3,c3
1,a4,b4,c4
2,a5,b5,c5


In [31]:
df3 = pd.DataFrame({'a' : ['a6', 'a7', 'a8'],
                    'b' : ['b6', 'b7', 'b8'],
                    'c' : ['c6', 'c7', 'c8']})
df3

Unnamed: 0,a,b,c
0,a6,b6,c6
1,a7,b7,c7
2,a8,b8,c8


In [32]:
pd.concat([df1, df2, df3])

Unnamed: 0,a,b,c
0,a0,b0,c0
1,a1,b1,c1
2,a2,b2,c2
0,a3,b3,c3
1,a4,b4,c4
2,a5,b5,c5
0,a6,b6,c6
1,a7,b7,c7
2,a8,b8,c8


In [33]:
pd.concat([df1, df2, df3], axis = 1)

Unnamed: 0,a,b,c,a.1,b.1,c.1,a.2,b.2,c.2
0,a0,b0,c0,a3,b3,c3,a6,b6,c6
1,a1,b1,c1,a4,b4,c4,a7,b7,c7
2,a2,b2,c2,a5,b5,c5,a8,b8,c8


By default, the concatenating is joining in a _outer_ way, while you can choose an _inner_ way by speicfying __join="inner"__.

The concatenated pieces are not identifiable in the result by default. In order to identify the original pieces, one can specific keys with each of the pieces of the chopped up DataFrame. 

In [34]:
result = pd.concat([df1, df2, df3], keys = ['x', 'y', 'z'])
result

Unnamed: 0,Unnamed: 1,a,b,c
x,0,a0,b0,c0
x,1,a1,b1,c1
x,2,a2,b2,c2
y,0,a3,b3,c3
y,1,a4,b4,c4
y,2,a5,b5,c5
z,0,a6,b6,c6
z,1,a7,b7,c7
z,2,a8,b8,c8


In [35]:
result.ix['z']

Unnamed: 0,a,b,c
0,a6,b6,c6
1,a7,b7,c7
2,a8,b8,c8


There is some situation that the row index is not meaningful in the context of the anaylsis.

In [36]:
import numpy as np
df1a  = pd.DataFrame(np.random.randn(3, 4), columns = ['a', 'b', 'c', 'd'])
df1a

Unnamed: 0,a,b,c,d
0,0.56033,1.413733,-0.196134,0.534316
1,0.37832,1.689136,0.057764,-0.659872
2,0.167346,1.694023,-2.566344,-1.390775


In [37]:
df2a = pd.DataFrame(np.random.randn(2, 3), columns = ['b', 'd', 'a'])
df2a

Unnamed: 0,b,d,a
0,-1.906159,-0.798266,0.557759
1,0.898787,-1.602993,0.252979


In [38]:
pd.concat([df1a, df2a], ignore_index = True)

Unnamed: 0,a,b,c,d
0,0.56033,1.413733,-0.196134,0.534316
1,0.37832,1.689136,0.057764,-0.659872
2,0.167346,1.694023,-2.566344,-1.390775
3,0.557759,-1.906159,,-0.798266
4,0.252979,0.898787,,-1.602993


There is an instance method on Series and DataFrame called __append__ that can concatenating.

In [39]:
result = df1.append(df2)
result

Unnamed: 0,a,b,c
0,a0,b0,c0
1,a1,b1,c1
2,a2,b2,c2
0,a3,b3,c3
1,a4,b4,c4
2,a5,b5,c5


### Combing data with overlap
One may have two datasets whose indexes overlap in full or part, thus one needs to choose one of them in the combinations.

In [40]:
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
              index = ['f', 'e', 'd', 'c', 'b', 'a'])
a

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [41]:
b = pd.Series(np.arange(len(a), dtype = np.float64),
              index = ['f', 'e', 'd', 'c', 'b', 'a'])
b

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    5.0
dtype: float64

In [42]:
b.combine_first(a)

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    5.0
dtype: float64

In [43]:
df1 = pd.DataFrame({'a' : [1., np.nan, 5., np.nan],
                    'b' : [np.nan, 2., np.nan, 6.],
                    'c' : range(2, 18, 4)})
df1

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,6.0,14


In [44]:
df2 = pd.DataFrame({'a' : [5., 4., np.nan, 3., 7.],
                    'b' : [np.nan, 3., 4., 6., 8.]})
df2

Unnamed: 0,a,b
0,5.0,
1,4.0,3.0
2,,4.0
3,3.0,6.0
4,7.0,8.0


In [45]:
df1.combine_first(df2)

Unnamed: 0,a,b,c
0,1.0,,2.0
1,4.0,2.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


## Reshaping
There are two primary actions to rearrange data in a DataFrame in hierarchical indexing:
* __stack__: this _rotate_ or pivots from the columns in teh data to the rows
* __unstack__: this pivots from the rows into the columns

In [46]:
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                    index = pd.Index(['Michigan', 'Washington'], name = 'state'),
                    columns = pd.Index(['one', 'two', 'three'], name = 'number'))
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Michigan,0,1,2
Washington,3,4,5


In [47]:
result = data.stack()
result

state       number
Michigan    one       0
            two       1
            three     2
Washington  one       3
            two       4
            three     5
dtype: int64

In [48]:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Michigan,0,1,2
Washington,3,4,5


In [49]:
result.unstack('state')

state,Michigan,Washington
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


Unstacking may introduce the missing data:

In [50]:
t1 = pd.Series([3, 4, 5], index = ['CA', 'OH', 'WI'])
t2 = pd.Series([10, 48, -9], index = ['NY', 'WA', 'CA'])
t3 = pd.concat([t1, t2], keys = ['first', 'second'])
t3

first   CA     3
        OH     4
        WI     5
second  NY    10
        WA    48
        CA    -9
dtype: int64

In [51]:
t3.unstack()

Unnamed: 0,CA,NY,OH,WA,WI
first,3.0,,4.0,,5.0
second,-9.0,10.0,,48.0,


Stacking, on the other hand, filters out missing data by default.

In [52]:
t3.unstack().stack()

first   CA     3.0
        OH     4.0
        WI     5.0
second  CA    -9.0
        NY    10.0
        WA    48.0
dtype: float64

Multiple time series in databases often be stored as a _long_ or _stacked_ format:

In [53]:
import pandas.util.testing as tm; tm.N = 3
def unpivot(frame):
    N, K = frame.shape
    data = {'value': frame.values.ravel('F'), 
            'variable': np.asarray(frame.columns).repeat(N),
            'date': np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns = ['date', 'variable', 'value'])
df = unpivot(tm.makeTimeDataFrame())
df

Unnamed: 0,date,variable,value
0,2000-01-03,A,-0.189719
1,2000-01-04,A,-0.34892
2,2000-01-05,A,0.658716
3,2000-01-03,B,0.343313
4,2000-01-04,B,-0.246592
5,2000-01-05,B,1.016401
6,2000-01-03,C,1.418991
7,2000-01-04,C,-1.714499
8,2000-01-05,C,1.190816
9,2000-01-03,D,-0.09276


A better representation would be where the _columns_ are the unique values and an _index_ of dates identifies individual observations.

In [54]:
df.pivot(index = 'date', columns = 'variable', values = 'value')

variable,A,B,C,D
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-03,-0.189719,0.343313,1.418991,-0.09276
2000-01-04,-0.34892,-0.246592,-1.714499,-0.192576
2000-01-05,0.658716,1.016401,1.190816,0.367835


It is a shortcut for creating a hierachical index using __set_index__ and reshaping with __unstack__.

In [55]:
unstacked = df.set_index(['date', 'variable']).unstack('variable')
unstacked

Unnamed: 0_level_0,value,value,value,value
variable,A,B,C,D
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2000-01-03,-0.189719,0.343313,1.418991,-0.09276
2000-01-04,-0.34892,-0.246592,-1.714499,-0.192576
2000-01-05,0.658716,1.016401,1.190816,0.367835


## Data Transformation
### Remove duplicates
Pandas provides the method to identify the duplicated rows as well as removing them.

In [56]:
data = pd.DataFrame({'a' : ['test'] * 4 + ['duplicated'] * 2,
                     'b' : [1]*3 + [2] * 2 + [9]})
data

Unnamed: 0,a,b
0,test,1
1,test,1
2,test,1
3,test,2
4,duplicated,2
5,duplicated,9


In [57]:
data.duplicated()

0    False
1     True
2     True
3    False
4    False
5    False
dtype: bool

In [58]:
data.drop_duplicates()

Unnamed: 0,a,b
0,test,1
3,test,2
4,duplicated,2
5,duplicated,9


__drop_duplicates__ can filter duplicates only based on one column, one can pass the column name into the method. 

In [59]:
data.drop_duplicates(['a'])

Unnamed: 0,a,b
0,test,1
4,duplicated,2


### Transform data via a function or mapping
In Pandas, Series has a method __map__ can apply function or dictionary.

In [60]:
info = pd.DataFrame({'movie' : ['The Shawshank Redemption', 'The Godfather', 'The Dark Knight', '12 Angry Men',
                                'Schindler\'s List', 'Pulp Fiction', 'The lord of the Rings: The Return of the King'],
                     'rate' : [9.5, 9.2, 9.0, 8.5, 8.5, 8.0,8.0]})
info

Unnamed: 0,movie,rate
0,The Shawshank Redemption,9.5
1,The Godfather,9.2
2,The Dark Knight,9.0
3,12 Angry Men,8.5
4,Schindler's List,8.5
5,Pulp Fiction,8.0
6,The lord of the Rings: The Return of the King,8.0


In [61]:
rate_to_review = {
    9.5: 'exellent',
    9.2: 'very good',
    9.0: 'good',
    8.5: 'above average',
    8.0: "average"
}
info['Review'] = info['rate'].map(rate_to_review)
info

Unnamed: 0,movie,rate,Review
0,The Shawshank Redemption,9.5,exellent
1,The Godfather,9.2,very good
2,The Dark Knight,9.0,good
3,12 Angry Men,8.5,above average
4,Schindler's List,8.5,above average
5,Pulp Fiction,8.0,average
6,The lord of the Rings: The Return of the King,8.0,average


A simpler method __replace__ can be used instead of __map__ for the purpose of the value replacement.

In [62]:
data = pd.Series([1, 0, 2, 0, 1])
data

0    1
1    0
2    2
3    0
4    1
dtype: int64

In [63]:
data.replace([0, 1, 2], ['test', 'exam', 'pass'])

0    exam
1    test
2    pass
3    test
4    exam
dtype: object

In [64]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index = ['California', 'New York', 'Washington'],
                    columns = ['t1', 't2', 't3', 't4'])
data

Unnamed: 0,t1,t2,t3,t4
California,0,1,2,3
New York,4,5,6,7
Washington,8,9,10,11


In [65]:
data.index = data.index.map(str.upper)
data

Unnamed: 0,t1,t2,t3,t4
CALIFORNIA,0,1,2,3
NEW YORK,4,5,6,7
WASHINGTON,8,9,10,11


In [66]:
data.rename(index = str.title, columns = str.upper)

Unnamed: 0,T1,T2,T3,T4
California,0,1,2,3
New York,4,5,6,7
Washington,8,9,10,11


### Discretization and Binning
Continuous data is often discretized or otherwised separated into "bin" for analysis. 

In [67]:
rates = [8.0, 8.5, 9.0, 9.5]
cats = pd.cut(info['rate'], rates)
cats

0    (9, 9.5]
1    (9, 9.5]
2    (8.5, 9]
3    (8, 8.5]
4    (8, 8.5]
5         NaN
6         NaN
Name: rate, dtype: category
Categories (3, object): [(8, 8.5] < (8.5, 9] < (9, 9.5]]

__cut__ function returns a special _Categorical_ object, which correspond to categorical variables in statistics, which contains a _categories_ and _ordered_ property.

In [68]:
cats.cat.categories

Index(['(8, 8.5]', '(8.5, 9]', '(9, 9.5]'], dtype='object')

In [69]:
cats.cat.ordered

True

In [70]:
pd.value_counts(cats)

(9, 9.5]    2
(8, 8.5]    2
(8.5, 9]    1
Name: rate, dtype: int64

One can specify equal-length bins based on the minimum and maximum values in the data.

In [71]:
data = np.random.rand(20)
pd.cut(data, 5, precision=2)

[(0.76, 0.9], (0.76, 0.9], (0.76, 0.9], (0.47, 0.61], (0.47, 0.61], ..., (0.18, 0.33], (0.18, 0.33], (0.47, 0.61], (0.47, 0.61], (0.33, 0.47]]
Length: 20
Categories (5, object): [(0.18, 0.33] < (0.33, 0.47] < (0.47, 0.61] < (0.61, 0.76] < (0.76, 0.9]]

__qcut__ function bins the data based on sample quantiles..

In [72]:
data = np.random.randn(1000)
cats = pd.qcut(data, 4)
cats

[(0.0345, 0.687], (0.0345, 0.687], (-0.678, 0.0345], [-2.629, -0.678], (-0.678, 0.0345], ..., (0.0345, 0.687], [-2.629, -0.678], (0.687, 3.0557], [-2.629, -0.678], (0.687, 3.0557]]
Length: 1000
Categories (4, object): [[-2.629, -0.678] < (-0.678, 0.0345] < (0.0345, 0.687] < (0.687, 3.0557]]

In [73]:
pd.value_counts(cats)

(0.687, 3.0557]     250
(0.0345, 0.687]     250
(-0.678, 0.0345]    250
[-2.629, -0.678]    250
dtype: int64

### Detecting and Filtering outliers
Using the conditional filtering, one can detect and filter the outliers.

In [74]:
np.random.seed(123)
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.007502,0.03916,-0.010286,0.024285
std,0.977024,0.973484,1.01223,0.970421
min,-3.167055,-2.920029,-3.801378,-3.231055
25%,-0.662012,-0.63616,-0.687717,-0.599195
50%,-0.024843,0.062549,0.007035,0.038718
75%,0.61395,0.672448,0.664586,0.683228
max,3.050755,2.850708,2.766603,3.571579


In [75]:
col = data[3]
col[np.abs(col) > 3]

48    -3.231055
182    3.571579
Name: 3, dtype: float64

In [77]:
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.007386,0.03916,-0.008831,0.023944
std,0.97634,0.973484,1.007422,0.967746
min,-3.0,-2.920029,-3.0,-3.0
25%,-0.662012,-0.63616,-0.687717,-0.599195
50%,-0.024843,0.062549,0.007035,0.038718
75%,0.61395,0.672448,0.664586,0.683228
max,3.0,2.850708,2.766603,3.0


### Permutation and Random Sampling

In [78]:
sampler = np.random.permutation(5)
sampler

array([3, 4, 1, 2, 0])

In [79]:
dt = pd.DataFrame(np.arange(6* 4).reshape(6, 4))

To select a random subst without replacement, one way is to slice off the first __k__ elements of the array returned by __permutation__. 

In [81]:
dt.take(np.random.permutation(len(dt))[:4])

Unnamed: 0,0,1,2,3
5,20,21,22,23
2,8,9,10,11
4,16,17,18,19
1,4,5,6,7


To generate a sample _with_ replacement, __np.random.randint__ can be used to draw random integers.

In [82]:
sampler = np.random.randint(0, len(dt), size=10)
sampler

array([5, 4, 3, 0, 0, 4, 0, 4, 5, 0])

In [83]:
draws = dt.take(sampler)
dt

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19
5,20,21,22,23


### Computing Indicator or Dummy variables
In machine learning application, it is very common to convert a categorical variable into a __dummy__ or __indicator__ matrix. If a column in a DataFrame has __k__ distinct values, one would derive a matrix or DataFrame containing __k__ columns containing all 1's or 0's.

In [84]:
df = pd.DataFrame({'key': ['ta', 'ca', 'ta', 'ba', 'ma', 'ca'],
                   'value': range(6)})
df

Unnamed: 0,key,value
0,ta,0
1,ca,1
2,ta,2
3,ba,3
4,ma,4
5,ca,5


In [85]:
df_with_dummy = df[['value']].join(pd.get_dummies(df['key'], prefix = 'key'))
df_with_dummy

Unnamed: 0,value,key_ba,key_ca,key_ma,key_ta
0,0,0.0,0.0,0.0,1.0
1,1,0.0,1.0,0.0,0.0
2,2,0.0,0.0,0.0,1.0
3,3,1.0,0.0,0.0,0.0
4,4,0.0,0.0,1.0,0.0
5,5,0.0,1.0,0.0,0.0


## String Manipulation
Python has lots of built-in method or function to deal with strings.

In [87]:
import re
text = "foo   bar\t baz  \tqux"
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [88]:
s1 = {'Dave': 'dave@gmail.com', 'Eric': 'eric@gmail.com',
      'Rob': 'rob@hotmail.com', 'Will': np.nan}
data = pd.Series(s1)
data

Dave     dave@gmail.com
Eric     eric@gmail.com
Rob     rob@hotmail.com
Will                NaN
dtype: object

In [89]:
data.str.contains('gmail')

Dave     True
Eric     True
Rob     False
Will      NaN
dtype: object

In [93]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+\.[A-Z]{2,4})'
all = data.str.findall(pattern, flags = re.IGNORECASE)
all

Dave     [(dave, gmail.com)]
Eric     [(eric, gmail.com)]
Rob     [(rob, hotmail.com)]
Will                     NaN
dtype: object

In [95]:
all.str[0]

Dave     (dave, gmail.com)
Eric     (eric, gmail.com)
Rob     (rob, hotmail.com)
Will                   NaN
dtype: object