In [2]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np



# Introduction to pandas Data Structures

## Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.

In [3]:
obj = Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

Get Series values and index attributes:

In [4]:
obj.values

array([ 4,  7, -5,  3])

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

Create a Series with an index identifying each data point:

In [6]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [7]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

Index values can be used to select single values or a set of values.

In [8]:
obj2['a']

-5

In [9]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

Numpy operations will preserve the index-value link.
Filtering with boolean array:

In [10]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

Scalar Multiplication:

In [11]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

Applying math functions:

In [12]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

A Series can be thought of as a fixed-length, ordered dict, since it maps index values to data values.  
It can be substituted into many functions that expect a dict.

In [13]:
'b' in obj2

True

Create a Series from Python dict by passing the dict:

In [14]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

This results in the dict keys being the index values of the Series.

Creating Series from dict with specifying an index:

In [15]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

A critical Series feature for many applications is that it automatically aligns differently-indexed data in arithmetic operations:

In [16]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the Series object and its index have a <name> attribute which integrates with other key areas of pandas functionality:

In [17]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series's index can be altered in place by assignment:

In [18]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## DataFrame
### Creating DataFrames

Create DataFrame from a dict of equal-length lists or NumPy arrays:

In [19]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


This results in a DataFrame with an index assigned automatically as with Series, and the columns are placed in sorted order A-Z.  
Specifying a sequence of columns retains the order assigned:

In [20]:
DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


Columns can be passed empty:

In [21]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                     index=['one', 'two', 'three', 'four', 'five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [22]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

### Retrieving Data from DataFrames

By dict-like notation:

In [23]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

By attribute:

In [24]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

Retrieving rows either by position, name, or methods like the 'loc' indexing field:

In [25]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

### Modify Data by Assignment

In [26]:
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [27]:
frame2['debt'] = np.arange(5.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0


When assigning lists or arrays to a column, the length must match the length of the DataFrame.  
Assigning a Series will conform exactly to the DataFrame's index, inserting missing values in any holes:

In [28]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


Assigning a column that doesn't exist creates a new column:

In [29]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


Columns can be deleted using the 'del' keyword like in a dictionary:

In [30]:
del frame2['eastern']
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Another common form of data is a nested dict of dicts format:

In [31]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

Passing the above to a DataFrame will interpret the outer keys as the columns and inner keys as the row indices:

In [32]:
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


The result can be transposed:

In [33]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


Dicts of Series are treated the same way:

In [34]:
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


In [35]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


The 'values' attribute returns the data contained in the DataFrame as a 2D ndarray:

In [36]:
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

## Index Objects
pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index:

In [37]:
obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [38]:
index[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable and can't be modified by the user. This is important so that Index objects can be safely shared among data structures:

In [39]:
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index=index)
obj2.index is index

True

There are multiple kinds of Index objects and each has a number of methods and properties for set logic and answering other common questions about the data it contains:

In [40]:
pd.options.display.max_colwidth = 100

methods = ['append', 'diff', 'intersection', 'union', 'isin', 'delete', 'drop', 'insert', 'is_monotonic', 'is_unique']
description = ['Concatenate with additional Index objects, producing a new Index', 
               'Compute set difference as an Index', 
               'Compute set intersection', 
               'Compute set union', 
               'Compute boolean array indicating whether each value is contained in the passed collection', 
               'Compute new Index with element at index i deleted', 
               'Compute new Index by deleting passed values', 
               'Compute new Index by inserting element at index i', 
               'Returns True if each element is greater than or equal to the previous element', 
               'Returns True if the Index has no duplicate values']
index_objects = DataFrame({'Method': methods, 'Description': description})
index_objects

Unnamed: 0,Method,Description
0,append,"Concatenate with additional Index objects, producing a new Index"
1,diff,Compute set difference as an Index
2,intersection,Compute set intersection
3,union,Compute set union
4,isin,Compute boolean array indicating whether each value is contained in the passed collection
5,delete,Compute new Index with element at index i deleted
6,drop,Compute new Index by deleting passed values
7,insert,Compute new Index by inserting element at index i
8,is_monotonic,Returns True if each element is greater than or equal to the previous element
9,is_unique,Returns True if the Index has no duplicate values


# Essential Functionality
## Reindexing
Creating a new object with the data conformed to a new index:

In [41]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [42]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [43]:
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

In [44]:
# do some interpolation e.g. for time series data
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

For DataFrames, 'reindex' can alter either the row index, columns, or both:

In [45]:
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
                    columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [46]:
# reindex the rows by passing a sequence 
frame2 = frame.reindex(['a', 'b', 'c', 'd']) 
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [47]:
# index columns using the columns keyword
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


## Dropping entries from an axis
The drop method will return a new object with the indicated value or values deleted from an axis.

In [48]:
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [49]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

In [50]:
# delete index values from either axis for DataFrames
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data.drop(['Colorado', 'Ohio']) # axis 0 by default

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [51]:
data.drop('two', axis=1) # axis 1 for columns

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [52]:
data.drop(['two', 'four'], axis=1)

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


## Indexing, Selection, and Filtering

Series indexing works like NumPy array indexing, but one can use the Series's index values instead of only integers:

In [53]:
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

In [54]:
obj['b']

1.0

Note: positional index requires .iloc in future versions. Purely positional indexing will be deprecated.

In [55]:
obj.iloc[1]

1.0

In [56]:
obj.iloc[2:4]

c    2.0
d    3.0
dtype: float64

In [57]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [58]:
obj.iloc[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [59]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

Slicing with labels is different to standard Python slicing since the endpoint is inclusive:

In [60]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

In [61]:
# setting values using slicing
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

Indexing into a DataFrame:

In [62]:
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [63]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [64]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


### Special Cases
Selecting Rows by Slicing or Boolean Arrays

In [65]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [66]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Indexing with a Boolean DataFrame, like one produced through scalar comparison.

In [67]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [68]:
data[data < 5] = 0 # set all values less than 5 to 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


## Arithmetic and Data Alignment
When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs:

In [69]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [70]:
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [71]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [72]:
# adding both Series together
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

For DataFrames, alignment is performed on both the rows and columns:

In [73]:
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                index=['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [74]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [75]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [76]:
# adding DataFrames together results in the union of the indexes and columns
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


### Arithmetic Methods with Fill Values

In [77]:
df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

In [78]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [79]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [80]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


Using the `add` method on `df1` allows to pass another DataFrame and specify a fill value in case of missing values.  
The fill value is filled only for the missing index/column values. So if there are values in one of the two DataFrames, the arithmetic operation still works.

In [81]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


Similarly, it is possible to reindex a Series/DataFrame while specifying a fill value for missing indices.

In [82]:
df1.reindex(columns=df2.columns, fill_value=0) # reindexing columns

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


In [83]:
df1.reindex(index=df2.index, columns=df2.columns, fill_value=0) # reindexing both rows and columns

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0.0
1,4.0,5.0,6.0,7.0,0.0
2,8.0,9.0,10.0,11.0,0.0
3,0.0,0.0,0.0,0.0,0.0


In [84]:
methods = ['add', 'sub', 'div', 'mul']
description = ['Method for addition',
               'Method for subtraction',
               'Method for division',
               'Method for multiplication']
arithmetic_methods = DataFrame({'Method': methods, 'Description': description})
arithmetic_methods

Unnamed: 0,Method,Description
0,add,Method for addition
1,sub,Method for subtraction
2,div,Method for division
3,mul,Method for multiplication


### Operations between DataFrame and Series
Consider the difference between a 2D array and one of its rows:

In [85]:
arr = np.arange(12.).reshape((3, 4))
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [86]:
arr[0]

array([0., 1., 2., 3.])

In [87]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

This is referred to as __broadcasting__. Operations between a DataFrame and a Series are similar.

In [88]:
frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon']) 
series = frame.iloc[0]

In [89]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [90]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

By default, arithmetic between DataFrame and Series __matches the index of the Series on the DataFrame's columns__, broadcasting down the rows.

In [91]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


If an index value is not found in either the DataFrame columns or the Series index, the objects will be reindexed to form the union.

In [92]:
series2 = Series(range(3), index=['b', 'e', 'f'])
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


Broadcasting over the columns, matching on the rows, requires to use one of the arithmetic methods for DataFrames.

In [93]:
series3 = frame['d']
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [94]:
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [95]:
frame.sub(series3, axis=0) # subtract along the columns

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


## Function Application and Mapping

NumPy element-wise array methods work finde with pandas objects:

In [96]:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-1.436331,1.265053,0.260316
Ohio,1.087041,-0.467244,0.658596
Texas,0.34441,0.146711,-1.163123
Oregon,0.012006,-0.42512,0.785075


In [97]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.436331,1.265053,0.260316
Ohio,1.087041,0.467244,0.658596
Texas,0.34441,0.146711,1.163123
Oregon,0.012006,0.42512,0.785075


Applying a function on 1D arrays to each column or row can be done through DataFrame's `apply` method.

In [98]:
f = lambda x: x.max() - x.min()

In [99]:
frame.apply(f) # apply the function to each column

b    2.523373
d    1.732297
e    1.948197
dtype: float64

In [100]:
frame.apply(f, axis=1) # apply the function to each row

Utah      2.701384
Ohio      1.554285
Texas     1.507533
Oregon    1.210194
dtype: float64

The function passed to `apply` need not return a scalar value, it can also return a Series with multiple values:

In [101]:
def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,-1.436331,-0.467244,-1.163123
max,1.087041,1.265053,0.785075


Element-wise Python functions can be used too using `map` method.

In [102]:
format = lambda x: '%.2f' % x # format the values to 2 decimal places
frame.map(format)

Unnamed: 0,b,d,e
Utah,-1.44,1.27,0.26
Ohio,1.09,-0.47,0.66
Texas,0.34,0.15,-1.16
Oregon,0.01,-0.43,0.79


In [103]:
# similar to method for Series
frame['e'].map(format)

Utah       0.26
Ohio       0.66
Texas     -1.16
Oregon     0.79
Name: e, dtype: object

## Sorting and Ranking
### Sorting
To sort lexicographically by row or column index, use the `sort_index` method:

In [104]:
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [105]:
# DataFrames can be sorted by index on either axis
frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                  columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [106]:
frame.sort_index() # sort by row index by default

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [107]:
frame.sort_index(axis=1) # sort by column index

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [108]:
# sort in descending order
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


To sort a Series by its values, use its `sort_values` method.

In [109]:
obj = Series([4, 7, -3, 2])
obj

0    4
1    7
2   -3
3    2
dtype: int64

In [110]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

Missing values are sorted to the end by default.

In [111]:
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

To sort by the values in one or more columns on DataFrames, pass one or more column names to the `sort_values` method.

In [112]:
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [113]:
frame.sort_values(by='b') # sort by column 'b'

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [114]:
frame.sort_values(by=['a', 'b']) # sort by column 'a' first, then 'b'

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


### Ranking
Ranking is closely related to sorting, assigning ranks from one through the number of valid data points in an array.  
The `rank` method by default breaks ties by assigning each group the mean rank:

In [115]:
obj = Series([7, -5, 7, 4, 2, 0, 4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [116]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [117]:
# assign ranks according to the order in which they're observed in the data
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [120]:
# assign rank in descending order
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [121]:
# DataFrame can compute ranks over rows or columns
frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                   'c': [-2, 5, 8, -2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [122]:
frame.rank(axis=1) # rank by columns

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


In [123]:
# tie-breaking methods with rank
methods = ['average', 'min', 'max', 'first']
description = ['Default: assign the average rank to each entry in the equal group',
               'Use the minimum rank for the whole group',
               'Use the maximum rank for the whole group',
               'Assign ranks in the order the values appear in the data']
rank_methods = DataFrame({'Method': methods, 'Description': description})
rank_methods

Unnamed: 0,Method,Description
0,average,Default: assign the average rank to each entry in the equal group
1,min,Use the minimum rank for the whole group
2,max,Use the maximum rank for the whole group
3,first,Assign ranks in the order the values appear in the data


## Axis Indexes with Duplicate Values

In [124]:
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

Check index values' uniqueness using the `is_unique` property:

In [125]:
obj.index.is_unique

False

Indexing a value with multiple index entries returns a Series while single entries return a scalar.

In [126]:
obj['a']

a    0
a    1
dtype: int64

In [127]:
obj['c']

4

In [128]:
# same logic applies to DataFrames
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

Unnamed: 0,0,1,2
a,0.016265,-0.052138,-0.633478
a,-0.518914,-0.628233,0.902968
b,-0.11935,-1.093004,-1.083729
b,0.573059,0.166416,0.237679


In [130]:
df.loc['b']

Unnamed: 0,0,1,2
b,-0.11935,-1.093004,-1.083729
b,0.573059,0.166416,0.237679
