# Data Selection in Series
## Series as dictionary

In [1]:
import pandas as pd
data = pd.Series([1,99,2,3,4,6],index=['b','c','a','l','l','a'])
data

b     1
c    99
a     2
l     3
l     4
a     6
dtype: int64

- notice that indices can repeat, unexpected!

In [2]:
data[['a','b']]

a    2
a    6
b    1
dtype: int64

- and that fancy indexing returns both matches to 'a' 

We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [3]:
'l' in data, 'q' in data

(True, False)

In [4]:
data.keys()

Index([u'b', u'c', u'a', u'l', u'l', u'a'], dtype='object')

In [5]:
data.items(),'  ',list(data.items())

(<itertools.izip at 0x7fde5b056098>,
 '  ',
 [('b', 1), ('c', 99), ('a', 2), ('l', 3), ('l', 4), ('a', 6)])

Series objects can even be modified with a dictionary-like syntax. Just as you can extend a dictionary by assigning to a new key, you can extend a Series by assigning to a new index value:

In [6]:
data['p'] = 2

In [7]:
data

b     1
c    99
a     2
l     3
l     4
a     6
p     2
dtype: int64

## Series as one-dimensional array

In [8]:
# slicing by explicit index

# data['l':'a'] 
# ==> ERROR "Cannot get right slice bound for non-unique label: 'a'"

data['c':'p']
# note that the element with index 'p' is included

c    99
a     2
l     3
l     4
a     6
p     2
dtype: int64

In [9]:
# slicing by implicit integer index
data[1:3]
# notice: upper bound is not 

c    99
a     2
dtype: int64

In [10]:
# masking
data[data>5]

c    99
a     6
dtype: int64

In [11]:
# fancy indexing
data[['p','l']]

p    2
l    3
l    4
dtype: int64

mong these, slicing may be the source of the most confusion. Notice that when slicing with an explicit index (i.e., data['a':'c']), the final index is included in the slice, while when slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice.

## Indexers: loc, iloc, and ix
These slicing and indexing conventions can be a source of confusion. For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.

In [12]:
data = pd.Series(['a','b','c'], index=[1,3,5])
data

1    a
3    b
5    c
dtype: object

In [13]:
# EXPLICIT index when indexing
data[1]

'a'

In [14]:
# IMPLICIT index when slicing
data[1:3]

3    b
5    c
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.

First, the **loc** attribute allows indexing and slicing that always references the **explicit index**:

In [15]:
data.loc[1]

'a'

In [16]:
data.loc[1:3]

1    a
3    b
dtype: object

The **iloc** attribute allows indexing and slicing that always references the **implicit Python-style** index:

In [17]:
data.iloc[0],' ', data.iloc[1]

('a', ' ', 'b')

In [18]:
data.iloc[1:3]

3    b
5    c
dtype: object

In [19]:
print( data.iloc[1:3],    '\n\n',     data[1:3] )

(3    b
5    c
dtype: object, '\n\n', 3    b
5    c
dtype: object)


In [20]:
data.iloc[1:3] is data[1:3]
# unexpected...

False

In [21]:
%timeit data.iloc[1:3]
%timeit data.ix[1:3]
%timeit data[1:3]

10000 loops, best of 3: 150 µs per loop


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated


The slowest run took 5.19 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 169 µs per loop
10000 loops, best of 3: 171 µs per loop


**One guiding principle of Python code is that "explicit is better than implicit."**

The explicit nature of loc and iloc make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.

# Data Selection in DataFrame

Recall that a **DataFrame** acts in many ways like a **two-dimensional or structured array**, and in other ways like a **dictionary of Series structures sharing the same index**.

These analogies can be helpful to keep in mind as we explore data selection within this structure.

In [22]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})

data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In [23]:
print(data['area'] is data.area,'\n\n')
data['area']

(True, '\n\n')


California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [24]:
%timeit data.area
%timeit data['area']

The slowest run took 4.68 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.75 µs per loop
The slowest run took 8.07 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.13 µs per loop


Though assessing series via **attribute stype** a useful shorthand,keep in mind that it does not work for all cases! For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible. For example, the DataFrame has a pop() method, so data.pop will point to this rather than the "pop" column:

In [25]:
data.pop is data['pop']

False

Like with the Series objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:

In [26]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


# DataFrame as two-dimensional array
As mentioned previously, we can also view the DataFrame as an *enhanced two-dimensional array*. We can examine the raw underlying data array using the values attribute:

In [27]:
print( type(data.values) ) 
data.values

<type 'numpy.ndarray'>


array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01]])

In [28]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


In [29]:
data.T

Unnamed: 0,California,Florida,Illinois,New York,Texas
area,423967.0,170312.0,149995.0,141297.0,695662.0
pop,38332520.0,19552860.0,12882140.0,19651130.0,26448190.0
density,90.41393,114.8061,85.88376,139.0767,38.01874


In [30]:
# transposing offers a clevery way of assessing single row via explicit key (i.e. an index element)
data.T['California']

area       4.239670e+05
pop        3.833252e+07
density    9.041393e+01
Name: California, dtype: float64

In [31]:
# here's another more natural way of selecting a single raw via explicit key (i.e. and index element)
data[data.index=='California']

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926


When it comes to indexing of DataFrame objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array. In particular, passing a single index to an array accesses a row:

In [32]:
# this is one ROW
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [33]:
# this is a column/Series, instead
data['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [34]:
data['area'] is data.area

True

Thus for array-style indexing, we need another convention. Here Pandas again uses the **loc, iloc, and ix indexers** mentioned earlier. Using the iloc indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame index and column labels are maintained in the result:

In [35]:
data.iloc[:3,:2]

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


Similarly, using the loc indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [36]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


The ix indexer allows a hybrid of these two approaches:

In [37]:
data.ix[:3,:'pop']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


Keep in mind that for integer indices, the ix indexer is subject to the same potential sources of confusion as discussed for integer-indexed Series objects.

Any of the familiar NumPy-style data access patterns can be used within these indexers. For example, in the loc indexer we can **combine masking and fancy indexing** as in the following:

In [38]:
data.loc[ data.density > 100, ['pop','density'] ]

Unnamed: 0,pop,density
Florida,19552860,114.806121
New York,19651127,139.076746


Any of these **indexing conventions may also be used to set or modify values**; this is done in the standard way that you might be accustomed to from working with NumPy:

In [39]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


In [40]:
data.iloc[0,0] = 3.14
data

Unnamed: 0,area,pop,density
California,3.14,38332521,90.413926
Florida,170312.0,19552860,114.806121
Illinois,149995.0,12882135,85.883763
New York,141297.0,19651127,139.076746
Texas,695662.0,26448193,38.01874


In [41]:
data.loc['Florida','area'] = 2.7
data

Unnamed: 0,area,pop,density
California,3.14,38332521,90.413926
Florida,2.7,19552860,114.806121
Illinois,149995.0,12882135,85.883763
New York,141297.0,19651127,139.076746
Texas,695662.0,26448193,38.01874


In [42]:
data.loc['Illinois',:] = 1.6E-19
data

Unnamed: 0,area,pop,density
California,3.14,38332520.0,90.41393
Florida,2.7,19552860.0,114.8061
Illinois,1.6e-19,1.6e-19,1.6e-19
New York,141297.0,19651130.0,139.0767
Texas,695662.0,26448190.0,38.01874


In [43]:
data.ix[3:,'pop':] = 1492
data

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


Unnamed: 0,area,pop,density
California,3.14,38332520.0,90.41393
Florida,2.7,19552860.0,114.8061
Illinois,1.6e-19,1.6e-19,1.6e-19
New York,141297.0,1492.0,1492.0
Texas,695662.0,1492.0,1492.0


## Additional indexing conventions

There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice. First, **while *indexing* refers to columns**, **the *slicing* refers to rows**:

In [68]:
# explicit indexing refers to columns
print(  type(data['pop'])  )
data['pop']

<class 'pandas.core.series.Series'>


California    3.833252e+07
Florida       1.955286e+07
Illinois      1.600000e-19
New York      1.492000e+03
Texas         1.492000e+03
Name: pop, dtype: float64

In [71]:
# explicit the slicing refers to rows
print(  type(data['Florida':'Illinois'])  )
data['Florida':'Illinois']

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,area,pop,density
Florida,2.7,19552860.0,114.8061
Illinois,1.6e-19,1.6e-19,1.6e-19


As a consequence/complement, if I try to explocit-index a row, I get an error.

If I want 

In [65]:
# data['Florida']    # error because
# data['Florida',:]  # error

print(data.loc['Florida'])        # ok

print('\n ---- \n')
print(data.loc['Florida',:])      # ok

print('\n ---- \n')
print('type is: '+str(type(data.loc['Florida'])) )     # ok
print('type is: '+str(type(data.loc['Florida',:])) )     # ok

print('\n ---- \n')
data.loc['Florida'] is data.loc['Florida':]

print( type(data['Florida':'Illinois'])   )
data['Florida':'Illinois']

bracket operator gives you a colum if you selecy by column title ==> which is like saying "indexing refers to column"; 

to get a row by key we've had to use loc, see 2 cells above

area       2.700000e+00
pop        1.955286e+07
density    1.148061e+02
Name: Florida, dtype: float64

 ---- 

area       2.700000e+00
pop        1.955286e+07
density    1.148061e+02
Name: Florida, dtype: float64

 ---- 

type is: <class 'pandas.core.series.Series'>
type is: <class 'pandas.core.series.Series'>

 ---- 



False