# Data Indexing and Selection

## Data Selection for Series

### Series as dictionary

In [13]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [14]:
data['b']

0.5

In [15]:
'a' in data

True

In [16]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [17]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [18]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

### As one-dimensional array

In [22]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [23]:
# slicing with implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [24]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [26]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

Note the difference between the first two. When slicing with an explicit index i.e. data['a':'c'], the final index is **included**, but when slicing with an implicit index i.e. data[0:2], the final index is **excluded**.

### Indexers: loc, iloc and ix

The above can lead to confusion - if the Series has an explicit integer index, then an indexing operation such as data[1] will use the explicit index, whereas a slicing operation data[1:3] will use the implicit python style index.

In [27]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [28]:
# explicit index
data[1]

'a'

In [29]:
# implicit index
data[1:3]

3    b
5    c
dtype: object

In order to prevent confusion, pandas has special indexing schemes. Note these **are not** functional methods, rather attributes that expose a particular slicing interface to the Series.

The loc attribute always provides the **explicit** index:

In [30]:
data.loc[1]

'a'

In [31]:
data.loc[1:3]

1    a
3    b
dtype: object

The iloc attribute provides the **implicit** python-style index

In [32]:
data.iloc[1]

'b'

In [36]:
data.iloc[1:3]

3    b
5    c
dtype: object

Finally the ix attribute is a hybrid - for Series objects it is equivalent to standard [] indexing. Its use comes into play when using DataFrames

## Data Selection for DataFrames

Recall a DataFrame is much like a two-dimensional or structured array, and also like a dictionary of Series structures sharing an index.

### DataFrame as a dictionary

In [67]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data.sort_index(inplace=True)
data

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


The individual Series (cols) can be accessed using dic-style indexing 

In [68]:
data['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [69]:
# Or via atrribute-style access
data.area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [70]:
# Note these are the same object
data.area is data['area']

True

Need to be careful in some cases e.g. col name is not a string, or they conflict with Dataframe methods. For example, DataFrame has a pop() method:

In [71]:
data.pop is data['pop']

False

Should never use column assignment via attribute ie use data['pop'] = z **not** data.pop = z

In [72]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


### DataFrame as a two_dimensional array

In [73]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01]])

Thus we can perform many familiar array operations 

In [74]:
data.T

Unnamed: 0,California,Florida,Illinois,New York,Texas
area,423967.0,170312.0,149995.0,141297.0,695662.0
pop,38332520.0,19552860.0,12882140.0,19651130.0,26448190.0
density,90.41393,114.8061,85.88376,139.0767,38.01874


Note, however, indexing a single index to an array accesses a row, whereas to a DataFrame accesses a column:

In [75]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [76]:
data['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [77]:
try:
    data[0]
except KeyError:
    print("KeyError")

KeyError


Using the iloc indexer, we can index the underlying array as if it's a simple NumPy array, but the index and column labels are maintained:

In [78]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


Using the loc indexer, we can index the underlying data in an array-like style but using explicit index and column names:

In [79]:
data.loc[:'Illinois',:'pop']

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


The ix indexer allows a hybrid of these two approaches:

In [81]:
data.ix[:3, :'pop']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  retval = getattr(retval, self.name)._getitem_axis(key, axis=i)


Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


Bear in mind (if ever using ix), there is similar confusion when using integer indexed DataFrames

We can combined the Numpy style access patterns to these. For example for loc we can combine masking and fancy indexing:

In [83]:
data.loc[data.density>100, ['pop', 'density']]

Unnamed: 0,pop,density
Florida,19552860,114.806121
New York,19651127,139.076746


Any of these methods can also be used for assignment: 

In [86]:
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


### Additional Conventions

While indexing refers to columns, slicing refers to rows:

In [88]:
data['Florida': 'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [89]:
data[1:3]

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


Similarly, direct masking operations are interpreted row-wise rather than column-wise:

In [90]:
data[data.density > 100]

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
New York,141297,19651127,139.076746
