# Data Indexing and Selection

In NumPy, we looked in detail at these methods and tools to access, set and modify values:

indexing (eg., `arr[2, 1]`) <br>
slicing (eg., `arr[:, 1:5]`) <br>
masking (eg., `arr[arr > 0]`) <br>
fancy indexing (eg., `arr[0, [1,5]]`) <br>
combinations thereof (eg., `arr[:, [1, 5]]`)

## Data Selection in Series
### Series as dictionary

Like a dictionary, the `Series` object provides a mapping from a collection of keys to a collection of values:

In [2]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [3]:
data['b']

0.5

In [7]:
'a' in data

True

In [8]:
# like in python dictonary
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [12]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [13]:
# Modifing Series objects like dict syntax
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

### Series as one-dimensional array

A `series` build on this dictionary-like interface and provides array-style item selection via the same basic mechanism as NumPy arrays - that is, slices, masking, and fancy indexing. Examples of these are as follows:

In [14]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [16]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [17]:
# masking 
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [18]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

### Indexers: loc, iloc and ix

In [19]:
data = pd.Series(['a', 'b', 'c'], 
                index=[1,3,5])
data

1    a
3    b
5    c
dtype: object

In [20]:
# explicit index when indexing 
data[1]

'a'

In [21]:
# implicit index when slicing 
data[1:3]

3    b
5    c
dtype: object

Because of potential confusion we use the following:

First, the `loc` attribute allows indexing and slicing that always references the explict index:

In [22]:
data.loc[1]

'a'

In [23]:
data.loc[1:3]

1    a
3    b
dtype: object

The `iloc` attribute allows indexing and slicing:

In [24]:
data.iloc[1]

'b'

In [25]:
data.iloc[1:3]

3    b
5    c
dtype: object

## Data Selection in DataFrame
### DataFrame as a dictonary

In [26]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In [27]:
data['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [28]:
data.area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [29]:
# Only values
data.area.values

array([423967, 170312, 149995, 141297, 695662], dtype=int64)

In [30]:
data.area is data['area']

True

The above is not true always. For example if there is a confict of name with a method of the `DataFrame`


In [31]:
data.pop is data['pop']

False

In [32]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


### DataFrame as two-dimensional array

As mentioned previously, we can also view the `DataFrame` as an enhanced two-dimensional array. We can examine the raw underlying data array using the `values` attribute:

In [33]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01]])

In [39]:
# Transpose
data.T

Unnamed: 0,California,Florida,Illinois,New York,Texas
area,423967.0,170312.0,149995.0,141297.0,695662.0
pop,38332520.0,19552860.0,12882140.0,19651130.0,26448190.0
density,90.41393,114.8061,85.88376,139.0767,38.01874


In [40]:
# Access a row
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [41]:
# Access a column
data['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [49]:
# Using iloc (implicit) - positional based indexing
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


In [50]:
# Using loc (explicit) - label based indexing
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


In [51]:
# The ix index is the hybrid of above two
data.ix[:3, :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


In [52]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
Florida,19552860,114.806121
New York,19651127,139.076746


In [53]:
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


## Additional indexing conventions

There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice. First, while indexing refers to columns, slicing refers to rows:

In [54]:
data['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [55]:
data[1:3]

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [56]:
# Direct masking opreations are interpreted row-wiise rather than
# column-wise

data[data.density > 100]

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
New York,141297,19651127,139.076746
