# Mod10 DataFrame Indexing and Selection

## Data Selection in DataFrame

### DataFrame as a dictionary

In [1]:
import numpy as np
import pandas as pd

In [2]:
np.__version__

'1.19.4'

In [3]:
pd.__version__

'1.1.4'

In [4]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [None]:
data['area'] #dictionary-style indexing

In [None]:
data['area'].values # get numpy array 

In [None]:
data.area #attribute-style access

This attribute-style column access actually accesses the exact same object as the dictionary-style access:

In [None]:
data.area is data['area']

the ``DataFrame`` has a ``pop()`` method, so ``data.pop`` will point to this rather than the ``"pop"`` column:

In [None]:
# name conflict wuth method 
data.pop is data['pop']

adding a new column:

In [6]:
# add new colume
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


### DataFrame as two-dimensional array

In [7]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [None]:
data.columns

In [None]:
data.index

In [None]:
data.values

transpose the full ``DataFrame`` to swap rows and columns:

In [None]:
data.T

passing a single index to an array accesses a row:

In [None]:
data.values[0]

passing a single "index" to a ``DataFrame`` accesses a column:

In [None]:
data['area']  # return series

In [None]:
# fancy index的特例
data[['area']]

acess mutiple columns

In [None]:
# fancy index
data[['area','pop']]

In [None]:
# data[data.density > 100]  # attribute style, confused and maybe conflict
data[data['density'] > 100] # better way, by index key name

In [None]:
data[(data['density'] > 90) & (data['area'] > 15000)]

Using the iloc indexer, the ``DataFrame`` index and column labels are maintained in the result:

In [None]:
data

In [None]:
data.iloc[1,2]  # use 'iat' is better 

In [None]:
data.iloc[0]  # return Series: California

In [None]:
data.iloc[:3, :2]

Using the ``loc`` indexer we can index the underlying data using the explicit index and column names:

In [None]:
data

In [None]:
data.loc['Texas'] # return Series

In [None]:
data.loc['Illinois', 'pop'] # return values。 'at' is better

In [None]:
data.loc[:'New York', :'pop'] # explicit key 。 same as 'data.iloc[:3, :2]'(faster)

combine masking and fancy indexing as in the following:

In [None]:
# attribute style
data.loc[data.density > 100, ['pop', 'density']]

In [None]:
# fancy index
data.loc[data['density'] > 100, ['pop', 'density']] 

In [None]:
%%timeit
# fancy index
data.loc[data['density'] > 100, ['pop', 'density']] 

In [None]:
# slicing
data.loc[data['density'] > 100, 'pop':'density']  

In [None]:
%%timeit
# slicing
data.loc[data['density'] > 100, 'pop':'density']  

to set or modify values

In [None]:
data.iloc[0, 2] = 90
data

Access a single value for a row/column pair

In [8]:
%%timeit
data.at['Texas','area']

2.63 µs ± 44.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [9]:
%%timeit
data.loc['Texas','area']

4.46 µs ± 82.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [10]:
%%timeit
data.iat[1,0]

12.2 µs ± 79.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [11]:
%%timeit
data.iloc[1,0]

13.2 µs ± 57.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
