<div style="position:relative; height: 8em">
	<h1 style="position:absolute; left:0em; top:1em">Index DataFrame</h1>
	<img src="../../../../outfit/images/logos/software-web@256x256.png" style="position:absolute; right:1em; top:0; margin:0">
</div>

In [1]:
import pandas as pd

## Create dataset

In [2]:
prices_dict = {
    "fruits": ["apples", "oranges", "bananas", "strawberries"],
    "prices": [1.5, 2, 2.5, 3],
    "suppliers": ["supplier1", "supplier2", "supplier4", "supplier3"],    
}

prices_df = pd.DataFrame(prices_dict, index = [1,2,3,4])
prices_df

Unnamed: 0,fruits,prices,suppliers
1,apples,1.5,supplier1
2,oranges,2.0,supplier2
3,bananas,2.5,supplier4
4,strawberries,3.0,supplier3


## Basic indexing

df[colname] -> Series corresponding to colname

### Select Single Column 

In [3]:
## select single column - square bracket notation:
prices_col = prices_df['prices']
prices_col

1    1.5
2    2.0
3    2.5
4    3.0
Name: prices, dtype: float64

Remember, that a DataFrame Column is a Series object

In [4]:
print(type(prices_col))

<class 'pandas.core.series.Series'>


In [5]:
## select single column - attribute (dot) notation:
prices_df.prices

1    1.5
2    2.0
3    2.5
4    3.0
Name: prices, dtype: float64

#### square bracket vs dot notation
Note that square bracket notation is more canonical (can be used for 1 or multiple columns selection) and allows for any string to be used as selector. I.e you can't use the dot notation, if the column name contains spaces, or is a reserverd word (like max, min, etc.)


In [6]:
demo_df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['col 1', 'col 2', 'col 3'])

# the line bellow will raise an error:
# demo_df.'col 1'

# but next is ok:
demo_df['col 1']


0    1
1    4
Name: col 1, dtype: int64

### Slicing ranges with [] operator

Slicing inside of [] **slices the rows**. This is provided largely as a convenience since it is such a common operation.

In [7]:
# get the first two rows:
prices_df[0:2]

Unnamed: 0,fruits,prices,suppliers
1,apples,1.5,supplier1
2,oranges,2.0,supplier2


In [8]:
# get all odd rows
prices_df[0::2]

Unnamed: 0,fruits,prices,suppliers
1,apples,1.5,supplier1
3,bananas,2.5,supplier4


### Select List of Columns

Note, that the columns will be selected in the order specified in the list

In [9]:
prices_df[['prices', 'fruits']]

Unnamed: 0,prices,fruits
1,1.5,apples
2,2.0,oranges
3,2.5,bananas
4,3.0,strawberries


The returned slice is a DataFrame object!

In [10]:
type(prices_df[['prices', 'fruits']])

pandas.core.frame.DataFrame


## Access data with the loc method

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

Access a group of rows and columns by **label**(s) or a boolean array.

**Syntax**: df.loc[row_indexer,column_indexer]

In [11]:
# select value in row with label '1' and column  'prices'
prices_df.loc[1,'prices']

1.5

In [12]:
# get all rows for columns 'fruits' and 'prices'
prices_df.loc[:, ['fruits', 'prices']]

# equivalent to:
# prices_df[['fruits', 'prices']]

Unnamed: 0,fruits,prices
1,apples,1.5
2,oranges,2.0
3,bananas,2.5
4,strawberries,3.0


If we have labeled indexes, we can see the real power of loc method. So lets set the fruits column data as index

In [13]:
prices_df.set_index('fruits', inplace=True)
prices_df

Unnamed: 0_level_0,prices,suppliers
fruits,Unnamed: 1_level_1,Unnamed: 2_level_1
apples,1.5,supplier1
oranges,2.0,supplier2
bananas,2.5,supplier4
strawberries,3.0,supplier3


In [14]:
# get the price of 'oranges':
prices_df.loc['oranges', 'prices']

2.0

In [15]:
# get the price of 'oranges' and 'bananas':
prices_df.loc[['oranges','bananas'], 'prices']

fruits
oranges    2.0
bananas    2.5
Name: prices, dtype: float64

In [16]:
# lets reset the index back
prices_df.reset_index(inplace=True)
prices_df

Unnamed: 0,fruits,prices,suppliers
0,apples,1.5,supplier1
1,oranges,2.0,supplier2
2,bananas,2.5,supplier4
3,strawberries,3.0,supplier3


## Access data with the iloc method

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

Purely **integer-location** based indexing for selection by position

In [17]:
prices_df

Unnamed: 0,fruits,prices,suppliers
0,apples,1.5,supplier1
1,oranges,2.0,supplier2
2,bananas,2.5,supplier4
3,strawberries,3.0,supplier3


In [18]:
# get the data in first row, second column
prices_df.iloc[0,1]

1.5

In [19]:
# get the data from second row till the end for all columns
prices_df.iloc[1:,]

Unnamed: 0,fruits,prices,suppliers
1,oranges,2.0,supplier2
2,bananas,2.5,supplier4
3,strawberries,3.0,supplier3


In [20]:
# get the cells from second row till the end, ant the last column (using the -1 index)
prices_df.iloc[1:,-1]

1    supplier2
2    supplier4
3    supplier3
Name: suppliers, dtype: object

#### pass Boolean array to loc/iloc method

As with Series, we can pass a Boolean array as index/column value in loc and iloc.
Note, that the index/column Boolean array must have the same shape as the DF index/columns

In [21]:
columns_mask = [False, True, False]
row_mask = [False, False, True, True]
prices_df.loc[row_mask, columns_mask]

Unnamed: 0,prices
2,2.5
3,3.0


In [22]:
# get all trows for data which have price > 2:
mask = prices_df.prices>2
prices_df.loc[mask]

# the same can be done with:
# prices_df[prices_df.prices>2]

Unnamed: 0,fruits,prices,suppliers
2,bananas,2.5,supplier4
3,strawberries,3.0,supplier3


In [23]:
# gat all fruit names starting with letter 'a'
mask = prices_df.fruits.str.startswith('a')
prices_df.loc[mask, 'fruits']

0    apples
Name: fruits, dtype: object