#### Part 16: Advanced Indexing and Selection in Pandas

In this notebook, we'll explore:
- Random sampling with seeds
- Setting with enlargement
- Fast scalar value getting and setting
- Dictionary-like get() method
- The lookup() method
- Index objects and metadata

##### Setup
First, let's import the necessary libraries:

In [1]:
import pandas as pd
import numpy as np
import datetime

##### 1. Random Sampling with Seeds

You can set a seed for sample's random number generator using the `random_state` argument, which will accept either an integer (as a seed) or a NumPy RandomState object.

In [2]:
df4 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})

# With a given seed, the sample will always draw the same rows
df4.sample(n=2, random_state=2)

Unnamed: 0,col1,col2
2,3,4
1,2,3


In [3]:
# Running it again with the same seed gives the same result
df4.sample(n=2, random_state=2)

Unnamed: 0,col1,col2
2,3,4
1,2,3


##### 2. Setting with Enlargement

The `.loc/[]` operations can perform enlargement when setting a non-existent key for that axis. In the Series case, this is effectively an appending operation.

In [4]:
se = pd.Series([1, 2, 3])
print(se)

0    1
1    2
2    3
dtype: int64


In [5]:
# Setting a value at a non-existent index position
se[5] = 5.
print(se)

0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64


A DataFrame can be enlarged on either axis via `.loc`.

In [6]:
dfi = pd.DataFrame(np.arange(6).reshape(3, 2),
                  columns=['A', 'B'])
dfi

Unnamed: 0,A,B
0,0,1
1,2,3
2,4,5


In [7]:
# Adding a new column
dfi.loc[:, 'C'] = dfi.loc[:, 'A']
dfi

Unnamed: 0,A,B,C
0,0,1,0
1,2,3,2
2,4,5,4


In [8]:
# Adding a new row
dfi.loc[3] = 5
dfi

Unnamed: 0,A,B,C
0,0,1,0
1,2,3,2
2,4,5,4
3,5,5,5


##### 3. Fast Scalar Value Getting and Setting

Since indexing with `[]` must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you're asking for. If you only want to access a scalar value, the fastest way is to use the `at` and `iat` methods, which are implemented on all of the data structures.

- `at` provides label-based scalar lookups
- `iat` provides integer-based lookups

In [9]:
# Create a Series and DataFrame for demonstration
s = pd.Series([0, 1, 2, 3, 4, 5])

dates = pd.date_range('20000101', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

# Display the DataFrame
df

Unnamed: 0,A,B,C,D
2000-01-01,-0.17992,-2.839338,1.855987,0.16071
2000-01-02,-2.174306,0.446245,-2.398792,0.983548
2000-01-03,0.692869,-0.308068,0.594531,0.312468
2000-01-04,-0.170267,0.010328,-1.50048,0.487505
2000-01-05,-0.728984,1.071743,-0.953226,0.392945
2000-01-06,-0.320021,0.154224,-0.407157,0.643866
2000-01-07,0.430511,-0.553209,-1.777871,0.282906
2000-01-08,-0.229793,-0.270444,1.13649,0.619445


In [10]:
# Using iat for integer-based lookup
s.iat[5]

np.int64(5)

In [11]:
# Using at for label-based lookup
df.at[dates[5], 'A']

np.float64(-0.3200207421940305)

In [12]:
# Using iat for integer-based lookup in DataFrame
df.iat[3, 0]

np.float64(-0.1702670264595661)

You can also set values using these same indexers:

In [13]:
# Setting values using at
df.at[dates[5], 'E'] = 7

# Setting values using iat
df.iat[3, 0] = 7

df

Unnamed: 0,A,B,C,D,E
2000-01-01,-0.17992,-2.839338,1.855987,0.16071,
2000-01-02,-2.174306,0.446245,-2.398792,0.983548,
2000-01-03,0.692869,-0.308068,0.594531,0.312468,
2000-01-04,7.0,0.010328,-1.50048,0.487505,
2000-01-05,-0.728984,1.071743,-0.953226,0.392945,
2000-01-06,-0.320021,0.154224,-0.407157,0.643866,7.0
2000-01-07,0.430511,-0.553209,-1.777871,0.282906,
2000-01-08,-0.229793,-0.270444,1.13649,0.619445,


`at` may enlarge the object in-place if the indexer is missing:

In [14]:
# Adding a new row with at
df.at[dates[-1] + pd.Timedelta('1 day'), 0] = 7
df

Unnamed: 0,A,B,C,D,E,0
2000-01-01,-0.17992,-2.839338,1.855987,0.16071,,
2000-01-02,-2.174306,0.446245,-2.398792,0.983548,,
2000-01-03,0.692869,-0.308068,0.594531,0.312468,,
2000-01-04,7.0,0.010328,-1.50048,0.487505,,
2000-01-05,-0.728984,1.071743,-0.953226,0.392945,,
2000-01-06,-0.320021,0.154224,-0.407157,0.643866,7.0,
2000-01-07,0.430511,-0.553209,-1.777871,0.282906,,
2000-01-08,-0.229793,-0.270444,1.13649,0.619445,,
2000-01-09,,,,,,7.0


##### 4. Dictionary-like get() Method

Each of Series or DataFrame have a `get` method which can return a default value.

In [15]:
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

# Equivalent to s['a']
s.get('a')

np.int64(1)

In [16]:
# Getting a non-existent key with a default value
s.get('x', default=-1)

-1

##### 5. The lookup() Method

Sometimes you want to extract a set of values given a sequence of row labels and column labels, and the `lookup` method allows for this and returns a NumPy array.

In [17]:
dflookup = pd.DataFrame(np.random.rand(20, 4), columns=['A', 'B', 'C', 'D'])
dflookup.head()

Unnamed: 0,A,B,C,D
0,0.667685,0.513656,0.779095,0.831329
1,0.419444,0.538754,0.913358,0.430405
2,0.892438,0.473765,0.990673,0.386426
3,0.57507,0.52093,0.115757,0.460345
4,0.490568,0.20921,0.182056,0.249694


In [20]:
# Extract values at specific row and column positions
# Instead of: dflookup.lookup(list(range(0, 10, 2)), ['B', 'C', 'A', 'B', 'D'])

# Use this alternative approach:
row_indices = list(range(0, 10, 2))
col_indices = ['B', 'C', 'A', 'B', 'D']
values = [dflookup.iloc[row, dflookup.columns.get_loc(col)] for row, col in zip(row_indices, col_indices)]
values

[np.float64(0.5136557798123911),
 np.float64(0.9906734043334495),
 np.float64(0.49056787059268303),
 np.float64(0.05708490973874736),
 np.float64(0.3597510107487878)]

##### 6. Index Objects

The pandas `Index` class and its subclasses can be viewed as implementing an ordered multiset. Duplicates are allowed.

In [21]:
# Creating an Index directly
index = pd.Index(['e', 'd', 'a', 'b'])
index

Index(['e', 'd', 'a', 'b'], dtype='object')

In [22]:
# Testing membership
'd' in index

True

### 6.1 Setting Metadata

You can also pass a name to be stored in the index:

In [23]:
# Creating an index with a name
index = pd.Index(['e', 'd', 'a', 'b'], name='something')
index.name

'something'

In [24]:
# The name will be shown in the console display
index = pd.Index(list(range(5)), name='rows')
columns = pd.Index(['A', 'B', 'C'], name='cols')

df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)
df

cols,A,B,C
rows,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,-1.380805,0.467694,0.416283
1,0.678784,2.420706,0.229922
2,-0.615269,-0.123062,-1.293457
3,-0.564634,-2.163362,-1.698356
4,-0.129523,-0.28469,-0.578366


In [25]:
# Selecting a column shows the index name
df['A']

rows
0   -1.380805
1    0.678784
2   -0.615269
3   -0.564634
4   -0.129523
Name: A, dtype: float64

Indexes are "mostly immutable", but it is possible to set and change their metadata, like the index name:

In [26]:
ind = pd.Index([1, 2, 3])

# Create a new index with a different name
ind.rename("apple")

Index([1, 2, 3], dtype='int64', name='apple')

In [27]:
# Original index is unchanged
ind

Index([1, 2, 3], dtype='int64')

In [28]:
# Change the name in-place
ind.set_names(["apple"], inplace=True)
ind

Index([1, 2, 3], dtype='int64', name='apple')

In [29]:
# Another way to change the name
ind.name = "bob"
ind

Index([1, 2, 3], dtype='int64', name='bob')