# [Basic Indexing and Selecting Data](#)

Indexing and selecting data are fundamental operations in data manipulation and analysis. They allow you to access specific subsets of data from your datasets, enabling you to focus on the relevant information for your tasks. Pandas, a powerful data manipulation library in Python, provides a wide range of indexing and selection methods to efficiently work with your data.


Indexing and selection are essential operations in data analysis and manipulation for several reasons:

1. **Accessing Specific Data Points**: Indexing and selection methods allow you to retrieve specific data points from your Series or DataFrame. Whether you need to access a single value, a subset of rows or columns, or a combination of both, indexing and selection techniques enable you to pinpoint the desired data quickly.

2. **Filtering and Subsetting**: Often, you may need to work with a specific subset of your data based on certain conditions or criteria. Indexing and selection methods provide ways to filter your data, allowing you to extract the relevant portions of your dataset for further analysis or processing.

3. **Data Exploration and Analysis**: Indexing and selection play a crucial role in data exploration and analysis. By selecting specific subsets of your data, you can examine patterns, trends, or anomalies within those subsets. This helps in gaining insights, testing hypotheses, and making data-driven decisions.

4. **Data Manipulation and Transformation**: Indexing and selection methods are essential for data manipulation and transformation tasks. You can use these methods to update specific values, apply functions to selected subsets, or create new features based on existing data. Indexing and selection provide the foundation for efficient data manipulation operations.


<img src="../images/subset-pandas.png" width="800">

Pandas provides a rich set of indexing and selection methods to cater to various data access and manipulation scenarios. Some of the key indexing and selection methods include:

1. **Square Bracket Notation**: The square bracket notation `[]` is the most basic and widely used indexing method in Pandas. It allows you to access elements in a Series or DataFrame by providing the index labels or integer positions.

2. **Dot Notation**: For accessing columns in a DataFrame, you can use the dot notation `.` followed by the column name. This provides a concise and intuitive way to refer to specific columns.

3. **`loc` Accessor**: The `loc` accessor is used for label-based indexing. It allows you to select data based on the index labels of rows and columns. With `loc`, you can select single elements, slices, or arbitrary subsets of data using labels.

4. **`iloc` Accessor**: The `iloc` accessor is used for integer position-based indexing. It allows you to select data based on the integer positions of rows and columns. With `iloc`, you can select single elements, slices, or arbitrary subsets of data using integer positions.

5. **Boolean Indexing**: Boolean indexing allows you to select data based on boolean conditions. By providing a boolean mask or a boolean expression, you can filter rows or columns that satisfy the given condition.

6. **Indexing with Callable**: Pandas also supports indexing using callable functions or objects. This allows you to define custom criteria or conditions for selecting data based on complex logic.


These are just a few examples of the indexing and selection methods available in Pandas. Throughout this lecture, we will explore these methods in detail, along with their variations and combinations, to provide you with a comprehensive understanding of data indexing and selection in Pandas.


By mastering indexing and selection techniques, you will gain the ability to efficiently access, filter, and manipulate your data, enabling you to perform powerful data analysis and manipulation tasks with ease.

## <a id='toc1_'></a>[Different Choices for Indexing](#toc0_)

Pandas provides a wide range of indexing choices to suit different data access and manipulation requirements. These indexing choices offer flexibility and power in working with your data, allowing you to select, filter, and transform your data efficiently. Let's explore the various indexing choices available in Pandas.

1. **Label-based Indexing**: Label-based indexing allows you to access data based on the labels of the index or columns. In Pandas, you can use the `loc` accessor to perform label-based indexing. With `loc`, you can select data using single labels, lists of labels, or slices of labels.

   Example:
   ```python
   df.loc[['row1', 'row2'], ['col1', 'col2']]
   ```

2. **Integer-based Indexing**: Integer-based indexing allows you to access data based on the integer positions of the index or columns. In Pandas, you can use the `iloc` accessor to perform integer-based indexing. With `iloc`, you can select data using single integers, lists of integers, or slices of integers.

   Example:
   ```python
   df.iloc[[0, 1], [0, 1]]
   ```

3. **Boolean Indexing**: Boolean indexing allows you to select data based on boolean conditions. You can provide a boolean mask or a boolean expression to filter rows or columns that satisfy the given condition. Boolean indexing is a powerful way to select data based on specific criteria.

   Example:
   ```python
   df[df['col1'] > 10]
   ```

4. **Indexing with MultiIndex**: Pandas provides the ability to work with hierarchical or multi-level indexes, known as MultiIndex. MultiIndex allows you to have multiple levels of indexing, enabling more complex data structures. You can use tuple-based indexing or the `xs` method to select data from a MultiIndex.

   Example:
   ```python
   df.loc[('level1', 'level2'), 'col1']
   ```

5. **Indexing with DatetimeIndex**: When working with time series data, Pandas offers specialized indexing using DatetimeIndex. You can select data based on datetime labels, date ranges, or time periods. DatetimeIndex provides convenient methods for resampling and aggregating time series data.

   Example:
   ```python
   df.loc['2023-01-01':'2023-12-31']
   ```


In the following sections, we will dive deeper into each indexing choice, exploring their syntax, use cases, and practical examples to help you master the art of indexing in Pandas.

## <a id='toc2_'></a>[Basics of Indexing and Selection](#toc0_)

Indexing and selection are fundamental operations in Pandas that allow you to access and manipulate specific subsets of data in your Series or DataFrame. Let's explore the basics of indexing and selection in Pandas.


The primary function of indexing with `[]` (a.k.a. `__getitem__` for those familiar with implementing class behavior in Python) is selecting out lower-dimensional slices. The following table shows return type values when indexing Pandas objects with `[]`:

| Object Type | Selection      | Return Value Type                |
|-------------|----------------|-----------------------------------|
| Series      | `series[label]`  | scalar value                      |
| DataFrame   | `frame[colname]` | Series corresponding to `colname` |


### <a id='toc2_1_'></a>[Accessing Elements in a Series](#toc0_)


To access elements in a Pandas Series, you can use square bracket notation `[]` with either the index labels or integer positions.


In [1]:
import pandas as pd

In [2]:
series = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])

- Accessing elements using index labels:

In [3]:
series['a']

1

- Accessing elements using integer positions:

> Note: `Series.__getitem__` treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`

In [4]:
series[0]

  series[0]


1

- Slicing a Series using index labels:


In [5]:
series['b':'d']

b    2
c    3
d    4
dtype: int64

- Slicing a Series using integer positions:


In [6]:
series[1:3]

b    2
c    3
dtype: int64

### <a id='toc2_2_'></a>[Selecting Columns in a DataFrame](#toc0_)


When working with a Pandas DataFrame, you can select specific columns using square bracket notation `[]` or dot notation `.`.


In [7]:
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8], 'C': [9, 10, 11, 12]}, index=['a', 'b', 'c', 'd'])
df

Unnamed: 0,A,B,C
a,1,5,9
b,2,6,10
c,3,7,11
d,4,8,12


- Selecting a single column using square bracket notation:

In [8]:
df['A']

a    1
b    2
c    3
d    4
Name: A, dtype: int64

- Selecting a single column using dot notation:

In [9]:
df.A

a    1
b    2
c    3
d    4
Name: A, dtype: int64

Note that while using dot notation is concise, it has some limitations and potential issues. You can use this access only if the index element is a valid Python identifier, e.g. `s.1` is not allowed. See here for an explanation of valid identifiers.

- The attribute will not be available if it conflicts with an existing method name, e.g. `s.min` is not allowed, but `s['min']` is possible.
- Similarly, the attribute will not be available if it conflicts with any of the following list: `index`, `major_axis`, `minor_axis`, `items`.

In any of these cases, standard indexing will still work, e.g. `s['1']`, `s['min']`, and `s['index']` will access the corresponding element or column.

- Selecting multiple columns using square bracket notation:

In [10]:
df[['A', 'C']]

Unnamed: 0,A,C
a,1,9
b,2,10
c,3,11
d,4,12


### <a id='toc2_3_'></a>[Selecting Rows in a DataFrame](#toc0_)


To select specific rows in a DataFrame, you can use index labels or integer positions.


- Selecting a single row by **index label**:

In [11]:
df.loc['a']

A    1
B    5
C    9
Name: a, dtype: int64

- Selecting a single row by **integer position**:

In [12]:
df.iloc[0]

A    1
B    5
C    9
Name: a, dtype: int64

- Selecting multiple rows using **index labels**:

In [13]:
df.loc[['a', 'c']]

Unnamed: 0,A,B,C
a,1,5,9
c,3,7,11


- Selecting multiple rows using **integer positions**:

In [14]:
df.iloc[[0, 2]]

Unnamed: 0,A,B,C
a,1,5,9
c,3,7,11


### <a id='toc2_4_'></a>[Slicing Rows and Columns](#toc0_)


You can also use slicing to select a range of rows or columns in a DataFrame.


- Slicing rows using **index labels**:

In [15]:
df.loc['a':'c'] # Output: DataFrame with rows from index label 1 to 3 (inclusive)


Unnamed: 0,A,B,C
a,1,5,9
b,2,6,10
c,3,7,11


- Slicing rows using **integer positions**:

In [16]:
df.iloc[0:3] # Output: DataFrame with rows from integer position 1 to 3 (exclusive)

Unnamed: 0,A,B,C
a,1,5,9
b,2,6,10
c,3,7,11


- Slicing columns using **column names (labels)**:

In [17]:
df.loc[:, 'A':'C'] # Output: DataFrame with columns from 'A' to 'C' (inclusive)

Unnamed: 0,A,B,C
a,1,5,9
b,2,6,10
c,3,7,11
d,4,8,12


- Slicing columns using **integer positions**:

In [18]:
df.iloc[:, 0:2] # Output: DataFrame with columns from integer position 0 to 2 (exclusive)

Unnamed: 0,A,B
a,1,5
b,2,6
c,3,7
d,4,8


These are just a few examples of the basic indexing and selection operations in Pandas. Pandas provides a wide range of indexing and selection methods, including boolean indexing, label-based indexing with `loc`, integer-based indexing with `iloc`, and more, which we will explore in the upcoming sections.


By mastering the basics of indexing and selection, you'll be able to efficiently access and manipulate specific subsets of data in your Pandas Series and DataFrames, enabling you to perform various data analysis and manipulation tasks with ease.

## <a id='toc3_'></a>[Selecting Random Samples](#toc0_)

In data analysis and machine learning, it is often useful to select random samples from your dataset. Pandas provides a convenient way to generate random samples from Series and DataFrames using the `sample()` method. This allows you to create subsets of your data for various purposes, such as testing, validation, or exploratory analysis.


The `sample()` method in Pandas allows you to randomly select a specified number of rows from a Series or DataFrame. It returns a new Series or DataFrame containing the randomly selected samples. It also provides several parameters to control the random sampling process:

- `n`: The number of items to return. If not specified, a single item is returned.
- `frac`: The fraction of items to return. If `n` is not specified, `frac` must be between 0 and 1.
- `replace`: Whether to allow sampling with replacement. If `True`, selected items can be chosen again.
- `weights`: Probabilities associated with each item. If not specified, items are chosen with equal probability.
- `random_state`: The seed for the random number generator. Specifying a fixed value allows for reproducibility.


Here are a few examples demonstrating different use cases of the `sample()` method:

1. Selecting a fraction of rows from a DataFrame:


In [24]:
df.sample(frac=0.5)

Unnamed: 0,A,B
4,5,10.0
0,1,2.0


2. Allowing sampling with replacement:


In [25]:
df.sample(n=3, replace=True)

Unnamed: 0,A,B
2,3,6.0
3,10,8.0
1,8,4.0


3. Specifying probabilities for each row:

In [26]:
weights = [0.1, 0.2, 0.3, 0.2, 0.2]
df.sample(n=3, weights=weights)

Unnamed: 0,A,B
2,3,6.0
4,5,10.0
1,8,4.0


4. Setting a random seed for reproducibility:

In [27]:
df.sample(n=3, random_state=42)

Unnamed: 0,A,B
1,8,4.0
4,5,10.0
2,3,6.0


By using the `sample()` method, you can easily generate random samples from your Pandas Series and DataFrames. This is particularly useful when you want to create subsets of your data for various purposes, such as:

- Creating training and testing datasets for machine learning models.
- Performing exploratory data analysis on a smaller subset of data.
- Validating results or testing hypotheses on random samples.
- Generating bootstrap samples for statistical inference.


It's important to note that the `sample()` method selects rows randomly without considering the distribution of the data. If you require a stratified or weighted sample based on certain criteria, you may need to use additional techniques or libraries, such as `scikit-learn`'s `train_test_split()` function for stratified sampling.


Selecting random samples is a fundamental operation in data analysis and machine learning, and Pandas provides a straightforward way to accomplish this using the `sample()` method. By leveraging this functionality, you can efficiently create random subsets of your data for various purposes and gain insights from your datasets.

## <a id='toc4_'></a>[Fast Scalar Value Getting and Setting](#toc0_)

When working with large datasets in Pandas, performance becomes a critical factor. Accessing and modifying individual scalar values in a DataFrame can be a common operation, but using the standard indexing methods may not always be the most efficient approach. Pandas provides optimized methods, `at` and `iat`, for fast scalar value getting and setting. Let's explore how to use these methods effectively.


Pandas offers two optimized methods for accessing and modifying individual scalar values in a DataFrame:

1. `at`: This method is used for label-based scalar access and setting. It takes row and column labels as arguments.
2. `iat`: This method is used for integer-based scalar access and setting. It takes row and column integer positions as arguments.


These methods are optimized for performance and provide faster access and modification compared to the standard indexing methods, such as `loc` and `iloc`, when working with scalar values.


### <a id='toc4_1_'></a>[Using `at` for Fast Scalar Value Getting and Setting](#toc0_)


The `at` method is used for label-based scalar access and setting. It allows you to quickly retrieve or modify a single value in a DataFrame by specifying the row and column labels.


Here's an example of using `at` for scalar value getting:


In [28]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df

Unnamed: 0,A,B
a,1,4
b,2,5
c,3,6


In [29]:
df.at['b', 'A']

2

In this example, we have a DataFrame `df` with index labels 'a', 'b', 'c', and columns 'A' and 'B'. By calling `df.at['b', 'A']`, we retrieve the scalar value at row label 'b' and column label 'A'.


Similarly, you can use `at` for scalar value setting:


In [30]:
df.at['b', 'A'] = 10
df

Unnamed: 0,A,B
a,1,4
b,10,5
c,3,6


This statement modifies the value at row label 'b' and column label 'A' to 10.


### <a id='toc4_2_'></a>[Using `iat` for Fast Scalar Value Getting and Setting](#toc0_)


The `iat` method is used for integer-based scalar access and setting. It allows you to quickly retrieve or modify a single value in a DataFrame by specifying the row and column integer positions.


Here's an example of using `iat` for scalar value getting:


In [31]:
df.iat[1, 0]

10

In this example, we have a DataFrame `df` with integer-based indexing. By calling `df.iat[1, 0]`, we retrieve the scalar value at row position 1 and column position 0.


Similarly, you can use `iat` for scalar value setting:


In [32]:
df.iat[1, 0] = 100
df

Unnamed: 0,A,B
a,1,4
b,100,5
c,3,6


This statement modifies the value at row position 1 and column position 0 to 10.


## <a id='toc5_'></a>[Performance Considerations](#toc0_)


Using `at` and `iat` for scalar value access and modification provides performance benefits compared to using `loc` and `iloc` or standard indexing with square brackets `[]`. The `at` and `iat` methods are optimized for fast scalar access and bypass some of the overhead associated with the more flexible indexing methods.


However, it's important to note that the performance gains are most significant when accessing or modifying a single scalar value. If you need to access or modify multiple values or slices of data, using `loc` or `iloc` may be more appropriate and can still provide good performance.


In [33]:
import numpy as np

# Create a sample DataFrame with 1 million rows and 3 columns
data = np.random.randint(0, 100, size=(1000000, 3))
df = pd.DataFrame(data, columns=['A', 'B', 'C'])

In [34]:
# Access a single value using at
%timeit df.at[500000, 'B']

1.76 µs ± 266 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [35]:
# Access a single value using iat
%timeit df.iat[500000, 1]

5.24 µs ± 35.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [36]:
# Access a single value using loc
%timeit df.loc[500000, 'B']

2.93 µs ± 19.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [37]:
# Access a single value using iloc
%timeit df.iloc[500000, 1]

6.9 µs ± 44.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


It's also worth mentioning that the `at` and `iat` methods do not support broadcasting or fancy indexing. They are specifically designed for fast scalar access and modification.


When working with large datasets and performance is a concern, consider using `at` and `iat` for fast scalar value getting and setting. These methods can help optimize your code and improve the efficiency of your data manipulation tasks.


Remember to profile your code and measure the performance impact of using `at` and `iat` in your specific use case. While these methods are optimized for fast scalar access, the actual performance gains may vary depending on the size and structure of your data, as well as the specific operations you are performing.


By leveraging the `at` and `iat` methods for fast scalar value getting and setting, you can write more efficient and performant code when working with large datasets in Pandas.