# Pandas Indexing and Updating Cells


▶️ Import `pandas` and `numpy`.


In [1]:
import pandas as pd
import numpy as np

▶️ Create a new `DataFrame` named `df_companies`.


In [2]:
df_companies = pd.DataFrame(
    {
        "company_name": ["Amazon", "Nvidia", "Google", "Microsoft", "Adobe"],
        "ticker": ["AMZN", "AMD", np.nan, "MSFT", np.nan],
        "headquarters": [
            "Champaign",
            "Santa Clara",
            "Mountain View",
            "Redmond",
            "San Jose",
        ],
    }
)

df_companies

Unnamed: 0,company_name,ticker,headquarters
0,Amazon,AMZN,Champaign
1,Nvidia,AMD,Santa Clara
2,Google,,Mountain View
3,Microsoft,MSFT,Redmond
4,Adobe,,San Jose


:::{caution} Missing and Incorrect Values are Intentional
`df_companies` purposefully contains missing and incorrect values that need to be fixed.

- Google and Adobe's ticker symbols are missing.
- NVidia's ticker symbol is incorrect (it should be "NVDA" not "AMD").
- Amazon's headquarters is incorrect (it should be "Seattle" not "Champaign").

We will fix these issues in the upcoming exercises.
:::


---

## 📇 What is Indexing?

In Pandas, _indexing_ refers to the way we access, select, and reference rows and columns in a `Series` or a `DataFrame`. Indexing is a fundamental operation in Pandas that allows us to manipulate and analyze data efficiently. Think of it as the "addressing system" for your data.

We have already seen some indexing methods, such as:

```python
df["dept"]                              # Access a column by its name
df.loc[df["dept"] == "HR", :]           # Access rows based on a condition (all columns)
df.loc[df["dept"] == "HR", "salary"]    # Access rows based on a condition (specific column)
df.iloc[0]                              # Access the first element by its integer position
df.iloc[0, 1]                           # Access the element in the first row and second column
```


---

## 🔖 `.loc`

The `.loc` indexer ("location by label") is used to access rows and columns in a `DataFrame` by their labels (i.e., the index and column names). It allows for more intuitive and flexible indexing based on the actual labels of the data.

```python
df.loc[row_selection, column_selection]  # Access rows and columns by labels
df.loc[:, column_selection]              # Access all rows for specific columns
df.loc[row_selection, :]                 # Access specific rows for all columns
```

The `row_selection` and `column_selection` can be:

- A single label (e.g., `"HR"` or `"salary"`).
- A list of labels (e.g., `["HR", "IT"]` or `["salary", "dept"]`).
- A boolean array-like data (e.g., `df["dept"] == "HR"`).

We can use `.loc` to update values in a `DataFrame`. For example, to update the headquarters of Google:

```python
df.loc["row_label", "column_name"] = "new_value"
```

The `.loc` indexer can be used with a `Series` as well:

```python
my_series = pd.Series([10, 20, 30], index=["a", "b", "c"])
my_series.loc["a"]       # Access the first element (10)
my_series.loc["b":"c"]   # Access elements from index "b" to "c" (20, 30)
```


---

**🎯 Example: Update Nvidia's ticker using `.loc`**

▶️ Show the `DataFrame` before the update:


In [3]:
df_companies[df_companies["company_name"] == "Nvidia"]

Unnamed: 0,company_name,ticker,headquarters
1,Nvidia,AMD,Santa Clara


To update one or more cells in a `DataFrame`, we can use the `.loc` indexer to specify the rows and columns we want to update, and then assign new values to those cells.

In the code below, we update Nvidia's ticker symbol from "AMD" to "NVDA" by specifying the row where the `company_name` is "Nvidia" and the `ticker` column.

▶️ Update Nvidia's ticker symbol from "AMD" to "NVDA":


In [4]:
df_companies.loc[df_companies["company_name"] == "Nvidia", "ticker"] = "NVDA"

▶️ Show the `DataFrame` after the update:


In [5]:
display(df_companies[df_companies["company_name"] == "Nvidia"])

Unnamed: 0,company_name,ticker,headquarters
1,Nvidia,NVDA,Santa Clara


---

**🎯 Example: Fill in missing tickers using `.loc`**

▶️ Show the `DataFrame` before the update:


In [6]:
display(df_companies.loc[df_companies["company_name"].isin(["Google", "Adobe"])])

Unnamed: 0,company_name,ticker,headquarters
2,Google,,Mountain View
4,Adobe,,San Jose


▶️ Fill in Google and Adobe's missing ticker symbols.

The syntax is identical to the previous example, except we are updating two cells this time.


In [7]:
df_companies.loc[df_companies["company_name"] == "Google", "ticker"] = "GOOG"
df_companies.loc[df_companies["company_name"] == "Adobe", "ticker"] = "ADBE"

▶️ Show the `DataFrame` after the update:


In [8]:
display(df_companies.loc[df_companies["company_name"].isin(["Google", "Adobe"])])

Unnamed: 0,company_name,ticker,headquarters
2,Google,GOOG,Mountain View
4,Adobe,ADBE,San Jose


---

## 🔢 `.iloc`

The `.iloc` indexer ("integer location") is used to access rows and columns in a `DataFrame` by their integer positions (i.e., the row and column indices). It is useful when you want to access data based on its position rather than its label.

```python
df.iloc[row_index, column_index]  # Access rows and columns by integer positions
```

The `row_index` and `column_index` can be:

- A single integer (e.g., `0` or `1`).
- A list of integers (e.g., `[0, 2]` or `[1, 3]`).
- A slice object (e.g., `0:3` to select the first three rows).

The `.iloc` indexer can be used to select a specific element from a `Series` as well.

```python
my_series = pd.Series([10, 20, 30], index=["a", "b", "c"])
my_series.iloc[0]       # Access the first element (10)
my_series.iloc[1:3]     # Access elements from index 1 to 3 (20, 30)
```

We can also use `.iloc` to update values in a `DataFrame`.

```python
df.iloc[row_index, column_index] = new_value
```


---

**🎯 Example: Update Amazon's headquarters using `.iloc`**

▶️ Show the `DataFrame` before the update:


In [9]:
df_companies

Unnamed: 0,company_name,ticker,headquarters
0,Amazon,AMZN,Champaign
1,Nvidia,NVDA,Santa Clara
2,Google,GOOG,Mountain View
3,Microsoft,MSFT,Redmond
4,Adobe,ADBE,San Jose


▶️ Update Amazon's headquarters from "Champaign" to "Seattle" using `.iloc`.


In [10]:
df_companies.iloc[0, 2] = "Seattle"

:::{warning} Avoid using `.iloc` for updates based on conditions

Using `.iloc` for updating specific cells based on conditions (like company names) is generally not recommended because it relies on the position of the data, which can change if the `DataFrame` is modified (e.g., rows are added or removed). This can lead to errors or unintended updates. `.loc` is preferred for such tasks because it uses labels, which are more stable and meaningful.

The example below is for demonstration purposes only. You should never hard-code row and column indices for updates in real-world scenarios.
:::


▶️ Show the `DataFrame` after the update:


In [11]:
df_companies

Unnamed: 0,company_name,ticker,headquarters
0,Amazon,AMZN,Seattle
1,Nvidia,NVDA,Santa Clara
2,Google,GOOG,Mountain View
3,Microsoft,MSFT,Redmond
4,Adobe,ADBE,San Jose


---

**🎯 Example: Retrieve Microsoft's ticker symbol using `.loc`**

▶️ Retrieve Microsoft's ticker symbol using `.loc`.


In [12]:
df_companies.loc[df_companies["company_name"] == "Microsoft", "ticker"]

3    MSFT
Name: ticker, dtype: object

:::{caution} The data type of the result is a `Series`, not a single value!

The result of `df_companies.loc[df_companies["company_name"] == "Microsoft", "ticker"]` is a `Series`, even though it contains only one value. This is because the `.loc` indexer always returns a `Series` when selecting a single column, even if the selection results in a single value.

▶️ We can confirm this by using the `type()` function:

:::


In [13]:
type(df_companies.loc[df_companies["company_name"] == "Microsoft", "ticker"])

pandas.core.series.Series

▶️ We can retrieve the first element of the resulting `Series` using the `.iloc` indexer. Because we know there is only one matching row, we can use `.iloc[0]` to get the first (and only) element.


In [14]:
df_companies.loc[df_companies["company_name"] == "Microsoft", "ticker"].iloc[0]

'MSFT'

▶️ Alternatively, we can use the `.values` attribute to get the underlying NumPy array and then access the first element of that array.

This approach is less common but works just as well. Getting the underlying NumPy array may be useful in some scenarios, but for simple retrieval of a single value, using `.iloc[0]` is more straightforward.


In [15]:
df_companies.loc[df_companies["company_name"] == "Microsoft", "ticker"].values[0]

'MSFT'

---

**🎯 Example: Retrieve Adobe's headquarters location using `.iloc`**

▶️ We can retrieve Adobe's headquarters location using `.iloc`. Since Adobe is the last company in the `DataFrame`, we can use negative indexing to access the last row. Similarly, headquarters is the last column, so we can use negative indexing to access the last column as well.


In [16]:
df_companies.iloc[-1, -1]

'San Jose'

:::{caution} Reminder to Avoid Hard-Coding Indices

Hard-coding row and column indices (like `-1` for the last row and last column) is not recommended because it relies on the position of the data, which can change if the `DataFrame` is modified. For example, the code will not work if a new company is added to the `DataFrame`.

:::


---

## 🧩 `SettingWithCopyWarning`

When updating values in a `DataFrame`, you might encounter a `SettingWithCopyWarning`. This warning indicates that you are trying to set a value on a copy of a slice from a `DataFrame`, which may not have the intended effect.

This is one of the most common stumbling blocks when using Pandas. It often occurs when you create a subset of a `DataFrame` and then try to modify that subset.

It's Pandas' way of saying:

> "Hey, I’m not sure if you’re changing the **original DataFrame** or just a **temporary copy** (view) of it. Your change might not stick."

When you filter a `DataFrame`, Pandas may return either:

- a **view**: a "window" into the original data
- a **copy**: a completely separate object in memory

Pandas doesn't always know which one it is, so it throws a warning to make sure you're aware of the potential issue.

We will demonstrate this with an example.

▶️ Create a new `DataFrame` named `df_seats` that contains information about available seats in various courses.


In [17]:
df_seats = pd.DataFrame(
    {
        "CourseID": ["ACCY 201", "FIN 300", "ACCY 202", "BADM 101"],
        "Department": ["ACCY", "FIN", "ACCY", "BADM"],
        "Available Seats": [300, 90, 250, 500],
    }
)

df_seats

Unnamed: 0,CourseID,Department,Available Seats
0,ACCY 201,ACCY,300
1,FIN 300,FIN,90
2,ACCY 202,ACCY,250
3,BADM 101,BADM,500


▶️ Filter `df_seats` to create a new `DataFrame` named `df_accy` that only contains accounting courses.


In [18]:
df_accy = df_seats[df_seats["Department"] == "ACCY"]
df_accy

Unnamed: 0,CourseID,Department,Available Seats
0,ACCY 201,ACCY,300
2,ACCY 202,ACCY,250


▶️ Increase the available seats for all accounting courses by 10%.


In [19]:
df_accy["Available Seats"] = df_accy["Available Seats"] * 1.1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accy["Available Seats"] = df_accy["Available Seats"] * 1.1


You can see the `SettingWithCopyWarning` because `df_accy` is a **slice** of the original `df_seats` DataFrame. When we try to modify `df_accy`, Pandas is unsure if we intend to modify the original `df_seats` DataFrame or just the temporary slice (`df_accy`).

▶️ Confirm that the available seats in `df_accy` have been increased by 10%.


In [20]:
df_accy

Unnamed: 0,CourseID,Department,Available Seats
0,ACCY 201,ACCY,330.0
2,ACCY 202,ACCY,275.0


▶️ Check the original `df_seats` DataFrame to see if the changes are reflected there as well.


In [21]:
df_seats

Unnamed: 0,CourseID,Department,Available Seats
0,ACCY 201,ACCY,300
1,FIN 300,FIN,90
2,ACCY 202,ACCY,250
3,BADM 101,BADM,500


You'll notice that the changes are not reflected in the original `df_seats`. This is because `df_accy` is a separate copy of the data, and modifying it does not affect `df_seats`.


In [22]:
df_seats

Unnamed: 0,CourseID,Department,Available Seats
0,ACCY 201,ACCY,300
1,FIN 300,FIN,90
2,ACCY 202,ACCY,250
3,BADM 101,BADM,500


:::{seealso} Use `.copy()` to Avoid `SettingWithCopyWarning`

An easy way to avoid the `SettingWithCopyWarning` is to create an explicit copy of the slice using the `.copy()` method. This ensures that you are working with a separate object in memory, and any modifications will not affect the original `DataFrame`.

```python
df_copy = df.loc[condition].copy()
df_copy["column_name"] = new_value
```

:::
