## Part 3: Indexes, Lookup by Key, and Sorting

### 1. Adding an integer identifier & using a unique `uid` as the index

We’ll start with the small DataFrame from before (`df_small`) which already has a unique `uid` column (e.g., `AS150190`). We'll also add a simple integer identifier, then set `uid` as the index.

In [None]:
import pandas as pd

df_small = pd.DataFrame({
    "first": ["Alice", "Bob", "Carol"],
    "last": ["Smith", "Jones", "Lee"],
    "email": ["alice@example.com", "bob@example.com", "carol@example.com"],
    "uid": ["AS100293", "BJ240806", "CL150510"],
})

# Build DataFrame
df_small = pd.DataFrame(people)

In [None]:
# Show current frame
print("Before setting index:")
df_small

In [None]:
# Set uid as the index
df_small.set_index("uid", inplace=True)

# Inspect the index
print("\nAfter setting uid as index:")
print("\nIndex object:", df_small.index)
df_small

**Key points:**

* `set_index("uid", inplace=True)` makes the `uid` column become the DataFrame’s index, replacing the default integer index.
* You can inspect what the current index is via `df.index`.

---

### 2. Accessing rows via `.loc` vs `.iloc` after changing the index

With `uid` as the index, you can no longer use integer labels with `.loc` directly (e.g., `.loc[0]` will fail), but `.iloc` (position-based) still works.

In [None]:
# Example: access by uid with .loc
print("Row by uid via .loc:")
print(df_small.loc['BJ240806'])

# .iloc by position still works (e.g., first row)
print("\nRow by position via .iloc:")
print(df_small.iloc[0])

### 3. Resetting the index if it was modified accidentally

If you ever want to revert and bring the index back as a column:

In [None]:
# Reset index back to default integer index
df_small = df_small.reset_index()
print("After reset_index:")
df_small

### 4. Setting the index on the big survey DataFrame on load

When reading the large Stack Overflow survey results, we can directly set a meaningful unique identifier as the index. The column `ResponseId` uniquely identifies each respondent.

In [None]:
# Read with ResponseId as index right away
df = pd.read_csv('data/survey_results_public.csv', index_col="ResponseId")

# Quick check
print("Index name and sample:")
print(df.index.name, "— first 5 indices:")
print(df.index[:5])

Now selections by respondent ID can be done with `.loc[...]` using their ResponseId.

---

### 5. Real-world schema lookup example

Suppose you want to understand what the question code `AISelect` means in the schema, and you have `schema_df` loaded from the schema CSV.

First, set its index to the question name field (commonly `qname`), so lookups are simple:

In [None]:
# Load schema with index set to question names
schema_df = pd.read_csv('data/survey_results_schema.csv', index_col="qname")

# Lookup the full entry for 'AISelect'
print("Schema entry for AISelect:")
schema_df.loc["AISelect"]

In [None]:
# Another code we may not know: "Check"
print("\nSchema entry for 'Check':")
schema_df.loc["Check"]

In [None]:
# If the question text is truncated in display, access the full field explicitly
print("\nFull question text for 'Check':")
print(schema_df.loc["Check", "question"])

**Why this helps:**
By setting `qname` as the index, you can directly do `schema_df.loc[...]` to resolve what each coded column in the main survey means, making interpretation and downstream labeling far easier.

---

### 6. Sorting alphabetically (ascending and descending) in place

You can sort a DataFrame by an axis (e.g., index or a column). Here are examples:

In [None]:
# Sort schema_df by its index (qname) alphabetically ascending
schema_df.sort_index(ascending=True, inplace=True)
print("First few qnames after ascending sort:")
print(schema_df.index[:5])

# Sort schema_df by index descending
schema_df.sort_index(ascending=False, inplace=True)
print("\nFirst few qnames after descending sort:")
print(schema_df.index[:5])

*Note:* `inplace=True` modifies the DataFrame directly. Newer pandas practice often prefers assignment (e.g., `schema_df = schema_df.sort_index(...)`) because some `inplace` behaviors are being discouraged for clarity.

### Exercise for Part 3

1. Sort the people in `df_small` by date of birth so that the **oldest** person appears first and the **youngest** last (i.e., ascending by date).
2. After sorting, print:
   - The first row (should be the oldest person).
   - The last row (youngest person).