# Lecture 5 –  Accessing, Sorting, and Querying
## DSC 10, Fall 2021

### Announcements

- Lab 2 is due on **Tuesday 10/5 at 11:59pm**.
- Homework 2 is due on **Saturday 10/10 at 11:59pm**.
- If attending in-person office hours, make sure to hit the `#` key on the keypad after typing in the SDSC building access code (491720#).
    - The code and instructions can be found embedded within each relevant calendar event on Canvas.
- Attend discussion today or tomorrow!

### Agenda

- Using a single dataset to illustrate key DataFrame manipulation techniques.
    - Will use lots of motivating questions.

#### Note:

- Remember to check the [Resources tab of the course website](https://dsc10.com/resources/) for programming resources.
- Some key links moving forward:
    - [DSC 10 Reference Sheet](https://drive.google.com/file/d/1mQApk9Ovdi-QVqMgnNcq5dZcWucUKoG-/view).
    - [BabyPandas Documentation](https://babypandas.readthedocs.io/en/latest/index.html).

## Case study: NBA salaries 🏀

In [None]:
import babypandas as bpd
import numpy as np

### Reading data from a file

- The file `nba_salaries.csv` contains all salaries from the 2015-2016 NBA (National Basketball Association) season.
    - CSV: *comma-separated values*.
- We can read a CSV using `bpd.read_csv()`. Give it the name of the file, if in the same directory as your notebook, or a path to the file otherwise.

In [None]:
salaries = bpd.read_csv('data/nba_salaries.csv')
salaries

### Discussion Question

Which would be a good column to use as the index?

A. `PLAYER`

B. `POSITION`

C. `TEAM`

D. `2015_SALARY`

### To answer, go to **[menti.com](https://menti.com)** and enter the code **4991 2163**.

Is there something we should be worried about?

### Setting the index

In [None]:
salaries_by_player = salaries.set_index('PLAYER')
salaries_by_player

### Shape of a table

- `.shape` returns the number of rows and number of columns.
- Access each with `[]`:

In [None]:
salaries_by_player.shape

In [None]:
# Number of rows
salaries_by_player.shape[0]

In [None]:
# Number of columns
salaries_by_player.shape[1]

## Example 1: Adjusting for inflation

**Key concept:** How do we access columns, perform operations on them, and add new columns?

### Adjusting for inflation

- These salaries are old. We should adjust for inflation.
- $\$1.00$ in 2015 = $\$1.09$ in 2021.
- Workflow:
    - Get the column of salaries.
    - Multiply every element by 1.09.
    - Add new column to table.

#### Step 1 – Getting a column

- We can get a column from a dataframe using `.get(column_name)`.
- Warning: case sensitive!
- The result looks like a 1-column DataFrame, but is actually a *Series*.

In [None]:
salaries_by_player

In [None]:
salaries_by_player.get("2015_SALARY")

### Digression: Series

- A *Series* is like an array, but with an index.
- In particular, `Series`' support arithmetic.

In [None]:
salaries_by_player.get("2015_SALARY")

#### Step 2 – Adjust the salaries for inflation

- Just like with arrays, we can perform arithmetic operations on every element of a `Series` quite efficiently.

In [None]:
salaries_by_player.get("2015_SALARY") * 1.09

#### Step 3 – Add the adjusted salaries to the table

- Use `.assign(Name_of_column=data_in_array)` to assign an array (or series, or list) to a table.
- **Warning!** No quotes around `Name_of_column`.
- Creates a new DataFrame! Must save to variable.

In [None]:
salaries_by_player.assign(
    ADJUSTED_SALARY=salaries_by_player.get("2015_SALARY") * 1.09
)

In [None]:
salaries_by_player

In [None]:
adjusted_salaries = salaries_by_player.assign(
    ADJUSTED_SALARY=salaries_by_player.get("2015_SALARY") * 1.09
)
adjusted_salaries

## Example 2: Getting a particular player's salary

**Key concept**: How do we access specific values using an index?

### How much did LeBron James 🐐 make in 2015 (adjusted for inflation)?

In [None]:
# This is a Series!
adjusted_salaries.get('ADJUSTED_SALARY')

### Accessing a Series by row label: `.loc`

- Use `.loc[]` to *access* the element of a Series with a particular row label.

In [None]:
adjusted_salaries.get('ADJUSTED_SALARY').loc['LeBron James']

### How to get a particular element from a table

1. `.get()` the column label.
2. `.loc[]` the row label.

In this class, we'll always get column, then row (but row, then column is also possible).

 Example: What position does LeBron play?

In [None]:
adjusted_salaries.get('POSITION').loc['LeBron James']

## Example 3: Salary analysis

**Key concept**: How do we compute statistics of columns?

### Questions about salary

- What was the highest/lowest salary? What was the average salary?
- `Series`' have helpful methods, like `.min()`, `.max()`, `.mean()`, etc.

In [None]:
adjusted_salaries.get('ADJUSTED_SALARY').min()

In [None]:
adjusted_salaries.get('ADJUSTED_SALARY').max()

In [None]:
adjusted_salaries.get('ADJUSTED_SALARY').mean()

In [None]:
adjusted_salaries.get('ADJUSTED_SALARY').median()

## Example 4: *Who* had the highest salary?

**Key concept**: Sorting.

#### Step 1 – sort the table

- Use the `.sort_values(by=column_name)` method to sort.
    - **Notice:** Like most DataFrame methods, this returns a new DataFrame.
- Everything works as expected, but we wanted *descending* order.

In [None]:
adjusted_salaries.sort_values(by='ADJUSTED_SALARY')

#### Step 1 – sort the table in *descending* order

- Use `.sort_values(by=column_name, ascending=False)` to sort in *descending* order.

In [None]:
highest_salaries = adjusted_salaries.sort_values(by='ADJUSTED_SALARY', ascending=False)
highest_salaries

#### Step 2 – get the *name* of the person with the highest salary

- We saw that it was Kobe Bryant, but how do we get the name using code?
- Remember, the index is like an array.

In [None]:
highest_salaries.index[0]

## Example 5: What team did the person with the third-lowest salary play for?

**Key concept**: Using integer positions.

#### Strategy 1

Using what we already know, we could...

1. Sort the table in ascending order using `.sort_values(by='ADJUSTED_SALARY')`.
2. Get the name of the person using `.index[2]` (remember, positions start at 0).
3. Use `.get('TEAM').loc[their_name]` to get their team name.



In [None]:
lowest_salaries = adjusted_salaries.sort_values(by='ADJUSTED_SALARY')
lowest_salaries

In [None]:
name = lowest_salaries.index[2]
name

In [None]:
lowest_salaries.get('TEAM').loc[name]

#### Another approach?

- To get the third element using `.loc[]`, we first had to find its label.
- Can we just get the 3rd element without knowing its label?
- Yes, with `.iloc[]`:

In [None]:
lowest_salaries.get('TEAM')

In [None]:
lowest_salaries.get('TEAM').loc['Jordan McRae']

In [None]:
lowest_salaries.get('TEAM').iloc[2]

#### Strategy 2

1. Sort the table in ascending order using `.sort_values(by='ADJUSTED_SALARY')`, as before.
2. Use `.get('TEAM').iloc[2]` to get the team name.

In [None]:
adjusted_salaries.sort_values(by='ADJUSTED_SALARY').get('TEAM').iloc[2]

### Summary of accessing a Series

- There are two ways to get an element of a series:
    - `.loc[]` uses the row label.
    - `.iloc[]` uses the integer position.
- Usually `.loc` is more convenient.

### Note

- Sometimes the integer position and row label are the same.
- This happens by default with `bpd.read_csv`:

In [None]:
bpd.read_csv('data/nba_salaries.csv')

In [None]:
bpd.read_csv('data/nba_salaries.csv').get('PLAYER').loc[3]

In [None]:
bpd.read_csv('data/nba_salaries.csv').get('PLAYER').iloc[3]

## Reflection

### Questions we can answer right now:
- What was the highest salary?
    - `adjusted_salaries.get('ADJUSTED_SALARY').max()`
- How many players were there?
    - `adjusted_salaries.shape[0]`
- What was LeBron James' salary?
    - `adjusted_salaries.get('ADJUSTED_SALARY').loc['LeBron James']`
- _Who_ had the highest salary?
    - `adjusted_salaries.sort_values(by='ADJUSTED_SALARY', ascending=False).index[0]`

### Questions we can't yet answer:
- What is the total payroll of the Cleveland Cavaliers?
- How many players make over 10 million?
- Who is the highest paid center (C)?

The common thread between these questions is that they all involve only a **subset** of the rows in our table.

## Example 6: Who was the highest paid center (C)?

**Key concept**: Selecting rows (via Boolean indexing).

### Selecting rows

- We could determine who was the highest paid center (C) if we had a table consisting of only centers.
- How do we get that table?

### The solution

In [None]:
adjusted_salaries[adjusted_salaries.get('POSITION') == 'C']

In [None]:
'PG' == 'C'

In [None]:
'C' == 'C'

In [None]:
adjusted_salaries.get('POSITION') == 'C'

### Boolean indexing

To select only some rows of `adjusted_salaries`:

1. Make a list/array/Series of `True`s (keep) and `False`s (toss)
2. Then pass it into `adjusted_salaries[]`.

Rather than making the list by hand, we usually generate it by making a comparison.

### Elementwise comparisions work as expected

In [None]:
adjusted_salaries.get('2015_SALARY') > 5

### How do we make a table with only the players who made over 5 million?

### How do we make a table with only the players on the Cleveland Cavaliers?

### Original Question: How do we determine who was the highest paid center?

Strategy:
1. Extract a table of just the centers.
2. Sort by salary.
3. Return the first element in the index.

### What if the condition isn't satisfied?

In [None]:
adjusted_salaries[adjusted_salaries.get('TEAM') == 'UC San Diego Tritons']

### Discussion Question

Which of these three queries determines the total payroll of the Los Angeles Lakers?

A. `adjusted_salaries[adjusted_salaries.get('TEAM') == 'Los Angeles Lakers'].get('ADJUSTED_SALARY').sum()`

B. `adjusted_salaries.get('ADJUSTED_SALARY').sum()[adjusted_salaries.get('TEAM') == 'Los Angeles Lakers']`

C. `adjusted_salaries['Los Angeles Lakers'].get('ADJUSTED_SALARY').sum()`

### To answer, go to **[menti.com](https://menti.com)** and enter the code **4991 2163**.

## Summary

### Summary

- We learned many DataFrame methods and techniques.
- Don't feel the need to memorize them all right away.
- Instead, use this lecture and the aforementioned readings as references when working on the labs and homeworks.
- Over time, these techniques will become more and more familiar.
- **Practice!** Frame your own questions using this dataset and try and answer them.

### Next time

- On Wednesday, we'll try and answer more involved questions involving our data, which will lead us to a new core DataFrame method – `groupby`.