In [None]:
# This code configures how dataframes are displayed. Don't worry about
# understanding this code, but make sure to run it if you're following along.
import numpy as np
import pandas as pd

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("precision", 2)

# Lecture 5 –  Accessing, Sorting, and Querying
## DSC 10, Summer 2022

### Announcements

- Lab 2 is due on **Tuesday 7/12 at 11:59pm**.
- Remember to check the [Debugging](https://dsc10.com/debugging) tab on the course website if you run into any issues.
- Please use Campuswire main feed instead of DM and email.

### Agenda

- Using a single dataset to illustrate key DataFrame manipulation techniques.
    - Will use lots of motivating questions.

#### Note:

- Remember to check the [Resources tab of the course website](https://dsc10.com/resources/) for programming resources.
- Some key links moving forward:
    - [DSC 10 Reference Sheet](https://drive.google.com/file/d/1mQApk9Ovdi-QVqMgnNcq5dZcWucUKoG-/view).
    - [BabyPandas Documentation](https://babypandas.readthedocs.io/en/latest/index.html).

## DataFrames

### `pandas`

- DataFrames are provided by a package called `pandas`.
- `pandas` is **the** tool for doing data science in Python.

<center>
<img src='data/pandas.png' width=500>
</center>

### But `pandas` is not so cute...

<center>
<img height=100% src="data/angrypanda.jpg"/>
</center>

### `babypandas`

- We at UCSD have created a smaller, nicer version of `pandas` called `babypandas`.
- It keeps the important stuff and has much better error messages.
- It's easier to learn, but is still valid `pandas` code.

<center>
<img height=75% src="data/babypanda.jpg"/ width=500>
</center>

### DataFrames in `babypandas` 🐼

- Tables in `babypandas` (and `pandas`) are called "DataFrames."
- To use DataFrames, we'll need to import `babypandas`. (We'll need `numpy` as well.)

In [None]:
import babypandas as bpd
import numpy as np
%load_ext pandas_tutor

### Reading data from a file 📖

- The file `nba-2022.csv` contains all salaries from the 2021-22 NBA (National Basketball Association 🏀) season for players who have played at least 15 games.
    - CSV stands for "comma-separated values".
- We can read a CSV using `bpd.read_csv(...)`. Give it the name of the file if it's in the same directory as your notebook, or a path to the file otherwise.

### Structure of a DataFrame

- DataFrames have *columns* and *rows*.
    - Think of each column as an array.
- Each column has a label: `'Player'`, `'Position'`, etc.
    - This is its name.
    - Column labels are stored as strings.
- Every row has a label too: in this case, 0, 1, 2, 3, 4, ..., 380.
    - Together, the row labels are called the _index_. The index is **not** a column!

In [None]:
salaries

### Setting a new index

- We can set a better index using `.set_index(column_name)`.
- Row labels should be unique identifiers.
    - Row labels = row names; ideally, each row has a different, descriptive name.
- ⚠️ Like most DataFrame methods, `.set_index` returns a new DataFrame; it does not modify the original DataFrame.

### Shape of a DataFrame

- `.shape` returns the number of rows and columns in a given DataFrame.
- Access each with `[]`:

In [None]:
salaries_by_player

In [None]:
# Number of rows


In [None]:
# Number of columns


## Example 1: Salary per game

**Key concepts:** How do we access columns, perform operations on them, and add new columns?

### Finding per-game salaries

- We have each player's salary this season. An NBA regular season has 82 games in it.
- **Question:** How much money is each player getting paid per game?
- Workflow:
    - Get the column of salaries.
    - Divide each element by 82.
    - Add a new column to the DataFrame.

#### Step 1 – Getting a column

- We can get a column from a DataFrame using `.get(column_name)`.
- ⚠️ Column names are case sensitive!
- The result looks like a 1-column DataFrame, but is actually a *Series*.

In [None]:
salaries_by_player

### Digression: Series

- A *Series* is like an array, but with an index.
- In particular, Series' support arithmetic.

In [None]:
salaries_by_player.get('Salary')

#### Step 2 – Dividing salaries by the number of games

- Just like with arrays, we can perform arithmetic operations on every element of a Series quite efficiently.

#### Step 3 – Adding the per-game salaries to the DataFrame

- Use `.assign(Name_of_column=data_in_array)` to assign a Series (or array, or list) to a DataFrame.
- ⚠️ Don't put quotes around `Name_of_column`.
- Creates a new DataFrame! Must save to variable.

### Using pandas_tutor

- Pandas Tutor is a tool I developed as part of my PhD!
- It draws diagrams to explain `pandas` (and `babypandas`) code.
- Add `%%pt` to the top of the cell to explain the last line of `babypandas` code.
  - Can also use website: https://pandastutor.com/

## Example 2: Getting a particular player's salary

**Key concept**: How do we access specific values using an index?

### How much is LeBron James 🐐 making this season?

### Accessing a Series by row label: `.loc`

- Use `.loc[]` to *access* the element of a Series with a particular row label.

### How to get a particular element from a DataFrame

1. `.get()` the column label.
2. `.loc[]` the row label.

In this class, we'll always get a column, then a row (but row, then column is also possible).

 Example: What position does LeBron play?

## Example 3: Salary analysis

**Key concept**: How do we compute statistics of columns?

### Questions about salary

- What is the highest/lowest salary? What is the average salary?
- Series' have helpful methods, like `.min()`, `.max()`, `.mean()`, etc.

## Example 4: *Who* has the highest salary?

**Key concept**: Sorting.

#### Step 1 – Sorting the DataFrame

- Use the `.sort_values(by=column_name)` method to sort.
    - **Notice:** Like most DataFrame methods, this returns a new DataFrame.
- Everything works as expected, but we wanted *descending* order.

#### Step 1 – Sorting the DataFrame in *descending* order

- Use `.sort_values(by=column_name, ascending=False)` to sort in *descending* order.

#### Step 2 – Getting the *name* of the person with the highest salary

- We saw that it's Stephen Curry, but how do we get the name using code?
- Remember, the index is like an array.

## Example 5: What team does the person with the third-lowest salary play for?

**Key concept**: Using integer positions.

#### Strategy 1

Using what we already know, we could...

1. Sort the DataFrame in ascending order using `.sort_values(by='Salary')`.
2. Get the name of the person using `.index[2]` (remember, positions start at 0).
3. Use `.get('Team').loc[their_name]` to get their team name.



#### Another approach?

- To get the third element of a Series using `.loc[]`, we first had to find its label.
- Can we just get the third element of a Series without knowing its label?
- Yes, with `.iloc[]`:

#### Strategy 2

1. Sort the DataFrame in ascending order using `.sort_values(by='Salary')`, as before.
2. Use `.get('Team').iloc[2]` to get the team name.

### Summary of accessing a Series

- There are two ways to get an element of a Series:
    - `.loc[]` uses the row label.
    - `.iloc[]` uses the integer position.
- Usually `.loc[]` is more convenient, but you'll need to know both.

### Note

- Sometimes the integer position and row label are the same.
- This happens by default with `bpd.read_csv`:

In [None]:
bpd.read_csv('data/nba-2022.csv')

## Reflection

### Questions we can answer right now:
- What is the highest salary?
    - `salaries_pg.get('Salary').max()`
- How many players are there in the dataset?
    - `salaries_pg.shape[0]`
- What is LeBron James' salary per game?
    - `salaries_pg.get('Salary_per_game').loc['LeBron James']`
- Who has the highest salary?
    - `salaries_pg.sort_values(by='Salary', ascending=False).index[0]`

### Questions we can't yet answer:
- Who is the highest paid center (C)?
- How many players make over \$200,000 per game?
- What is the total payroll of the Cleveland Cavaliers?



The common thread between these questions is that they all involve only a **subset** of the rows in our DataFrame.

## Example 6: Who is the highest paid center (C)?

**Key concept**: Selecting rows (via Boolean indexing).

### Selecting rows

- We can determine who the highest paid center (C) is if we have a DataFrame consisting of only centers.
- How do we get that DataFrame?

### The solution

### Boolean indexing

To select only some rows of `salaries_pg`:

1. Make a sequence (list/array/Series) of `True`s (keep) and `False`s (toss).
    - The values `True` and `False` are of the _Boolean_ data type.
    
2. Then pass it into `salaries_pg[sequence_goes_here]`.

Rather than making the sequence by hand, we usually generate it by making a comparison.

### Elementwise comparisions work as expected

### How do we make a DataFrame with only the players who make over \$200,000 per game?

### How do we make a DataFrame with only the players on the Cleveland Cavaliers?

### Original Question: How do we determine who the highest paid center is?

Strategy:
1. Extract a DataFrame of just the centers.
2. Sort by salary.
3. Return the first element in the index.

### What if the condition isn't satisfied?

<div class="menti">
<div>

### Discussion Question

Which of the following queries determines the total payroll of the Golden State Warriors (GSW)?
Assume that `gsw = 'Golden State Warriors'`.

A. `
   (salaries_pg
    [salaries_pg.get('Team') == gsw]
    .get('Salary').sum())
   `

B. `
   (salaries_pg.get('Salary').sum()
    [salaries_pg.get('Team') == gsw])`

C. `
   salaries_pg[gsw].get('Salary').sum()
   `
    
</div>
<div>

### To answer, go to **[menti.com](https://www.menti.com/v42ge81t5d)** and enter the code 2863 3386 or use this QR code:

![](images/menti-qr.png)
    
</div>
</div>

In [None]:
gsw = 'Golden State Warriors'

## Summary

### Summary

- We learned many DataFrame methods and techniques.
- Don't feel the need to memorize them all right away.
- Instead, refer to this lecture, [the course notes](https://notes.dsc10.com/front.html), [the DSC 10 reference sheet](https://drive.google.com/file/d/1mQApk9Ovdi-QVqMgnNcq5dZcWucUKoG-/view), and [the `babypandas` documentation](https://babypandas.readthedocs.io/en/latest/index.html) when working on assignments.
- Over time, these techniques will become more and more familiar.
- **Practice!** Frame your own questions using this dataset and try and answer them.

### Next time

- We'll answer more complicated questions, which will lead us to a new core DataFrame method – `groupby`.