# Lecture 11 – Taking and Filtering Rows

## Data 6, Summer 2022

In [None]:
from datascience import *
import numpy as np

## SAT Data

Today we will be working with a dataset showing aggregated (average) SAT scores by state. This data is from 2014, so the total score is out of 2400 (over three sections each out of 800) instead of 1600.

In [None]:
sat = Table.read_table('data/sat2014-lecture.csv')
sat

Recall the table methods and properties we can use to learn more above our data and even create new data:

In [None]:
sat.num_rows, sat.num_columns

It would be nice to have a combined score too.

In [None]:
sat.column('Critical Reading') + sat.column('Math') + sat.column('Writing')

In [None]:
sat = sat.with_columns(
    'Combined', sat.column('Critical Reading') + sat.column('Math') + sat.column('Writing')
)

In [None]:
sat

## `.take`

Sometimes, it is a little tricky to work with a large dataset (even though this specific dataset isn't _that_ large). To make it easier for us to understand parts of our data, we may want to just look at certain rows in our table. We can do that using `tbl.take(indices`, which takes in a single index or list of indices corresponding to the rows of the table we want to take.

In [None]:
sat.take(2) # Use `tbl.take()` to take the third row in the `sat` table (Remember indices are 0-indexed)

In [None]:
sat.take(make_array(1, 4, 3)) # Pass in an array with values 1, 4, and 3 to take the 2nd, 5th and 4th rows of `sat`

Recall that `np.arange()` makes it really easy to generate an array of sequential numbers. The [Data 6 Python Reference](https://data6.org/su22/reference/#numpy-array-functions) provides a good explanation of how `np.arange` works.

In [None]:
np.arange(5)

We can pass array ranges into `.take()` just as we would manually-created arrays. This is often much easier that manually typing out arrays.

In [None]:
sat.take(np.arange(5))

When we combine `sort` and `take`, we can get some pretty powerful answers.

What are the five states with the highest math scores?

In [None]:
sat.sort('Math', descending = True).take(np.arange(5))

What are the top 8 states in terms of participation?

In [None]:
sat.sort('Participation Rate', descending = True).take(np.arange(8))

Note: `.take` works on arrays too, not just tables!

In [None]:
sat.column('State').take(np.arange(5))

In [None]:
sat.take(np.arange(5)).column('State')

### Quick Check 1

In [None]:
animals = Table.read_table('data/animals.csv')
animals

Using the `animals` table, fill in the blanks in the code below so that the result is an array containing the names of the five smallest animals by body weight, in increasing order.

In [None]:
animals._____(_____).column(_____).take(_____) # Replace the blanks with your answer

## Booleans

Another Python data type is the `bool` or Boolean, whic only has two possible values: `True` and `False`.

In [None]:
True

In [None]:
f = False
f

In [None]:
type(True)

In [None]:
type(f)

**Be careful**, because `True` and `False` have special meanings in Python and _cannot_ be used as names. 

In [None]:
# This doesn't work
3 = 4

In [None]:
# This also doesn't work
True = 5

## `.where`

We've already seen how we can use `tbl.where()` to find rows that _exactly_ match what we're looking for. For example:

In [None]:
sat.where('State', 'California')

But `tbl.where` is also capable of so much more! The second argument in `.where` can accept a **predicate**, which tells Python what condition to match rows on. Are few relevant predicates are:

| Predicate | Description |
| --- | --- |
| `are.equal_to(z)` | Is the value from the column equal to `z`? |
| `are.above(x)`, `are.below(x)` | Is the value from the column above/below `x`? |
| `are.between(x, y)` | Is the value from the column between `x` (inclusive) and `y` (exclusive)? |
| `are.containing(s)` | Does the value from the column contain the string `s`? |
| `are.contained_in(s)` | Is the value from the column inside the string/array s? |

You can also negate any of these predicates (i.e. find the opposite) by adding `not_` on the front of any of their function names (e.g. `are.not_equal_to(z)`).

A full list of predicates can be found on the [Python Reference](https://data6.org/su22/reference/#tablewhere-predicates).

Let's see the power of `.where` in action:

In [None]:
sat.where('Combined', are.above(1800))

In [None]:
sat.where('State', are.equal_to('California'))

Note that `are.equal_to(z)` is the same as just passing in `z` itself as the second argument.

In [None]:
sat.where('State', are.containing('Dakota'))

In [None]:
sat.where('Math', are.between(580, 600))

### Multiple Conditions

We can match rows to multiple conditions/predicates by chaining `.where` method calls together. For example, we can look for states where the participation rate is above 20% and the average combined SAT score is above 1500.

In [None]:
sat.where('Participation Rate', are.above(20)).where('Combined', are.above(1500))

In [None]:
sat.where('Participation Rate', are.below(10)).where('Combined', are.above(1600))

We can have multiple different values to match to if we put then in an array and then use `are.contained_in`.

In [None]:
deep_south = np.array(['Alabama', 'Georgia', 'Louisiana', 'Mississippi', 'South Carolina'])

In [None]:
sat.where('State', are.contained_in(deep_south))

In [None]:
sat.where('State', are.contained_in(deep_south)) \
   .where('Participation Rate', are.below(10)) \
   .where('Combined', are.above(1600))

**Just for fun:** consider the scatter plot of all states' participation rates and combined SAT scores. Does this scatter plot imply that lower participation _causes_ higher SAT scores? Or what is going on here?

In [None]:
px.scatter(data_frame = sat.to_df(), 
           x = 'Combined', 
           y = 'Participation Rate', 
           hover_data = {'State': True},
           title = 'Participation Rate vs. Combined SAT Score for States in 2014')

### Quick Check 2

For this Quick Check we will return to the `wnba` data from previous lectures.

In [None]:
wnba = Table.read_table('data/wnba-2020.csv').select('Player', 'Tm', 'Pos', 'G', 'PTS')
wnba

Fill in the code below so that the result is the average **PTS** scored last season by **forwards** (players whose `Pos` is “F”) who **played 20 or more games**.

In [None]:
wnba.where(____, ____).where('G',____).column(____).mean() # Replace the blanks with your answer