# Lecture 11 – Taking and Filtering Rows

## Data 6, Fall 2024

In [1]:
from datascience import *
import numpy as np

## SAT Data

Today we will be working with a dataset showing aggregated (average) SAT scores by state. This data is from 2014, so the total score is out of 2400 (over three sections each out of 800) instead of 1600.

In [2]:
sat = Table.read_table('data/sat2014-lecture.csv')
sat

State,Participation Rate,Critical Reading,Math,Writing
Alabama,6.7,547,538,532
Alaska,54.2,507,503,475
Arizona,36.4,522,525,500
Arkansas,4.2,573,571,554
California,60.3,498,510,496
Colorado,14.3,582,586,567
Connecticut,88.4,507,510,508
Delaware,100.0,456,459,444
District of Columbia,100.0,440,438,431
Florida,72.2,491,485,472


Recall the table methods and properties we can use to learn more above our data and even create new data:

In [9]:
sat.num_rows, sat.num_columns

(51, 5)

It would be nice to have a combined score too.

In [10]:
sat.column('Critical Reading') + sat.column('Math') + sat.column('Writing')

array([1617, 1485, 1547, 1698, 1504, 1735, 1525, 1359, 1309, 1448, 1445,
       1460, 1364, 1802, 1474, 1794, 1753, 1746, 1667, 1387, 1468, 1556,
       1784, 1786, 1714, 1771, 1637, 1745, 1458, 1566, 1526, 1617, 1468,
       1483, 1816, 1652, 1697, 1544, 1481, 1480, 1443, 1792, 1714, 1432,
       1690, 1554, 1530, 1519, 1522, 1782, 1762])

In [11]:
sat = sat.with_columns(
    'Combined', sat.column('Critical Reading') + sat.column('Math') + sat.column('Writing')
)

In [12]:
sat

State,Participation Rate,Critical Reading,Math,Writing,Combined
Alabama,6.7,547,538,532,1617
Alaska,54.2,507,503,475,1485
Arizona,36.4,522,525,500,1547
Arkansas,4.2,573,571,554,1698
California,60.3,498,510,496,1504
Colorado,14.3,582,586,567,1735
Connecticut,88.4,507,510,508,1525
Delaware,100.0,456,459,444,1359
District of Columbia,100.0,440,438,431,1309
Florida,72.2,491,485,472,1448


## `.take`

Sometimes, it is a little tricky to work with a large dataset (even though this specific dataset isn't _that_ large). To make it easier for us to understand parts of our data, we may want to just look at certain rows in our table. We can do that using `tbl.take(indices`, which takes in a single index or list of indices corresponding to the rows of the table we want to take.

In [13]:
... # Use `tbl.take()` to take the third row in the `sat` table (Remember indices are 0-indexed)

Ellipsis

In [14]:
... # Pass in an array with values 1, 4, and 3 to take the 2nd, 5th and 4th rows of `sat`

Ellipsis

Recall that `np.arange()` makes it really easy to generate an array of sequential numbers. The [Data 6 Python Reference](https://data6.org/fa24/reference/#numpy-array-functions) provides a good explanation of how `np.arange` works.

In [15]:
np.arange(5)

array([0, 1, 2, 3, 4])

We can pass array ranges into `.take()` just as we would manually-created arrays. This is often much easier that manually typing out arrays.

In [16]:
... # Take the first five rows of `sat` using `np.arange`

Ellipsis

When we combine `sort` and `take`, we can get some pretty powerful answers.

What are the five states with the highest math scores?

In [17]:
...

Ellipsis

What are the top 8 states in terms of participation?

In [18]:
...

Ellipsis

Note: `.take` works on arrays too, not just tables!

In [19]:
sat.column('State').take(np.arange(5))

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California'],
      dtype='<U20')

In [20]:
sat.take(np.arange(5)).column('State')

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California'],
      dtype='<U20')

## Booleans

Another Python data type is the `bool` or Boolean, whic only has two possible values: `True` and `False`.

In [None]:
True

In [None]:
f = False
f

In [None]:
type(True)

In [None]:
type(f)

**Be careful**, because `True` and `False` have special meanings in Python and _cannot_ be used as names. 

In [None]:
# This doesn't work
3 = 4

In [None]:
# This also doesn't work
True = 5

## `.where`

We've already seen how we can use `tbl.where()` to find rows that _exactly_ match what we're looking for. For example:

In [None]:
sat.where('State', 'California')

But `tbl.where` is also capable of so much more! The second argument in `.where` can accept a **predicate**, which tells Python what condition to match rows on. Are few relevant predicates are:

| Predicate | Description |
| --- | --- |
| `are.equal_to(z)` | Is the value from the column equal to `z`? |
| `are.above(x)`, `are.below(x)` | Is the value from the column above/below `x`? |
| `are.between(x, y)` | Is the value from the column between `x` (inclusive) and `y` (exclusive)? |
| `are.containing(s)` | Does the value from the column contain the string `s`? |
| `are.contained_in(s)` | Is the value from the column inside the string/array s? |

You can also negate any of these predicates (i.e. find the opposite) by adding `not_` on the front of any of their function names (e.g. `are.not_equal_to(z)`).

A full list of predicates can be found on the [Python Reference](https://data6.org/su23/reference/#tablewhere-predicates).

Let's see the power of `.where` in action:

In [None]:
... # Filter the `sat` table to only include states with a combined score above 1800

In [None]:
... # Filter the `sat` table to only include California

Note that `are.equal_to(z)` is the same as just passing in `z` itself as the second argument.

In [None]:
... # Filter the `sat` table to only include North and South Dakota using only one `.where` call

In [None]:
... # Find the states where the math scores are between 580 and 600

### Multiple Conditions

We can match rows to multiple conditions/predicates by chaining `.where` method calls together. For example, we can look for states where the participation rate is above 20% and the average combined SAT score is above 1500.

In [None]:
... # Filter the `sat` table to find states where participation is above 20% and combined score is above 1500

In [None]:
... # Filter the `sat` table to find states where participation is below 10% and combined score is above 1600

We can have multiple different values to match to if we put then in an array and then use `are.contained_in`.

In [None]:
deep_south = np.array(['Alabama', 'Georgia', 'Louisiana', 'Mississippi', 'South Carolina'])

In [None]:
... # Filter the `sat` table to include only the states listed in the `deep_south` array

In [None]:
... # Find the states in the deep south with participation lower than 10% and combined score above 1600

**Just for fun:** consider the scatter plot of all states' participation rates and combined SAT scores. Does this scatter plot imply that lower participation _causes_ higher SAT scores? Or what is going on here?

In [None]:
px.scatter(data_frame = sat.to_df(), 
           x = 'Combined', 
           y = 'Participation Rate', 
           hover_data = {'State': True},
           title = 'Participation Rate vs. Combined SAT Score for States in 2014')