# Lecture 15 – Applying

## Data 6, Summer 2022

In [None]:
from datascience import *
import numpy as np

## Motivation

Until now, we've primarily worked with just the data we've been given in our tables. However, oftentimes some of our data needs to be manipulated or "cleaned" before we can work with it to infer things about the world.

For example, we have been given this table of dogs and their ages in 'human years'. While this may be useful in some contexts, what if we want to know each dog's age in 'dog years'. For this example, we will use the (incorrect) conversion of one human year being equivalent to 7 dogs years — you can read about a more accurate conversion [here](https://pets.webmd.com/dogs/how-to-calculate-your-dogs-age).

In [None]:
pups = Table.read_table('data/pups.csv')

In [None]:
pups

We already know how to convert the column `age` to dog years using **array arithmetic**. We can then add this new array as a new column in our table.

In [None]:
... # Add a new column to `pups` called `dog years` that contains each dog's age in dog years (human years * 7)

## Apply

Now that we know how to write our own functions, we can leverage these functions to manipulate our tables in particular ways. This is really useful if we want to extract or convert data in our table to generate new insights about it.

We can use `tbl.apply(col, func)` to **apply** the function `func` to the column `col`. This creates an array when each item is the result of evaluating `func` with the corresponding item in `col` as the input. This essentially allows you to do multiple function calls all at the same time.

In [None]:
def seven_times(x):
    return 7 * x

In [None]:
... # Apply the `seven_times` function to the column `age` in the `pups` table

Note, we wouldn't actually use the above example with `.apply` since we could just write `pups.column('age') * 7`.

Here's a more useful example:

In [None]:
def email_from_name(name):
    first, last = name.split(' ')
    email = first + '.' + last + '@dogschool.edu'
    return email.lower()

In [None]:
# Can use email_from_name on a single argument
email_from_name('Champ Major')

In [None]:
... # Apply the `email_from_name` function to `pups` to generate emails for each dog

In [None]:
... # Add each dog's email as new column in the `pups` table

Notice how fast and easy that was!

### Quick Check 1

In [None]:
# Large file – this may take ~10 seconds to load
salary = Table.read_table('https://media.githubusercontent.com/media/dailycal-projects/ucb-faculty-salary/master/data/salary/salary_2015.csv')
salary

In [None]:
profs = salary.select('first', 'last', 'title', 'gross').where('title', are.containing('PROF'))
profs

Look at the very last row of the output – that gross income doesn't look right.

In [None]:
profs.sort('gross', descending = True)

The issue here is that the elements in the `gross` column of the table `profs` right now are strings (instead of integers, which is what we would expect). Fill in the blanks to replace the elements in the `gross` column with integers. _(Hint: use the fix_income function)_

In [None]:
def fix_income(income):
    return str(income.replace(',', ''))

In [None]:
fixed_income = profs.apply(_____, _____) # Fill in the blanks to fix the `gross` column

profs = profs.with_columns(
    'gross', _____
)

In [None]:
profs

## Masking

Python also allows us to select elements of an array or rows in a table based off of **boolean masking** (also called boolean indexing).

In [None]:
numbers = np.array([15, 14, -2, 1, 9])

The syntax for boolean masking is not what we're used to, so don't worry too much about understanding it.

In [None]:
... # Use boolean masking to get only the first and third elements of `numbers`

Notice how masking the `numbers` array with an array of booleans allowed us to get only the elements of `numbers` that we wanted.

Let's see another example:

In [None]:
gradebook = Table().with_columns(
    'Name', np.array(['Carrera', 'Panamera', 'Taycan', 'Cayenne', 'Macan', 'Cayman', 'Boxster']),
    'Grading Option', np.array(['GRD', 'PNP', 'PNP', 'GRD', 'GRD', 'GRD', 'PNP']),
    'Score', np.array([98, 86, 67.5, 45, 82, 88, 71])
)

In [None]:
gradebook

`gradebook` is a table of fake students, their scores/grades, and their grading option (letter graded - "GRD" or Pass/No Pass - "PNP"). Let's use boolean masking to get only the students whose grading option is "GRD". 

In [None]:
... # Use boolean indexing and `.where` to get only the students whose grading option is "GRD"

This weird `.where` call is actually what's happening under the hood when we do `gradebook.where("Grading Option", "GRD")`

In [None]:
gradebook.where("Grading Option", "GRD")

In [None]:
# You'll learn what this line means next lecture
letter_grade = gradebook.column("Grading Option") == 'GRD'

In [None]:
gradebook.where(letter_grade)

That being said, boolean masking is pretty tedious, so we almost exclusively rely on the usual `.where` syntax.

### Example: Countries

Run the following cell – ignore the `lambda` parts:

In [None]:
countries = Table.read_table('data/countries.csv')
countries = countries.relabeled('Country(or dependent territory)', 'Country') \
           .relabeled('% of world', '%') \
           .relabeled('Source(official or UN)', 'Source')
countries = countries.with_columns(
    'Country', countries.apply(lambda s: s[:s.index('[')].lower() if '[' in s else s.lower(), 'Country'),
    'Population', countries.apply(lambda i: int(i.replace(',', '')), 'Population'),
    '%', countries.apply(lambda f: float(f.replace('%', '')), '%')
)

In [None]:
countries

Let's find all of the countries whose name starts or ends with the letter 'a':

In [None]:
def starts_or_ends_with_a(name):
    return name[0] == 'a' or name[-1] == 'a'

In [None]:
countries.apply(starts_or_ends_with_a, 'Country')

In [None]:
countries.where(countries.apply(starts_or_ends_with_a, 'Country'))