# Lecture 09 - Applying

## Data 6, Fall 2024

In [1]:
from datascience import *
import numpy as np

## Motivation

Until now, we've primarily worked with just the data we've been given in our tables. However, oftentimes some of our data needs to be manipulated or "cleaned" before we can work with it to infer things about the world.

For example, we have been given this table of dogs and their ages in 'human years'. While this may be useful in some contexts, what if we want to know each dog's age in 'dog years'. For this example, we will use the (incorrect) conversion of one human year being equivalent to 7 dogs years — you can read about a more accurate conversion [here](https://pets.webmd.com/dogs/how-to-calculate-your-dogs-age).

In [2]:
pups = Table.read_table('data/pups.csv')

In [8]:
tbl = Table().with_columns("Name",
                           make_array("Arthur", "Beth", "Chand"),
                           "Birthday",
                           make_array("1987-04-01", "2003-05-23", "2008-01-01"),
                           "Magic Year", make_array("1980", "2003", "2068"))
data_methods = Table().with_columns("Name", make_array("Beth", "Chand", "Arthur"),
                                    "Method of Selection", make_array("Random", "Random", "Lucky Number"))

def age_in_given_year(year, birthday):
    str_year, str_month, str_day = birthday.split("-")
    return int(year) - int(str_year)
update = tbl.with_column("Age in Magic Year", tbl.apply(age_in_given_year, "Magic Year", "Birthday"))

data_magic = update.where("Age in Magic Year", are.above_or_equal_to(0))

data_methods.join("Name", data_magic)

Name,Method of Selection,Birthday,Magic Year,Age in Magic Year
Beth,Random,2003-05-23,2003,0
Chand,Random,2008-01-01,2068,60


In [None]:
#CELL A
num_tries = 0

In [None]:
#CELL B
def draw_triangle():
    print("  *  ")
    print(" *** ")
    print("*****")

num_tries = num_tries + 1

We already know how to convert the column `age` to dog years using **array arithmetic**. We can then add this new array as a new column in our table.

In [None]:
... # Add a new column to `pups` called `dog years` that contains each dog's age in dog years (human years * 7)

## Apply

Now that we know how to write our own functions, we can leverage these functions to manipulate our tables in particular ways. This is really useful if we want to extract or convert data in our table to generate new insights about it.

We can use `tbl.apply(col, func)` to **apply** the function `func` to the column `col`. This creates an array when each item is the result of evaluating `func` with the corresponding item in `col` as the input. This essentially allows you to do multiple function calls all at the same time.

In [4]:
def seven_times(x):
    return 7 * x

In [None]:
... # Apply the `seven_times` function to the column `age` in the `pups` table

Note, we wouldn't actually use the above example with `.apply` since we could just write `pups.column('age') * 7`.

Here's a more useful example:

In [6]:
def email_from_name(name):
    first, last = name.split(' ')
    email = first + '.' + last + '@dogschool.edu'
    return email.lower()

In [7]:
# Can use email_from_name on a single argument
email_from_name('Champ Major')

'champ.major@dogschool.edu'

In [13]:
emails = pups.apply(email_from_name, 'name')

In [14]:
pups.with_column("Email Address", emails)

name,age,size,Email Address
Junior Smith,11,medium,junior.smith@dogschool.edu
Rex Rogers,7,big,rex.rogers@dogschool.edu
Flash Heat,3,big,flash.heat@dogschool.edu
Reese Bo,4,medium,reese.bo@dogschool.edu
Polo Cash,2,small,polo.cash@dogschool.edu


Notice how fast and easy that was!

### Quick Check 1

In [27]:
arr = make_array(1, 2, 3)
int_arr = arr.apply(int, arr)

AttributeError: 'numpy.ndarray' object has no attribute 'apply'

In [20]:
profs = salary.select('first', 'last', 'title', 'gross').where('title', are.containing('PROF'))
profs

first,last,title,gross
ELIZABETH,ABEL,PROF-AY,138775
NORMAN,ABRAHAMSON,ADJ PROF-AY-1/9-B/E/E,19668
BARBARA,ABRAMS,PROF-AY,191162
ILAN,ADLER,PROF-AY-B/E/E,166617
VINOD,AGGARWAL,PROF-AY,167525
ALICE,AGOGINO,PROF-AY-B/E/E,243259
DAVID,ALDOUS,PROF-AY,218666
RONELLE,ALEXANDER,PROF-AY,167642
NEZAR,ALSAYYAD,PROF-AY,210389
GENEVIEVE,AMES,ADJ PROF-AY,9783


Look at the very last row of the output – that gross income doesn't look right.

In [21]:
profs.sort('gross', descending = True)

first,last,title,gross
STEVEN H,APPLEBAUM,HS ASSOC CLIN PROF-HCOMP,999756
JOHN A,GLASPY,PROF-HCOMP,999631
FRANK P.K.,HSU,PROF OF CLIN-HCOMP,998340
JOHN STUART,NELSON,PROF-HCOMP,997975
HANMIN,LEE,PROF OF CLIN-HCOMP,995434
DENNIS J,SLAMON,PROF-HCOMP,991973
BENJAMIN J,ANSELL,HS CLIN PROF-HCOMP,991543
NICHOLAS C,SAENZ,HS CLIN PROF-HCOMP,991463
JOSEPH F,GRECO,HS ASST CLIN PROF-HCOMP,991458
OMRI Y.,MARIAN,ACT PROF-AY-LAW,99997


The issue here is that the elements in the `gross` column of the table `profs` right now are strings (instead of integers, which is what we would expect). Fill in the blanks to replace the elements in the `gross` column with integers. _(Hint: use the fix_income function)_

In [30]:
def fix_income(income):
    return str(income.replace(',', ''))

In [31]:
fixed_income = profs.apply(int, profs.apply(fix_income, "gross")) # Fill in the blanks to fix the `gross` column

profs = profs.with_columns(
    'gross', fixed_income
)

profs.sort('gross', descending = True)

AttributeError: 'numpy.int64' object has no attribute 'replace'

In [None]:
profs

## Masking

Python also allows us to select elements of an array or rows in a table based off of **boolean masking** (also called boolean indexing).

In [None]:
numbers = np.array([15, 14, -2, 1, 9])

The syntax for boolean masking is not what we're used to, so don't worry too much about understanding it.

In [None]:
... # Use boolean masking to get only the first and third elements of `numbers`

Notice how masking the `numbers` array with an array of booleans allowed us to get only the elements of `numbers` that we wanted.

Let's see another example:

In [10]:
gradebook = Table().with_columns(
    'Name', np.array(['Carrera', 'Panamera', 'Taycan', 'Cayenne', 'Macan', 'Cayman', 'Boxster']),
    'Grading Option', np.array(['GRD', 'PNP', 'PNP', 'GRD', 'GRD', 'GRD', 'PNP']),
    'Score', np.array([98, 86, 67.5, 45, 82, 88, 71])
)

In [8]:
gradebook

Name,Grading Option,Score
Carrera,GRD,98.0
Panamera,PNP,86.0
Taycan,PNP,67.5
Cayenne,GRD,45.0
Macan,GRD,82.0
Cayman,GRD,88.0
Boxster,PNP,71.0


`gradebook` is a table of fake students, their scores/grades, and their grading option (letter graded - "GRD" or Pass/No Pass - "PNP"). Let's use boolean masking to get only the students whose grading option is "GRD". 

In [None]:
... # Use boolean indexing and `.where` to get only the students whose grading option is "GRD"

This weird `.where` call is actually what's happening under the hood when we do `gradebook.where("Grading Option", "GRD")`

In [None]:
gradebook.where("Grading Option", "GRD")

In [None]:
# You'll learn what this line means next lecture
letter_grade = gradebook.column("Grading Option") == 'GRD'

In [None]:
gradebook.where(letter_grade)

That being said, boolean masking is pretty tedious, so we almost exclusively rely on the usual `.where` syntax.

### Example: Countries

Run the following cell – ignore the `lambda` parts:

In [11]:
countries = Table.read_table('data/countries.csv')
countries = countries.relabeled('Country(or dependent territory)', 'Country') \
           .relabeled('% of world', '%') \
           .relabeled('Source(official or UN)', 'Source')
countries = countries.with_columns(
    'Country', countries.apply(lambda s: s[:s.index('[')].lower() if '[' in s else s.lower(), 'Country'),
    'Population', countries.apply(lambda i: int(i.replace(',', '')), 'Population'),
    '%', countries.apply(lambda f: float(f.replace('%', '')), '%')
)

In [12]:
countries

Rank,Country,Population,%,Date,Source
1,china,1405936040,17.9,27 Dec 2020,National population clock[3]
2,india,1371366679,17.5,27 Dec 2020,National population clock[4]
3,united states,330888778,4.22,27 Dec 2020,National population clock[5]
4,indonesia,269603400,3.44,1 Jul 2020,National annual projection[6]
5,pakistan,220892331,2.82,1 Jul 2020,UN Projection[2]
6,brazil,212523810,2.71,27 Dec 2020,National population clock[7]
7,nigeria,206139587,2.63,1 Jul 2020,UN Projection[2]
8,bangladesh,169885314,2.17,27 Dec 2020,National population clock[8]
9,russia,146748590,1.87,1 Jan 2020,National annual estimate[9]
10,mexico,127792286,1.63,1 Jul 2020,National annual projection[10]


Let's find all of the countries whose name starts or ends with the letter 'a':

In [None]:
def starts_or_ends_with_a(name):
    return name[0] == 'a' or name[-1] == 'a'

In [None]:
countries.apply(starts_or_ends_with_a, 'Country')

In [None]:
countries.where(countries.apply(starts_or_ends_with_a, 'Country'))