In [164]:
import pandas as pd
import numpy as np

# Discussion 3

### Announcements:
- Project 1 due Saturday 1/27
- Lab 03 due Monday 1/29
- Saturday office hours (CSE 2217)
    - Dylan 12:00 - 2:30
    - Jasmine 4:30 - 6:00
- Lab solutions going up on Ed
    - Link also on course website

## SettingWithCopyWarning 

A warning you're likely to run into at some point is the SettingWithCopyWarning. It likely doesn't affect your behavior, but it is good practice to run code that won't throw warnings.

Run the below code to see an example of how it happens, and how to prevent it.

In [230]:
warn_df = pd.DataFrame({"Movie Title": ["Spider-Man: Across the Spider-Verse",
                                        "Scott Pilgrim vs. the World",
                                        "Monty Python and the Holy Grail",
                                        "Joker",
                                        "Fight Club"],
                        "Release Year": [2023,2010,1975,2019,1999],
                        "Rating": ["PG","PG-13","PG","R","R"],
                        "Pretty Visuals": [True,True,False,True,False],
                        "Funny": [False,True,True,False,False]})
warn_df

Unnamed: 0,Movie Title,Release Year,Rating,Pretty Visuals,Funny
0,Spider-Man: Across the Spider-Verse,2023,PG,True,False
1,Scott Pilgrim vs. the World,2010,PG-13,True,True
2,Monty Python and the Holy Grail,1975,PG,False,True
3,Joker,2019,R,True,False
4,Fight Club,1999,R,False,False


In [166]:
# Mask for 'Rating' to 'PG'.
is_pg = warn_df["Rating"] == "PG"

# Apply filter to DataFrame.
warn_df_pg = warn_df[is_pg]
warn_df_pg

Unnamed: 0,Movie Title,Release Year,Rating,Pretty Visuals,Funny
0,Spider-Man: Across the Spider-Verse,2023,PG,True,False
2,Monty Python and the Holy Grail,1975,PG,False,True


In [167]:
# Add a new column on if I would show the movie to a kid.
warn_df_pg["Would Show to a Kid"] = [True, False]
warn_df_pg

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  warn_df_pg["Would Show to a Kid"] = [True, False]


Unnamed: 0,Movie Title,Release Year,Rating,Pretty Visuals,Funny,Would Show to a Kid
0,Spider-Man: Across the Spider-Verse,2023,PG,True,False,True
2,Monty Python and the Holy Grail,1975,PG,False,True,False


### Oh no!

The above code threw a warning even though `warn_df_pg` looks correct, what happend?

Getting a series with brackets is called slicing. When we call `warn_df[is_pg]`, we slice the dataframe to show us a **view**, or subset, of the original DataFrame that contains PG movies. A view is not a new DataFrame, but rather you can imagine we just covered up the non-PG rows (hence why earlier I called `is_pg` a mask).

If you then try to change the contents of your view, Pandas has a decision: did you want to make a new DataFrame from the view? Or did you want to just edit the values of the original dataframe that are visible?

Pandas decides to assume you want to make a copy of the original, which is probably true in most cases. However, in case it isn't what you intended, Pandas will throw the SettingWithCopyWarning to let you know it made this assumption!

To avoid this warning, just explicitly call `.copy()` or `.loc[]` to specify whether you want to make a copy or change the original DataFrame, and now Pandas doesn't need to assume anything.

In [231]:
# Another way to get this warning.
warn_df["Pretty Visuals"] = [True, True, True, True, True]

In [232]:
# Solution A: Explicitly set on a copy using .copy().
is_pg = warn_df["Rating"] == "PG"
copy_df = warn_df[is_pg].copy()
copy_df["Would Show to a Kid"] = [True, False]
copy_df

Unnamed: 0,Movie Title,Release Year,Rating,Pretty Visuals,Funny,Would Show to a Kid
0,Spider-Man: Across the Spider-Verse,2023,PG,True,False,True
2,Monty Python and the Holy Grail,1975,PG,True,True,False


In [170]:
# Solution B: Explicitly set on the original using .loc[]
# Note that this edits the original warn_df, not a copy!
is_pg = warn_df["Rating"] == "PG"
warn_df.loc[is_pg, "Would Show to a Kid"] = [True, False]
warn_df

Unnamed: 0,Movie Title,Release Year,Rating,Pretty Visuals,Funny,Would Show to a Kid
0,Spider-Man: Across the Spider-Verse,2023,PG,True,False,True
1,Scott Pilgrim vs. the World,2010,PG-13,True,True,
2,Monty Python and the Holy Grail,1975,PG,True,True,False
3,Joker,2019,R,True,False,
4,Fight Club,1999,R,True,False,


## Working With `groupby() `
<br/>
<div>
<img src="https://i.imgflip.com/8ddsrh.jpg"/ width="300">
</div>
<br/>

When you group an object, there are a lot of options as to how to work with it. Most simple would be built-in functions such as `count()`, `sum()`, and `mean()`, but we can also use `transform()`, `apply()`, or `agg()` to perform custom operations.

In [171]:
df = pd.DataFrame({"animal": ["Manta Ray",
                              "Quokka",
                              "Rain Frog",
                              "Binturong",
                              "Sailfish",
                              "Sturgeon",
                              "Rhino",
                              "Platypus"],
                   "who": ["water_thing", "cute", "cute", "weird", "water_thing", "water_thing", "weird", "weird"],
                   "weight (lbs)": [6600, 6, 0.025, 60, 120, 800, 1600, 3],
                   "lifespan": [30, 10, 5, 18, 5, 100, 50, 15]
                  }).set_index("animal")
df

Unnamed: 0_level_0,who,weight (lbs),lifespan
animal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Manta Ray,water_thing,6600.0,30
Quokka,cute,6.0,10
Rain Frog,cute,0.025,5
Binturong,weird,60.0,18
Sailfish,water_thing,120.0,5
Sturgeon,water_thing,800.0,100
Rhino,weird,1600.0,50
Platypus,weird,3.0,15


In [204]:
def diffs(x):
    print("\tSingle Iteration Input: ")
    print(x)
    print("-"*40)
    return x.max() - x.min()

In [205]:
df.groupby("who").mean()

Unnamed: 0_level_0,weight (lbs),lifespan
who,Unnamed: 1_level_1,Unnamed: 2_level_1
cute,3.0125,7.5
water_thing,2506.666667,45.0
weird,554.333333,27.666667


### .transform()

Use when you want an aggregate calculation in a dataframe that matches the original dataframe's dimensions

In [206]:
df.groupby("who").transform('mean')

Unnamed: 0_level_0,weight (lbs),lifespan
animal,Unnamed: 1_level_1,Unnamed: 2_level_1
Manta Ray,2506.666667,45.0
Quokka,3.0125,7.5
Rain Frog,3.0125,7.5
Binturong,554.333333,27.666667
Sailfish,2506.666667,45.0
Sturgeon,2506.666667,45.0
Rhino,554.333333,27.666667
Platypus,554.333333,27.666667


In [207]:
df.groupby("who").transform(diffs)

	Single Iteration Input: 
animal
Quokka       6.000
Rain Frog    0.025
Name: weight (lbs), dtype: float64
----------------------------------------
	Single Iteration Input: 
animal
Quokka       10
Rain Frog     5
Name: lifespan, dtype: int64
----------------------------------------
	Single Iteration Input: 
           weight (lbs)  lifespan
animal                           
Quokka            6.000        10
Rain Frog         0.025         5
----------------------------------------
	Single Iteration Input: 
animal
Manta Ray    6600.0
Sailfish      120.0
Sturgeon      800.0
Name: weight (lbs), dtype: float64
----------------------------------------
	Single Iteration Input: 
animal
Manta Ray     30
Sailfish       5
Sturgeon     100
Name: lifespan, dtype: int64
----------------------------------------
	Single Iteration Input: 
           weight (lbs)  lifespan
animal                           
Manta Ray        6600.0        30
Sailfish          120.0         5
Sturgeon          800.0       

Unnamed: 0_level_0,weight (lbs),lifespan
animal,Unnamed: 1_level_1,Unnamed: 2_level_1
Manta Ray,6480.0,95.0
Quokka,5.975,5.0
Rain Frog,5.975,5.0
Binturong,1597.0,35.0
Sailfish,6480.0,95.0
Sturgeon,6480.0,95.0
Rhino,1597.0,35.0
Platypus,1597.0,35.0


### apply()

Row operations

*Note that it has different behavior and parameters for DataFrames*

In [218]:
# selecting the columns is just to avoid a warning, 
# it has the same output if you don't select the columns explicitly.
df.groupby("who")[["weight (lbs)", "lifespan"]].apply(np.mean)

Unnamed: 0_level_0,weight (lbs),lifespan
who,Unnamed: 1_level_1,Unnamed: 2_level_1
cute,3.0125,7.5
water_thing,2506.666667,45.0
weird,554.333333,27.666667


In [219]:
df.groupby("who").apply(diffs)

	Single Iteration Input: 
            who  weight (lbs)  lifespan
animal                                 
Quokka     cute         6.000        10
Rain Frog  cute         0.025         5
----------------------------------------
	Single Iteration Input: 
           weight (lbs)  lifespan
animal                           
Quokka            6.000        10
Rain Frog         0.025         5
----------------------------------------
	Single Iteration Input: 
           weight (lbs)  lifespan
animal                           
Manta Ray        6600.0        30
Sailfish          120.0         5
Sturgeon          800.0       100
----------------------------------------
	Single Iteration Input: 
           weight (lbs)  lifespan
animal                           
Binturong          60.0        18
Rhino            1600.0        50
Platypus            3.0        15
----------------------------------------


Unnamed: 0_level_0,weight (lbs),lifespan
who,Unnamed: 1_level_1,Unnamed: 2_level_1
cute,5.975,5.0
water_thing,6480.0,95.0
weird,1597.0,35.0


### .agg()

Use when you need to do different operations on an aggregation.

In [179]:
df.groupby("who").agg("mean")

Unnamed: 0_level_0,weight (lbs),lifespan
who,Unnamed: 1_level_1,Unnamed: 2_level_1
cute,3.0125,7.5
water_thing,2506.666667,45.0
weird,554.333333,27.666667


In [180]:
df.groupby("who").agg(diffs)

	One Iteration Input: 
Quokka       6.000
Rain Frog    0.025
Name: weight (lbs), dtype: float64

	One Iteration Input: 
Manta Ray    6600.0
Sailfish      120.0
Sturgeon      800.0
Name: weight (lbs), dtype: float64

	One Iteration Input: 
Binturong      60.0
Rhino        1600.0
Platypus        3.0
Name: weight (lbs), dtype: float64

	One Iteration Input: 
Quokka       10
Rain Frog     5
Name: lifespan, dtype: int64

	One Iteration Input: 
Manta Ray     30
Sailfish       5
Sturgeon     100
Name: lifespan, dtype: int64

	One Iteration Input: 
Binturong    18
Rhino        50
Platypus     15
Name: lifespan, dtype: int64



Unnamed: 0_level_0,weight (lbs),lifespan
who,Unnamed: 1_level_1,Unnamed: 2_level_1
cute,5.975,5
water_thing,6480.0,95
weird,1597.0,35


### Some special uses of .agg() and .apply()

In [186]:
df.groupby("who").agg(["mean", diffs])
# df.groupby("who").transform(["mean", diffs]) # Error!
# df.groupby("who").apply(["mean", diffs]) # Error!

	One Iteration Input: 
Quokka       6.000
Rain Frog    0.025
Name: weight (lbs), dtype: float64

	One Iteration Input: 
Manta Ray    6600.0
Sailfish      120.0
Sturgeon      800.0
Name: weight (lbs), dtype: float64

	One Iteration Input: 
Binturong      60.0
Rhino        1600.0
Platypus        3.0
Name: weight (lbs), dtype: float64

	One Iteration Input: 
Quokka       10
Rain Frog     5
Name: lifespan, dtype: int64

	One Iteration Input: 
Manta Ray     30
Sailfish       5
Sturgeon     100
Name: lifespan, dtype: int64

	One Iteration Input: 
Binturong    18
Rhino        50
Platypus     15
Name: lifespan, dtype: int64



Unnamed: 0_level_0,weight (lbs),weight (lbs),lifespan,lifespan
Unnamed: 0_level_1,mean,diffs,mean,diffs
who,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
cute,3.0125,5.975,7.5,5
water_thing,2506.666667,6480.0,45.0,95
weird,554.333333,1597.0,27.666667,35


In [233]:
def diff_cols(x):
    return x["weight (lbs)"].mean() - x["lifespan"].mean()

df.groupby("who").apply(diff_cols)
# df.groupby("who").transform(diff_cols) # Error!
# df.groupby("who").agg(diff_cols) # Error!

who
cute             -4.487500
water_thing    2461.666667
weird           526.666667
dtype: float64

## Bad Boolean Zen

Something small that I see in a number of students' code...

If an operation evaluates to `True` or `False`, you do not then have to check if the output is `True` to return `True`, or `False` to return `False`. Instead, you can generally just return the operation output directly.

As you can see below, we define two functions that return True if a value is less than 10, and False otherwise. `is_small_bad()` has an example of a bad boolean zen implementation, while `is_small_good()` corrects the implementation.

As a general caution, double check your work if you directly `return True` or `return False`. This is not a guarantee that your function has bad boolean zen, but it can be a sign of it.

In [187]:
# Bad boolean zen
def is_small_bad(n):
    if (n < 10) == True:
        return True
    else:
        return False
    
# Good boolean zen
def is_small_good(n):
    return n < 10

print(is_small_bad(5))
print(is_small_good(5))

True
True
