# Advanced Applications of Mutate

## Three helpful functions

# Install stuff

!pip install unpythonic # not until I know why

In [1]:
import pandas as pd
pd.set_option("display.max_column", None)
import collections
collections.Iterable = collections.abc.Iterable # fix possible issue with case_when in Python 3.10
from dfply import *
import matplotlib.pylab as plt
%matplotlib inline

## Hiding stack traceback

We hide the exception traceback for didactic reasons (code source: [see this post](https://stackoverflow.com/questions/46222753/how-do-i-suppress-tracebacks-in-jupyter)).  Don't run this cell if you want to see a full traceback.

import sys
ipython = get_ipython()

def hide_traceback(exc_tuple=None, filename=None, tb_offset=None,
                   exception_only=False, running_compiled_code=False):
    etype, value, tb = sys.exc_info()
    return ipython._showtraceback(etype, value, ipython.InteractiveTB.get_exception_only(etype, value))

ipython.showtraceback = hide_traceback

## Data set

We will be using two of the data sets provided by the Museam of Modern Art (MoMA) in this lecture.  Make sure that you have downloaded each repository.  [Download Instructions](./get_MOMA_data.ipynb)

#### MoMA Exhibitions

In [2]:
exhib_url = "https://github.com/MuseumofModernArt/exhibitions/raw/master/MoMAExhibitions1929to1989.csv"
dat_cols = ['ExhibitionBeginDate', 'ExhibitionEndDate', 'ConstituentBeginDate' ,'ConstituentEndDate']
exhibitions = pd.read_csv(exhib_url, 
                          encoding="ISO-8859-1",
                          parse_dates=dat_cols)
exhibitions.head(2)

Unnamed: 0,ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ExhibitionRoleinPressRelease,ConstituentID,ConstituentType,DisplayName,AlphaSort,FirstName,MiddleName,LastName,Suffix,Institution,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL
0,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1.0,moma.org/calendar/exhibitions/1767,Curator,Director,9168.0,Individual,"Alfred H. Barr, Jr.",Barr Alfred H. Jr.,Alfred,H.,Barr,Jr.,,American,1902,1981,"American, 19021981",Male,109252853.0,Q711362,500241556.0,moma.org/artists/9168
1,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1.0,moma.org/calendar/exhibitions/1767,Artist,Artist,1053.0,Individual,Paul Cézanne,Cézanne Paul,Paul,,Cézanne,,,French,1839,1906,"French, 18391906",Male,39374836.0,Q35548,500004793.0,moma.org/artists/1053


#### MoMA Artists

In [3]:
artists = pd.read_csv("./data/Artists.csv")
artists.head(2)

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,


#### MoMA Artwork

In [4]:
from more_dfply import fix_names

artwork = (pd.read_csv("./data/Artworks.csv")
           >> fix_names
           >> mutate(id = X.index + 1)
          )
artwork.head(2)

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,DateAcquired,Cataloged,ObjectID,URL,ThumbnailURL,Circumference_cm,Depth_cm,Diameter_cm,Height_cm,Length_cm,Weight_kg,Width_cm,Seat_Height_cm,Duration_sec,id
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,"19 1/8 x 66 1/2"" (48.6 x 168.9 cm)",Fractional and promised gift of Jo Carole and ...,885.1996,Architecture,Architecture & Design,1996-04-09,Y,2,http://www.moma.org/collection/works/2,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,,1
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,"16 x 11 3/4"" (40.6 x 29.8 cm)",Gift of the architect in honor of Lily Auchinc...,1.1995,Architecture,Architecture & Design,1995-01-17,Y,3,http://www.moma.org/collection/works/3,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,,2


# Three helpful column functions

In this section, we will focus on two useful column functions: `ifelse`, `coalesce` and `case_when`

# Branching with `ifelse`

The functions `ifelse` 

* allows us to pick between two options in a `mutate`.
* has the following syntax: `ifelse(cond, then, else)`
* Will return `then` with `cond == True`
* Will return `else` with `cond == False`

In [5]:
from more_dfply import ifelse

(artwork
    >> select(X.Gender)
    >> mutate(recode_gender = ifelse(X.Gender == '(Male)', 'm', 'f'))
    >> head
)

Unnamed: 0,Gender,recode_gender
0,(Male),m
1,(Male),m
2,(Male),m
3,(Male),m
4,(Male),m


### `then` and `else` conform to the `len(cond)`

* Singletons are repeated.
* Short vectors are tiled.
* Series/lists that are too long are truncated.

#### Some example conditions

In [6]:
from numpy import repeat, arange
all_true = repeat(True, 5)
all_false = repeat(False, 5)
all_true, all_false

(array([ True,  True,  True,  True,  True]),
 array([False, False, False, False, False]))

#### Series that are too long or too short

In [7]:
short = arange(1,4,1)
long = arange(1,10,1)
short, long

(array([1, 2, 3]), array([1, 2, 3, 4, 5, 6, 7, 8, 9]))

#### Singletons are repeated

In [8]:
ifelse(all_true, 'singleton', long)

0    singleton
1    singleton
2    singleton
3    singleton
4    singleton
dtype: object

#### Short sequences are tiled

The sequence of a short sequence is repeated, over and over, until it has the same length as `cond`

In [9]:
ifelse(all_true, short, long)

0    1
1    2
2    3
3    1
4    2
dtype: int64

#### Long sequences are truncated

The sequence of a long sequence is repeated, over and over, until it has the same length as `cond`

In [10]:
ifelse(all_false, short, long)

0    1
1    2
2    3
3    4
4    5
dtype: int64

### `then` and `else` are only evaluated if needed (when `Intention`s)

* If `cond` is all true, then `else Intention` will not be evaluated.
* Similarly, if `cond` is all false, the an `then Intention` will not be evaluated.

#### It is important that conditionals don't evaluate the "other" expression

In [11]:
'safe' if True else 1/0

'safe'

In [12]:
def my_ifelse(cond, then, else_):
    return then if cond else else_

In [13]:
# This errors because it evaluates 1/0 before it calls my_ifelse
my_ifelse(True, 'safe', 1/0)

ZeroDivisionError: division by zero

#### An expression that will crash if `else` is evaluated

In [14]:
(X/0).evaluate(2)

ZeroDivisionError: division by zero

#### No crash $\Rightarrow$ `else` was not evaluated

In [15]:
ifelse(all_true, 'safe', X.height_cm/0)

<dfply.base.Intention at 0x7fee8ba5ee00>

In [16]:
# turns out that only works because the intention is not evaluated.
# ifelse won't protect you if you decide to do the same thing
ifelse(all_true, "safe", 1/0)

ZeroDivisionError: division by zero

In [17]:
ifelse(all_true, 'safe', X.height_cm/0).evaluate(2)

0    safe
1    safe
2    safe
3    safe
4    safe
dtype: object

In [18]:
# That would crash in a different way because 2 is not a valid substitution for X
ifelse(all_false, "safe", X.height_cm/0).evaluate(2)

AttributeError: 'int' object has no attribute 'height_cm'

In [19]:
# I think this is the intent where false gets evaluated and crashes as expected?
ifelse(all_false, "safe", X/0).evaluate(2)

ZeroDivisionError: division by zero

In [20]:
# doesn't crash because numpy floats yield inf when divided by 0
ifelse(all_false, "safe", X.Height_cm/0).evaluate(artwork)

0    inf
1    inf
2    inf
3    inf
4    inf
Name: Height_cm, dtype: float64

In [21]:
artwork.Height_cm.dtype

dtype('float64')

## <font color="red"> Exercise 3.4.1 </font>

Consider the `Nationality` column `exhibition` table.  We would like to make a new column that reclassifies this column titled `"American"` that contains `1` if the artist is of American decent and `0` otherwise. 

In [22]:
exhibitions.Nationality.head()

0    American
1      French
2      French
3       Dutch
4      French
Name: Nationality, dtype: object

In [23]:
exhibitions.Nationality.unique()

array(['American', 'French', 'Dutch', 'Italian', nan, 'Spanish', 'German',
       'Mexican', 'Austrian', 'Finnish', 'Swedish', 'architect', 'Swiss',
       'British', 'Czech', 'Belgian', 'Russian', 'Guatemalan',
       'Russian-Lithuanian', 'English', 'Nationality unknown', 'Greek',
       'Norwegian', 'Georgian', 'Latvian', 'Polish', 'Japanese',
       'Milanese', 'Danish', 'Netherlandish', 'Romanian', 'Flemish',
       'Israeli', 'Scottish', 'Hungarian', 'Yugoslav', 'Brazilian',
       'Ukrainian', 'Catalan', 'Florentine', 'Venetian', 'Peruvian',
       'Canadian', 'Bolivian', 'Cuban', 'Irish', 'Chinese', 'Argentine',
       'Chilean', 'Colombian', 'Uruguayan', 'Ecuadorian', 'Venezuelan',
       'Australian', 'Haitian', 'Indian', 'Korean', 'Turkish', 'New',
       'Tanzanian', 'New Zealander', 'South', 'Icelandic', 'Iranian',
       'Panamanian', 'Rhodesian', 'Sudanese', 'Moroccan and American',
       'Canadian Inuit', 'Slovene', 'Bosnian', 'South African',
       'Croatian', 'Luxem

In [24]:
# descent?
# Does "Native American" count? What about "American and Mexican"? I'm going to say yes
(exhibitions
    >> mutate(American = ifelse(X.Nationality.str.contains("American"), 1, 0))
    >> select(X.Nationality, X.American)
    >> head()
)

Unnamed: 0,Nationality,American
0,American,1
1,French,0
2,French,0
3,Dutch,0
4,French,0


## Generalizing `ifelse` with `case_when`

`case_when` takes one more `(pred, then)` tuples
* `pred` is a `bool` expression
* `then` is added/coalesced with the answer when `pred == True`

This is similar to the R `case_when` from `dplyr`. See [case_when docs](https://dplyr.tidyverse.org/reference/case_when.html)

In [25]:
from more_dfply import case_when

#### Some example conditions

In [26]:
df = pd.DataFrame({'cat':['a','b','b','c','c'],
                   'val':[ 1,  1,  2,  1, 2]})
df

Unnamed: 0,cat,val
0,a,1
1,b,1
2,b,2
3,c,1
4,c,2


#### `case_when` with one predicate pair

Unmatched values are `nan`

In [27]:
(df
 >> mutate(new = case_when((X.cat == 'a', df.val + 1))))

Unnamed: 0,cat,val,new
0,a,1,2.0
1,b,1,
2,b,2,
3,c,1,
4,c,2,


#### Left-hand pairs have precident

In [28]:
(df
 >> mutate(new = case_when((X.cat == 'a', df.val + 1),
                           (X.cat == 'b', df.val + 2))))

Unnamed: 0,cat,val,new
0,a,1,2.0
1,b,1,3.0
2,b,2,4.0
3,c,1,
4,c,2,


#### Singletons are accepted

In [29]:
(df
 >> mutate(new = case_when((X.cat == 'a', df.val + 1),
                           (X.cat == 'b', df.val + 2),
                           (X.cat == 'c', 18))))

Unnamed: 0,cat,val,new
0,a,1,2.0
1,b,1,3.0
2,b,2,4.0
3,c,1,18.0
4,c,2,18.0


## <font color="red"> Exercise 3.4.2 </font>

Consider the `Nationality` column `exhibition` table.  We would like to make a new column that reclassifies this column as `"North American"`, `"European"`, or `"Other"`.  Use `case_when` to accomplish this task. 

In [30]:
# define our cases (otherwise case_when could get really out of hand)
americans = {'American', 'Mexican', 'Guatemalan', 'Canadian', 'Cuban', 'Haitian', 'Panamanian', 'Moroccan and American',
             'Canadian Inuit', 'Native American'}

europeans = {'French', 'Dutch', 'Italian', 'Spanish', 'German', 'Austrian', 'Finnish', 'Swedish', 'Swiss', 'British',
             'Czech', 'Belgian', 'Russian', 'Russian-Lithuanian', 'English', 'Greek', 'Norwegian', 'Georgian', 'Latvian',
             'Polish', 'Milanese', 'Danish', 'Netherlandish', 'Romanian', 'Flemish', 'Scottish', 'Hungarian', 'Yugoslav',
             'Ukrainian', 'Catalan', 'Florentine', 'Venetian', 'Irish', 'Icelandic', 'Slovene', 'Bosnian',
             'Croatian', 'Luxembourgish'}

others = set(exhibitions.Nationality.unique()) - americans - europeans

In [31]:
# Your code here
(exhibitions
    >> mutate(nationality_group = case_when((X.Nationality.isin(americans), "North American"),
                                            (X.Nationality.isin(europeans), "European"),
                                            (X.Nationality.isin(others), "Other")
                                           )
             )
    >> select(X.Nationality, X.nationality_group)
    >> sample(10)
)

Unnamed: 0,Nationality,nationality_group
17588,Nationality unknown,Other
22859,American,North American
132,,Other
4119,American,North American
32338,American,North American
14911,German,European
29257,Belgian,European
20162,American,North American
30957,French,European
33260,French,European


# Using `coalesce` to remove missing values

* Syntax: `coalesce(col1, col2, ...)`
* Returns a `pd.Series`
* Each entry is the first non-missing value from `col1`, `col2`, ... (working left to right).

In [32]:
from more_dfply import coalesce

In [33]:
df = pd.DataFrame({'cat':['a','b','b','c','c'],
                   'val':[ 1,  1,  2,  1, 2]})
df

Unnamed: 0,cat,val
0,a,1
1,b,1
2,b,2
3,c,1
4,c,2


#### Example `df` with some missing values

In [34]:
df = pd.DataFrame(np.random.randint(0, 10, size=(5, 3)), columns=list('abc'))
df.loc[::2, 'a'] = np.nan
df.loc[::3, 'b'] = np.nan
df

Unnamed: 0,a,b,c
0,,,6
1,2.0,7.0,1
2,,1.0,0
3,3.0,,1
4,,9.0,8


#### `coaleace` first two columns

In [35]:
(df
    >> mutate(d = coalesce(df.a, df.b)))

Unnamed: 0,a,b,c,d
0,,,6,
1,2.0,7.0,1,2.0
2,,1.0,0,1.0
3,3.0,,1,3.0
4,,9.0,8,9.0


#### `coaleace` first all three columns

In [36]:
(df
    >> mutate(d = coalesce(df.a, df.b, df.c)))

Unnamed: 0,a,b,c,d
0,,,6,6.0
1,2.0,7.0,1,2.0
2,,1.0,0,1.0
3,3.0,,1,3.0
4,,9.0,8,9.0


#### `coaleace` handles `dfply.Intention`s

In [37]:
(df
>> mutate(d = coalesce(X.a, X.b)))

Unnamed: 0,a,b,c,d
0,,,6,
1,2.0,7.0,1,2.0
2,,1.0,0,1.0
3,3.0,,1,3.0
4,,9.0,8,9.0


#### `coalesce` ignores unnecessary arguments

In [38]:
(df
 >> mutate(d = coalesce(X.a, df.b, df.c, 
                        X/0 # Ignored
                       )))

Unnamed: 0,a,b,c,d
0,,,6,6.0
1,2.0,7.0,1,2.0
2,,1.0,0,1.0
3,3.0,,1,3.0
4,,9.0,8,9.0
