# More on Piping, Intentions, and Column Expressions

In [1]:
import pandas as pd
from dfply import *
import matplotlib.pylab as plt
%matplotlib inline

In [2]:
artists = pd.read_csv("./data/Artists.csv")
artwork = pd.read_csv("./data/Artworks.csv")

In [11]:
artists.head()

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
4,5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,


In [8]:
artists['Nationality'].unique()

array(['American', 'Spanish', 'Danish', 'Italian', 'French', 'Estonian',
       'Mexican', 'Swedish', nan, 'Israeli', 'British', 'Finnish',
       'Polish', 'Palestinian', 'Japanese', 'Guatemalan', 'Colombian',
       'Romanian', 'Russian', 'German', 'Argentine', 'Kuwaiti', 'Various',
       'Belgian', 'Dutch', 'Norwegian', 'Nationality unknown', 'Chilean',
       'Swiss', 'Costa Rican', 'Czech', 'Brazilian', 'Austrian',
       'Canadian', 'Australian', 'Ukrainian', 'Hungarian', 'Haitian',
       'Congolese', 'Bolivian', 'Cuban', 'Yugoslav', 'Portuguese',
       'Indian', 'Icelandic', 'Irish', 'Guyanese', 'Uruguayan', 'Slovak',
       'Croatian', 'Greek', 'Peruvian', 'Chinese', 'Venezuelan', 'Turkish',
       'Panamanian', 'Algerian', 'Ecuadorian', 'South African', 'Iranian',
       'Korean', 'Canadian Inuit', 'Paraguayan', 'Luxembourgish',
       'Nicaraguan', 'Zimbabwean', 'Moroccan', 'Tanzanian', 'Bulgarian',
       'Tunisian', 'Sudanese', 'Taiwanese', 'Ethiopian', 'Slovenian',
    

In [20]:
# carried over from the last lecture
bad_lbls = (artists >> 
             filter_by(X.Nationality.str.lower().str.startswith('nation').astype('bool')) >>
             pull('Nationality')).unique()

In [21]:
bad_lbls

array([nan, 'Nationality unknown', 'Nationality Unknown',
       'nationality unknown'], dtype=object)

In [3]:
recode_bad_lbls = {old_lbl:'Nationality unknown' for old_lbl in bad_lbls}
replace_zero = {0:np.NaN}

## Why we love piping? 

### Reason 1: Composition Baby!

It is very easy to put separate pipe together.

In [4]:
artists_renamed = (artists >>
                    rename(Wiki_QID = 'Wiki QID'))
artists_new = (artists >>
                mutate(Nationality = X.Nationality.replace(recode_bad_lbls)))
artists_new = (artists_renamed >>
                mutate(BeginDate = X.BeginDate.replace(replace_zero)))

## To compose separate pipes

1. Switch ending `)` to `>>`
2. Remove the next assignment
3. ??
4. Profit!

In [5]:
artists_renamed = (artists >>
                    rename(Wiki_QID = 'Wiki QID') >> #)
#artists_new = (artists >>
                mutate(Nationality = X.Nationality.replace(recode_bad_lbls)) >> #)
#artists_new = (artists_renamed >>
                mutate(BeginDate = X.BeginDate.replace(replace_zero)))

## End product ... full process in a single pipe

In [6]:
artists_renamed = (artists >>
                    rename(Wiki_QID = 'Wiki QID') >>
                    mutate(Nationality = X.Nationality.replace(recode_bad_lbls)) >>
                    mutate(BeginDate = X.BeginDate.replace(replace_zero)))

## Why we love piping? 

### Reason 2: Textual Gravity!

A pipe clearly expression the intention of our code by

1. Reading left-to-right and top-to-bottom
2. Putting the verbs up front

In [31]:
artists_renamed = (artists >>
                    rename(Wiki_QID = 'Wiki QID') >>
                    mutate(Nationality = X.Nationality.replace(recode_bad_lbls)) >>
                    mutate(BeginDate = X.BeginDate.replace(replace_zero)))

## Why we love piping? 

### Reason 3: Easy debugging

Comments make it easy to debug a pipe.

## Debugging Step 1 - Start at the top

Use comments to remove all part of the chain

*Don't forget the ending `)`*

In [32]:
artists_renamed = (artists ) #>>
                    #rename(Wiki_QID = 'Wiki QID') >>
                    #mutate(Nationality = X.Nationality.replace(recode_bad_lbls)) >>
                    #mutate(BeginDate = X.BeginDate.replace(replace_zero)))

## Debugging Step 2 - Work your way down the pipe

Add in each part, one-at-a-time, checking the results

*Don't forget the ending `)`*

In [33]:
artists_renamed = (artists >>
                    rename(Wiki_QID = 'Wiki QID')  #>>
                    #mutate(Nationality = X.Nationality.replace(recode_bad_lbls)) >>
                    #mutate(BeginDate = X.BeginDate.replace(replace_zero)))
                  )
artists_renamed.columns

Index(['ConstituentID', 'DisplayName', 'ArtistBio', 'Nationality', 'Gender',
       'BeginDate', 'EndDate', 'Wiki_QID', 'ULAN'],
      dtype='object')

In [34]:
artists_renamed = (artists >>
                    rename(Wiki_QID = 'Wiki QID') >>
                    mutate(Nationality = X.Nationality.replace(recode_bad_lbls)) ) #>>
                    #mutate(BeginDate = X.BeginDate.replace(replace_zero)))

In [35]:
artists_renamed = (artists >>
                    rename(Wiki_QID = 'Wiki QID') >>
                    mutate(Nationality = X.Nationality.replace(recode_bad_lbls)) >>
                    mutate(BeginDate = X.BeginDate.replace(replace_zero)))

# More about Intentions 

## `X` is an `Intention`

<img src="img/dfply_X_intention_1.png" width = 400>

Think of it as recording an expression for later evaluation

In [36]:
expr = X.BeginDate.head()
expr

<dfply.base.Intention at 0x124e64cc0>

## Use `evaluate` to apply the expression

We can apply an expression *later* using `evaluate` with a dataframe.

In [37]:
expr.evaluate(artists)

0    1930
1    1936
2    1941
3    1946
4    1941
Name: BeginDate, dtype: int64

## Intention expressions are reusable!

In [38]:
# Reusable!
expr.evaluate(artwork)


0    (1841)
1    (1944)
2    (1876)
3    (1944)
4    (1876)
Name: BeginDate, dtype: object

## <font color="red"> Exercise 1 </font>
    
**Task:** Use `X` to create an expression that replaces spaces in column names with underscores.  Apply this expression to fresh instances of `artists` and `artwork`.

In [39]:
expr1 = X.columns.str.replace(' ', '_').str.replace('[().]', '').str.lower()

In [40]:
expr1.evaluate(artists)

Index(['constituentid', 'displayname', 'artistbio', 'nationality', 'gender',
       'begindate', 'enddate', 'wiki_qid', 'ulan'],
      dtype='object')

In [41]:
expr1.evaluate(artwork)

Index(['title', 'artist', 'constituentid', 'artistbio', 'nationality',
       'begindate', 'enddate', 'gender', 'date', 'medium', 'dimensions',
       'creditline', 'accessionnumber', 'classification', 'department',
       'dateacquired', 'cataloged', 'objectid', 'url', 'thumbnailurl',
       'circumference_cm', 'depth_cm', 'diameter_cm', 'height_cm', 'length_cm',
       'weight_kg', 'width_cm', 'seat_height_cm', 'duration_sec'],
      dtype='object')

## Not just for data frames ... works for any* expression

In [42]:
double, line = 2*X, 3*X + 5

In [43]:
double.evaluate(3), line.evaluate(6)

(6, 23)

## We can make functions lazy too!

Decorate a function with `make_symbolic` to allow lazy evaluation of `Intention` objects

In [44]:
from math import log
log = make_symbolic(log)

In [45]:
log(8, 2) # Works as expected with numbers

3.0

## Passing in `X` now makes a `log` expression

In [46]:
expr = log(X, 2) # Passing in X makes it lazy/symbolic
expr

<dfply.base.Intention at 0x11478e668>

In [47]:
expr.evaluate(16) # Evaluate later

4.0

## `pyspark.sql` set up

In [49]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, column, mean

spark1 = SparkSession.builder.appName('Ops').getOrCreate()

## `pyspark.sql` columns also generate expression 

In [50]:
column('height')

Column<b'height'>

In [51]:
column('height') > 3

Column<b'(height > 3)'>

## `col == column`

In [52]:
5*col('height') + 2 # col is an alias for column

Column<b'((height * 5) + 2)'>

## Column expressions can combine columns

In [42]:
col('height') + col('weight')

Column<b'(height + weight)'>

## Columns work with other `pyspark.sql.functions`

In [43]:
mean(col('height'))

Column<b'avg(height)'>

## `sqlalchemy` columns generate expression too

In [53]:
from sqlalchemy import func as f
f.column('height')

<sqlalchemy.sql.functions.Function at 0x114a4f0b8; column>

In [67]:
f.column('height') > 3

<sqlalchemy.sql.elements.BinaryExpression object at 0x10f73a6d8>

## `col == column`

In [68]:
5*f.col('height') + 2 # col is an alias for column

<sqlalchemy.sql.elements.BinaryExpression object at 0x10f752b70>

## Column expressions can combine columns

In [69]:
f.col('height') + f.col('weight')

<sqlalchemy.sql.elements.BinaryExpression object at 0x10f752ac8>

## Columns work with other `pyspark.sql.functions`

In [71]:
f.avg(col('height'))

<sqlalchemy.sql.functions.Function at 0x10f758630; avg>

## Up Next

Now we will continue on to [Lecture 2.4](./2_4_pandas_dtypes.ipynb).
