**Introduction**
+ learning about 'apply' is fundamental to the data cleaning process
+ apply takes a function and "applies" (runs it), across each row or column of a dataframe - simultaneously
+ it is similar to writing a 'for loop' across each row or column and calling the function
    + 'apply' does it simultaneously
+ in general, this is a peferred way to apply functions across dataframes - it is much faster than writing a For loop

**Functions**
+ Functions are core elements of writing 'apply' statements
+ Functions are a way to group, and reuse Python
+ if ever in a situation where you ae copying/pasting code, and changing a few parts of the code - there is a chance the copied code can be written into a function
+ to create a function, need to 'define' it

In [2]:
# def my_function():
    # indent 4 spaces
    # function code here

In [4]:
def my_sq(x):
    """squares a given value
    """
    return x ** 2

def avg_2(x, y):
    """Calculates the average of 2 numbers
    """
    return (x + y)/2

In [5]:
# the text within the triple quotes is 'docstring' - 
# the text that appears, when you look up the help documentation about a function

print(my_sq(4))

16


In [6]:
print(avg_2(10, 20))

15.0


**9.3 Apply (basics)**
+ when working with daaframes, it's more likely that will want to use a function across rows or columns of data

In [9]:
import pandas as pd
df = pd.DataFrame({'a': [10, 20, 30],'b': [20, 30, 40]})
print(df)

    a   b
0  10  20
1  20  30
2  30  40


In [10]:
# can then apply functions over a Series (an individual column or row)
# write a function to square the 'a' column
print(df['a']**2)

0    100
1    400
2    900
Name: a, dtype: int64


**9.3.1 Apply over a Series**
+ if subset a single column or row, the type of object to get back is a Pandas Series


In [11]:
# get the first column
print(type(df['a']))

<class 'pandas.core.series.Series'>


In [12]:
# get the first row
print(type(df.iloc[0]))

<class 'pandas.core.series.Series'>


In [14]:
# the Series has a method called 'apply'
# to use this, will pass the function to use across each element in the Series

# apply our square function on the 'a' column
sq = df['a'].apply(my_sq)
print(sq)

0    100
1    400
2    900
Name: a, dtype: int64


In [16]:
# writing a function with 2 parameters - value, and then the exponent to raise the value to
def my_exp(x, e):
    return x ** e

cb = my_exp(2, 3)
print(cb)

8


In [17]:
# but, to apply the function to the series, need to pass in the 2nd parameter
# so pass in the 2nd argument as a keyword argument

ex = df['a'].apply(my_exp, e=2)
print(ex)

0    100
1    400
2    900
Name: a, dtype: int64


In [19]:
ex = df['a'].apply(my_exp, e=3)
print(ex)

0     1000
1     8000
2    27000
Name: a, dtype: int64


**Apply over a Dataframe**

In [20]:
df = pd.DataFrame({'a': [10, 20, 30], 'b': [20, 30, 40]})
print(df)

    a   b
0  10  20
1  20  30
2  30  40


In [21]:
# dataframes have 2 dimensions
# when applying a function over a df, need to 1st specify which axis to apply the function over, column or rows
def print_me(x):
    print(x)

In [22]:
# lets apply this to the df
# the syntax is similar to using the Apply to a Series, but this time need to define the axis
df.apply(print_me, axis=0)

0    10
1    20
2    30
Name: a, dtype: int64
0    20
1    30
2    40
Name: b, dtype: int64


a    None
b    None
dtype: object

In [23]:
print(df['a'])

0    10
1    20
2    30
Name: a, dtype: int64


In [24]:
print(df['b'])

0    20
1    30
2    40
Name: b, dtype: int64


In [25]:
def avg_3(x, y, z):
    return (x + y+ z) / 3

# will cause an error
# print(df.apply(avg_3))

def avg_3_apply(col):
    x = col[0]
    y = col[1]
    z = col[2]
    return (x + y + z) / 3
print(df.apply(avg_3_apply))

a    20.0
b    30.0
dtype: float64


In [26]:
# Row-wise operations
# now use axis = 1

def avg_2_apply(row):
    x = row[0]
    y = row[1]
    return(x + y) / 2
print(df.apply(avg_2_apply, axis=0))

a    15.0
b    25.0
dtype: float64


**Apply (More Advanced)**
+ the seaborn library has a built-in Titanic data set

In [27]:
import seaborn as sns
titanic = sns.load_dataset('titanic')
print(titanic.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived       891 non-null int64
pclass         891 non-null int64
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
fare           891 non-null float64
embarked       889 non-null object
class          891 non-null category
who            891 non-null object
adult_male     891 non-null bool
deck           203 non-null category
embark_town    889 non-null object
alive          891 non-null object
alone          891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB
None


In [31]:
# one way to use apply, is to calculate how many 'null' or 'NaN' values there are in the data

# 1. Number of missing  values
# use the numpy use function
import numpy as np

def count_missing(vec):
    """Counts the number of missing values in a vector
    """
    
    # get a vector of True/False values
    # depending whether the value is missing
    null_vec = pd.isnull(vec)
    
    # take the sum of the null_vec
    # since null values do not contribute to the sum
    null_count = np.sum(null_vec)
    
    #return the # of missing values in the vector
    return null_count

# 2. Proportion of missing values
def prop_missing(vec):
    """Percentage of missing values in a vector
    """
    # numerator: # of missing values
    # can use the count_missing function
    num = count_missing(vec)
    
    # denominator: total # of values in the vector
    # also need to count the missing values
    dem = vec.size
    
    # return the proportion/% of missing
    return num / dem

# 3. Proportion of complete values
def prop_complete(vec):
    """Percentage of nonmissing values in a vector
    """
    # we can utilize the percent_missing function 
    # by subtacting its vale from 1
    return 1 - prop_missing(vec)

In [32]:
cmis_col = titanic.apply(count_missing)

pmis_col = titanic.apply(prop_missing)

pcom_col = titanic.apply(prop_complete)

print(cmis_col)

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


In [33]:
print(pmis_col)

survived       0.000000
pclass         0.000000
sex            0.000000
age            0.198653
sibsp          0.000000
parch          0.000000
fare           0.000000
embarked       0.002245
class          0.000000
who            0.000000
adult_male     0.000000
deck           0.772166
embark_town    0.002245
alive          0.000000
alone          0.000000
dtype: float64


In [34]:
print(pcom_col)

survived       1.000000
pclass         1.000000
sex            1.000000
age            0.801347
sibsp          1.000000
parch          1.000000
fare           1.000000
embarked       0.997755
class          1.000000
who            1.000000
adult_male     1.000000
deck           0.227834
embark_town    0.997755
alive          1.000000
alone          1.000000
dtype: float64


In [35]:
# since we have counts of missing values, can determine whether a column is a viable option for use
print(titanic.loc[pd.isnull(titanic.embark_town), :])

     survived  pclass     sex   age  sibsp  parch  fare embarked  class  \
61          1       1  female  38.0      0      0  80.0      NaN  First   
829         1       1  female  62.0      0      0  80.0      NaN  First   

       who  adult_male deck embark_town alive  alone  
61   woman       False    B         NaN   yes   True  
829  woman       False    B         NaN   yes   True  


**Row-wise Operations**
+ since the functions are vectorized, can apply them across rows of data without changing them


In [36]:
cmis_row = titanic.apply(count_missing, axis=1)
pmis_row = titanic.apply(prop_missing, axis=1)
pcom_row = titanic.apply(prop_complete, axis=1)
print(cmis_row.head())

0    1
1    0
2    1
3    0
4    1
dtype: int64


In [37]:
print(pmis_row.head())

0    0.066667
1    0.000000
2    0.066667
3    0.000000
4    0.066667
dtype: float64


In [39]:
print(pcom_row.head())

0    0.933333
1    1.000000
2    0.933333
3    1.000000
4    0.933333
dtype: float64


In [40]:
# can now see if have any rows in data that have multiple missing values
print(cmis_row.value_counts())

1    549
0    182
2    160
dtype: int64


In [41]:
# we can create a new column containing these values
titanic['num_missing'] = titanic.apply(count_missing, axis=1)
print(titanic.head())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  num_missing  
0    man        True  NaN  Southampton    no  False            1  
1  woman       False    C    Cherbourg   yes  False            0  
2  woman       False  NaN  Southampton   yes   True            1  
3  woman       False    C  Southampton   yes  False            0  
4    man        True  NaN  Southampton    no   True            1  


In [42]:
# can then look at the rows with multiple missing data
print(titanic.loc[titanic.num_missing > 1, :].sample(10))

     survived  pclass     sex  age  sibsp  parch      fare embarked  class  \
495         0       3    male  NaN      0      0   14.4583        C  Third   
306         1       1  female  NaN      0      0  110.8833        C  First   
825         0       3    male  NaN      0      0    6.9500        Q  Third   
260         0       3    male  NaN      0      0    7.7500        Q  Third   
692         1       3    male  NaN      0      0   56.4958        S  Third   
497         0       3    male  NaN      0      0   15.1000        S  Third   
533         1       3  female  NaN      0      2   22.3583        C  Third   
612         1       3  female  NaN      1      0   15.5000        Q  Third   
680         0       3  female  NaN      0      0    8.1375        Q  Third   
48          0       3    male  NaN      2      0   21.6792        C  Third   

       who  adult_male deck  embark_town alive  alone  num_missing  
495    man        True  NaN    Cherbourg    no   True            2  
306

**9.5 Vectorized Functions**

In [43]:
# there are times when it is not feasible to rewrite a function
# you can leverage the 'vectorize' function and decorator, to vectorize any function
df = pd.DataFrame({'a': [10, 20, 30], 'b': [20, 30, 40]})
print(df)

    a   b
0  10  20
1  20  30
2  30  40


In [44]:
# here is the average function, which can apply on a row-by-row basis
def avg_2(x, y):
    return (x + y) / 2

print(avg_2(df['a'], df['b']))

0    15.0
1    25.0
2    35.0
dtype: float64


In [46]:
# let's perform a non-vectorizable calculation
import numpy as np
def avg_2_mod(x, y):
    """Calculate the average, unless x is 20
    """
    if (x == 20):
        return(np.NaN)
    else:
        return (x + y) / 2
# running this function will cause an error

print(avg_2_mod(10, 20))

15.0


+**9.5.1 Using Numpy**
+ pass np.vectorize to the function

In [47]:
# np.vectorize actually creates a new function
avg_2_mod_vec = np.vectorize(avg_2_mod)
print(avg_2_mod_vec(df['a'], df['b']))

[15. nan 35.]


In [48]:
# can use a Python decorator to vectorize a function
# to use the vectorize decorator
# we use the @ symbol before our function definition
@np.vectorize
def v_avg_2_mod(x, y):
    """Calculate the average, unless x=20
    Same as before, but using the vectorize decorator
    """
    if (x == 20):
        return(np.NaN)
    else:
        return (x + y) / 2
    # can then directly use the vectorized function
    # without havging to create a new function
print(v_avg_2_mod(df['a'],df['b']))

[15. nan 35.]


**Using numba**
+ the numba library is designez to optimize Python code - it has a vectorize decorator

In [55]:
import numba
@numba.vectorize
def v_avg_2_numba(x, y):
    """Calculate the average, unless x is 20
    Use the numba decorator
    """
    # we now have to add type info to our function
    if(int(x)== 20):
        return(np.NaN)
    else:
        return (x + y) / 2
print(v_avg_2_numba(df['a'].values, df['b'].values))

[15. nan 35.]


In [56]:
# so need to pass in the numpy array part of data
print(v_avg_2_numba(df['a'].values, df['b'].values))

[15. nan 35.]


**9.6 Lambda Functions**
+ sometimes the function used in apply, is simple enough that there is no need to create a separate function


In [57]:
docs = pd.read_csv('/Users/BrendanErhard/Desktop/pandas_for_everyone-master/data/doctors.csv', header=None)

In [62]:
# can write a pattern that extracts all the letters from the row, and assign those values to a new 'name' column

import re
p = re.compile('\w+\s+\w+')

def get_name(s):
    return p.match(s).group()

docs['name_func'] = docs[0].apply(get_name)
print(docs)

                               0              name_func
0     William Hartnell (1963-66)       William Hartnell
1    Patrick Troughton (1966-69)      Patrick Troughton
2          Jon Pertwee (1970 74)            Jon Pertwee
3            Tom Baker (1974-81)              Tom Baker
4        Peter Davison (1982-84)          Peter Davison
5          Colin Baker (1984-86)            Colin Baker
6      Sylvester McCoy (1987-89)        Sylvester McCoy
7             Paul McGann (1996)            Paul McGann
8   Christopher Eccleston (2005)  Christopher Eccleston
9        David Tennant (2005-10)          David Tennant
10          Matt Smith (2010-13)             Matt Smith
11     Peter Capaldi (2014-2017)          Peter Capaldi
12        Jodie Whittaker (2017)        Jodie Whittaker


In [63]:
# the actual function is a simplie one-liner
# people opt to write the one-liner directly in the apply methos
# this is known as using lambda functions
docs['name_lamb'] = docs[0].apply(lambda x: p.match(x).group())
print(docs)

                               0              name_func              name_lamb
0     William Hartnell (1963-66)       William Hartnell       William Hartnell
1    Patrick Troughton (1966-69)      Patrick Troughton      Patrick Troughton
2          Jon Pertwee (1970 74)            Jon Pertwee            Jon Pertwee
3            Tom Baker (1974-81)              Tom Baker              Tom Baker
4        Peter Davison (1982-84)          Peter Davison          Peter Davison
5          Colin Baker (1984-86)            Colin Baker            Colin Baker
6      Sylvester McCoy (1987-89)        Sylvester McCoy        Sylvester McCoy
7             Paul McGann (1996)            Paul McGann            Paul McGann
8   Christopher Eccleston (2005)  Christopher Eccleston  Christopher Eccleston
9        David Tennant (2005-10)          David Tennant          David Tennant
10          Matt Smith (2010-13)             Matt Smith             Matt Smith
11     Peter Capaldi (2014-2017)          Peter Capa