 # Functions

## Defining Functions in Python

Defining functions in python is very simple:

In [1]:
def my_first_function(input):
    output = 2*input
    return output

To use the function, simply run it elsewhere once it has been defined:

In [2]:
my_first_function(5)

10

Note the indentation when writing functions. To end the function, leave a blank line and remove the indent.

In [3]:
def my_second_function(x):
    y = int(x/2)
    return y 

my_second_function(12)

6

### _Docstrings_

We can write documentation for our functions, called a _docstring_, by adding writing a string in the line following the `def` call:

In [7]:
def my_documented_function(input1, input2):
    """
    This function converts inputs to string, concatenates the input, then reverses the order.
    """
    in1 = str(input1)
    in2 = str(input2)
    combi = in1+in2
    rev = combi[::-1]
    output = rev
    return output

my_documented_function(43214, 'cats')

'stac41234'

In [9]:
my_documented_function?

### _Default Values_

- We can give arguments default values when defining a function.
- Arguments taking default values become optional arguments; we do not have to pass a value each time we call the function.


In [None]:
def function_with_defaults(x, replace=" ", val=" "):
    "Casts input to string, replaces `replace` with `val`, prints result. Retuns None"
    y = str(x)
    y = y.replace(replace, val)
    print(y)

function_with_defaults("1. I love dogs.")
function_with_defaults("2. I love dogs.", " dogs.")
function_with_defaults("3. I love dogs.", val="_")
function_with_defaults("4. I love dogs.", "d", "b")
function_with_defaults("5. I love dogs.", "dog", "cat")

### Namespaces

Recall the difference between local and global namespace.

- Variables named within functions are not accessible outside the function.
- When a variable is called within a function, the program first checks if it is defined locally, then checks if it is defined globally.

In [None]:
def function_with_local():
    some_local_variable = 12
    
print(some_local_variable)

In [10]:
a = "Global A"

def function1():
    print(a)
    
def function2():
    a = "Local A"
    print(a)
    
def function3(a):
    print(a)

function1()
function2()
function3("Argument A")

Global A
Local A
Argument A


### Namespaces Take Away

When defining variables within the global environment, use _unique, specific and informative names_. When working within functions, give generic names that inform what the argument or variable is doing.

# Applying Functions to Vectors

We go over a variety of ways in which you may apply a function to a `pandas.Series` or `pandas.DataFrame`.

- Transformations:
    - Element-wise Operations
    - Cumulative Operations
- Summaries:
    - Point Summaries
    - Grouped Summaries

## Element-wise Operations on a Series

We can use the `pd.Series.apply()` method to apply a function element-wise to a pandas Series.

In [11]:
import pandas as pd

ser = pd.Series(range(0, 12, 2)) # range(start, stop, step)
ser

0     0
1     2
2     4
3     6
4     8
5    10
dtype: int64

In [12]:
def square(x):
    y = x**2
    return y

ser.apply(square)

0      0
1      4
2     16
3     36
4     64
5    100
dtype: int64

In [13]:
def exponentiate(x, e):
    y = x**e
    return y

ser.apply(lambda x: exponentiate(x, 3))

0       0
1       8
2      64
3     216
4     512
5    1000
dtype: int64

In [14]:
ser.apply(lambda x: x**3) 

0       0
1       8
2      64
3     216
4     512
5    1000
dtype: int64

In [15]:
e = 1/2
ser.apply(lambda x: x**e)

0    0.000000
1    1.414214
2    2.000000
3    2.449490
4    2.828427
5    3.162278
dtype: float64

## Cumulative Operations on a Series

In order to use cumulative operations, we can either use a `cum` function, or the `pd.Series.expanding` method.

In [16]:
ser.cumsum()

0     0
1     2
2     6
3    12
4    20
5    30
dtype: int64

In [17]:
ser.expanding()

Expanding [min_periods=1,center=False,axis=0]

In [18]:
ser.expanding().sum()

0     0.0
1     2.0
2     6.0
3    12.0
4    20.0
5    30.0
dtype: float64

In [19]:
ser.expanding(2).sum() # We can set the minimum period within the expand function.

0     NaN
1     2.0
2     6.0
3    12.0
4    20.0
5    30.0
dtype: float64

We can also use apply with a DataFrame. In this case, each row (axis=0) or column (axis=1) is treated as an element.

In [20]:
df = pd.DataFrame({
    'col1': pd.np.random.randint(-100, 100, 5),
    'col2': pd.np.random.randint(-100, 100, 5),
    'col3': pd.np.random.randint(-100, 100, 5)
})
df.head()

Unnamed: 0,col1,col2,col3
0,-35,-76,-20
1,-20,-8,-39
2,63,9,7
3,32,37,-46
4,-78,-99,-100


In [21]:
df.apply(lambda x: x.mean(), axis=0)

col1    -7.6
col2   -27.4
col3   -39.6
dtype: float64

In [22]:
df.apply(lambda x: x.sum(), axis=1)

0   -131
1    -67
2     79
3     23
4   -277
dtype: int64

In [23]:
df.applymap(lambda x: abs(x)**0.5)

Unnamed: 0,col1,col2,col3
0,5.91608,8.717798,4.472136
1,4.472136,2.828427,6.244998
2,7.937254,3.0,2.645751
3,5.656854,6.082763,6.78233
4,8.831761,9.949874,10.0


## Point Summaries

We have already looked at a number of point summary functions in the previous week.

- `pd.Series.mean()`
- `pd.Series.sum()`

We do not spend more time on them here.

## Grouped Summaries

The syntax for group summaries is [explained in detail in the lecture](https://muhark.github.io/dpir-intro-python/Week3/lecture.html#/groupby-syntax-simple-group-operations).

In [28]:
df = pd.read_feather("../Week2/data/bes_data_subset_week2.feather")

In [29]:
df.groupby('region')['Age'].mean()

region
East Midlands         54.903226
Eastern               54.070796
London                46.896552
North East            54.276786
North West            51.388158
Scotland              53.109948
South East            51.971631
South West            54.560241
Wales                 51.269841
West Midlands         54.451327
Yorkshire & Humber    53.152174
Name: Age, dtype: float64

In [30]:
df.groupby('region')[['Age']].mean() # List indexer returns a DataFrame of width 1.

Unnamed: 0_level_0,Age
region,Unnamed: 1_level_1
East Midlands,54.903226
Eastern,54.070796
London,46.896552
North East,54.276786
North West,51.388158
Scotland,53.109948
South East,51.971631
South West,54.560241
Wales,51.269841
West Midlands,54.451327


We can pass custom functions to the groupby object by using `apply`

In [35]:
df.head(1)

Unnamed: 0,finalserialno,region,Constit_Code,Constit_Name,Interview_Date,total_num_dwel,total_num_hous,num_elig_people,turnoutValidationReg,Age,...,k08,y01,y03,y06,y07,y08,y09,y11,y17,y18
0,10115,East Midlands,Ashfield,E14000535,06/09/2017,1,1,2,Voted,21.0,...,No,"GBP 5,200 - GBP 10,399",Own home on mortgage,No religion,,No,Female,English/Welsh/Scottish/Northern Irish/British,Working full time - employee (30+ hours),


In [31]:
df.groupby(['region', 'Constit_Code'])['Age'].apply(lambda x: f"{int(x.min())}-{int(x.max())}").rename('Age Range')

region              Constit_Code
East Midlands       Ashfield        21-83
                    Bassetlaw       23-93
                    Bolsover        27-65
                    Broxtowe        35-67
                    Charnwood       36-80
                                    ...  
Yorkshire & Humber  Sheffield       19-71
                    Sheffield,      22-90
                    Skipton an      22-84
                    York Centr      19-77
                    York Outer      46-86
Name: Age Range, Length: 218, dtype: object

We can pass multiple functions by using the `agg` function.

In [None]:
df.groupby(['region', 'Constit_Code'])[['Age']].agg([pd.np.mean, len])

We can also apply a single function to multiple columns simultaneously:

In [None]:
df.groupby(['region', 'Constit_Code'])[['k03', 'y06', 'y09', 'y11']].apply(lambda x: x.mode().iloc[0])

And finally, we can map different functions to different columns using the `agg()` function with a dictionary:

In [None]:
def group_mode(x):
    "Function for extracting first modal value from pandas groupby object"
    m = x.value_counts().index[0]
    return m

def gender_proportion(x):
    m = x.apply(lambda e: 1 if e=="Female" else 0)
    m = m.astype(int).mean()
    return m


df.groupby(['region', 'Constit_Code']).agg({'k03': group_mode,
                                            'y06': group_mode,
                                            'y09': gender_proportion,
                                            'Age': ['min', 'max']})

Note: to index the above DataFrame, you will need some fancy indexers, namely `pd.IndexSlice`.

For more general notes, see: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-indexing-with-hierarchical-index

In [None]:
temp = df.groupby(['region', 'Constit_Code']).agg({'k03': group_mode,
                                                   'y06': group_mode,
                                                   'y09': gender_proportion,
                                                   'Age': ['min', 'max']})

In [None]:
idx = pd.IndexSlice
temp.loc[idx['East Midlands', :], idx[:, 'group_mode']]

We can also use apply with a DataFrame. In this case, each row (axis=0) or column (axis=1) is treated as an element.

# Combining Datasets

We look at two commands in particular. For an in-depth explanation, see: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

In [None]:
df1 = df.loc[:, ['finalserialno', 'Age', 'y09']]
df2 = df.loc[:, ['finalserialno', 'y06']]

## Using `pd.concat`

In [None]:
print(df1.shape)
print(df2.shape)
print(pd.concat([df1, df2], axis=0, sort=False).shape)
pd.concat([df1, df2], axis=0, sort=False)

In [None]:
print(df1.shape)
print(df2.shape)
print(pd.concat([df1, df2], axis=1).shape)

pd.concat([df1, df2], axis=1)

In [None]:
df3 = df.loc[:2130, ['finalserialno', 'a02']]

In [None]:
pd.concat([df1, df2, df3], axis=1)

## `pd.merge`

In [None]:
pd.merge(df1, df2, on="finalserialno")

In [None]:
df4 = df.loc[30:, ['finalserialno', 'y11']]

In [None]:
print(df3.index)
print(df4.index)
for join in ['inner', 'left', 'right', 'outer']:
    print(pd.merge(df3, df4, how=join, on="finalserialno").shape)

In [None]:
df4 = df4.rename({'finalserialno':'serialno'}, axis=1).set_index('serialno')
df4

In [None]:
pd.merge(df3, df4, how="outer", left_on="finalserialno", right_index=True)

# Melting and Pivoting

In [None]:
long_df = pd.DataFrame({
    "Constituency": ['Oxford West', 'Oxford East']*4,
    "Year": [2010, 2010, 2015, 2015, 2017, 2017, 2019, 2019],
    "Party": ["Labour", "Tory"]*2+["Labour", "LibDem"]*2
})

long_df

In [None]:
wide_df = long_df.pivot(index="Constituency", columns="Year", values="Party")
wide_df

In [None]:
wide_df.reset_index().melt(id_vars="Constituency", value_vars=[2010, 2015, 2017, 2019], var_name="Year")