In [7]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Lecture 3A - Apply & Map, Misc

# Table of Contents
* [Lecture 3A - Apply & Map, Misc](#Lecture-3A---Apply-&-Map,-Misc)
	* &nbsp;
		* [Content](#Content)
		* [Learning Outcomes](#Learning-Outcomes)
	* [1. Functions and Dataframes - Using *apply()* and *applymap()*](#1.-Functions-and-Dataframes---Using-*apply%28%29*-and-*applymap%28%29*)
		* [Functions along an axis](#Functions-along-an-axis)
		* [Functions applied element-wise](#Functions-applied-element-wise)
	* [Dummy Variables](#Dummy-Variables)
	* [2. Removing Duplicates](#2.-Removing-Duplicates)
	* [3. Transpose](#3.-Transpose)


---

### Content

1. Applying functions to dataframes
2. Removing duplicates
3. Re-shaping dataframes with transpose
4. Shift operations for time series

### Learning Outcomes

At the end of this lecture, you should be able to:

* apply functions to dataframes
* remove duplicate rows in dataframes
* transpose dataframes
* apply shift operations to dataframes for time series data


---

In [10]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from pylab import rcParams

%matplotlib inline

In [11]:
# Set some Pandas options as you like
pd.set_option('max_columns', 30)
pd.set_option('max_rows', 30)

OptionError: 'Pattern matched multiple keys'

In [12]:
rcParams['figure.figsize'] = 15, 10
rcParams['font.size'] = 20

## 1. Functions and Dataframes - Using *apply()* and *applymap()* 

Built-in or user-defines functions can be applied along the entire axes of a dataframe.

To apply a function to an entire axis (or multiple axes) of a dataframe, we resort to the apply() method, which can take an optional axis argument to determine if the axis is vertical/column-wise (0) or horizontal/row-wise (1).

### Functions along an axis

In [45]:
df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
                'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
                'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

df = df[['one','two','three']]
df

Unnamed: 0,one,two,three
a,0.787153,1.176522,
b,1.865876,2.019979,0.833163
c,-2.022951,-0.224008,-1.465468
d,,-0.998299,0.100662


Below is an example of applying a built in sum function 

In [46]:
df.apply(np.sum, axis=0)

one      0.630077
two      1.974194
three   -0.531643
dtype: float64

**Exercise**: Apply the mean function to the above dataframe in a row-wise manner.

In [47]:
df.apply(np.mean, axis=1)

a    0.981837
b    1.573006
c   -1.237476
d   -0.448818
dtype: float64

**Exercise**: Apply the sum function to columns 'one' and 'two' only in a row-wise manner, and assign the result to a new column in the dataframe called 'four'.

In [48]:
df[['one','two']].apply(np.mean, axis=1)

a    0.981837
b    1.942927
c   -1.123480
d   -0.998299
dtype: float64

In [49]:
df['Four'] = df[['one','two']].apply(np.mean, axis=1)
df

Unnamed: 0,one,two,three,Four
a,0.787153,1.176522,,0.981837
b,1.865876,2.019979,0.833163,1.942927
c,-2.022951,-0.224008,-1.465468,-1.12348
d,,-0.998299,0.100662,-0.998299


**Exercise**: Replace the missing value in both columns with the row-wise mean value.

In [50]:
df['one'].fillna(df.apply(np.mean,axis=1),inplace=True)
df['three'].fillna(df.apply(np.mean,axis=1),inplace=True)
df

Unnamed: 0,one,two,three,Four
a,0.787153,1.176522,0.981837,0.981837
b,1.865876,2.019979,0.833163,1.942927
c,-2.022951,-0.224008,-1.465468,-1.12348
d,-0.631978,-0.998299,0.100662,-0.998299


**Exercise**: Calculate the column-wise product for the first and third columns only.     

In [51]:
df[['one','two']].apply(np.product,axis=0)

one    1.877713
two    0.531462
dtype: float64

**Exercise**: Write a function which calculates the sum of a vector and then returns the square of the sum. Once you have done this, apply your function to the dataframe in a row-wise manner, whilst creating a new column 'five', to which you will add insert the result. 

In [52]:
def sum_squared(vec):
    return sum(vec) ** 2
df['Five'] = df.apply(lambda row: sum_squared(row), axis=1)
df

Unnamed: 0,one,two,three,Four,Five
a,0.787153,1.176522,0.981837,0.981837,15.424074
b,1.865876,2.019979,0.833163,1.942927,44.381512
c,-2.022951,-0.224008,-1.465468,-1.12348,23.385998
d,-0.631978,-0.998299,0.100662,-0.998299,6.390346


In [53]:
def pos_neg_to_string(x):
    if x >= 0:
        return 'pos'
    else: return 'neg'

### Functions applied element-wise

The apply() method produces some form of aggregate calculations on the axes of a dataframe.  applymap() on the other hand extends us the flexibility of applying functions which manipulate single elements in a dataframe.

Say we would like to define a function which returns 'pos' for a positive number and alternatively 'neg'

We can apply this to our dataframe as follows:

In [54]:
df.applymap(pos_neg_to_string)

Unnamed: 0,one,two,three,Four,Five
a,pos,pos,pos,pos,pos
b,pos,pos,pos,pos,pos
c,neg,neg,neg,neg,pos
d,neg,neg,pos,neg,pos


Having the ability to apply element-wise operations on dataframes is extremely useful when it comes to dataset cleaning and transformations.

Let's take a look at a sample from a real-world dataset used for gathering results from a survey:

In [55]:
assig = pd.read_csv("dataset/surveySample.csv")
assig.head()

Unnamed: 0,OCCUPATION_M,supermarket spend in a week
0,e. Self-employed,c. $200 to $300
1,l. Home Duties,d. $300 to $400
2,i. Retired,b. $100 to $200
3,i. Retired,b. $100 to $200
4,h. Machinery operator/driver,b. $100 to $200


In [56]:
assig.OCCUPATION_M.head(20)

0                 e. Self-employed
1                   l. Home Duties
2                       i. Retired
3                       i. Retired
4     h. Machinery operator/driver
5                   l. Home Duties
6                     n. No Answer
7                       k. Student
8                       i. Retired
9                       a. Manager
10                        c. Sales
11                   m. Unemployed
12                      a. Manager
13                      i. Retired
14                 b. Professional
15                   m. Unemployed
16                      i. Retired
17                      i. Retired
18                      i. Retired
19                 b. Professional
Name: OCCUPATION_M, dtype: object

Clearly the values in this column need to be cleaned up.

Let's first find out what all the unique values are in this dataset.

In [57]:
assig.OCCUPATION_M.unique()

array(['e. Self-employed', 'l. Home Duties', 'i. Retired',
       'h. Machinery operator/driver', 'n. No Answer', 'k. Student',
       'a. Manager', 'c. Sales', 'm. Unemployed', 'b. Professional',
       'g. Labourer', 'f. Technician/trade worker',
       'd. Clerical/administration', 'j. Community/personal'],
      dtype=object)

We can now write a function that removes the first 3 characters in each entry in order to tidy the values.

In [58]:
def remove_first_three_chars(x):
    return x.replace(x[:3], '')

In [59]:
remove_first_three_chars('hello')

'lo'

In [60]:
assig[['OCCUPATION_M']].applymap(remove_first_three_chars)

Unnamed: 0,OCCUPATION_M
0,Self-employed
1,Home Duties
2,Retired
3,Retired
4,Machinery operator/driver
...,...
64995,No Answer
64996,Labourer
64997,Self-employed
64998,Professional


In order to make the change permanent, we need to assign the result to the dataframe:

In [64]:
assig['OCCUPATION_M'] = assig[['OCCUPATION_M']].applymap(remove_first_three_chars)

## Dummy Variables


A dummy variable is a numerical variable used in data analysis to represent subgroups of the sample in under study. 

In research design, a dummy variable is often used to distinguish different treatment groups. This is accomplished by taking distinct values from a column and creating new columns out of them which are populated with 0 or 1 in order to indicate whether or not the particular data point belongs to this. 

This is a frequent operation that can be easily in Python.

In [62]:
assig['OCCUPATION_M'].str.get_dummies()

Unnamed: 0,Clerical/administration,Community/personal,Home Duties,Labourer,Machinery operator/driver,Manager,No Answer,Professional,Retired,Sales,Self-employed,Student,Technician/trade worker,Unemployed
0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64995,0,0,0,0,0,0,1,0,0,0,0,0,0,0
64996,0,0,0,1,0,0,0,0,0,0,0,0,0,0
64997,0,0,0,0,0,0,0,0,0,0,1,0,0,0
64998,0,0,0,0,0,0,0,1,0,0,0,0,0,0


We can also specify if there are multiple values within some cells that should be treated as separate columns. In this example we will say that the forward slash indicates a distinct value for which we would like to generate a column for.

In [63]:
assig['OCCUPATION_M'].str.get_dummies('/')

Unnamed: 0,Clerical,Community,Home Duties,Labourer,Machinery operator,Manager,No Answer,Professional,Retired,Sales,Self-employed,Student,Technician,Unemployed,administration,driver,personal,trade worker
0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64995,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
64996,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
64997,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
64998,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


**Exercise:** From the assignment dataset, consider the column 'supermarket spend in a week'. The '\$' character can cause issues in some applications. We want to clean up this column in such a way that the first 3 characters are replaced as well as the '\$' character, and we also want to change entries with 'No Answer' to reflect that they are actually missing values so replace them with np.NaN. Write a function to do this and apply this function to this column.

Verify that your code works. 

## 2. Removing Duplicates

Duplicate rows may be naturally occurring in some datasets or they might arise from input errors. In many instances, like machine learning, these duplicate entries need to be removed from the datasets. 

Dataframes provide straightforward functionality to remove such records.

Here is an example:


In [None]:
df = pd.DataFrame({'c1': ['one'] * 3 + ['two'] * 4,
                  'c2': [1, 1, 2, 3, 3, 4, 4]})
df

`drop_duplicates` returns a DataFrame where the duplicated rows **across all columns** are dropped:

In [None]:
df.drop_duplicates()

We can also pass a particular column we  would like the duplicates removed from. Let's first make a change to the dataframe:

In [None]:
df.loc[1, 'c1'] = 'five'
df

In [None]:
df.drop_duplicates(['c2'])

Notice that `drop_duplicates` by default keep the first observed value combination.

## 3. Transpose

Transposing is a special form of reshaping tabular data in such a way that the rows become columns and likewise the columns become rows.

In [None]:
df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
                'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
                'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

df = df[['one','two','three']]
df

Transpose of a dataframe can be accomplished using either the transpose() method call  or simple .T

In [None]:
df.T

Transpose operations are not permanent unless you re-assign the result back tothe original dataframe.

In [None]:
df

**Exercise:** Slice and select out a dataframe with rows 'c' and 'd' and columns 'one' and 'two', then execute a transpose.  