# Pandas - Summary Functions and Maps
Formatting data to suit the task at hand

In [1]:
import pandas as pd

In [3]:
data = pd.read_csv('Employee-attrition for Module 1.csv')

In [4]:
print(data)

       EmployeeID   recorddate_key birthdate_key orighiredate_key  \
0            1318  12/31/2006 0:00      1/3/1954        8/28/1989   
1            1318  12/31/2007 0:00      1/3/1954        8/28/1989   
2            1318  12/31/2008 0:00      1/3/1954        8/28/1989   
3            1318  12/31/2009 0:00      1/3/1954        8/28/1989   
4            1318  12/31/2010 0:00      1/3/1954        8/28/1989   
...           ...              ...           ...              ...   
49648        8258   12/1/2015 0:00     5/28/1994        8/19/2013   
49649        8264    8/1/2013 0:00     6/13/1994        8/27/2013   
49650        8279   12/1/2015 0:00     7/18/1994        9/15/2013   
49651        8296   12/1/2013 0:00      9/2/1994        10/9/2013   
49652        8321   12/1/2014 0:00    11/28/1994       11/24/2013   

      terminationdate_key  age  length_of_service    city_name  \
0                1/1/1900   52                 17    Vancouver   
1                1/1/1900   53         

## Summary functions
To restructure data in useful ways

### decsribe()
This method produces a high-level summary of the attributes of a given column. note - it is type-aware; the output will vary depending on the data type. Below is an example of the description of the column 'EmployeeID'. This description shows us the mean, minimum, and percentiles, etc. of the column. 

In [6]:
data.EmployeeID.describe()

count    49653.000000
mean      4859.495740
std       1826.571142
min       1318.000000
25%       3360.000000
50%       5031.000000
75%       6335.000000
max       8336.000000
Name: EmployeeID, dtype: float64

The above function works well for numerical data (floats or integers) but if we run this function on string data, we get a different result:

In [7]:
data.city_name.describe()

count         49653
unique           40
top       Vancouver
freq          11211
Name: city_name, dtype: object

To get around this odd result (above), we can use specific summary statistics to find the data that we need, e.g. mean() to find the average, unique() to find the number of unique values, etc. 

In [9]:
data.length_of_service.mean()

10.434596096912573

In [10]:
data.length_of_service.unique()

array([17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 16, 15, 14, 13, 12, 11, 10,
        9,  8,  7,  6,  5,  4,  3,  2,  1,  0], dtype=int64)

To see a list of unique values AND how often they occur, we can use value_counts()

In [11]:
data.length_of_service.value_counts()

13    2885
12    2567
8     2559
11    2482
10    2432
9     2381
7     2341
6     2294
3     2270
4     2262
5     2258
2     2257
1     2222
14    2203
15    2192
16    2160
17    2066
0     1962
18    1829
19    1656
20    1322
21    1047
22     830
23     608
24     433
25     121
26      14
Name: length_of_service, dtype: int64

From this result, we can determine that more people stayed at the company for 13 years, than 8 years, and that while 26 years is the longest length of service, only 14 employees ever lasted that long. 

In [13]:
data.EmployeeID.value_counts()

2047    10
3148    10
2924    10
3020    10
3052    10
        ..
2419     1
2579     1
5245     1
4265     1
4408     1
Name: EmployeeID, Length: 6284, dtype: int64

This second value count shows us that there are a number of Employee IDs that have been used more than once. If I were to draw conclusions from this data, I would seek to understand why this occurs and whether the duplicate uses of the ID are valid.

## Maps 

A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. 
There are two mapping methods that you will use often.
(1) map() is the first, and slightly simpler one. 
(2) apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

### map() 

In this example, we will remean the length_of_service (LOS) to 0.

In [18]:
LOS_data_mean = data.length_of_service.mean()
data.length_of_service.map(lambda p: p - LOS_data_mean)

0         6.565404
1         7.565404
2         8.565404
3         9.565404
4        10.565404
           ...    
49648    -8.434596
49649   -10.434596
49650    -8.434596
49651   -10.434596
49652    -9.434596
Name: length_of_service, Length: 49653, dtype: float64

The function you pass to map() should expect a single value and return a transformed version of that value. 

map() returns a new Series where all the values have been transformed by your function.

### apply() 

apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

In [20]:
def remean_length_of_service(row):
    row.length_of_service = LOS_data_mean
    return row
data.apply(remean_length_of_service, axis = 'columns')

Unnamed: 0,EmployeeID,recorddate_key,birthdate_key,orighiredate_key,terminationdate_key,age,length_of_service,city_name,department_name,job_title,store_name,gender_short,gender_full,termreason_desc,termtype_desc,STATUS_YEAR,STATUS,BUSINESS_UNIT
0,1318,12/31/2006 0:00,1/3/1954,8/28/1989,1/1/1900,52,10.434596,Vancouver,Executive,CEO,35,M,Male,Not Applicable,Not Applicable,2006,ACTIVE,HEADOFFICE
1,1318,12/31/2007 0:00,1/3/1954,8/28/1989,1/1/1900,53,10.434596,Vancouver,Executive,CEO,35,M,Male,Not Applicable,Not Applicable,2007,ACTIVE,HEADOFFICE
2,1318,12/31/2008 0:00,1/3/1954,8/28/1989,1/1/1900,54,10.434596,Vancouver,Executive,CEO,35,M,Male,Not Applicable,Not Applicable,2008,ACTIVE,HEADOFFICE
3,1318,12/31/2009 0:00,1/3/1954,8/28/1989,1/1/1900,55,10.434596,Vancouver,Executive,CEO,35,M,Male,Not Applicable,Not Applicable,2009,ACTIVE,HEADOFFICE
4,1318,12/31/2010 0:00,1/3/1954,8/28/1989,1/1/1900,56,10.434596,Vancouver,Executive,CEO,35,M,Male,Not Applicable,Not Applicable,2010,ACTIVE,HEADOFFICE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49648,8258,12/1/2015 0:00,5/28/1994,8/19/2013,12/30/2015,21,10.434596,Valemount,Dairy,Dairy Person,34,M,Male,Layoff,Involuntary,2015,TERMINATED,STORES
49649,8264,8/1/2013 0:00,6/13/1994,8/27/2013,8/30/2013,19,10.434596,Vancouver,Customer Service,Cashier,44,F,Female,Resignaton,Voluntary,2013,TERMINATED,STORES
49650,8279,12/1/2015 0:00,7/18/1994,9/15/2013,12/30/2015,21,10.434596,White Rock,Customer Service,Cashier,39,F,Female,Layoff,Involuntary,2015,TERMINATED,STORES
49651,8296,12/1/2013 0:00,9/2/1994,10/9/2013,12/31/2013,19,10.434596,Kelowna,Customer Service,Cashier,16,F,Female,Resignaton,Voluntary,2013,TERMINATED,STORES


If we had called data.apply() with axis='index', then instead of passing a function to transform each row, we would need to give a function to transform each column.

Note that map() and apply() return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of reviews, we can see that it still has its original length_of_service value.

## Pandas built-in operations 

Here is an example of a faster way to remean our length_of_service column:

In [21]:
data.length_of_service_mean = data.length_of_service.mean()
data.length_of_service - data.length_of_service_mean

0         6.565404
1         7.565404
2         8.565404
3         9.565404
4        10.565404
           ...    
49648    -8.434596
49649   -10.434596
49650    -8.434596
49651   -10.434596
49652    -9.434596
Name: length_of_service, Length: 49653, dtype: float64

In this example, we are performing an operation between a lot of values on the left-hand side (everything in the Series) and a single value on the right-hand side (the mean value). Pandas looks at this expression and figures out that we must mean to subtract that mean value from every value in the dataset.

Pandas will also understand what to do if we perform these operations between Series of equal length. For example, an easy way of combining country and region information in the dataset would be to do the following:

In [28]:
data.city_name + " - " + data.job_title

0                 Vancouver - CEO
1                 Vancouver - CEO
2                 Vancouver - CEO
3                 Vancouver - CEO
4                 Vancouver - CEO
                   ...           
49648    Valemount - Dairy Person
49649         Vancouver - Cashier
49650        White Rock - Cashier
49651           Kelowna - Cashier
49652       Grand Forks - Cashier
Length: 49653, dtype: object

NOTE: With this example, I received errors when using float or integer data (e.g. the column, EmployeeID). However, with string type data this code worked well. 