# 6. Series Attributes and Statistical Methods

### Objectives

+ Pandas Series and DataFrame objects have a tremendous amount of methods
+ Not necessary to know how to use all of them
+ A small subset of these methods are sufficient for all data analysis tasks
+ Know where to find the Series API
+ Know the common attributes
+ Overview of the statistical methods
+ Distinguish between aggregation and non-aggregation methods
+ Know what a vectorized operation is
+ Know how to call a method after applying a mathematical operation to a Series

### Only selected subsets of data
Up to this point, we have only selected subsets of data. We have not changed the data or made any operations on the data. Our selections have happened in two ways:

* Selection by label and integer location
* Selection by actual values of the DataFrame (Boolean Selection)

Both of these methods involve using Python's indexing operator, brackets **`[]`**. Boolean selection formed Series of booleans with the comparison operators, **`<`**, **`>`**, etc...

### Calling methods on a Series/DataFrame
Selecting subsets of data does not usually require calling methods using dot notation. In this notebook we will call many methods that will perform actions on our DataFrame. We have actually already done some of this with **`head`**, **`tail`**, **`isna`**, and **`set_index`**.

There are more than 250 methods available to both DataFrames and Series at your disposal.

### Use a small subset of methods
It can be quite overwhelming to think about having to learn and memorize this staggering amount of functionality. The good news is that many of these methods are quite unnecessary and don't add any extra functionality. Furthermore, many methods are remnants from the early days of Pandas and have few/no use cases or have been **deprecated**. When a method is deprecated, then it both discouraged from being used and will likely be removed from the library in the future.

### Minimally Sufficient Pandas
I suggest to use a subset of the Pandas library that allows you to do as many tasks as possible. You should strive to write Pandas as simple as possible to maximize both performance and readability. Since there is so much functionality, power users of Pandas can think of very creative and complex code to accomplish different tasks. This is NOT a positive thing and when working with a group of other data scientists can lead to confusion for those that are not familiar with the syntax.

### Knowing lots of tricks doesn't help
It is not an uncommon sight to see Pandas experts provide several different to the same question. Nearly all operations can be accomplished with a small subset of the available operations. 

# Begin with the Series
We will begin our exploration of the Pandas library by retrieving attributes and calling methods from Series objects.

## View the API for complete list of functionality
All modern programming languages use the term, **Application Programming Interface** or **API**, to list and describe all the possible functionality therein. The Pandas API reference can be found [here][1]. This is a huge list, but as mentioned above, only a small subset of this page is needed for the vast majority of tasks.

## The best of the Pandas Series API
The Pandas Series object is a single column of data and easier to work with than an entire DataFrame. We start with it and cover the most basic and important methods below. Navigate to the [Series API][2] section of Pandas.

### City of Houston Employee Data
We will use a small public dataset from City of Houston employees with information on their position, race, gender, and salary. Notice that the columns `HIRE_DATE` and `JOB_DATE` can be coerced to datetimes.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html
[2]: http://pandas.pydata.org/pandas-docs/stable/api.html#series

In [1]:
import pandas as pd

In [2]:
emp = pd.read_csv('../data/employee.csv', parse_dates=['HIRE_DATE', 'JOB_DATE'])
emp.head()

Unnamed: 0,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13
1,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18
2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03
3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25
4,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22


## Select a single column as a Series
Let's select the **`BASE_SALARY`** column as a Series and use it to explore the API.

In [3]:
salary = emp['BASE_SALARY']
salary.head()

0    121862.0
1     26125.0
2     45279.0
3     63166.0
4     56347.0
Name: BASE_SALARY, dtype: float64

In [4]:
type(salary)

pandas.core.series.Series

## Core Series Attributes
Pandas Series have [many attributes][1], but only a few are important to know. The attributes to be aware of are:

* `index`
* `values`
* `size`
* `dtype`

The **`index`** and **`values`** were covered in a previous notebook. Only **`size`** and **`dtype`** are new. The size represents the total number of values in the Series. And the **`dtype`** is the data type of the values. Let's display these now.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#attributes

In [21]:
salary.size

2000

In [22]:
salary.dtype

dtype('float64')

### `len` function instead of `size` attribute
The built-in **`len`** function returns the same number as the **`size`** attribute. They are both equally acceptable.

In [20]:
len(salary)

2000

# Basic Mathematical Operations
One of the most basic operations we can do to to a Series of numbers is to complete a basic math operation on each value, such as adding 5 to each value. To do this we simply use the operator symbol with the Series name.

### Adding (and other math operations) a number to a Series
Let's see some examples of this with a small Series to avoid long output:

In [11]:
sal_head = salary.head(10)
sal_head + 5

0    121867.0
1     26130.0
2     45284.0
3     63171.0
4     56352.0
5     66619.0
6     71685.0
7     42395.0
8    107967.0
9     44621.0
Name: BASE_SALARY, dtype: float64

In [12]:
sal_head - 1

0    121861.0
1     26124.0
2     45278.0
3     63165.0
4     56346.0
5     66613.0
6     71679.0
7     42389.0
8    107961.0
9     44615.0
Name: BASE_SALARY, dtype: float64

In [13]:
# raise to the power of three
sal_head ** 3

0    1.809693e+15
1    1.783072e+13
2    9.283046e+13
3    2.520288e+14
4    1.789008e+14
5    2.955946e+14
6    3.682934e+14
7    7.617110e+13
8    1.258383e+15
9    8.881205e+13
Name: BASE_SALARY, dtype: float64

In [14]:
# floor division
sal_head // 17

0    7168.0
1    1536.0
2    2663.0
3    3715.0
4    3314.0
5    3918.0
6    4216.0
7    2493.0
8    6350.0
9    2624.0
Name: BASE_SALARY, dtype: float64

## Isn't this notebook about calling methods?
Although the above operations are not using dot notation to call methods, they technically are invoking something called **special methods**. Special methods are an advanced Python topic. 

### Example of a special method
All Python operators have an equivalent special method that can be used in its place. Let's use the special methods to accomplish the same tasks as above.

In [15]:
sal_head.__add__(5)

0    121867.0
1     26130.0
2     45284.0
3     63171.0
4     56352.0
5     66619.0
6     71685.0
7     42395.0
8    107967.0
9     44621.0
Name: BASE_SALARY, dtype: float64

In [16]:
sal_head.__sub__(1)

0    121861.0
1     26124.0
2     45278.0
3     63165.0
4     56346.0
5     66613.0
6     71679.0
7     42389.0
8    107961.0
9     44615.0
Name: BASE_SALARY, dtype: float64

In [17]:
sal_head.__pow__(3)

0    1.809693e+15
1    1.783072e+13
2    9.283046e+13
3    2.520288e+14
4    1.789008e+14
5    2.955946e+14
6    3.682934e+14
7    7.617110e+13
8    1.258383e+15
9    8.881205e+13
Name: BASE_SALARY, dtype: float64

In [18]:
sal_head.__floordiv__(17)

0    7168.0
1    1536.0
2    2663.0
3    3715.0
4    3314.0
5    3918.0
6    4216.0
7    2493.0
8    6350.0
9    2624.0
Name: BASE_SALARY, dtype: float64

## What really happens when you use the `+` operator
When you call **`sal_head + 5`**, Python will actually call **`sal_head.__add__(5)`**. This fact isn't too important now but will be if you begin defining your own classes. To learn more, see the [Python data model][1].

[1]: https://docs.python.org/3/reference/datamodel.html

## Basic math operations are vectorized
All the above basic mathematical operations were **vectorized** - were applied to each value in the Series without an explicit writing of a **`for`** loop. Calling **`sal_head + 5`** adds 5 to each value of the Series. This would not work with a Python list.

## We've already done this with the comparison operators

In the boolean indexing notebooks, we used the vectorized comparison operators to produce Series of booleans. The same thing is happening now with the arithmetic operators.

# Descriptive Statistics Methods
We will now call methods that compute [basic descriptive statistics][1] of a numerical Series. We will do so explicitly with dot notation. It is useful to place these methods into two categories - those that **aggregate** and those that do not.

A method that performs an aggregation returns a **single** number to represent the description. Examples of methods that aggregate are:
* `sum`
* `min`
* `max`
* `mean`
* `median`
* `std` - standard deviation
* `var` - variance
* `count` - returns number of non-na values
* `describe` - returns most of the above aggregations in one Series
* `quantile` - returns given percentile of distribution

Any other method that does not return a single value is not an aggregation. Some examples of these methods are:
* `abs` - takes absolute value
* `round` - round to the nearest given decimal
* `cummin` - cumulative minimum
* `cummax` - cumulative maximum
* `cumsum` - cumulative sum
* `rank` - rank values in a variety of different ways
* `diff` - difference between one element and another
* `pct_change` - percent change from one element to another

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#computations-descriptive-stats

### Aggregation methods
Let's use some of the populat built-in aggregation methods:

In [5]:
salary.sum()

105178319.0

In [6]:
salary.min()

24960.0

In [7]:
salary.max()

275000.0

In [8]:
salary.mean()

55767.93160127253

In [22]:
salary.median()

54461.0

In [9]:
salary.std()

21693.706679449504

Count non-missing values. Since this number is less than 2,000, we know missing values exist.

In [19]:
salary.count()

1886

# Pandas ignores missing values by default
One big difference between Pandas and NumPy is that Pandas ignores missing values by default. When calling aggregation methods such as **`sum`** or **`mean`**, Pandas will simply ignore any missing value as if that piece of data did not exist.

Pandas provides you with control over how missing data is handled with the **`skipna`** parameter which exists for all the aggregation methods. By default it is **`True`**. If you set it to **`False`** and your Series contains missing values, then Pandas will return a missing value as the result.

In [70]:
salary.sum(skipna=False)

nan

### `describe` and `quantile`
There are more built-in aggregation methods such as `describe` and `quantile`.

In [24]:
salary.describe()

count      1886.000000
mean      55767.931601
std       21693.706679
min       24960.000000
25%       40170.000000
50%       54461.000000
75%       66614.000000
max      275000.000000
Name: BASE_SALARY, dtype: float64

By default, the `quantile` method returns the median. It uses parameter `q` as a number between 0 and 1. We can change it to return the 20th and 99th percentiles respectively.

In [25]:
salary.quantile(q=.2)

37045.0

In [26]:
salary.quantile(q=.99)

125918.65000000002

### Non-Aggregation Descriptive Statistic Methods

In [27]:
sal_head.abs()

0    121862.0
1     26125.0
2     45279.0
3     63166.0
4     56347.0
5     66614.0
6     71680.0
7     42390.0
8    107962.0
9     44616.0
Name: BASE_SALARY, dtype: float64

In [30]:
# round to the nearest thousand
sal_head.round(decimals=-3)

0    122000.0
1     26000.0
2     45000.0
3     63000.0
4     56000.0
5     67000.0
6     72000.0
7     42000.0
8    108000.0
9     45000.0
Name: BASE_SALARY, dtype: float64

In [31]:
sal_head.cummin()

0    121862.0
1     26125.0
2     26125.0
3     26125.0
4     26125.0
5     26125.0
6     26125.0
7     26125.0
8     26125.0
9     26125.0
Name: BASE_SALARY, dtype: float64

In [32]:
sal_head.cummax()

0    121862.0
1    121862.0
2    121862.0
3    121862.0
4    121862.0
5    121862.0
6    121862.0
7    121862.0
8    121862.0
9    121862.0
Name: BASE_SALARY, dtype: float64

In [33]:
sal_head.cumsum()

0    121862.0
1    147987.0
2    193266.0
3    256432.0
4    312779.0
5    379393.0
6    451073.0
7    493463.0
8    601425.0
9    646041.0
Name: BASE_SALARY, dtype: float64

In [34]:
sal_head.rank()

0    10.0
1     1.0
2     4.0
3     6.0
4     5.0
5     7.0
6     8.0
7     2.0
8     9.0
9     3.0
Name: BASE_SALARY, dtype: float64

Use the parameter ascending to change the direction of the ranking.

In [35]:
sal_head.rank(ascending=False)

0     1.0
1    10.0
2     7.0
3     5.0
4     6.0
5     4.0
6     3.0
7     9.0
8     2.0
9     8.0
Name: BASE_SALARY, dtype: float64

By default **`diff`** subtracts the previous value from the current row. 

In [47]:
# print out the Series to visually verify
sal_head

0    121862.0
1     26125.0
2     45279.0
3     63166.0
4     56347.0
5     66614.0
6     71680.0
7     42390.0
8    107962.0
9     44616.0
Name: BASE_SALARY, dtype: float64

In [36]:
sal_head.diff()

0        NaN
1   -95737.0
2    19154.0
3    17887.0
4    -6819.0
5    10267.0
6     5066.0
7   -29290.0
8    65572.0
9   -63346.0
Name: BASE_SALARY, dtype: float64

The first parameter, **`periods`**, determines which two values are subtracted. For instance we can subtract the 2nd previous value from the current like this:

In [42]:
sal_head.diff(periods=2)

0        NaN
1        NaN
2   -76583.0
3    37041.0
4    11068.0
5     3448.0
6    15333.0
7   -24224.0
8    36282.0
9     2226.0
Name: BASE_SALARY, dtype: float64

Or even reverse the subtraction by using negative values:

In [44]:
sal_head.diff(-1)

0    95737.0
1   -19154.0
2   -17887.0
3     6819.0
4   -10267.0
5    -5066.0
6    29290.0
7   -65572.0
8    63346.0
9        NaN
Name: BASE_SALARY, dtype: float64

The **`pct_change`** method works analogously but returns the percentage instead

In [49]:
sal_head.pct_change()

0         NaN
1   -0.785618
2    0.733167
3    0.395040
4   -0.107954
5    0.182210
6    0.076050
7   -0.408622
8    1.546874
9   -0.586743
Name: BASE_SALARY, dtype: float64

In [50]:
sal_head.pct_change(-1)

0    3.664574
1   -0.423022
2   -0.283174
3    0.121018
4   -0.154127
5   -0.070675
6    0.690965
7   -0.607362
8    1.419805
9         NaN
Name: BASE_SALARY, dtype: float64

# Calling methods after an operation
Let's say, you would like to find the sum of all the salaries after giving everyone a $5,000 bonus. You can do this in one line like this:

In [51]:
(salary + 5000).sum()

114608319.0

### Must use parentheses
The above syntax might be confusing but it is doing the same thing as this:

In [52]:
salary_bonus = salary + 5000
salary_bonus.sum()

114608319.0

## Explanation
Writing, **`(salary + 5000).sum()`**, first adds 5,000 to each value in the salary Series. This produces a temporary Series, which has all the available methods as any other Series. We then call the **`sum`** method on this temporary Series.

### What is a temporary Series?
The word **`temporary`** is used to describe a Series object that is not assigned to any variable. It is only held in memory temporarily during the execution of that one statement.

## A temporary list
Let's see another example of a temporary object. Here, the temporary list **`[-99, -11]`** is never assigned to a variable and only exists during the execution of the second line of code below:

In [56]:
a = [1, 4, 10]
b = [-99, -11] + a
b

[-99, -11, 1, 4, 10]

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset with the title as the index and assign the `imdb_score` as a Series to variable `score`. Output the first 5 values.</span>

In [None]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">What is the data type of `score` and how many values does it contain?</span>

In [57]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">What is the maximum and minimum score?</span>

In [58]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">How many missing values are there in the `score`?</span>

In [59]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Rank the first 10 values in `score` from highest to lowest.</span>

In [60]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">How many movies have scores greater than 6? (Remember that True/False evaluates to 1/0)</span>

In [61]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">How many movies have scores greater than 4 and less than 7?</span>

In [62]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">Find the difference between the median and mean of the scores.</span>

In [63]:
# your code here

### Problem 9
<span  style="color:green; font-size:16px">Add 1 to every value of `score` and then calculate the median.</span>

In [64]:
# your code here

### Problem 10
<span  style="color:green; font-size:16px">Calculate the median of `score` and add 1 to this. Why is this value the same as problem 9?</span>

In [65]:
# your code here

### Problem 11
<span  style="color:green; font-size:16px">Return a Series that has only scores above the 99.9th percentile</span>

In [66]:
# your code here