# 6. Series Attributes and Statistical Methods

### Only selected subsets of data thus far

### We will now call Series/DataFrame methods
We have actually already done some of this with **`head`**, **`tail`**, **`isna`**, and **`set_index`**. There are more than 250 methods available to both Series and DataFrames at your disposal.

### Minimally Sufficient Pandas
I suggest to use a subset of the Pandas library that allows you to do as many tasks as possible. You should strive to write Pandas as simple as possible to maximize both performance and readability.

### Knowing lots of tricks doesn't help

# Begin with the Series

## View the API for complete list of functionality
The Pandas API reference can be found [here][1]. This is a huge list, but as mentioned above, only a small subset of this page is needed for the vast majority of tasks.

### City of Houston Employee Data
We will use a small public dataset from City of Houston employees with information on their position, race, gender, and salary.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html

In [1]:
import pandas as pd

In [2]:
emp = pd.read_csv('data/employee.csv')
emp.head()

Unnamed: 0,title,dept,salary,race,gender,experience
0,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,1
1,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,34
2,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,Black,Male,32
3,ENGINEER,Public Works & Engineering-PWE,71680.0,Asian,Male,4
4,CARPENTER,Houston Airport System (HAS),42390.0,White,Male,3


## Select a single column as a Series
Let's select the **`salary`** column as a Series and use it to explore the API.

In [3]:
salary = emp['salary']
salary.head()

0    45279.0
1    63166.0
2    66614.0
3    71680.0
4    42390.0
Name: salary, dtype: float64

In [4]:
type(salary)

pandas.core.series.Series

## Core Series Attributes
Pandas Series have [many attributes][1], but only a few are important to know. The attributes to be aware of are:

* `index`
* `values`
* `size`
* `dtype`

The **`index`** and **`values`** were covered in a previous notebook. Only **`size`** and **`dtype`** are new. The size represents the total number of values in the Series. And the **`dtype`** is the data type of the values. Let's display these now.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#attributes

In [5]:
salary.dtype

dtype('float64')

### `len` function
Returns the number of elements

In [6]:
len(salary)

1535

# Basic Arithmetic Operations
Let's see some examples of this with a small Series to avoid long output:

In [7]:
sal_head = salary.head(10)
sal_head + 5

0     45284.0
1     63171.0
2     66619.0
3     71685.0
4     42395.0
5    107967.0
6     52649.0
7    180421.0
8     30352.0
9     55274.0
Name: salary, dtype: float64

In [8]:
sal_head - 1

0     45278.0
1     63165.0
2     66613.0
3     71679.0
4     42389.0
5    107961.0
6     52643.0
7    180415.0
8     30346.0
9     55268.0
Name: salary, dtype: float64

In [9]:
# raise to the power of three
sal_head ** 3

0    9.283046e+13
1    2.520288e+14
2    2.955946e+14
3    3.682934e+14
4    7.617110e+13
5    1.258383e+15
6    1.458971e+14
7    5.872529e+15
8    2.794778e+13
9    1.688281e+14
Name: salary, dtype: float64

In [10]:
# floor division
sal_head // 17

0     2663.0
1     3715.0
2     3918.0
3     4216.0
4     2493.0
5     6350.0
6     3096.0
7    10612.0
8     1785.0
9     3251.0
Name: salary, dtype: float64

## Arithmetic operations are vectorized
All the above basic mathematical operations were **vectorized** - each value in the Series was applied the operation without an explicit writing of a **`for`** loop.

## We've already done this with the comparison operators

# Descriptive Statistics Methods
We will now call methods that compute [basic descriptive statistics][1] of a numerical Series. We will do so explicitly with dot notation. It is useful to place these methods into two categories - those that **aggregate** and those that do not.

A method that performs an aggregation returns a **single** number to represent the description. Examples of methods that aggregate are:
* `sum`
* `min`
* `max`
* `mean`
* `median`
* `std` - standard deviation
* `var` - variance
* `count` - returns number of non-na values
* `describe` - returns most of the above aggregations in one Series
* `quantile` - returns given percentile of distribution

Any other method that does not return a single value is not an aggregation. Some examples of these methods are:
* `abs` - takes absolute value
* `round` - round to the nearest given decimal
* `cummin` - cumulative minimum
* `cummax` - cumulative maximum
* `cumsum` - cumulative sum
* `rank` - rank values in a variety of different ways
* `diff` - difference between one element and another
* `pct_change` - percent change from one element to another

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#computations-descriptive-stats

### Aggregation methods
Let's use some of the populat built-in aggregation methods:

In [11]:
salary.sum()

86387875.0

In [12]:
salary.min()

24960.0

In [13]:
salary.describe()

count      1535.000000
mean      56278.745928
std       20328.441253
min       24960.000000
25%       42000.000000
50%       55461.000000
75%       66614.000000
max      210588.000000
Name: salary, dtype: float64

Count non-missing values. Since this number is less than 2,000, we know missing values exist.

In [14]:
salary.count()

1535

# Pandas ignores missing values by default
Use `skipna=True` to use missing values:

In [15]:
salary.sum(skipna=False)

86387875.0

### `describe` and `quantile`
There are more built-in aggregation methods such as `describe` and `quantile`.

In [18]:
salary.describe()

count      1535.000000
mean      56278.745928
std       20328.441253
min       24960.000000
25%       42000.000000
50%       55461.000000
75%       66614.000000
max      210588.000000
Name: salary, dtype: float64

By default, the `quantile` method returns the median. It uses parameter `q` as a number between 0 and 1. We can change it to return the 20th and 99th percentiles respectively.

In [16]:
salary.quantile(q=.2)

38746.0

In [17]:
salary.quantile(q=.99)

115162.70000000006

### Non-Aggregation Descriptive Statistic Methods

In [19]:
sal_head.abs()

0     45279.0
1     63166.0
2     66614.0
3     71680.0
4     42390.0
5    107962.0
6     52644.0
7    180416.0
8     30347.0
9     55269.0
Name: salary, dtype: float64

In [20]:
# round to the nearest thousand
sal_head.round(decimals=-3)

0     45000.0
1     63000.0
2     67000.0
3     72000.0
4     42000.0
5    108000.0
6     53000.0
7    180000.0
8     30000.0
9     55000.0
Name: salary, dtype: float64

In [21]:
sal_head.cummin()

0    45279.0
1    45279.0
2    45279.0
3    45279.0
4    42390.0
5    42390.0
6    42390.0
7    42390.0
8    30347.0
9    30347.0
Name: salary, dtype: float64

The **`pct_change`** find the percentage change from the current observation to the one immediately preceding it.

In [22]:
sal_head.pct_change()

0         NaN
1    0.395040
2    0.054586
3    0.076050
4   -0.408622
5    1.546874
6   -0.512384
7    2.427095
8   -0.831794
9    0.821234
Name: salary, dtype: float64

In [23]:
sal_head.pct_change(-1)

0   -0.283174
1   -0.051761
2   -0.070675
3    0.690965
4   -0.607362
5    1.050794
6   -0.708208
7    4.945102
8   -0.450922
9         NaN
Name: salary, dtype: float64

## These methods return an entirely new Series
The last few methods return an entirely new Series and do not modify the calling Series.

In [24]:
sal_head_change = sal_head.pct_change()

In [25]:
sal_head_change

0         NaN
1    0.395040
2    0.054586
3    0.076050
4   -0.408622
5    1.546874
6   -0.512384
7    2.427095
8   -0.831794
9    0.821234
Name: salary, dtype: float64

In [26]:
sal_head

0     45279.0
1     63166.0
2     66614.0
3     71680.0
4     42390.0
5    107962.0
6     52644.0
7    180416.0
8     30347.0
9     55269.0
Name: salary, dtype: float64

# Calling methods after an operation
Let's say, you would like to find the sum of all the salaries after giving everyone a $5,000 bonus. You can do this in one line like this:

In [27]:
(salary + 5000).sum()

94062875.0

### Must use parentheses
The above syntax might be confusing but it is doing the same thing as this:

In [None]:
salary_bonus = salary + 5000
salary_bonus.sum()

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset with the title as the index and assign the `imdb_score` as a Series to variable `score`. Output the first 5 values.</span>

In [50]:
import pandas as pd
var_m = pd.read_csv('data/movie.csv', index_col='title')
var_s = movie['imdb_score']
var_s.head(30)
var_s = var_m['imdb_score']
var_s.head()

title
Avatar                                        7.9
Pirates of the Caribbean: At World's End      7.1
Spectre                                       6.8
The Dark Knight Rises                         8.5
Star Wars: Episode VII - The Force Awakens    7.1
Name: imdb_score, dtype: float64

### Problem 3
<span  style="color:green; font-size:16px">What is the maximum and minimum score?</span>

In [53]:
var_s.min()


1.6000000000000001

In [54]:
var_s.max()

9.5

### Problem 4
<span  style="color:green; font-size:16px">How many missing values are there in the `score`?</span>

In [56]:
len(var_s)-var_s.count()

0

### Problem 6
<span  style="color:green; font-size:16px">How many movies have scores greater than 6? (Remember that True/False evaluates to 1/0)</span>

In [57]:
(var_s>6).sum()

3368

### Problem 8
<span  style="color:green; font-size:16px">Find the difference between the median and mean of the scores.</span>

In [58]:
var_s.median()-var_s.mean()

0.1625711960943912