# Pandas Series

Think of a Pandas Series as a list or array of data, but on crack. This array comes with added functionality and metadata that a standard Python list does not.  

For instance, every element of a Series has a label associated with it, called the `index`.  By default, the index is a number, but can be set to anything.  You can then reference any element in a series by its index.  It should be known that most operations that deal with more than a single Series does so by comparing index values.

So a series has a ton of added functionality built right into it, which can be found in the documentation [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html).  Bookmark that site, we will be referring to it often.

Let's create a simple Series from a list of numbers.

In [1]:
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

In [3]:
# Let's start with importing the package into the notebook
import pandas as pd

In [4]:
# Define a list of numbers
numbers = [1,2,3,4,5,6,7,8,9,10]

In [6]:
# create series object
my_series = pd.Series(numbers)

In [7]:
my_series

0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int64

We now have a Series object in memory called `my_series`.  You can see the default index (left column) is the result of `range(len(numbers))` which always starts with 0, like all Python indexing schemes. Ya, cool, so what?  Although we aren't sure why we need to use Series yet, let's learn how to create them from various sources. 

## Creating A Pandas Series

We just created a Pandas Series from an ordinary Python list, and saw its created index.  Pandas has multiple ways to create a Series, which is nice because often the data we want to analyze isn't always in the same format.  Let's try a few.

In [25]:
salaries = {
    'John'   : 80000,
    'Mary'   : 200000,
    'Michael': 100000,
    'Betty'  : 76000,
    'Greg'   : 90000,
    'Hillary': 120000,
}
series_from_dict = pd.Series(salaries)

In [26]:
series_from_dict

Betty       76000
Greg        90000
Hillary    120000
John        80000
Mary       200000
Michael    100000
dtype: int64

We can see if we create a Series object from a python dictionary, it automatically uses the keys of the dictionary as the index intead of a numerical index.  How about a file?  The "comma-separated-values" or .csv format is very common.  Pandas can handle it well. 

In [60]:
series_from_csv = pd.Series.from_csv('./salaries.csv', index_col=0)

In [61]:
series_from_csv

John        80000.0
Mary       200000.0
Michael    100000.0
Betty           NaN
Greg        90000.0
Hillary    120000.0
dtype: float64

Note that Pandas automatically handled the missing value, and placed a `NaN` in the way.  By default, when performing built-in functions, the `NaN` values are ignored.  If we'd like, we can full all these values in with a default.

In [63]:
series_from_csv.fillna(50000)

John        80000.0
Mary       200000.0
Michael    100000.0
Betty       50000.0
Greg        90000.0
Hillary    120000.0
dtype: float64

There is a robust set of tools to deal with null or missing values in a dataset.  It's always good to check if there are missing values in your data first

In [71]:
# Check for missing values in a Series object
series_from_csv.hasnans

True

## Manipulating Series Objects

Ok, that's enough of creating the objects.  Let's learn how to manipulate them once you have them.  We'll deal with more file I/O in the next part.  

A nice feature of dealing with a Series instead of a typical python list object is vectorized operations.  That means you can perform operations on each element of a Series without the need of writing loops to do so.  Here are a few examples of this, playing with our original `my_series` object first created:

In [66]:
# just for reference puporses
my_series

0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int64

In [65]:
# Adding a Series object with another
my_series + my_series

0     2
1     4
2     6
3     8
4    10
5    12
6    14
7    16
8    18
9    20
dtype: int64

In [67]:
my_series - my_series

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
dtype: int64

In [69]:
# adding a scalar to all elements of a Series
# this also works with other operations (* / + -)
my_series + 5

0     7
1     9
2    11
3    13
4    15
5    17
6    19
7    21
8    23
9    25
dtype: int64

In [19]:
# apply function on an element by element basis
def multiply_by_ten(element):
    return element * 10.0
my_series.map(multiply_by_ten)

0     10.0
1     20.0
2     30.0
3     40.0
4     50.0
5     60.0
6     70.0
7     80.0
8     90.0
9    100.0
dtype: float64

### String operations

If your series is a bunch of string data, then you can perform various built-in operations on each element of that series.  

More info about string operations can be found [here](http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling). If you're familiar with regex, then you'll be happy to know that it's integrated.

In [20]:
names = ['John','Carol','Corey','Laura','Mike','Jared']
names_series = pd.Series(names)

In [21]:
names_series

0     John
1    Carol
2    Corey
3    Laura
4     Mike
5    Jared
dtype: object

In [23]:
names_series.str.upper()

0     JOHN
1    CAROL
2    COREY
3    LAURA
4     MIKE
5    JARED
dtype: object

In [24]:
names_series.str.contains('J')

0     True
1    False
2    False
3    False
4    False
5     True
dtype: bool

## Useful Built-ins

[Here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) is a list of the availble functions you can use in a Series object.  You can see that there are a ton of options.  Let's go over a few highly useful ones.

If I start with numerical data, the `describe()` function is usually the first thing I do.

In [14]:
# Give standard statistics from the Series 
my_series.describe()

count    10.00000
mean      5.50000
std       3.02765
min       1.00000
25%       3.25000
50%       5.50000
75%       7.75000
max      10.00000
dtype: float64

In [73]:
# Get the size of the Series, including `NaN`s
my_series.size

10

In [79]:
# Get the size of the Series, discluing `NaN`s
my_series.count()

10

In [75]:
# Returns an array of just the values, no index
my_series.values

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [76]:
# sum over the entire series
my_series.sum()

55

In [78]:
# Returns a Series object of the cummulative sum of the series
my_series.cumsum() # also has cummax, cummin, cumprod

0     1
1     3
2     6
3    10
4    15
5    21
6    28
7    36
8    45
9    55
dtype: int64

In [82]:
# Give discrete difference based on a given interval, e.g. my_series[3] - my_series[0]
my_series.diff(3)

0    NaN
1    NaN
2    NaN
3    3.0
4    3.0
5    3.0
6    3.0
7    3.0
8    3.0
9    3.0
dtype: float64

## Slicing the Series object

Often you import a large dataset, but only want a certain part of it for your analysis.  Pandas has made it easy to take subsets, called slices, of your data.

This is somewhat straightfoward for a Series object, but these same capabilities used on DataFrames will be much more useful.  

In [83]:
# take from a certain index on
my_series[4:]

4     5
5     6
6     7
7     8
8     9
9    10
dtype: int64

In [84]:
# Just get a certain value
my_series.get(4)

5