## Introduction
### What's a Series object
* The Series is the primary building block of pandas. 
* A Series represents a 1D labeled indexed array based on the NumPy ndarray. 
* Like an array, a Series can hold zero or more values of any single data type.
* It enables accesing elements through labels instead of integar position.
* A Series always has an index even if one is not specified.

## Creating Series

In [18]:
# Importing
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

# Set some pandas options for controlling output display
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)


In [19]:
# Series object can be constructed form a Numoy array, a Dictionary or a List
s0 = pd.Series({'cat':0, 'dog':1, 'cow':5})
s0 = pd.Series([0,1,5], index= ['cat', 'dog', 'cow']) # Not defining an index, Pandas will automatically give it integar indices
s0

cat    0
dog    1
cow    5
dtype: int64

In [20]:
s0.values

array([0, 1, 5], dtype=int64)

In [21]:
s0.index

Index(['cat', 'dog', 'cow'], dtype='object')

In [22]:
s0['cow']

5

In [25]:
# Creating a Series with scalar value
s1 = pd.Series(2, index=s0.index)
s1

cat    2
dog    2
cow    2
dtype: int64

In [26]:
# Generate a Series from 5 normal random numbers
np.random.seed(123456)
pd.Series(np.random.randn(5))

0    0.469112
1   -0.282863
2   -1.509059
3   -1.135632
4    1.212112
dtype: float64

In [29]:
# The np.linspace() method creates an array of values between two specified values (Inclusive)
pd.Series(np.linspace(0,9,9)) # Third param is the number of values

0    0.000
1    1.125
2    2.250
3    3.375
4    4.500
5    5.625
6    6.750
7    7.875
8    9.000
dtype: float64

Notice the type

In [28]:
# Likewise, the np.arange() method creates an array of values between two specified values
# 0 through 8
pd.Series(np.arange(0, 9))

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
dtype: int32

## Size, shape, uniqueness, and counts of values

In [35]:
# example series, which also contains a NaN
s = pd.Series([0, 1, 1, 2, 3, 4, 5, 6, 7, np.nan])
s

0    0.0
1    1.0
2    1.0
3    2.0
4    3.0
5    4.0
6    5.0
7    6.0
8    7.0
9    NaN
dtype: float64

In [40]:
len(s) # Length of the Series
s.size # Number of items
s.shape # Returns a tuple where the first item is the number of items
s.count() # Returns the number of non-NaN values
s.unique() # Returns a Numpy array with all the uniue values
s.value_counts() # count of non-NaN values, returned max to min order

1.0    2
7.0    1
6.0    1
5.0    1
4.0    1
3.0    1
2.0    1
0.0    1
dtype: int64

## Peeking at data with heads, tails, and take

In [41]:
# first five
s.head()

0    0.0
1    1.0
2    1.0
3    2.0
4    3.0
dtype: float64

In [42]:
# first three
s.head(3)

0    0.0
1    1.0
2    1.0
dtype: float64

In [43]:
# last five
s.tail()

# last 3
s.tail(3)

7    6.0
8    7.0
9    NaN
dtype: float64

In [44]:
# Return the elements in the given positional indices
s.take([0, 3, 9]) 

0    0.0
3    2.0
9    NaN
dtype: float64

## Looking up values in Series


In [47]:
s3 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s3

a    1
b    2
c    3
dtype: int64

In [46]:
# single item lookup
s3['a']

1

In [52]:
# lookup by position since the index is not an integer
s3[1]

2

In [50]:
# force lookup by index label if it was integar values
# s3.loc[12]

# forced lookup by location / position
s3.iloc[1]

2

In [51]:
# multiple items
s3[['a', 'c']]
s3.loc[['a', 'c']]
s3.iloc[[0, 2]]

a    1
c    3
dtype: int64

<b>Notice that</b> : If a location/position passed to .iloc[] in a list is out of bounds, an exception will
be thrown. This is different than with .loc[], which if passed a label that does not
exist, will return NaN as the value for that label:

A Series also has a property .ix that can be used to look up items either by label or
by zero-based array position.

## Alignment via index labels

In [53]:
s6 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s6

a    1
b    2
c    3
d    4
dtype: int64

In [54]:
s7 = pd.Series([4, 3, 2, 1], index=['d', 'c', 'b', 'a'])
s7

d    4
c    3
b    2
a    1
dtype: int64

In [55]:
# add them
s6 + s7

a    2
b    4
c    6
d    8
dtype: int64

This becomes significantly powerful
when using pandas Series to combine data based on labels instead of having to first
order the data manually.

## Arithmetic operations
can be applied either to a Series or between two Series objects. 

When applied to a single Series, the operation is applied to all of the values in that Series.

<b>Note that:</b> alignment is being performed first when applying arithmetic operations across two Series objects

In [56]:
# multiply all values in s3 by 2
s3 * 2

a    2
b    4
c    6
dtype: int64

In [57]:
s8 = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 5})
s9 = pd.Series({'b': 6, 'c': 7, 'd': 9, 'e': 10})
s8 + s9

a     NaN
b     8.0
c    10.0
d    14.0
e     NaN
dtype: float64

The tasks performed with pandas using Series (and DataFrame) objects are often such that multiple sets of
data need to be aligned, and if there are no matching labels during alignment, then
the operation should not fail. Hence, pandas returns NaN in those situations.


This is actually common as datasets used in various statistical, financial, and data
science domains often are incomplete

### Note

In [59]:
s10 = pd.Series([1.0, 2.0, 3.0], index=['a', 'a', 'b'])
s10

a    1.0
a    2.0
b    3.0
dtype: float64

In [61]:
s11 = pd.Series([4.0, 5.0, 6.0], index=['a', 'a', 'c'])
s11

a    4.0
a    5.0
c    6.0
dtype: float64

In [62]:
s10 + s11

a    5.0
a    6.0
a    6.0
a    7.0
b    NaN
c    NaN
dtype: float64

The reason for this is that during alignment, pandas actually performs a Cartesian
product of the sets of all unique index labels in both Series objects

Each combination of values for 'a' in both Series are
computed, resulting in the four values: 1+4, 1+5, 2+4 and 2+5

### The special case of Not-A-Number (NaN)

In [63]:
# Numpy stops when there's a NaN value
nda = np.array([1, 2, 3, 4, np.NaN])
nda.mean()


nan

In [64]:
# ignores NaN values
s = pd.Series(nda)
s.mean()

2.5

NaN values are simply ignored. They are not counted as a 0 value.

It is
expected that data will be missing, and that you will "tidy" the data over progressive
iterations, but until then you will still be able to produce analysis with data that is not
tidy. 

## Boolean selection
A Boolean selection applies a logical expression to
the values of the Series and returns a new Series of Boolean values representing the
result for each value.

In [68]:
s = pd.Series(np.arange(0, 10))
s[s > 5]

6    6
7    7
8    8
9    9
dtype: int32

pandas performs this Boolean selection by overloading the Series object's []
operator so that when passed a Series object consisting of boolean values it knows
to return only the values in the outer Series

In [70]:
# Multiple conditions

# s[s > 5 and s < 8] # commented as it throws an exception

# correct syntax
s[(s > 5) & (s < 8)]


6    6
7    7
dtype: int32

In [71]:
# are all items >= 0?
(s >= 0).all()

True

In [72]:
s[s < 2].any()

True

In [73]:
# how many values < 2?
(s < 2).sum()

2

## Reindexing a Series
1. Reordering existing data to match a set of labels.
2. Inserting NaN markers where no data exists for a label.
3. Possibly, filling missing data for a label using some type of logic (defaulting to adding NaN values).

In [77]:
# sample series of five items
s = pd.Series(np.random.randn(5), index = ['a', 'b', 'c', 'd', 'e'])
s

a   -0.673690
b    0.113648
c   -1.478427
d    0.524988
e    0.404705
dtype: float64

In [78]:
# The following code concatenates two Series objects resulting in duplicate index labels
np.random.seed(123456)
s1 = pd.Series(np.random.randn(3))
s2 = pd.Series(np.random.randn(3))
combined = pd.concat([s1, s2])
combined

0    0.469112
1   -0.282863
2   -1.509059
0   -1.135632
1    1.212112
2   -0.173215
dtype: float64

In [79]:
# reset the index
combined.index = np.arange(0, len(combined))
combined

0    0.469112
1   -0.282863
2   -1.509059
3   -1.135632
4    1.212112
5   -0.173215
dtype: float64

In [80]:
np.random.seed(123456)
s1 = pd.Series(np.random.randn(4), ['a', 'b', 'c', 'd'])
 
# reindex with different number of labels
# results in dropped rows and/or NaN's
s2 = s1.reindex(['a', 'c', 'g'])
s2

a    0.469112
c   -1.509059
g         NaN
dtype: float64

Note that: 
* using .index if the number of elements didn't match the length an exception wil be thrown.
* the result of a .reindex() method is a new Series.

In [81]:
# different types for the same values of labels
# causes big trouble
s1 = pd.Series([0, 1, 2], index=[0, 1, 2])
s2 = pd.Series([3, 4, 5], index=['0', '1', '2'])
s1 + s2

0   NaN
1   NaN
2   NaN
0   NaN
1   NaN
2   NaN
dtype: float64

In [82]:
# reindex by casting the label types and we will get the desired result
s2.index = s2.index.values.astype(int)
s1 + s2

0    3
1    5
2    7
dtype: int64

In [83]:
# fill with 0 instead of NaN
s2 = s.copy()
s2.reindex(['a', 'f'], fill_value=0)

a   -0.67369
f    0.00000
dtype: float64

In [84]:
# create example to demonstrate fills
s3 = pd.Series(['red', 'green', 'blue'], index=[0, 3, 5])
s3

0      red
3    green
5     blue
dtype: object

In [85]:
# forward fill example
# often referred to as "last known value." 

s3.reindex(np.arange(0,7), method='ffill')

0      red
1      red
2      red
3    green
4    green
5     blue
6     blue
dtype: object

In [86]:
# backwards fill example
s3.reindex(np.arange(0,7), method='bfill')

0      red
1    green
2    green
3    green
4     blue
5     blue
6      NaN
dtype: object

## Modifying a Series in-place

In [87]:
# generate a Series to play with
np.random.seed(123456)
s = pd.Series(np.random.randn(3), index=['a', 'b', 'c'])
s

a    0.469112
b   -0.282863
c   -1.509059
dtype: float64

In [88]:
# change a value in the Series
# this is done in-place
# a new Series is not returned that has a modified value
s['d'] = 100
s

a      0.469112
b     -0.282863
c     -1.509059
d    100.000000
dtype: float64

In [89]:
# remove a row / item
del(s['a'])
s

b     -0.282863
c     -1.509059
d    100.000000
dtype: float64

## Slicing a Series


In [90]:
s = pd.Series(np.arange(100, 110), index=np.arange(10, 20))
s

10    100
11    101
12    102
13    103
14    104
15    105
16    106
17    107
18    108
19    109
dtype: int32

In [92]:
# items at position 0, 2, 4
s[0:6:2]

# equivalent to
s.iloc[[0, 2, 4]]


10    100
12    102
14    104
dtype: int32

In [93]:
# first five by slicing, same as .head(5)
s[:5]

10    100
11    101
12    102
13    103
14    104
dtype: int32

In [94]:
# fourth position to the end
s[4:]


14    104
15    105
16    106
17    107
18    108
19    109
dtype: int32

In [95]:
# every other item starting at the fourth position
s[4::2]

14    104
16    106
18    108
dtype: int32

In [96]:
# reverse the Series
s[::-1]

19    109
18    108
17    107
16    106
15    105
14    104
13    103
12    102
11    101
10    100
dtype: int32

In [97]:
# every other starting at position 4, in reverse
s[4::-2]

14    104
12    102
10    100
dtype: int32

If the series has n elements, then negative values for the start and end of the slice represent elements
n + start through and not including n + end. 

In [99]:
# :-2, which means positions 0 through (10-2) [8]
s[:-2]

10    100
11    101
12    102
13    103
14    104
15    105
16    106
17    107
dtype: int32

In [100]:
# last three items of the series
s[-3:]

17    107
18    108
19    109
dtype: int32

In [101]:
# equivalent to s.tail(4).head(3)
s[-4:-1]

16    106
17    107
18    108
dtype: int32

An important thing to keep in mind when using slicing, is that the result of the slice
is actually a view into the original Series. Modification of values through the result
of the slice will modify the original Series.

In [103]:
copy = s.copy() # preserve s
slice = copy[:2] # slice with first two rows

# change item with label 10 to 1000
slice[11] = 1000

# and see it in the source
copy

10     100
11    1000
12     102
13     103
14     104
15     105
16     106
17     107
18     108
19     109
dtype: int32

Keep this in mind as it is powerful, because if you were expecting
slicing to use a copy of the data you will likely be tracking down
some bugs in the future.

In [104]:
# With the noninteger index, it is also possible to slice with values in the same type of
# the index
# this slices by the strings in the index
s = pd.Series(np.arange(0, 5), index=['a', 'b', 'c', 'd', 'e'])

s['b':'d']

b    1
c    2
d    3
dtype: int32