<h1>Importing Pandas</h1>

In [1]:
# bring in NumPy and pandas
import numpy as np
import pandas as pd

Pandas provides several options that can be set to control the formatting of output.
The notebooks in this book will use the following code or a slight variant to control the representation of the rendering, as well as setting a maximum number of rows and
columns to be displayed in the output any code example.

In [2]:
# Set some pandas options for controlling output display
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)

<h2>Creating Series</h2>

In [3]:
# create one item Series
s1 = pd.Series(2)
s1

0    2
dtype: int64

In [4]:
# get value with label 0
s1[0]

2

In [5]:
# The following example creates a Series from a Python list:

# create a series of multiple items from a list
s2 = pd.Series([1, 2, 3, 4, 5])
s2

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [6]:
# The array of values in the Series can be retrieved using the .values property, as shown
# here:

# get the values in the Series
s2.values

array([1, 2, 3, 4, 5])

In [7]:
# Also, the index of the series can be retrieved with the .index property:

# get the index of the Series
s2.index

RangeIndex(start=0, stop=5, step=1)

In [8]:
# To specify the index at the time of creation of the Series, use the
# index parameter of the constructor.

s3 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s3

a    1
b    2
c    3
dtype: int64

In [9]:
# The type of items in the index that are created are now of type object.
s3.index

Index(['a', 'b', 'c'], dtype='object')

In [10]:
# lookup by label value, not integer position
s3['c']

3

A Series created from a single scalar value is useful, as it allows you to apply an
operation and a single value across all elements of a Series. When creating a Series
object with a scalar and specifying an index with multiple labels, pandas will copy the
scalar value to associate with each index label. The following code demonstrates this by
creating a Series with a scalar value and an index based on an already existing index:

In [11]:
# create Series from an existing index
# scalar value with be copied at each index label
s4 = pd.Series(2, index=s2.index)
s4

0    2
1    2
2    2
3    2
4    2
dtype: int64

In [12]:
# generate a Series from 5 normal random numbers
np.random.seed(123456)
pd.Series(np.random.randn(5))

0    0.469112
1   -0.282863
2   -1.509059
3   -1.135632
4    1.212112
dtype: float64

In [13]:
# 0 through 9
pd.Series(np.linspace(0, 5, 10))

0    0.000000
1    0.555556
2    1.111111
3    1.666667
4    2.222222
5    2.777778
6    3.333333
7    3.888889
8    4.444444
9    5.000000
dtype: float64

In [14]:
# The np.arange() method creates an array of values between two specified
# values:
# 0 through 8

pd.Series(np.arange(0, 9))

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
dtype: int64

In [15]:
# Finally, a Series can be directly initialized from a Python dictionary. The keys of the
# dictionary are used as the index labels for the Series:

# create Series from dict
s6 = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4})
s6

a    1
b    2
c    3
d    4
dtype: int64

<h2>Size, shape, uniqueness, and counts of values</h2>

In [16]:
# example series, which also contains a NaN
s = pd.Series([0, 1, 1, 2, 3, 4, 5, 6, 7, np.nan])
s

0    0.0
1    1.0
2    1.0
3    2.0
4    3.0
5    4.0
6    5.0
7    6.0
8    7.0
9    NaN
dtype: float64

In [17]:
# length of the Series
len(s)

10

In [18]:
# .size is also the # of items in the Series
s.size

10

In [19]:
# .shape is a tuple with one value
s.shape

(10,)

In [20]:
# all unique values
s.unique()

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7., nan])

In [21]:
# count of non-NaN values, returned max to min order
s.value_counts()

1.0    2
7.0    1
6.0    1
5.0    1
4.0    1
3.0    1
2.0    1
0.0    1
dtype: int64

<h2>Peeking at data with heads, tails, and take</h2>

In [22]:
# first five
s.head() # or s.head(15) # 15 first lines

0    0.0
1    1.0
2    1.0
3    2.0
4    3.0
dtype: float64

In [23]:
# last five
s.tail()

5    4.0
6    5.0
7    6.0
8    7.0
9    NaN
dtype: float64

In [24]:
# series with an integer index, but not starting with 0
s5 = pd.Series([1, 2, 3], index=[10, 11, 12])
s5

10    1
11    2
12    3
dtype: int64

To alleviate the potential confusion in determining label-based lookup versus position-based lookup, index label based lookup can be enforced using the .loc[] accessor:

In [25]:
s5

10    1
11    2
12    3
dtype: int64

In [26]:
# Lookup by index label
s5.loc[12]

3

In [27]:
# Lookup by position can be enforced using the .iloc[] accessor:

# forced lookup by location / position
s5.iloc[1]

2

These two options also function using lists, as shown in the following example:
<pre>
multiple items by label (loc)
s5.loc[[12, 10]]
Out[35]:
12 3
10 1
dtype: int64
In [36]:
multiple items by location / position (iloc)
s5.iloc[[0, 2]]
Out[36]:
10 1
12 3
dtype: int64
</pre>

<p>If a location/position passed to .iloc[] in a list is out of bounds, an exception will be
thrown. This is different than with .loc[], which if passed a label that does not exist, will
return NaN as the value for that label:</p>
<pre>
In [37]:
# -1 and 15 will be NaN
s5.loc[[12, -1, 15]]
Out[37]:
12 3
-1 NaN
15 NaN
dtype: float64
</pre>

<h4>Note</h4>
<p>When looking to write the highest performance code for accessing items in a Series, it is
recommended that you use the .loc[] method using lookup by index position.</p>

<h2>Alignment via index labels</h2>

<p>A fundamental difference between a NumPy ndarray and a pandas Series is the ability of
a Series to automatically align data from another Series based on label values before
performing an operation</p>

In [28]:
s6 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s6

a    1
b    2
c    3
d    4
dtype: int64

In [29]:
s7 = pd.Series([4, 3, 2, 1], index=['d', 'c', 'b', 'a'])
s7

d    4
c    3
b    2
a    1
dtype: int64

In [30]:
s6+s7

a    2
b    4
c    6
d    8
dtype: int64

The process of adding two Series objects differs from the process of addition of arrays as
it first aligns data based on index label values instead of simply applying the operation to
elements in the same position. This becomes significantly powerful when using pandas
Series to combine data based on labels instead of having to first order the data manually.

<h2>Arithmetic Operations</h2>

In [31]:
s3 = pd.Series(np.arange(1,4))
s3

0    1
1    2
2    3
dtype: int64

In [32]:
# multiply all values in s3 by 2
s3 * 2

0    2
1    4
2    6
dtype: int64

The preceding code is also roughly equivalent to the following code, which creates a new
series from a scalar value using the index from s3. It has the same result, but it is not as
efficient, as alignment is performed between the Series objects instead of a simple
vectorization of the multiplication:

In [33]:
# scalar series using s3's index
t = pd.Series(2, s3.index)
s3 * t

0    2
1    4
2    6
dtype: int64

Addition of Series derived from dictionaries

In [34]:
s8 = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 5})
s8

a    1
b    2
c    3
d    5
dtype: int64

In [35]:
s9 = pd.Series({'b': 6, 'c': 7, 'd': 9, 'e': 10})
s9

b     6
c     7
d     9
e    10
dtype: int64

In [36]:
# NaN's result for a and e
# demonstrates alignment
s8 + s9

a     NaN
b     8.0
c    10.0
d    14.0
e     NaN
dtype: float64

Addition of series with Duplicate labels - Results in Cartesian product

In [37]:
s10 = pd.Series([1.0, 2.0, 3.0], index=['a', 'a', 'b'])
s10

a    1.0
a    2.0
b    3.0
dtype: float64

In [38]:
# going to add this to s10
s11 = pd.Series([4.0, 5.0, 6.0], index=['a', 'a', 'c'])
s11

a    4.0
a    5.0
c    6.0
dtype: float64

In [39]:
s10 + s11

a    5.0
a    6.0
a    6.0
a    7.0
b    NaN
c    NaN
dtype: float64

<h2>The special case of Not-A-Number (NaN)</h2>

In [40]:
# mean of numpy array values
nda = np.array([1, 2, 3, 4, 5])
nda.mean()

3.0

In [41]:
# mean of numpy array values with a NaN
nda = np.array([1, 2, 3, 4, np.NaN])
nda.mean()

nan

When encountering a NaN value, NumPy simply returns NaN. pandas changes this, so that
NaN values are ignored:

In [42]:
# ignores NaN values
s = pd.Series(nda)
s.mean()

2.5

In this case, pandas override the mean function of the Series object so that NaN values are
simply ignored. They are not counted as a 0 value;

In [43]:
# handle NaN values like NumPy
s.mean(skipna=False)

nan

<h2>Boolean selection</h2>

In [44]:
# which rows have values that are > 5?
s = pd.Series(np.arange(0, 10), index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
s

a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
i    8
j    9
dtype: int64

In [45]:
s > 5

a    False
b    False
c    False
d    False
e    False
f    False
g     True
h     True
i     True
j     True
dtype: bool

In [46]:
# select rows where values are > 5
logicalResults = s > 5
s[logicalResults]

g    6
h    7
i    8
j    9
dtype: int64

In [47]:
# a little shorter version
s[s > 5]

g    6
h    7
i    8
j    9
dtype: int64

In [48]:
# correct syntax
s[(s > 5) & (s < 8)]  # not like this: s[s > 5 and s < 8]

g    6
h    7
dtype: int64

It is possible to determine whether all the values in a Series match a given expression
using the .all() method. The following asks if all elements in the series are greater than
or equal to 0:

In [49]:
# are all items >= 0?
(s >= 0).all()

True

The .any() method returns True if any values satisfy the expressions. The following asks
if any elements are less than 2:

In [50]:
# any items < 2?
(s < 2).any() # or s[s < 2].any()

True

There is something important going on here that is worth mentioning. The result of these
logical expressions is a Boolean selection, a Series of True and False values. The .sum()
method of a Series, when given a series of Boolean values, will treat True as 1 and False
as 0. The following demonstrates using this to determine the number of items in a Series
that satisfy a given expression

In [51]:
# how many values < 2?
(s < 2).sum()

2

In [52]:
# Sum of values < of 2
s[s<2].sum()

1

In [53]:
s

a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
i    8
j    9
dtype: int64

<h2>Reindexing a Series</h2>

In [54]:
# sample series of five items
s = pd.Series(np.random.randn(5))
s

0   -0.173215
1    0.119209
2   -1.044236
3   -0.861849
4   -2.104569
dtype: float64

In [55]:
# change the index
s.index = ['a', 'b', 'c', 'd', 'e']
s

a   -0.173215
b    0.119209
c   -1.044236
d   -0.861849
e   -2.104569
dtype: float64

Now, let’s examine a slightly more practical example. The following code concatenates
two Series objects resulting in duplicate index labels, which may not be desired in the
resulting Series:

In [56]:
# concat copies index values verbatim,
# potentially making duplicates
np.random.seed(123456)
s1 = pd.Series(np.random.randn(3))
s2 = pd.Series(np.random.randn(3))
combined = pd.concat([s1, s2])
combined

0    0.469112
1   -0.282863
2   -1.509059
0   -1.135632
1    1.212112
2   -0.173215
dtype: float64

To fix this, the following creates a new index for the concatenated result which has
sequential and distinct values.

In [57]:
# reset the index
combined.index = np.arange(0, len(combined))
combined

0    0.469112
1   -0.282863
2   -1.509059
3   -1.135632
4    1.212112
5   -0.173215
dtype: float64

<p>Reindexing using the .index property in-place modifies the Series.</p>
<p>Greater flexibility in creating a new index is provided using the .reindex() method. An
example of the flexibility of .reindex() over assigning the .index property directly is
that the list provided to .reindex() can be of a different length than the number of rows
in the Series:</p>

In [58]:
np.random.seed(123456)
s1 = pd.Series(np.random.randn(4), ['a', 'b', 'c', 'd'])
s1

a    0.469112
b   -0.282863
c   -1.509059
d   -1.135632
dtype: float64

In [59]:
# reindex with different number of labels
# results in dropped rows and/or NaN's
s2 = s1.reindex(['a', 'c', 'g'])
s2

a    0.469112
c   -1.509059
g         NaN
dtype: float64

The default action of inserting NaN as a missing value during reindexing can be changed by using the fill_value parameter of the method. The following example demonstrates using 0 instead of NaN:

In [60]:
# fill with 0 instead of NaN
s2.reindex(['a', 'f'], fill_value=0)

a    0.469112
f    0.000000
dtype: float64

There are several things here that are important to point out about .reindex(). First is that
the result of a .reindex() method is a new Series. This new Series has an index with
labels that are provided as the parameter to .reindex(). For each item in the given
parameter list, if the original Series contains that label, then the value is assigned to that
label. If the label does not exist in the original Series, pandas assigns a NaN value. Rows
in the Series without a label specified in the parameter of .reindex() is not included in
the result.

In [61]:
# different types for the same values of labels
# causes big trouble
s1 = pd.Series([0, 1, 2], index=[0, 1, 2])
s2 = pd.Series([3, 4, 5], index=['0', '1', '2'])
s1 + s2

0   NaN
1   NaN
2   NaN
0   NaN
1   NaN
2   NaN
dtype: float64

Once this situation is identified, it becomes a fairly trivial situation to fix by reindexing
the second series:

In [62]:
# reindex by casting the label types
# and we will get the desired result
s2.index = s2.index.values.astype(int)
s2

0    3
1    4
2    5
dtype: int64

In [63]:
s1 + s2

0    3
1    5
2    7
dtype: int64

The following example demonstrates forward filling, often referred to as “last known
value.” The Series is reindexed to create a contiguous integer index, and using the
method='ffill' parameter, any new index labels are assigned the previously known
values that are not part of NaN value from earlier in the Series object:

In [64]:
# create example to demonstrate fills
s3 = pd.Series(['red', 'green', 'blue'], index=[0, 3, 5])
s3

0      red
3    green
5     blue
dtype: object

In [65]:
# forward fill example
s3.reindex(np.arange(0,7), method='ffill')

0      red
1      red
2      red
3    green
4    green
5     blue
6     blue
dtype: object

In [66]:
# The following example fills backward using method='bfill':

# backwards fill example
s3.reindex(np.arange(0,7), method='bfill')

0      red
1    green
2    green
3    green
4     blue
5     blue
6      NaN
dtype: object

<h2>Modifying a Series in-place</h2>

A new item can be added to a Series by assigning a value to an index label that does not
already exist. The following code creates a Series object and adds a new item to the
series:

In [67]:
# generate a Series to play with
np.random.seed(123456)
s = pd.Series(np.random.randn(3), index=['a', 'b', 'c'])
s

a    0.469112
b   -0.282863
c   -1.509059
dtype: float64

In [68]:
# change a value in the Series
# this is done in-place
# a new Series is not returned that has a modified value
s['d'] = 100
s

a      0.469112
b     -0.282863
c     -1.509059
d    100.000000
dtype: float64

Items can be removed from a Series using the del() function and passing the index
label(s) to be removed. The following code removes the item at index label 'a':

In [69]:
# remove a row / item
del(s['a'])
s

b     -0.282863
c     -1.509059
d    100.000000
dtype: float64

<h2>Slicing a Series</h2>

Just like NumPy arrays, you can pass a slice object to the []
operator of the Series to get the specified values. Slices also work with the .loc[],
.iloc[], and .ix properties and accessors.

In [70]:
# a Series to use for slicing
# using index labels not starting at 0 to demonstrate
# position based slicing
s = pd.Series(np.arange(100, 110), index=np.arange(10, 20))
s

10    100
11    101
12    102
13    103
14    104
15    105
16    106
17    107
18    108
19    109
dtype: int64

In [71]:
# items at position 0, 2, 4
s[0:6:2]

10    100
12    102
14    104
dtype: int64

In [72]:
# equivalent to
s.iloc[[0, 2, 4]]

10    100
12    102
14    104
dtype: int64

In [73]:
# first five by slicing, same as .head(5)
s[:5]

10    100
11    101
12    102
13    103
14    104
dtype: int64

In [74]:
# fourth position to the end
s[4:]

14    104
15    105
16    106
17    107
18    108
19    109
dtype: int64

In [75]:
# every other item in the first five positions
s[:5:2]

10    100
12    102
14    104
dtype: int64

In [76]:
# every other item starting at the fourth position
s[4::2]

14    104
16    106
18    108
dtype: int64

In [77]:
# reverse the Series
s[::-1]

19    109
18    108
17    107
16    106
15    105
14    104
13    103
12    102
11    101
10    100
dtype: int64

In [78]:
# every other starting at position 4, in reverse
s[4::-2]

14    104
12    102
10    100
dtype: int64

In [79]:
# :-2, which means positions 0 through (10-2) [8]
s[:-2]

10    100
11    101
12    102
13    103
14    104
15    105
16    106
17    107
dtype: int64

In [80]:
# last three items of the series
s[-3:]

17    107
18    108
19    109
dtype: int64

In [81]:
# equivalent to s.tail(4).head(3)
s[-4:-1]

16    106
17    107
18    108
dtype: int64

An important thing to keep in mind when using slicing, is that the result of the slice is
actually a view into the original Series. Modification of values through the result of the
slice will modify the original Series. Consider the following example, which selects the
first two elements in the Series and stores it into a new variable:

In [82]:
copy = s.copy() # preserve s
slice = copy[:2] # slice with first two rows
slice

10    100
11    101
dtype: int64

Now, the assignment of a value to an element of a slice will change the value in the
original Series:

In [83]:
# change item with label 10 to 1000
slice[11] = 1000
# and see it in the source
copy

10     100
11    1000
12     102
13     103
14     104
15     105
16     106
17     107
18     108
19     109
dtype: int64

Slicing can be performed on Series objects with a noninteger index. The following
Series will be used to demonstrate this:

In [84]:
# used to demonstrate the next two slices
s = pd.Series(np.arange(0, 5),
index=['a', 'b', 'c', 'd', 'e'])
s

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [85]:
# slices by position as the index is characters
s[1:3]

b    1
c    2
dtype: int64

In [86]:
# this slices by the strings in the index
s['b':'d']

b    1
c    2
d    3
dtype: int64