# Pandas 02 - Series

by Nova@Douban

The video record of this session is here: https://zoom.us/recording/share/hMxWmW7CRemL7wT8495JdIDXFCiyU6TAkMO4fL7J9GOwIumekTziMw?startTime=1546170816000

---

## 2.1 The Series object

### 2.1.1 Concept

pandas Series:

1. represents a one-dimensional labeled indexed array;

2. deviates from NumPy arrays by adding an index.

### 2.1.2 Examples of pandas Series

In [1]:
import numpy as np
import pandas as pd

aray = np.random.randn(6)
aray

array([ 0.46676643, -0.15545763,  0.44000794,  1.88418346,  1.35743695,
       -0.92247118])

In [2]:
srs = pd.Series(aray, index = ['m', 'a', 'f', '9', 'h', 'l'])
srs

m    0.466766
a   -0.155458
f    0.440008
9    1.884183
h    1.357437
l   -0.922471
dtype: float64

In [3]:
print(srs.values)
print(srs.index)

[ 0.46676643 -0.15545763  0.44000794  1.88418346  1.35743695 -0.92247118]
Index(['m', 'a', 'f', '9', 'h', 'l'], dtype='object')


---


## 2.2 Creating Series

### 2.2.1 Creating from other data structures

A Series can be created and initialized by passing 

1. a scalar value, 
2. a NumPy ndarray,
3. a Python list, 
4. a Python Dict,

### 2.2.2 Examples of creating Series

In [4]:
# from a scaler value
ind = np.random.randn(6)
pd.Series('a', index=ind)

 1.278977    a
-1.642189    a
 0.300328    a
 1.407208    a
 1.018008    a
 0.101313    a
dtype: object

In [5]:
# from a numpy ndarray
aray = np.random.randn(6)
pd.Series(aray)

0   -1.180448
1    0.274229
2   -1.972792
3   -0.108837
4   -0.255784
5    1.419131
dtype: float64

In [6]:
# from a list
lst = [0, 1, 3, 89]
pd.Series(lst)

0     0
1     1
2     3
3    89
dtype: int64

In [7]:
# from a dict
dic = {'a': 9.1, 'i': 0}
pd.Series(dic)

a    9.1
i    0.0
dtype: float64

---

### 2.2.3 Index and values of Series

1. By default, the Series object will construct an index automatically using integer values.


2. To specify the index, use the index parameter of the constructor.


3. A Series created with scaler value allows you to apply an operation and a single value across all elements of a Series.

### 2.2.4 Examples of index and values of Series

In [8]:
# Set the index when creating the Series
srs = pd.Series(lst, index = np.random.randn(4))
srs

 0.284722     0
-0.761072     1
-1.104793     3
 0.730485    89
dtype: int64

In [9]:
# get the values of srs
srs.values

array([ 0,  1,  3, 89])

In [10]:
# A Series created from a scaler value is useful
scaler = pd.Series(5, index=srs.index)
print(scaler)
print()
print(srs * scaler)
print()
print(srs * 5)

 0.284722    5
-0.761072    5
-1.104793    5
 0.730485    5
dtype: int64

 0.284722      0
-0.761072      5
-1.104793     15
 0.730485    445
dtype: int64

 0.284722      0
-0.761072      5
-1.104793     15
 0.730485    445
dtype: int64


---

## 2.3 Accessing Series

1. `pd.Series.size()`: return the number of elements in the underlying data;
2. `pd.Series.shape`: return a tuple of the shape of the underlying data;
3. `pd.Series.unique()`: return unique values of Series object;
4. `pd.Series.count()`: return number of non-NA/null observations in the Series;
5. `pd.Series.head()`: return the first `n` rows;
6. `pd.Series.tail()`: return the last `n` rows;
7. `pd.Series.take()`: return the elements in the given *positional* indices along an axis;

In [11]:
srs.size

4

In [12]:
srs.shape

(4,)

In [13]:
srs.unique()

array([ 0,  1,  3, 89])

In [14]:
srs.count()

4

In [15]:
srs.head(2)

 0.284722    0
-0.761072    1
dtype: int64

In [16]:
srs.tail(2)

-1.104793     3
 0.730485    89
dtype: int64

In [17]:
srs.take([0, 1], axis=0)

 0.284722    0
-0.761072    1
dtype: int64

---

## 2.4 More about alignment

### 2.4.1 Always start with alignment

The computing between multiple Series always start with alignment.

In [18]:
# A Series * scaler values VS. vectorization
scaler = pd.Series(5, index=srs.index)
%timeit srs * scaler
%timeit srs * 5

97.1 µs ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
102 µs ± 27.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


---

### 2.4.2 Repeated labels in index

If there are repeated labels in the index, the result will be surprising.

    Cartesian product: an index having duplicate labels will result in a number of index labels equivalent to the products of the number of the labels in each Series.

In [19]:
ind = [1, 2, 2, 3]
s1 = pd.Series(np.random.randn(4), index=ind)
s2 = pd.Series(np.random.randn(4), index=reversed(ind))
print(s1+s2)
print()
print(s1)
print()
print(s1.to_dict())

1    1.026273
2    1.339302
2    1.551478
2    0.327764
2    0.539940
3    3.797827
dtype: float64

1    0.007499
2    0.788075
2   -0.223463
3    2.548161
dtype: float64

{1: 0.007498765264798723, 2: -0.22346329852722358, 3: 2.548161210676347}


---

## 2.5 Boolean selection


1. Boolean selection produces a new Series with a copy of index and value for the selected rows.


2. With the `[]` operator, Boolean selection can get the values of the original Series.


3. Chain selection does not work with Series, instead, put parentheses around logical conditions and use '|' and '&'.

In [20]:
ss = (s1 > 0)
print(ss)
print(type(ss))

1     True
2     True
2    False
3     True
dtype: bool
<class 'pandas.core.series.Series'>


In [21]:
s1[(s1 > 0)]

1    0.007499
2    0.788075
3    2.548161
dtype: float64

In [22]:
s1[(s1 > 0)]._is_view

False

In [23]:
# s1[(0.5 > s1 > 0)]

s1[(0.5 > s1)&(s1 > 0)]

1    0.007499
dtype: float64

In [24]:
s1[(0.5 < s1)|(s1 > 0)]

1    0.007499
2    0.788075
3    2.548161
dtype: float64

---

## 2.6  Slicing a Series

Slicing a Series is siilar to slicing a list, and the result is a view, instead of a copy.

If the series has n elements, then negative values for the start and end of the slice represent elements n + start through and not including n + end.

In [25]:
print(s1)
print()
s3 = s1[0:]
print(s3)

1    0.007499
2    0.788075
2   -0.223463
3    2.548161
dtype: float64

1    0.007499
2    0.788075
2   -0.223463
3    2.548161
dtype: float64


In [26]:
print(s3._is_view)
print()
print(s3.copy()._is_view)

True

False


In [27]:
s3 = s1[-1:]
s3

3    2.548161
dtype: float64

In [28]:
print(s1)
print()
print(s1[::-1])

1    0.007499
2    0.788075
2   -0.223463
3    2.548161
dtype: float64

3    2.548161
2   -0.223463
2    0.788075
1    0.007499
dtype: float64


---

## 2.7 Sorting and ranking

Sorting Series can be based on indices of values, and pandas provides both solutions:

`pd.Series.sort_index()`: sort a Series by row indexs, and returns a new, sorted object

`pd.Series.sort_values()`: sort a Series by its values


Please note: any missing values are sorted to the end of the Series by default

In [29]:
ind = [9, 2, 2, 1]
s1 = pd.Series(np.random.randn(4), index=ind)
s1.sort_index()

1   -0.329720
2    0.083856
2    1.009812
9    0.325275
dtype: float64

In [30]:
s1.sort_values()

1   -0.329720
2    0.083856
9    0.325275
2    1.009812
dtype: float64

In [31]:
s3 = pd.Series([9, None, 2, None, 2, 1])
s3.sort_values()

5    1.0
2    2.0
4    2.0
0    9.0
1    NaN
3    NaN
dtype: float64

In [32]:
print(s3)
print()
print(s3.rank())

0    9.0
1    NaN
2    2.0
3    NaN
4    2.0
5    1.0
dtype: float64

0    4.0
1    NaN
2    2.5
3    NaN
4    2.5
5    1.0
dtype: float64


---

## 2.8 Copy VS view

This warning often occurs when we write pandas functions:

> /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead


The fundamental cause of this warning is that we used chain indexing in pandas, which is a taboo. To solve this problem, we need clarity copy and view in pandas first.

View: 

    1. can be regarded as a reference to the original DataFrame / Series;
    2. the modofication on the view affects the original DataFrame / Series.


Copy: 

    1. a new DataFrame / Series based on the original DataFrame / Series;
    2. the modification doesn't affect the original DataFrame / Series.

<img src="../image/view_copy_1.png">

---

<img src="../image/view_copy_2.png">


---

The chain indexing may introduce views and copies at the same time, so the original DataFrame might be affected without noticing. This is super dangerous!

We can avoid chain indexing by using `pd.DataFrame.loc()` / `pd.DataFrame.iloc()`

In [33]:
s3._is_view

False

In [34]:
# Converting a view to a copy

s4 = s3.copy()
s4._is_view

False

In [35]:
s5 = s4.view()
s5._is_view

True

## 2.9 Exercises

1. For a detailed answer of chain indexing warning, please read [this great post](https://www.dataquest.io/blog/settingwithcopywarning/)

2. Find the parameter settings of following pandas functions:

`pd.Series.reindex()`

`pd.Series.sort_values()`

`pd.Series.sort_index()`

`pd.Series.loc()`

`pd.Series.iloc()`

3. Check the result of the following functions to see if they return a copy or a view?

`pd.Series.reindex()`

`pd.Series.sort_values()`

`pd.Series.sort_index()`

`pd.Series.loc()`

`pd.Series.iloc()`

---

To the rest sessions (outlines and video records), please scan the QR code below to pay.

1. The price is 799 RMB.
2. Please leave your email address in the __payment comment__, so I will send you the links of the rest sessions.


<img src="../image/alipay.jpg">