In this lecture we're going to explore the pandas Series structure. By the end of this lecture you should be 
familiar with how to store and manipulate single dimensional indexed data in the Series object.

The series is one of the core data structures in pandas. You think of it a cross between a list and a dictionary.
The items are all stored in an order and there's labels with which you can retrieve them. An easy way to 
visualize this is two columns of data. The first is the special index, a lot like keys in a dictionary. While the
second is your actual data. It's important to note that the data column has a label of its own and can be 
retrieved using the .name attribute. This is different than with dictionaries and is useful when it comes to 
merging multiple columns of data. And we'll talk about that later on in the course.

In [1]:
import pandas as pd

最简单构建一个series的方式是，直接导入一个类array的object，例如一个list。在这种操作下，pandas会自动从0开始为index赋值。

In [3]:
students = ['Alice', 'Jack', 'Molly']
pd.Series(students)

0    Alice
1     Jack
2    Molly
dtype: object

The result is a Series object which is nicely rendered to the screen. We see here that the pandas has automatically identified the type of data in this Series as "object" and set the dytpe parameter as appropriate. We see that the values are indexed with integers, starting at zero.

We don't have to use strings. If we passed in a list of whole numbers, for instance, we could see that panda sets the type to int64. Underneath panda stores series values in a typed array using the Numpy library. This offers significant speedup when processing data versus traditional python lists.

In [4]:
# Lets create a little list of numbers
numbers = [1, 2, 3]

# And turn that into a series
pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

Numpy和Pandas如何处理缺失数据？

In Python, we have the none type to indicate a lack of data. But what do we do if we want to have a typed list like we do in the series object?

Underneath, pandas does some type conversion. If we create a list of strings and we have one element, a None type, pandas inserts it as a None and uses the type object for the underlying array. 

In [5]:
# Let's recreate our list of students, but leave the last one as a None
students = ['Alice', 'Jack', None]

# And lets convert this to a series
pd.Series(students)

0    Alice
1     Jack
2     None
dtype: object

In [6]:
# However, if we create a list of numbers, integers or floats, and put in the None type,
# pandas automatically converts this to a special floating point value designated as NaN, 
# which stands for "Not a Number".

# So lets create a list with a None value in it
numbers = [1, 2, None]
# And turn that into a series
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

You'll notice a couple of things. 

First, NaN is a different value. 

Second, pandas set the dytpe of this series to floating point numbers instead of object or ints. That's maybe a bit of a surprise - why not just leave this as an integer? Underneath, pandas represents NaN as a floating point number, and because integers can be typecast to floats, pandas went and converted our integers to floats. So when you're wondering why the list of integers you put into a Series become floats, it's probably because there is some missing data.

For those who might not have done scientific computing in Python before, it is important to stress that None and NaN might be being used by the data scientist in the same way, to denote missing data, but that underneath these are not represented by pandas in the same way.

NaN is NOT equivilent to None and when we try the equality test, the result is False.

In [8]:
# Lets bring in numpy which allows us to generate an NaN value
import numpy as np

# And lets compare it to None
np.nan == None

False

In [9]:
# It turns out that you actually can't do an equality test of NAN to itself. When you do, 
# the answer is always False. 

np.nan == np.nan

False

In [10]:
# Instead, you need to use special functions to test for the presence of not a number, 
# such as the Numpy library isnan().

np.isnan(np.nan)

True

NaN is similar to None, but it's a numeric value and treated differently for efficiency reasons.

Let's talk more about how pandas' Series can be created. While my list might be a common way to create some data, often you have label data that you want to manipulate. A series can be created directly from dictionary data. If you do this, the index is automatically assigned to the keys of the dictionary that you provided and not just incrementing integers.

In [12]:
# Here's an example using some data of students and their classes.

students_scores = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English'}
s = pd.Series(students_scores)
s

# We see that, since it was string data, pandas set the data type of the series to "object".
# We see that the index, the first column, is also a list of strings.

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [13]:
s.index

Index(['Alice', 'Jack', 'Molly'], dtype='object')

In [15]:
# Now, this is kind of interesting. The dtype of object is not just for strings, but for
# arbitrary objects. Lets create a more complex type of data, say, a list of tuples.

students = [("Alice","Brown"), ("Jack", "White"), ("Molly", "Green")]
pd.Series(students)

0    (Alice, Brown)
1     (Jack, White)
2    (Molly, Green)
dtype: object

In [17]:
# 另一种方法以创造一个自定义index的series

s = pd.Series(['Physics', 'Chemistry', 'English'], index = ['Alice', 'Jack', 'Molly'])
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [18]:
# 构建series的时候，如果index和dictionary中的键不一致的时候，pandas只会考虑所提供的所有index。即，它会无视
# 不在index list中的却在dictionary中的那些键，而对于在index list中却不在dictionary中的那些键，pandas会自动为
# 这些index生成NaN或者None。

student_scores = {'Alice': 'Physics', 'Jack': 'Chemistry', 'Molly': 'English'}
s = pd.Series(student_scores, index = ['Alice', 'Jack', 'Sam'])
s

Alice      Physics
Jack     Chemistry
Sam            NaN
dtype: object

# Querying a Series

A pandas Series can be queried either by the index position or the index label. If you don't give an index to the series when querying, the position and the label are effectively the same values. To query by numeric location, starting at zero, use the iloc attribute. To query by the index label, you can use the loc attribute. 

In [19]:
# Lets start with an example. We'll use students enrolled in classes coming from a dictionary

import pandas as pd
students_classes = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'}
s = pd.Series(students_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [20]:
# So, for this series, if you wanted to see the fourth entry we would we would use the iloc attribute with the parameter 3.

s.iloc[3]

'History'

In [21]:
# If you wanted to see what class Molly has, we would use the loc attribute with a parameter of Molly.

s.loc['Molly']

'English'

iloc和loc不是method，而是attribute，所以不用()，而是用[]，即索引符。

In [23]:
# Pandas tries to make our code a bit more readable and provides a sort of smart syntax using 
# the indexing operator directly on the series itself. For instance, if you pass in an integer parameter, 
# the operator will behave as if you want it to query via the iloc attribute

s[3]

# 等价于s.iloc[3]

'History'

In [25]:
# If you pass in an object, it will query as if you wanted to use the label based loc attribute.

s['Molly']

# 等价于s.loc['Molly']

'English'

这也会存在一个问题，如果index正好为integer呢？这时候pandas不能判断你是想query by index position or index label。最保险的做法还是使用iloc和loc的attribute。

# 迭代

假设我们创造一个series来表示学生的分数，现在求平均分

In [28]:
grades = pd.Series([90, 80 ,70 ,60])

total = 0
for grade in grades:
    total += grade
print(total/len(grades))

75.0


这是一种求平均数的方法，但是太慢了，现在介绍numpy sum method

In [29]:
import numpy as np

total = np.sum(grades)
print(total/len(grades))

75.0


In [31]:
# 比较哪种求平均数的方法更快

numbers = pd.Series(np.random.randint(0, 1000, 10000)) # 0到1000内的10000个随机整数

# 观察series中的前五项
numbers.head()

0    938
1    589
2    208
3    356
4    574
dtype: int32

In [32]:
# 使用len()来确保series的长度是正确的

len(numbers)

10000

Here, we're actually going to use what's called a cellular magic function. These start with two % and wrap the code in the current Jupyter cell. The function we're going to use is called timeit. This function will run our code a few times to determine, on average, how long itmakes.

Let's run timeit with our original iterative code. You can give timeit the number of loops that you would like to run. By default, it is 1,000 loops. I'll ask timeit here to use 100 runs because we're recording this. Note that in order to use a cellular magic function, it has to be the first line in the cell.

In [33]:
%%timeit -n 100
total = 0
for number in numbers:
    total+=number

total/len(numbers)

1.1 ms ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [34]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

48.5 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


明显numpy sum method更快

This is a pretty shocking difference in the speed and demonstrates why one should be aware of parallel computing features and start thinking in functional programming terms.

Put more simply, vectorization is the ability for a computer to execute multiple instructions at once, and with high performance chips, especially graphics cards, you can get dramatic speedups. Modern graphics cards can run thousands of instructions in parallel.

A Related feature in pandas and nummy is called broadcasting. With broadcasting, you can apply an operation to every value in the series, changing the series. For instance, if we wanted to increase every random variable by 2, we could do so quickly using the += operator directly on the Series object. 

In [36]:
numbers.head()

0    938
1    589
2    208
3    356
4    574
dtype: int32

In [38]:
# 现在给series中的每一项+2

numbers += 2

In [39]:
numbers.head()

0    940
1    591
2    210
3    358
4    576
dtype: int32

.loc attribute不仅可以用来修改数据，还可以用来增加数据。如果输入的index不存在，那么会向Series中新建一行。虽然index有很多种类型，但pandas会自动改变numpy types


In [50]:
# Here's an example using a Series of a few numbers. 
s = pd.Series([1, 2, 3])

# We could add some new value, maybe a university course
s.loc['History'] = 102

s

0            1
1            2
2            3
History    102
dtype: int64

如果index values不是唯一的呢？这会使得pandas series

Up until now I've shown only examples of a series where the index values were unique. I want to end this lecture by showing an example where index values are not unique, and this makes pandas Series a little different conceptually then, for instance, a relational database.

In [65]:
students_classes = pd.Series({'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'})
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [66]:
# Now lets create a Series just for some new student Kelly, which lists all of the courses
# she has taken. We'll set the index to Kelly, and the data to be the names of courses.

kelly_classes = pd.Series(['Philosophy', 'Arts', 'Math'], index=['Kelly', 'Kelly', 'Kelly'])
kelly_classes

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [67]:
# Finally, we can append all of the data in this new Series to the first using the .append() function.

all_students_classes = students_classes.append(kelly_classes)

# This creates a series which has our original people in it as well as all of Kelly's courses

all_students_classes

Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

注意，这里的append()和list中的append()不同。例如对于list num，执行num.append(6)，会向list num中增加新的一项6；而在这里，直接num.append()没有效果，不会改变num:

In [68]:
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [69]:
# 如果我们向合并后的series中查找Kelly，不会得到单一的值，而是会得到一个series：

all_students_classes.loc['Kelly']

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object