<img src="http://www.cs.wm.edu/~rml/images/wm_horizontal_single_line_full_color.png">

<h1 style="text-align:center;">CSCI 140</h1>
<h1 style="text-align:center;">
pandas Series
</h1>

Next: [Introduction to Data Frames](Data_Frames.ipynb)

# Series

The pandas module contains data structures that will be of use to us for data science. The first data structure we will consider is a series. We've discussed lists, which are indexed numerically, and have an intrinsic order, as well as dictionaries, which are indexed by keys (of various types) and are inherently unordered. A Series is somewhere in-between, described in the pandas documentation as a:

"one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index." (https://pandas.pydata.org/pandas-docs/stable/dsintro.html)

The index will typically be either integer or string values. We can create a Series in many different ways. Let's consider the population of North and Central America from 2000-2016 (data source: https://ourworldindata.org/world-population-growth/)

In [None]:
#We need to import pandas, we'll abbreiviate it as pd
import pandas as pd
import numpy as np

In [None]:
data = [486880570, 492366094, 497854858, 503342438, 508828774, 514315776, 519801498, 525290102, 530776734, \
536261864, 541749246, 546802174, 551851944, 556901434, 561950624, 567001696, 572050884]
index = range(2000,2017)
popn = pd.Series(data, index=index)
print(popn)

We created the series using a list of data, and a range of years for the indices. If we want to access the population value for a particular year, we access it using the index:

In [None]:
print(popn[2001])

We can also change values the way we did with a dictionary:

In [None]:
popn[2001] = 10
print(popn[2001])

In [None]:
popn[2001] = 492366094
print(popn[2001])

The syntax is identical to what we've seen with dictionaries. What do you think will happen if we try to use an index that doesn't exist?

In [None]:
print(popn[2019])

If we don't specify an index, then the index will be the integers, starting with 0. Notice that this makes our series indexable like a list:

In [None]:
popn_no_index = pd.Series(data)
print(popn_no_index)

Let's go back to our original series. We can do things like take a slice of a series. We can do this in several ways. If you want to use positional indices, you can use the `iloc` atrribute (this is an attribute of a Series that is accessible to us): 

In [None]:
popn.iloc[2:10]

Notice that this slices both the data and the index, and uses positional indexing like a list (starting at 0). We can take a slice using the actual value of the index with the `loc` attribute:

In [None]:
popn.loc[2003:2007]

This works ALMOST the same as it does with lists, but the `loc` method does give us the item at the LAST index specified. Here's another example: if you want your slice reversed for every other year from 2009 to 2003:

In [None]:
popn.loc[2009:2003:-2]

## Why Series? Some useful properties and methods

Later on in the readings, we will talk more about Series methods, for example, to change the data type and to process strings. This is an introduction to why we want Series in the first place.

Imagine that you have a list and you want to multiply every item in the list by 2:

In [None]:
a_list = [5,2,5,6,7]
print(a_list*2)

That's not what we want! We want it to take every item out of the list, multipy it by 2, and put it in a new list. We actually have to write that step by step:

In [None]:
result = []
for item in a_list:
    result.append(item*2)
print(result)

Wouldn't it be nice if we could just multiply the list by 2? We can do that with a pandas Series:

In [None]:
a_series = pd.Series([5,2,5,6,7])
print(a_series*2)

We can also do things like add, subtract, multiply, or divide 2 Series. Compare and contrast this with how lists work:

In [None]:
a_list = [5,2,5,6,7]
b_list = [1,2,3,4,5]
print(a_list + b_list)
print(a_list * b_list)

In [None]:
a_series = pd.Series([5,2,5,6,7])
b_series = pd.Series([1,2,3,4,5])
print(a_series + b_series)
print(a_series * b_series)

In [None]:
data = [486880570, 492366094, 497854858, 503342438, 508828774, 514315776, 519801498, 525290102, 530776734, \
536261864, 541749246, 546802174, 551851944, 556901434, 561950624, 567001696, 572050884]
index = range(2000,2017)
popn = pd.Series(data, index=index)

Series objects also have a lot of useful methods:

In [None]:
popn.median()

In [None]:
popn.mean()

We can also give our series a name, which is useful when series are used as columns in a Data Frame (we'll talk about these next):

In [None]:
popn = pd.Series(data,index,name='NorthAmerica')

In [None]:
print(popn)

## A Playlist

Let's consider another example of a Series, a playlist. We could read in information for the playlist from a file:

In [None]:
play = open('playlist.txt','r') #You will need to change this based on where you stored the file

In [None]:
plist = []
for line in play:
    line = line.rstrip()
    plist.append(line)
playlist = pd.Series(plist)

In [None]:
play.close()

We didn't specify an index, so the songs will be indexed from 0 to 7:

In [None]:
print(playlist)

We might like to have them indexed from 1 to 8 instead:

In [None]:
playlist = pd.Series(plist, index = range(1,9))

In [None]:
playlist

Or perhaps we have more descriptive string labels for each song:

In [None]:
playlist = pd.Series(plist, name='Depressing Dance Party', index = ['Intro','Dance 1', 'Dance 2', 'Dance 3', 'Dance 4',\
                                                                    'Dance 5', 'Dance 6', 'Outro'])

In [None]:
print(playlist)

Here is a fancy slice from our playlist:

In [None]:
playlist.loc['Dance 4':'Dance 2':-1]