<a href="https://colab.research.google.com/github/csaatechnicalarts/ML_Bootcamp/blob/main/Pandas_02_SeriesIntro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#02 Pandas Data Structure: Series

##Content Outline
* Introduction
* Create a series from a Pandas data frame
* Create a series from a Python collection
* Statistical operations on a series
* Vector operation on a series



In Pandas the *series* is one of the core data structures for computation.  A series is a one-dimensional array with labeled index. Like a Python array, a series is an ordered data type: it's elements can be indexed with the **[ ]** notation. Element types can be heterogeneous; the index must be a hashable type.While each element of a series is mutable, the length of the series itself can never be updated.

For this notebook, we're going to demonstrate Pandas series using "ww2_leaders.csv" [file](https://github.com/csaatechnicalarts/ML_Bootcamp/blob/main/data/ww2_leaders.csv) that we load before hand to Google Colab.

In [129]:
import pandas as pd
import matplotlib.pyplot as plt

In [130]:
df = pd.read_csv("sample_data/ww2_leaders.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Name     12 non-null     object
 1   Born     12 non-null     object
 2   Died     12 non-null     object
 3   Age      12 non-null     int64 
 4   Title    12 non-null     object
 5   Country  12 non-null     object
dtypes: int64(1), object(5)
memory usage: 708.0+ bytes


In [131]:
df

Unnamed: 0,Name,Born,Died,Age,Title,Country
0,Franklin Roosevelt,1882-01-30,1945-04-12,63,President,United States
1,Joseph Stalin,1878-12-06,1953-03-05,74,Great Leader,Soviet Union
2,Adolph Hitler,1889-04-20,1945-04-30,56,Fuhrer,Germany
3,Michinomiya Hirohito,1901-04-29,1989-01-07,87,Emperor,Japan
4,Charles de Gaulle,1890-11-22,1970-11-09,79,President,France
5,Winston Churchill,1874-11-30,1965-01-24,90,Prime Minister,United Kingdom
6,Manuel Camacho,1897-04-24,1955-10-13,58,President,Mexico
7,Jan Smuts,1870-05-24,1950-09-11,80,Prime Minister,South Africa
8,Ibn Saud,1875-01-15,1953-11-09,78,King,Saudi Arabia
9,Plaek Phibunsongkhram,1897-07-14,1965-06-11,66,Prime Minister,Thailand


Recall that a Pandas data frame is a two-dimensional collection made up of rows and columns. Conceptually a data frame row is equivalent to a Pandas series. Using the *loc[]* method, we can index a row and provision a new series with it.

In [132]:
deGaule = 4
s = pd.Series(df.loc[deGaule])
print(f"{s}\n\nType of s: {type(s)}")

Name       Charles de Gaulle
Born              1890-11-22
Died              1970-11-09
Age                       79
Title              President
Country               France
Name: 4, dtype: object

Type of s: <class 'pandas.core.series.Series'>


A series is made up of a collection of labels (column indices) and a collection of elements (rows). We use the *keys()* method and the *values* attribute to retrieve each list accordingly. Just like arrays in Python, we can index into specific elements in the keys and values of a series.

In [133]:
print(f"{s.keys()}\n\n{s.keys()[0]}\n{s.keys()[3]}")

Index(['Name', 'Born', 'Died', 'Age', 'Title', 'Country'], dtype='object')

Name
Age


In [134]:
print(s.values)

['Charles de Gaulle' '1890-11-22' '1970-11-09' np.int64(79) 'President'
 'France']


In [135]:
print(f"Name:\t\t{s.values[0]}\nCountry:\t{s.values[5]}\nAge:\t\t{s.values[3]}")

Name:		Charles de Gaulle
Country:	France
Age:		79


Can we create our own series object on the fly? Yes, by using Panda's *Series()* method and passing as parameters a list of values, of homogeneous or heterogenous types.

In [136]:
s = pd.Series(range(100, 120, 5))

In [137]:
print(f"{s}\n\nType of s: {type(s)}")

0    100
1    105
2    110
3    115
dtype: int64

Type of s: <class 'pandas.core.series.Series'>


In [138]:
s = pd.Series(['AAA', 32.4907, 100])

In [139]:
print(f"{s}\n\nType of s: {type(s)}")

0        AAA
1    32.4907
2        100
dtype: object

Type of s: <class 'pandas.core.series.Series'>


In [140]:
print(f"{s}\n\nType of s: {type(s)}")

0        AAA
1    32.4907
2        100
dtype: object

Type of s: <class 'pandas.core.series.Series'>


By default Pandas will assign integers as indices or labels to the series we've just created. If we wanted to provision a series and specify the labels, we do so by passing a second parameter to *Series()*.

In [141]:
s = pd.Series(['AAA', 32.4907, 100], index=['word', 'float', 'integer'])

In [142]:
print(f"{s}\n\nType of s: {type(s)}")

word           AAA
float      32.4907
integer        100
dtype: object

Type of s: <class 'pandas.core.series.Series'>


Alternately we can pass a Python dictionary to create a Pandas series.

In [143]:
us_states_admission = {"Alabama": "1819-12-04", "Illinois": "1818-12-03", "Nevada": "1864-10-31"}
s =  pd.Series(us_states_admission)

In [144]:
print(f"{s}\n\nType of s: {type(s)}")

Alabama     1819-12-04
Illinois    1818-12-03
Nevada      1864-10-31
dtype: object

Type of s: <class 'pandas.core.series.Series'>


Let's go back to our table of WW2 leaders.

In [145]:
df = pd.read_csv("sample_data/ww2_leaders.csv")
df

Unnamed: 0,Name,Born,Died,Age,Title,Country
0,Franklin Roosevelt,1882-01-30,1945-04-12,63,President,United States
1,Joseph Stalin,1878-12-06,1953-03-05,74,Great Leader,Soviet Union
2,Adolph Hitler,1889-04-20,1945-04-30,56,Fuhrer,Germany
3,Michinomiya Hirohito,1901-04-29,1989-01-07,87,Emperor,Japan
4,Charles de Gaulle,1890-11-22,1970-11-09,79,President,France
5,Winston Churchill,1874-11-30,1965-01-24,90,Prime Minister,United Kingdom
6,Manuel Camacho,1897-04-24,1955-10-13,58,President,Mexico
7,Jan Smuts,1870-05-24,1950-09-11,80,Prime Minister,South Africa
8,Ibn Saud,1875-01-15,1953-11-09,78,King,Saudi Arabia
9,Plaek Phibunsongkhram,1897-07-14,1965-06-11,66,Prime Minister,Thailand


In our earlier example, we used the data frame *loc()* method to isolate a row and provision a new series from this. Pandas also allows us to create series from a data frame column.

In [146]:
age = df['Age']
print(f"Type of age: {type(age)}")

Type of age: <class 'pandas.core.series.Series'>


We can Panda's *describe()* operation to report descriptive statistics for this series.

In [147]:
age.describe()

Unnamed: 0,Age
count,12.0
mean,72.833333
std,11.784684
min,56.0
25%,62.25
50%,76.0
75%,80.75
max,90.0


Furthermore, if we just need the average age, we can use Panda's *mean()* function.

In [148]:
age.mean()

np.float64(72.83333333333333)

Let's say we're interested in the want the age values that are less than or equal to the average age in this series. Pandas let's use use the **[ ]** notation to apply a filtering logic, screening out the values that evaluate to false.

In [149]:
age[age <= age.mean()]

Unnamed: 0,Age
0,63
2,56
6,58
9,66
10,60


Behind the scenes, the filter statement expands to a list of true or false values. Pandas applies this list to return only the elements corresponding to true.  

In [150]:
print(age <= age.mean())

0      True
1     False
2      True
3     False
4     False
5     False
6      True
7     False
8     False
9      True
10     True
11    False
Name: Age, dtype: bool


We can see how this works using our own ad-hoc true-and-false list as a mask. In Pandas lingo this list is also known as a *vector* of boolean values.

In [151]:
names = pd.Series(df['Name'])
potentates = [False, True, True, True, False, False, False, False, True, False, False, True]
names[potentates]

Unnamed: 0,Name
1,Joseph Stalin
2,Adolph Hitler
3,Michinomiya Hirohito
8,Ibn Saud
11,Haile Selassie


Now it's time to look at vector operations applied to series. We can add or multiply a series by a scalar value and Pandas will apply the operation to each element individually.

In [152]:
age + 1000

Unnamed: 0,Age
0,1063
1,1074
2,1056
3,1087
4,1079
5,1090
6,1058
7,1080
8,1078
9,1066


In [153]:
age * 2

Unnamed: 0,Age
0,126
1,148
2,112
3,174
4,158
5,180
6,116
7,160
8,156
9,132


Taking the age series itself as a parameter, we can carry out vector addition to the series. Pandas follows a one-to-one correspondence. In the example below, the outcome is the same as doubling each age, similar to the notebook cell above.

In [154]:
age + age

Unnamed: 0,Age
0,126
1,148
2,112
3,174
4,158
5,180
6,116
7,160
8,156
9,132


What happens when pass as a parameter a series which doesn't share the same shape?

In [155]:
age + pd.Series([100, 100])

Unnamed: 0,0
0,163.0
1,174.0
2,
3,
4,
5,
6,
7,
8,
9,


As we see here, Pandas carries out the vector operation element by element, but leaves the result undefined for elements with no matching parameters