# Chapter 02 DataFrames and Series
Pandas for Everyone. See the author's [github page](https://github.com/chendaniely/pandas_for_everyone)

In [2]:
import pandas as pd

## Creating a series

In [3]:
s = pd.Series(['banana', 43])
s

0    banana
1        43
dtype: object

For a Pandas series, all values must be of the same type. If we pass in different Python types, then Pandas will use the most common representation of all values, typically the *dtype* will be *object*.

Compare |Series | Dictionary | List
-------|-------|-------------|---------
Values| same type | same or mixed type | same or mixed type
Keys | user assigned or system default | user assigned | system default

For a Pandas series, each value has an *index*. The index can be user assigned or system generated (default). The syntax to access a value in a series is s\[i\], where i is the index value.

If the user does not provide the index, then the series looks like a list, because the default index will be 0, 1, 2...

If the user provides the index, then the series looks like a dictionary.

In [4]:
s = pd.Series(['Wes Mckinney', 'Creator of Pandas'], index=['Person', 'Who']) # provide values and index separately
s

Person         Wes Mckinney
Who       Creator of Pandas
dtype: object

In [5]:
s = pd.Series({'x': 88, 'y': 99, 'z': 100}) # provide index and values together
s

x     88
y     99
z    100
dtype: int64

In [6]:
s = pd.Series({'x': 88, 'y': 99, 'z': 100}, index=['z', 'y', 'x', 'a']) # build a series then use another index to rearrange it
s

z    100.0
y     99.0
x     88.0
a      NaN
dtype: float64

Notice that we have 'NaN' value here, because 'a' does not have a corresponding value.

## Creating a DataFrame

In [7]:
df1 = pd.DataFrame([['x', 88], ['y', 99], ['z', 100]]) # create a dataframe with a list of lists
df1

Unnamed: 0,0,1
0,x,88
1,y,99
2,z,100


Note that both the row index and column index are system generated (0, 1, 2 ..)

In [8]:
df2 = pd.DataFrame([['x', 88], ['y', 99], ['z', 100]], columns=['Person', 'Score']) # provide column index
df2

Unnamed: 0,Person,Score
0,x,88
1,y,99
2,z,100


In [9]:
df3 = pd.DataFrame( [['Hong Kong', 88], ['Singapore', 99], ['Shenzhen', 100]]
                 , columns=['Location', 'Score']
                 , index=['x', 'y', 'z']
                 ) # provide column names and index
df3

Unnamed: 0,Location,Score
x,Hong Kong,88
y,Singapore,99
z,Shenzhen,100


In [10]:
df1[1] # access a column by column index

0     88
1     99
2    100
Name: 1, dtype: int64

In [11]:
type(df1[0]) # a column is a series object

pandas.core.series.Series

In [12]:
df2.loc[0] # use .loc attribute to access a row by row index

Person     x
Score     88
Name: 0, dtype: object

In [13]:
type(df2.loc[0])

pandas.core.series.Series

In [14]:
df3['Location']['y'] # access by column index first, then by row index

'Singapore'

In [15]:
df3.loc['z']['Location'] # access by row index first, then by column index

'Shenzhen'

From the above, we can think of a DataFrame object as:

1. A series of columns, where each column is a series of rows. Or,
2. A series of rows, where each row is a series of columns.

The difference to access a row from a DataFrame is to use .loc attribute, where accessing a colunm to use the column index directly.

## Series methods
The series class has many methods, such as mean(), hist(), replace(), unique, nuniqe(), etc.

In [16]:
df3['Score'].mean()

95.66666666666667

In [17]:
s

z    100.0
y     99.0
x     88.0
a      NaN
dtype: float64

In [18]:
s = s.fillna(0) # fillna() replace NaN value by another value (series -> series)
s

z    100.0
y     99.0
x     88.0
a      0.0
dtype: float64

### Using boolean vector to do selection
Other than using [] or .loc to select certain column or row, we can also use a boolean vector to do so.

The series class implemented ==, > and < operator so that they can also be used to generate a boolean vector (series)

In [19]:
selector = df3['Score'] > 90
selector

x    False
y     True
z     True
Name: Score, dtype: bool

In [20]:
df3[selector]

Unnamed: 0,Location,Score
y,Singapore,99
z,Shenzhen,100


In [21]:
selector = [False, True, True] # list must be of the same length as df3
df3[selector]

Unnamed: 0,Location,Score
y,Singapore,99
z,Shenzhen,100


The following does NOT work. If we use a series to do selection, then the series must have the same index as the dataframe object which needs selection.

In [24]:
selector = pd.Series([False, True, True])
try:
    df3[selector]
except:
    print('does not work')

does not work


  df3[selector]


In [26]:
df = pd.read_csv('data/scientists.csv')
df

Unnamed: 0,Name,Born,Died,Age,Occupation
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist
5,John Snow,1813-03-15,1858-06-16,45,Physician
6,Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


In [28]:
df['Age'].describe()

count     8.000000
mean     59.125000
std      18.325918
min      37.000000
25%      44.000000
50%      58.500000
75%      68.750000
max      90.000000
Name: Age, dtype: float64

In [29]:
df[df['Age'] > df['Age'].mean()]

Unnamed: 0,Name,Born,Died,Age,Occupation
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


### Operations on vector
If we do a \*, +, operator on a vector, it's an element-by-element calculation on the vector.

In [34]:
df['Age'] * 2 + 100 # for every element x, do 2x + 100

0    174
1    222
2    280
3    232
4    212
5    190
6    182
7    254
Name: Age, dtype: int64

In [31]:
df['Age'] + df['Age']

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

In [32]:
df['Age'] * df['Age']

0    1369
1    3721
2    8100
3    4356
4    3136
5    2025
6    1681
7    5929
Name: Age, dtype: int64

In [50]:
s = pd.Series(range(50, 55))
s

0    50
1    51
2    52
3    53
4    54
dtype: int64

In [52]:
df['Age'] + s # add two series with the same index

0     87.0
1    112.0
2    142.0
3    119.0
4    110.0
5      NaN
6      NaN
7      NaN
dtype: float64

In [55]:
s = pd.Series(range(50, 57), index=range(3, 10))
s

3    50
4    51
5    52
6    53
7    54
8    55
9    56
dtype: int64

In [57]:
df['Age'] + s # auto align index, for those with the same index, perform the operation; for those that don't align, output NaN

0      NaN
1      NaN
2      NaN
3    116.0
4    107.0
5     97.0
6     94.0
7    131.0
8      NaN
9      NaN
dtype: float64

In [59]:
df

Unnamed: 0,Name,Born,Died,Age,Occupation
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist
5,John Snow,1813-03-15,1858-06-16,45,Physician
6,Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


In [61]:
df * 2 # perform (x 2) for each column, we can see that string gets duplicated, age gets doubled

Unnamed: 0,Name,Born,Died,Age,Occupation
0,Rosaline FranklinRosaline Franklin,1920-07-251920-07-25,1958-04-161958-04-16,74,ChemistChemist
1,William GossetWilliam Gosset,1876-06-131876-06-13,1937-10-161937-10-16,122,StatisticianStatistician
2,Florence NightingaleFlorence Nightingale,1820-05-121820-05-12,1910-08-131910-08-13,180,NurseNurse
3,Marie CurieMarie Curie,1867-11-071867-11-07,1934-07-041934-07-04,132,ChemistChemist
4,Rachel CarsonRachel Carson,1907-05-271907-05-27,1964-04-141964-04-14,112,BiologistBiologist
5,John SnowJohn Snow,1813-03-151813-03-15,1858-06-161858-06-16,90,PhysicianPhysician
6,Alan TuringAlan Turing,1912-06-231912-06-23,1954-06-071954-06-07,82,Computer ScientistComputer Scientist
7,Johann GaussJohann Gauss,1777-04-301777-04-30,1855-02-231855-02-23,154,MathematicianMathematician


In [68]:
pd.to_datetime(df['Born'], format='%Y-%m-%d')

0   1920-07-25
1   1876-06-13
2   1820-05-12
3   1867-11-07
4   1907-05-27
5   1813-03-15
6   1912-06-23
7   1777-04-30
Name: Born, dtype: datetime64[ns]