# Chapter 02 DataFrames and Series
Pandas for Everyone. See the author's [github page](https://github.com/chendaniely/pandas_for_everyone)

In [6]:
import pandas as pd

## Creating a Series

In [146]:
s = pd.Series(['banana', 43]) # create a Series using a list
s

0    banana
1        43
dtype: object

The left column is called a *index*, used to access to value in a Series object. Pandas generates a default index (0, 1, 2..) when a user does not provide it.

In [147]:
s = pd.Series({'x': 88, 'y': 99, 'z': 100}) # create a Series using a dictionary, the keys become the index
s

x     88
y     99
z    100
dtype: int64

For a Pandas series, all values must be of the same type. If we pass in different Python types, then Pandas will use the most common representation of all values, typically the *dtype* will be *object*.

Compare |Series | Dictionary | List
-------|-------|-------------|---------
Values| same type | same or mixed type | same or mixed type
Keys | user assigned or system default | user assigned | system default

For a Pandas series, each value has an *index*. The index can be user assigned or system generated (default). The syntax to access a value in a series is s\[i\], where i is the index value.

If the user does not provide the index, then the series looks like a list, because the default index will be 0, 1, 2...

If the user provides the index, then the series looks like a dictionary.

In [148]:
s = pd.Series(['Wes Mckinney', 'Creator of Pandas'], index=['Person', 'Who']) # another way to provide values and index
s

Person         Wes Mckinney
Who       Creator of Pandas
dtype: object

In [149]:
s = pd.Series({'x': 88, 'y': 99, 'z': 100}, index=['z', 'y', 'x', 'a']) # build a series then use another index to rearrange it
s

z    100.0
y     99.0
x     88.0
a      NaN
dtype: float64

Notice that we have 'NaN' value here, because 'a' does not have a corresponding value.

### Series As an Iterable
We can iterate through a serries object like a list.

In [150]:
[x for x in s]

[100.0, 99.0, 88.0, nan]

In [151]:
len(s) # number of values

4

In [152]:
s.apply(lambda x: 2*x + 1) # map the Series object to a new Series. For NaN, they are not mapped and kept as is by default

z    201.0
y    199.0
x    177.0
a      NaN
dtype: float64

### Access Using loc\[\] vs Square Bracket \[\] vs iloc\[\]

See this [link](https://stackoverflow.com/questions/48409128/what-is-the-difference-between-using-loc-and-using-just-square-brackets-to-filte) for a discussion on the difference. Through testing we found the below (as of the current version 1.0.4):

loc\[x\] | iloc\[x\] | square bracket \[x\]
---------|-----------|----------------------
x as index | x as index, and *sometimes* x as location | x as location

In [153]:
s

z    100.0
y     99.0
x     88.0
a      NaN
dtype: float64

In [154]:
s['z'] # access using index

100.0

In [155]:
s[0] # access using location (but this does NOT work for all Series, especially if the Series index is all integer)

100.0

In [156]:
s.loc['z'] # access using index

100.0

In [157]:
s.iloc[0] # access using location

100.0

## Creating a DataFrame
There are different ways to create a DataFrame,

Input | Remarks
------|--------
list of lists | each list as a row
dictionary of lists | each list as a column, with keys as column index

The system will generate both a column index and a row index automatically (0, 1, 2..) if the user does not specify. To provide them explicity, use blow:

Paramter | Index
---------|---------
columns=*list* | column index
rows=*list* | row index

In [158]:
df1 = pd.DataFrame([['x', 88], ['y', 99], ['z', 100]]) # create a dataframe with a list of lists, where each list is a row
df1

Unnamed: 0,0,1
0,x,88
1,y,99
2,z,100


Note that both the row index and column index are system generated (0, 1, 2 ..)

In [159]:
df2 = pd.DataFrame({'Score': [88, 99, 100], 'Person': ['x', 'y', 'z']}) # create using a dictionary of Series, where each
                                                                        # Series object is a column, keys as column index
df2

Unnamed: 0,Score,Person
0,88,x
1,99,y
2,100,z


In [160]:
df3 = pd.DataFrame( [['Hong Kong', 88], ['Singapore', 99], ['Shenzhen', 100]]
                  , columns=['Location', 'Score']
                  , index=['x', 'y', 'z']
                  ) # provide column index and row index explicitly
df3

Unnamed: 0,Location,Score
x,Hong Kong,88
y,Singapore,99
z,Shenzhen,100


In [161]:
df4 = pd.DataFrame({'Temperature': [25, 28, 31]}, index=['2020-06-01', '2020-06-02', '2020-06-03']) # yet another way
df4

Unnamed: 0,Temperature
2020-06-01,25
2020-06-02,28
2020-06-03,31


### Access Value
We can access a DataFrame using either a row index, a column index or a row location, as below

Access Method | Output | Return Value
--------------|--------|---------------
\[*c*\], by column index | a column as a Series | a copy or internal value
.loc\[*r*\], by row index | a row as a Series | internal value
.iloc\[*r*\], by row position | a row as a Series | internal value

df\[x\] returns a copy, so df\[x\] = value may incur a *setOnCopy* error.

In [162]:
df3['Score'] # access a column by column index

x     88
y     99
z    100
Name: Score, dtype: int64

In [163]:
type(df3['Score']) # a column is a series object

pandas.core.series.Series

In [164]:
df3.loc['z'] # use .loc attribute to access a row by row index

Location    Shenzhen
Score            100
Name: z, dtype: object

In [165]:
type(df3.loc['z']) # a row is also a Series object

pandas.core.series.Series

In [166]:
df3.iloc[0] # using row location, the first row

Location    Hong Kong
Score              88
Name: x, dtype: object

In [167]:
df3.iloc[-1] # the last row

Location    Shenzhen
Score            100
Name: z, dtype: object

### Access an Element

In [168]:
df3['Location']['y'] # access by column index first, then by row index

'Singapore'

In [169]:
df3.loc['z']['Location'] # access by row index first, then by column index

'Shenzhen'

### DataFrame As an Iterable

In [170]:
[x for x in df3] # gives the list of column index

['Location', 'Score']

In [171]:
len(df3) # number of rows

3

## Series methods
The series class has many methods, such as mean(), hist(), replace(), unique, nuniqe(), etc. 

In [172]:
df3['Score'].mean()

95.66666666666667

In [173]:
s

z    100.0
y     99.0
x     88.0
a      NaN
dtype: float64

In [174]:
s = s.fillna(0) # fillna() replace NaN value by another value (series -> series)
s

z    100.0
y     99.0
x     88.0
a      0.0
dtype: float64

### Using boolean vector to do selection
Other than using [] or .loc to select certain column or row, we can also use a boolean vector to do so.

The series class implemented ==, > and < operator so that they can also be used to generate a boolean vector (series)

In [175]:
selector = df3['Score'] > 90
selector

x    False
y     True
z     True
Name: Score, dtype: bool

In [176]:
df3[selector]

Unnamed: 0,Location,Score
y,Singapore,99
z,Shenzhen,100


In [177]:
selector = [False, True, True] # list must be of the same length as df3
df3[selector]

Unnamed: 0,Location,Score
y,Singapore,99
z,Shenzhen,100


The following does NOT work. If we use a series to do selection, then the series must have the same index as the dataframe object which needs selection.

In [178]:
selector = pd.Series([False, True, True])
try:
    df3[selector]
except:
    print('does not work')

does not work


  This is separate from the ipykernel package so we can avoid doing imports until


In [179]:
df = pd.read_csv('data/scientists.csv')
df

Unnamed: 0,Name,Born,Died,Age,Occupation
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist
5,John Snow,1813-03-15,1858-06-16,45,Physician
6,Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


In [180]:
df['Age'].describe()

count     8.000000
mean     59.125000
std      18.325918
min      37.000000
25%      44.000000
50%      58.500000
75%      68.750000
max      90.000000
Name: Age, dtype: float64

In [181]:
df[df['Age'] > df['Age'].mean()]

Unnamed: 0,Name,Born,Died,Age,Occupation
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


### Update a DataFrame or Series

To update a DataFrame, usually we update part of it, 

In [7]:
dftest = pd.DataFrame({ 'Ticker': ['AAA', 'BBB', 'BBB', 'CCC', 'BBB']
                      , 'Vote': [2012, 1997, 1997, 2020, 1995]
                      , 'Value': range(100, 105)
                      })
dftest

Unnamed: 0,Ticker,Vote,Value
0,AAA,2012,100
1,BBB,1997,101
2,BBB,1997,102
3,CCC,2020,103
4,BBB,1995,104


#### Conditional Selector Update

When use conditional selector or slicing to choose certain rows and columns, we must not use \[\] to update. Instead we need to use .loc\[\] to update.

Approach | Result
---------|----------
df\[*conditional selector*\]\[*column name*\] = new value | doe NOT work, because sqaure bracket \[\] returns a copy
df.loc\[*conditional selector*, *column name*\] = new value | works, .loc\[\] returns internal data

In [8]:
selector = (dftest['Ticker'] == 'BBB') & (dftest['Vote'] == 1997)
dftest[selector]['Value'] = -888 # does NOT work
dftest # no update

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Ticker,Vote,Value
0,AAA,2012,100
1,BBB,1997,101
2,BBB,1997,102
3,CCC,2020,103
4,BBB,1995,104


In [9]:
dftest.loc[selector, 'Value'] = -88 # correct approach
dftest

Unnamed: 0,Ticker,Vote,Value
0,AAA,2012,100
1,BBB,1997,-88
2,BBB,1997,-88
3,CCC,2020,103
4,BBB,1995,104


#### Unconditional Update

We can use square bracket \[\] or .loc\[\] to selec a column to change

In [10]:
dftest['Value'] = dftest['Value'].apply(lambda x: x + 100)
dftest

Unnamed: 0,Ticker,Vote,Value
0,AAA,2012,200
1,BBB,1997,12
2,BBB,1997,12
3,CCC,2020,203
4,BBB,1995,204


In [11]:
dftest.loc[:, 'Value'] = dftest.loc[:, 'Value'].apply(lambda x: x/2)
dftest

Unnamed: 0,Ticker,Vote,Value
0,AAA,2012,100.0
1,BBB,1997,6.0
2,BBB,1997,6.0
3,CCC,2020,101.5
4,BBB,1995,102.0


### Operations on vector
If we do a \*, +, operator on a vector, it's an element-by-element calculation on the vector.

In [102]:
df['Age'] * 2 + 100 # for every element x, do 2x + 100

0    174
1    222
2    280
3    232
4    212
5    190
6    182
7    254
Name: Age, dtype: int64

In [103]:
df['Age'] + df['Age']

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

In [104]:
df['Age'] * df['Age']

0    1369
1    3721
2    8100
3    4356
4    3136
5    2025
6    1681
7    5929
Name: Age, dtype: int64

In [105]:
s = pd.Series(range(50, 55))
s

0    50
1    51
2    52
3    53
4    54
dtype: int64

In [106]:
df['Age'] + s # add two series with the same index

0     87.0
1    112.0
2    142.0
3    119.0
4    110.0
5      NaN
6      NaN
7      NaN
dtype: float64

In [107]:
s = pd.Series(range(50, 57), index=range(3, 10))
s

3    50
4    51
5    52
6    53
7    54
8    55
9    56
dtype: int64

In [108]:
df['Age'] + s # auto align index, for those with the same index, perform the operation; for those that don't align, output NaN

0      NaN
1      NaN
2      NaN
3    116.0
4    107.0
5     97.0
6     94.0
7    131.0
8      NaN
9      NaN
dtype: float64

In [109]:
df

Unnamed: 0,Name,Born,Died,Age,Occupation
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist
5,John Snow,1813-03-15,1858-06-16,45,Physician
6,Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


In [110]:
df * 2 # perform (x 2) for each column, we can see that string gets duplicated, age gets doubled

Unnamed: 0,Name,Born,Died,Age,Occupation
0,Rosaline FranklinRosaline Franklin,1920-07-251920-07-25,1958-04-161958-04-16,74,ChemistChemist
1,William GossetWilliam Gosset,1876-06-131876-06-13,1937-10-161937-10-16,122,StatisticianStatistician
2,Florence NightingaleFlorence Nightingale,1820-05-121820-05-12,1910-08-131910-08-13,180,NurseNurse
3,Marie CurieMarie Curie,1867-11-071867-11-07,1934-07-041934-07-04,132,ChemistChemist
4,Rachel CarsonRachel Carson,1907-05-271907-05-27,1964-04-141964-04-14,112,BiologistBiologist
5,John SnowJohn Snow,1813-03-151813-03-15,1858-06-161858-06-16,90,PhysicianPhysician
6,Alan TuringAlan Turing,1912-06-231912-06-23,1954-06-071954-06-07,82,Computer ScientistComputer Scientist
7,Johann GaussJohann Gauss,1777-04-301777-04-30,1855-02-231855-02-23,154,MathematicianMathematician


In [111]:
pd.to_datetime(df['Born'], format='%Y-%m-%d')

0   1920-07-25
1   1876-06-13
2   1820-05-12
3   1867-11-07
4   1907-05-27
5   1813-03-15
6   1912-06-23
7   1777-04-30
Name: Born, dtype: datetime64[ns]