# Learning Pandas from Start (Part I, Series)

Pandas is a module for processing data. It is one of the key module in SciPy ecosystem, because of the abundant data processing functions like RDBMS.

## Installation of Pandas

If you using anaconda, possibly you already have Pandas package included in your installation. But still if it not there you can use following command to install it:  

    conda install pandas  
With PyPI, can use the following command to install:  

    pip install pandas

## Pandas data objects

In pandas, there are 2 core data class:
- Series
- DataFrame

Series, a data structure designed to accomodate a sequence of one-dimensional data.  
DataFrame, a more complex data structure, mainly used for database table like 2 dimensional (rows and columns) data.  
If want to use pandas, import it first.

In [321]:
import pandas as pd
import numpy as np

## Create Series

Series has 2 columns, one is index, another is the actual data. If no special index is provided, series can create the defaul integer index to be used. (the integer index start from 0)  
Normally we can use list to create series. We can also use dict to create series, in this case dict.keys() will be used as index, and dict.values() will be used as values. Create series from dict is no so useful, most commonly still use list to create.

In [322]:
data = [i for i in range(10,40,2)]
s = pd.Series(data)
s

0     10
1     12
2     14
3     16
4     18
5     20
6     22
7     24
8     26
9     28
10    30
11    32
12    34
13    36
14    38
dtype: int64

In [323]:
s.array

<PandasArray>
[10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38]
Length: 15, dtype: int64

In [324]:
s.index

RangeIndex(start=0, stop=15, step=1)

In [325]:
s[0]

10

In [326]:
s[2:8] # slicing with integer positional index, end index item not included.

2    14
3    16
4    18
5    20
6    22
7    24
dtype: int64

In [327]:
s[[2,9]]

2    14
9    28
dtype: int64

In [328]:
data = ['James', 'Bob', 'Alex','Simon','Kevin','Tom']
index = ['a','b','c','d','e','f']
s = pd.Series(data, index=index)
s

a    James
b      Bob
c     Alex
d    Simon
e    Kevin
f      Tom
dtype: object

In [329]:
s['a']

'James'

In [330]:
s['a':'c'] # pandas slicing include the stop indexing items if not using the integer positional indexing.

a    James
b      Bob
c     Alex
dtype: object

***
⚠️**NOTE**  
The pandas slicing is different compared with NumPy and Python when using the index. It will **include the end index item**.  
But when using **integer positional index slicing** syntax, the **end position item will be excluded**, and this is same behavior with NumPy and Python.

***

In [331]:
s[0:3] # slicing with integer positional index, end index item not included.

a    James
b      Bob
c     Alex
dtype: object

In [332]:
s[['a','f']]

a    James
f      Tom
dtype: object

In [333]:
s[[0,4]] #integer number of position still can be used even index now is object type

a    James
e    Kevin
dtype: object

In [334]:
s.array

<PandasArray>
['James', 'Bob', 'Alex', 'Simon', 'Kevin', 'Tom']
Length: 6, dtype: object

In [335]:
s.index

Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

In [336]:
s=s.astype('string')
s

a    James
b      Bob
c     Alex
d    Simon
e    Kevin
f      Tom
dtype: string

In [337]:
s.array

<StringArray>
['James', 'Bob', 'Alex', 'Simon', 'Kevin', 'Tom']
Length: 6, dtype: string

### Change Value

In [338]:
s['a']='Lucas'
s

a    Lucas
b      Bob
c     Alex
d    Simon
e    Kevin
f      Tom
dtype: string

In [339]:
s[1]='Jack'
s

a    Lucas
b     Jack
c     Alex
d    Simon
e    Kevin
f      Tom
dtype: string

In [340]:
s[-2]='Ben'
s

a    Lucas
b     Jack
c     Alex
d    Simon
e      Ben
f      Tom
dtype: string

In [341]:
s=s.append(pd.Series(['John'],index=['z']))     # add more items
s

a    Lucas
b     Jack
c     Alex
d    Simon
e      Ben
f      Tom
z     John
dtype: object

In [342]:
s=s.drop('z')  # remove some items, if more than 1 can use list of index.
s

a    Lucas
b     Jack
c     Alex
d    Simon
e      Ben
f      Tom
dtype: object

### Special note when create Series from array/series: reference v.s. copy

When new Series is created from array or another series, note that the data will refer to the same actual value, they share same copy. Thus change one will affect another. (for numeric value)

In [343]:
a = np.array([i for i in range(10,20)])
a

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [344]:
i = np.array([i for i in 'abcdefghia'])
i

array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'a'], dtype='<U1')

In [345]:
s1 = pd.Series(a, index=i)
s1

a    10
b    11
c    12
d    13
e    14
f    15
g    16
h    17
i    18
a    19
dtype: int32

In [346]:
s1['a']  # repeted index value is allowed.

a    10
a    19
dtype: int32

In [347]:
s1.array

<PandasArray>
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
Length: 10, dtype: int32

In [348]:
type(s1.array)

pandas.core.arrays.numpy_.PandasArray

In [349]:
id(s1.array)

2720434844192

In [350]:
id(a)

2720434753136

In [351]:
id(s1.to_numpy()) 
# Series is based on PandasArray for numeric data.
# but inside the PandasArray still is numpy ndarray.
# and is same object we can find from the object id.

2720434753136

In [352]:
a is s1.to_numpy()

True

In [353]:
s1.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'a'], dtype='object')

In [354]:
id(s1.index)

2720142831520

In [355]:
id(i)

2720434734256

In [356]:
id(s1.index.array)

2720434919200

In [357]:
type(s1.index.array)

pandas.core.arrays.numpy_.PandasArray

In [358]:
id(s1.index.to_numpy())

2720434559760

In [359]:
i is s1.index.to_numpy()

False

We can see that the when using to_numpy() getting back the original array we used to create the Series, but the index and original array i are different.  
This means if we **change value in numpy ndarray a, then the Series s1 will be affected**.

In [360]:
a[0]=100
a

array([100,  11,  12,  13,  14,  15,  16,  17,  18,  19])

In [361]:
s1

a    100
b     11
c     12
d     13
e     14
f     15
g     16
h     17
i     18
a     19
dtype: int32

In [362]:
s2 = pd.Series(s1)
s2

a    100
b     11
c     12
d     13
e     14
f     15
g     16
h     17
i     18
a     19
dtype: int32

In [363]:
s2.array is s1.array

False

In [364]:
s1.to_numpy() is s2.to_numpy()

True

In [365]:
s1.index is s2.index

True

s1 and s2, the two series with same index object, and same numpy ndarray. only the PandasArray wrapper object is different for them.  
This also means if change s1 data, s2 also will be affected.

In [366]:
s1 is s2

False

In [367]:
s1.loc['e'] = 200
s1

a    100
b     11
c     12
d     13
e    200
f     15
g     16
h     17
i     18
a     19
dtype: int32

In [368]:
s2

a    100
b     11
c     12
d     13
e    200
f     15
g     16
h     17
i     18
a     19
dtype: int32

### Filtering Series Data

Series also can create boolean a-like (same shape), and then use it to filter data.

In [369]:
s2=s2-18
f = s2>5
f

a     True
b    False
c    False
d    False
e     True
f    False
g    False
h    False
i    False
a    False
dtype: bool

In [370]:
s2[f]

a     82
e    182
dtype: int32

In [371]:
s2*f.astype('int32')  # conver bool to int and filter value, make filter out value to 0

a     82
b      0
c      0
d      0
e    182
f      0
g      0
h      0
i      0
a      0
dtype: int32

In [372]:
s2*f.map({False:np.nan, True:1})  # this make those filter out items as NaN.

a     82.0
b      NaN
c      NaN
d      NaN
e    182.0
f      NaN
g      NaN
h      NaN
i      NaN
a      NaN
dtype: float64

### calculations of math functions on one-dimensional series

In [373]:
ss = s1/2
ss

a     50.0
b      5.5
c      6.0
d      6.5
e    100.0
f      7.5
g      8.0
h      8.5
i      9.0
a      9.5
dtype: float64

In [374]:
ss = np.exp(s1)
ss

a    2.688117e+43
b    5.987414e+04
c    1.627548e+05
d    4.424134e+05
e    7.225974e+86
f    3.269017e+06
g    8.886111e+06
h    2.415495e+07
i    6.565997e+07
a    1.784823e+08
dtype: float64

### Series data manipulation methods

In [375]:
d1 = ss.reset_index()  # reset_index will make the original index as a column and thus series will be changed to DataFramce which now holds 2 columns.
d1

Unnamed: 0,index,0
0,a,2.688117e+43
1,b,59874.14
2,c,162754.8
3,d,442413.4
4,e,7.225974e+86
5,f,3269017.0
6,g,8886111.0
7,h,24154950.0
8,i,65659970.0
9,a,178482300.0


In [376]:
ss.unique()

array([2.68811714e+43, 5.98741417e+04, 1.62754791e+05, 4.42413392e+05,
       7.22597377e+86, 3.26901737e+06, 8.88611052e+06, 2.41549528e+07,
       6.56599691e+07, 1.78482301e+08])

In [377]:
ss.value_counts()

2.688117e+43    1
5.987414e+04    1
1.627548e+05    1
4.424134e+05    1
7.225974e+86    1
3.269017e+06    1
8.886111e+06    1
2.415495e+07    1
6.565997e+07    1
1.784823e+08    1
dtype: int64

isin() will check whether the item of a series is in a list.

In [378]:
selection = ss.isin([2.**i for i in range(10)])
selection

a    False
b    False
c    False
d    False
e    False
f    False
g    False
h    False
i    False
a    False
dtype: bool

In [379]:
ss[selection]

Series([], dtype: float64)

## Calculation of series and series

In [380]:
ss

a    2.688117e+43
b    5.987414e+04
c    1.627548e+05
d    4.424134e+05
e    7.225974e+86
f    3.269017e+06
g    8.886111e+06
h    2.415495e+07
i    6.565997e+07
a    1.784823e+08
dtype: float64

In [381]:
s1

a    100
b     11
c     12
d     13
e    200
f     15
g     16
h     17
i     18
a     19
dtype: int32

In [382]:
ss1 = ss * s1
ss1

a    2.688117e+45
b    6.586156e+05
c    1.953057e+06
d    5.751374e+06
e    1.445195e+89
f    4.903526e+07
g    1.421778e+08
h    4.106342e+08
i    1.181879e+09
a    3.391164e+09
dtype: float64

In [383]:
ss2 = ss - s1
ss2

a    2.688117e+43
b    5.986314e+04
c    1.627428e+05
d    4.424004e+05
e    7.225974e+86
f    3.269002e+06
g    8.886095e+06
h    2.415494e+07
i    6.565995e+07
a    1.784823e+08
dtype: float64

In [384]:
s1.index = pd.Index([i for i in 'abcayzuvwd'])
s1

a    100
b     11
c     12
a     13
y    200
z     15
u     16
v     17
w     18
d     19
dtype: int32

In [385]:
ss1 = ss * s1
ss1

a    2.688117e+45
a    3.494552e+44
a    1.784823e+10
a    2.320270e+09
b    6.586156e+05
c    1.953057e+06
d    8.405854e+06
e             NaN
f             NaN
g             NaN
h             NaN
i             NaN
u             NaN
v             NaN
w             NaN
y             NaN
z             NaN
dtype: float64

In [386]:
ss2 = ss - s1
ss2

a    2.688117e+43
a    2.688117e+43
a    1.784822e+08
a    1.784823e+08
b    5.986314e+04
c    1.627428e+05
d    4.423944e+05
e             NaN
f             NaN
g             NaN
h             NaN
i             NaN
u             NaN
v             NaN
w             NaN
y             NaN
z             NaN
dtype: float64

the rules for calculation:  
- for **same index, the value are take to do the calculation itemwise**.
- if **index has duplications, the each possible combination** will be repeated and create more duplicated indexes in the result.
- if has different index, then the different index at other series was treated as nan and using nan in the final result.