# Pandas Primer - 1
- Pandas
- Series
    - Creating series
    - Indexing & slicing series
    - Series operations
- DataFrames
    - Creating DataFrames
    - Column operations
    - Indexing & Slicing
    - Viewing DataFrame
    - DataFrames and arrays

## 1. Pandas
<img src="https://cdn-images-1.medium.com/max/1600/1*93CVLqnQESmvfOhzvYUgQw.png" style="width: 400px"/>

- Python library for data manipulation and analysis
- Offers data structures similar to that of RDB & Excel & R
    - DataFrame & Series

In [1]:
# importing pandas
# pandas is imported using alias 'pd' in most cases
import pandas as pd
import numpy as np

## 2. Series
- Series is similar to *one-dimensional array or list*, but with **index** for each element
    - As default, **index** is set to ```[0, 1, 2, ..., n-1]```
    - However, user can set **index**, which has same length as data

### Creating series
- From scalar value
- From array/list/tuple
- From Python dictionary

In [2]:
# creating series from scalar
# note that when passing scalar, elements in Series are all same
s1 = pd.Series(10)                         # without any index, length = 1
print(s1)
s2 = pd.Series(10., index = range(3))     # when index is set of integers
print(s2)
s3 = pd.Series(10., index = ['a', 'b', 'c']) # when index is strings
print(s3)

0    10
dtype: int64
0    10.0
1    10.0
2    10.0
dtype: float64
a    10.0
b    10.0
c    10.0
dtype: float64


In [3]:
# creating series from array/list/tuple
s1 = pd.Series([1, 2, 3])            # from list
print(s1)
s2 = pd.Series((1., 2., 3.))         # from tuple
print(s2)
s3 = pd.Series(np.ones(3))           # from array
print(s3)

0    1
1    2
2    3
dtype: int64
0    1.0
1    2.0
2    3.0
dtype: float64
0    1.0
1    1.0
2    1.0
dtype: float64


In [4]:
# creating series from dictionary
# note that keys function as index & values as elements
dictionary = {'a': 0, 'b': 1, 'c': 2}
s1 = pd.Series(dictionary)
print(s1)
dictionary = {0: 'a', 1: 'b', 2: 'c'}
s2 = pd.Series(dictionary)
print(s2)

a    0
b    1
c    2
dtype: int64
0    a
1    b
2    c
dtype: object


In [6]:
# extracting index and values from series
s1 = pd.Series(['a', 'b', 'c', 'd', 'e'], index = [1, 2, 3, 4, 5])
print(s1)
print(s1.index)
print(s1.values)
print(s1.dtype)

1    a
2    b
3    c
4    d
5    e
dtype: object
Int64Index([1, 2, 3, 4, 5], dtype='int64')
['a' 'b' 'c' 'd' 'e']
object


### Indexing & slicing series
- Indexing & slicing is similar to that of NumPy arrays
- But note that Pandas series perform indexing & slicing based on *index*

In [7]:
# indexing examples
# note that result of indexing is scalar value
s1 = pd.Series([0, 1, 2, 3], index = ['a', 'b', 'c', 'd'])
print(s1[0])         # first element in series
print(s1['a'])       # element with index 'a'
print(s1[-1])        # last element in series
print(s1['d'])       # element with index 'd'

0
0
3
3


In [8]:
# slicing examples
# note that result of slicing is another series
s1 = pd.Series([0, 1, 2, 3], index = ['a', 'b', 'c', 'd'])
print(s1[1:])         # elements except first
print(s1[:-1])        # elements except last
print(s1['b':])       # elements except first
print(s1[:'c'])       # elements except last

b    1
c    2
d    3
dtype: int64
a    0
b    1
c    2
dtype: int64
b    1
c    2
d    3
dtype: int64
a    0
b    1
c    2
dtype: int64


### Series operations
- As NumPy arrays, series operations are mostly *element-wise*
- Also note that most NumPy functions can be applied to series

In [9]:
s1 = pd.Series(np.arange(3))
print(s1)
s2 = pd.Series(np.arange(3, 0, -1))
print(s2)

0    0
1    1
2    2
dtype: int32
0    3
1    2
2    1
dtype: int32


In [10]:
# basic operations - similar to NumPy
print(s1 + s2)
print(s1 * 2)
print(s1 ** 2)
print(np.exp(s2))

0    3
1    3
2    3
dtype: int32
0    0
1    2
2    4
dtype: int32
0    0
1    1
2    4
dtype: int32
0    20.085537
1     7.389056
2     2.718282
dtype: float64


### Exercise 1-1.
- Create NumPy array consisted of only even integers between 0 and 29 (0, 2, ..., 28)
- Convert array into Pandas series
- Set index of series to odd integers between 0 and 29 (1, 3, 5, ..., 29)
- Print series

In [None]:
## Your answer

## 3. DataFrames
- DataFrame is similar to two-dimensional array of list (i.e., matrix), but with index
- It has similar structure to that of Excel spreadsheet, RDB Table, R Dataframe, etc.
    - If confused, just think of it as table!
    
<img src="https://i.stack.imgur.com/G5PWJ.png" style="width: 400px"/>

<center> Pandas DataFrame </center>

<img src="https://cloud.addictivetips.com/wp-content/uploads/2010/04/copy1.jpg" style="width: 400px"/>

<center> Excel Spreadsheet </center>

<img src="https://gonehybrid.com/content/images/2017/02/table.png" style="width: 400px"/>

<center> SQL table </center>

<img src="http://www.zorro-trader.com/manual/images/cars-dataframe.png" style="width: 400px"/>

<center> R Dataframe </center>

### Creating DataFrames
- From Python dictionary (of 1-D lists, arrays, or series)
- From 2-D NumPy array
- From list of dictionaries

In [17]:
# creating df from dict
# note that keys become column names
dictionary = {'col': [1, 2, 3], 'col2': np.arange(3), 'col3': pd.Series([2., 4., 6.])}
df = pd.DataFrame(dictionary)
print(df)

   col  col2  col3
0    1     0   2.0
1    2     1   4.0
2    3     2   6.0


In [None]:
# note that arrays should have equal length!
dictionary = {'col': [1, 2], 'col2': np.arange(3)}
df = pd.DataFrame(dictionary)

In [18]:
# index can be set in DataFrames as well
df = pd.DataFrame(dictionary, index = [0, 1, 2])
print(df)   
df = pd.DataFrame(dictionary, index = ['a', 'b', 'c'])
print(df)    # note that col3 has values NaN as index does not match

   col  col2  col3
0    1     0   2.0
1    2     1   4.0
2    3     2   6.0
   col  col2  col3
a    1     0   NaN
b    2     1   NaN
c    3     2   NaN


In [19]:
# creating df from 2-D array
a = np.array([[1,2,3], [4,5,6]])
df = pd.DataFrame(a)
print(df)

   0  1  2
0  1  2  3
1  4  5  6


In [20]:
# creating df from 2-D array
# index & column names can be deisgnated
a = np.array([[1,2,3], [4,5,6]])
df = pd.DataFrame(a, index = ['a', 'b'], columns = ['x', 'y', 'z'])
print(df)

   x  y  z
a  1  2  3
b  4  5  6


In [21]:
# creating df from list of dictionaries
# note that keys become column names in df
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}, {'a': 5, 'b': 6}]
df = pd.DataFrame(l)
print(df)

   a  b
0  1  2
1  3  4
2  5  6


In [22]:
# creating df from list of dictionaries
# index & column names can be deisgnated here as well
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}, {'a': 5, 'b': 6}]
df = pd.DataFrame(l, index = ['x', 'y', 'z'], columns = ['a', 'b'])
print(df)

   a  b
x  1  2
y  3  4
z  5  6


### Column operations
- It could be said that DataFrames are like set of series with column names attached to it
- So with column names, each column can be manipulated
    - Also note that as columns are basically series, their operations are *element-wise* as default

In [23]:
l1 = [1., 2., 3., 4., 5.]
l2 = [1, 2, 3, 4, 5]
l3 = ['a', 'b', 'c', 'd', 'e']
l4 = ['A', 'B', 'C', 'D', 'E']
df = pd.DataFrame({'float': l1, 'int': l2, 'lower': l3, 'upper': l4})
print(df)

   float  int lower upper
0    1.0    1     a     A
1    2.0    2     b     B
2    3.0    3     c     C
3    4.0    4     d     D
4    5.0    5     e     E


In [24]:
# selecting single column
c1 = df['float']
print(type(c1))      # note that it is Series type
print(c1)

<class 'pandas.core.series.Series'>
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
Name: float, dtype: float64


In [25]:
# selecting multiple columns
c23 = df[['int', 'lower']]
print(type(c23))     # note that it is DataFrame type
print(c23)

<class 'pandas.core.frame.DataFrame'>
   int lower
0    1     a
1    2     b
2    3     c
3    4     d
4    5     e


In [26]:
# deleting single column
del df['upper']
print(df)

   float  int lower
0    1.0    1     a
1    2.0    2     b
2    3.0    3     c
3    4.0    4     d
4    5.0    5     e


In [27]:
# creating columns with single element
df['upper'] = 'A'
print(df)

   float  int lower upper
0    1.0    1     a     A
1    2.0    2     b     A
2    3.0    3     c     A
3    4.0    4     d     A
4    5.0    5     e     A


In [28]:
# creating columns with list/array
df['upper'] = ['A', 'B', 'C', 'D', 'E']
print(df)

   float  int lower upper
0    1.0    1     a     A
1    2.0    2     b     B
2    3.0    3     c     C
3    4.0    4     d     D
4    5.0    5     e     E


In [29]:
# exhibiting dtypes of each columns
print(df.dtypes)

float    float64
int        int64
lower     object
upper     object
dtype: object


### Indexing & Slicing
- Indexing and slicing are *row-wise* as default. In other words, they are based on *index*, not column names

|Operation | Syntax | Result  |
|----------|--------|---------|
|Select row by label |	df.loc[label]| Series |
|Select row by integer location	| df.iloc[loc]	| Series|
|Slice rows	| df[start_idx:last_idx] |	DataFrame |
|Select rows by boolean vector |	df[bool_vec]|	DataFrame|

In [30]:
# creating data
a1 = np.arange(10, dtype = np.float32)
a2 = np.random.random(10)
a3 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame({'col1': a1, 'col2': a2}, index = a3)
print(df)

   col1      col2
a   0.0  0.836176
b   1.0  0.998377
c   2.0  0.467166
d   3.0  0.510646
e   4.0  0.099334
f   5.0  0.831982
g   6.0  0.034774
h   7.0  0.551889
i   8.0  0.796765
j   9.0  0.022302


In [31]:
# slicing rows
df_ = df[5:]       # start from 6th row
print(df_)
df_ = df[-4:]      # start from 4th last row
print(df_)

   col1      col2
f   5.0  0.831982
g   6.0  0.034774
h   7.0  0.551889
i   8.0  0.796765
j   9.0  0.022302
   col1      col2
g   6.0  0.034774
h   7.0  0.551889
i   8.0  0.796765
j   9.0  0.022302


In [32]:
# selecting rows by index label
print(df.loc['a'])       # note that result is series
print(df.loc['a':'c'])   # note that result is df

col1    0.000000
col2    0.836176
Name: a, dtype: float64
   col1      col2
a   0.0  0.836176
b   1.0  0.998377
c   2.0  0.467166


In [33]:
# selecting rows by index location
print(df.iloc[0])        # note that result is series 
print(df.iloc[2:4])      # note that result is df

col1    0.000000
col2    0.836176
Name: a, dtype: float64
   col1      col2
c   2.0  0.467166
d   3.0  0.510646


In [34]:
# getting a scalar value
print(df.iloc[0, 0])       # first element in first row
print(df.iloc[1, 1])       # second element in second row

0.0
0.998376972548


### Viewing DataFrame
- When DataFrame is too large, one can view only certain number of row using ```head()``` or ```tail()```

In [36]:
# One can get shape of df like in NumPy array
a1 = np.arange(1000)
a2 = np.arange(1000, 0, -1)
df = pd.DataFrame({'col1': a1, 'col2': a2})
print(df.shape)    # so we have 1000 rows and 2 columns

(1000, 2)


In [37]:
# print only first & last 5 rows
print(df.head())
print(df.tail())

   col1  col2
0     0  1000
1     1   999
2     2   998
3     3   997
4     4   996
     col1  col2
995   995     5
996   996     4
997   997     3
998   998     2
999   999     1


In [38]:
# print only first & last 3 rows
print(df.head(3))
print(df.tail(3))

   col1  col2
0     0  1000
1     1   999
2     2   998
     col1  col2
997   997     3
998   998     2
999   999     1


### DataFrames and arrays
- Like series, DataFrames are interoperatable with NumPy arrays
- Also, most functions for NumPy arrays can be applied to DataFrames

In [39]:
a1 = np.random.random(10)
a2 = np.random.random(10)
a3 = np.random.random(10)

df = pd.DataFrame({'A': a1, 'B': a2, 'C': a3})
print(df)

          A         B         C
0  0.916240  0.323132  0.948865
1  0.039777  0.360001  0.381617
2  0.161079  0.326806  0.292763
3  0.627715  0.588707  0.097837
4  0.149464  0.081961  0.617773
5  0.755806  0.328113  0.002934
6  0.705548  0.978438  0.032138
7  0.747693  0.646978  0.356087
8  0.267015  0.022962  0.124409
9  0.547409  0.256276  0.584715


In [40]:
# transposing df
df_transposed = df.transpose()
print(df_transposed)       # rows & columns are reversed

          0         1         2         3         4         5         6  \
A  0.916240  0.039777  0.161079  0.627715  0.149464  0.755806  0.705548   
B  0.323132  0.360001  0.326806  0.588707  0.081961  0.328113  0.978438   
C  0.948865  0.381617  0.292763  0.097837  0.617773  0.002934  0.032138   

          7         8         9  
A  0.747693  0.267015  0.547409  
B  0.646978  0.022962  0.256276  
C  0.356087  0.124409  0.584715  


In [41]:
# applying mathematical function on df
df_exp = np.exp(df)
print(df_exp)
df_dot = np.dot(df_transposed, df)
print(df_dot)

          A         B         C
0  2.499873  1.381447  2.582777
1  1.040579  1.433330  1.464651
2  1.174778  1.386533  1.340125
3  1.873325  1.801657  1.102783
4  1.161211  1.085414  1.854793
5  2.129328  1.388346  1.002938
6  2.024956  2.660299  1.032660
7  2.112121  1.909761  1.427732
8  1.306061  1.023227  1.132479
9  1.728769  1.292110  1.794479
[[ 3.28242927  2.31330297  1.72990755]
 [ 2.31330297  2.24389517  1.0633914 ]
 [ 1.72990755  1.0633914   2.00811071]]


In [42]:
# converting df into 2-D array
df_as_array = df.values
print(type(df_as_array))
print(df_as_array.shape)

<class 'numpy.ndarray'>
(10, 3)


### Exercise 1-2.
- Create Pandas DataFrame consisted of three columns ```"X", "Y", and "Z"```
    - ```X column```: integers 1 from 50
    - ```Y column```: all zeros
    - ```Z column```: all ones
- Double elements ```X column```, so multiply each element by 2
- Swap elements in ```Y column``` and ```Z column```
    - So make ```Y column``` all ones, and ```Z column``` all zeros
- Print first ten elements in DataFrame

In [None]:
## Your answer

### Exercise 1-3.
- Create Pandas DataFrame consisted of four columns ```"A", "B", "C", and "D"```
    - ```A column```: integers 0 from 9
    - ```B column```: integers 10 from 19
    - ```C column```: integers 20 from 29
    - ```D column```: integers 30 from 39
    - ```Index```: integers 40 to 49
- Using ```iloc()```, select and print rows with index 45 to 47
- Using ```loc()```, select and print last 3 rows

In [None]:
## Your answer