# Pandas Primer - 1
- Pandas
- Series
    - Creating series
    - Indexing & slicing series
    - Series operations
- DataFrames
    - Creating DataFrames
    - Column operations
    - Indexing & Slicing
    - Viewing DataFrame
    - DataFrames and arrays

## 1. Pandas
<img src="https://cdn-images-1.medium.com/max/1600/1*93CVLqnQESmvfOhzvYUgQw.png" style="width: 400px"/>

- Python library for data manipulation and analysis
- Offers data structures similar to that of RDB & Excel & R
    - DataFrame & Series

In [None]:
# importing pandas
# pandas is imported using alias 'pd' in most cases
import pandas as pd
import numpy as np

## 2. Series
- Series is similar to *one-dimensional array or list*, but with **index** for each element
    - As default, **index** is set to ```[0, 1, 2, ..., n-1]```
    - However, user can set **index**, which has same length as data

### Creating series
- From scalar value
- From array/list/tuple
- From Python dictionary

In [None]:
# creating series from scalar
# note that when passing scalar, elements in Series are all same
s1 = pd.Series(10)                         # without any index, length = 1
print(s1)
s2 = pd.Series(10., index = range(3))     # when index is set of integers
print(s2)
s3 = pd.Series(10., index = ['a', 'b', 'c']) # when index is strings
print(s3)

In [None]:
# creating series from array/list/tuple
s1 = pd.Series([1, 2, 3])            # from list
print(s1)
s2 = pd.Series((1., 2., 3.))         # from tuple
print(s2)
s3 = pd.Series(np.ones(3))           # from array
print(s3)

In [None]:
# creating series from dictionary
# note that keys function as index & values as elements
dictionary = {'a': 0, 'b': 1, 'c': 2}
s1 = pd.Series(dictionary)
print(s1)
dictionary = {0: 'a', 1: 'b', 2: 'c'}
s2 = pd.Series(dictionary)
print(s2)

In [None]:
# extracting index and values from series
s1 = pd.Series(['a', 'b', 'c', 'd', 'e'], index = [1, 2, 3, 4, 5])
print(s1.index)
print(s1.values)
print(s1.dtype)

### Indexing & slicing series
- Indexing & slicing is similar to that of NumPy arrays
- But note that Pandas series perform indexing & slicing based on *index*

In [None]:
# indexing examples
# note that result of indexing is scalar value
s1 = pd.Series([0, 1, 2, 3], index = ['a', 'b', 'c', 'd'])
print(s1[0])         # first element in series
print(s1['a'])       # element with index 'a'
print(s1[-1])        # last element in series
print(s1['d'])       # element with index 'd'

In [None]:
# slicing examples
# note that result of slicing is another series
s1 = pd.Series([0, 1, 2, 3], index = ['a', 'b', 'c', 'd'])
print(s1[1:])         # elements except first
print(s1[:-1])        # elements except last
print(s1['b':])       # elements except first
print(s1[:'c'])       # elements except last

### Series operations
- As NumPy arrays, series operations are mostly *element-wise*
- Also note that most NumPy functions can be applied to series

In [None]:
s1 = pd.Series(np.arange(3))
print(s1)
s2 = pd.Series(np.arange(3, 0, -1))
print(s2)

In [None]:
# basic operations - similar to NumPy
print(s1 + s2)
print(s1 * 2)
print(s1 ** 2)
print(np.exp(s2))

### Exercise 1-1.
- Create NumPy array consisted of only even integers between 0 and 29 (0, 2, ..., 28)
- Convert array into Pandas series
- Set index of series to odd integers between 0 and 29 (1, 3, 5, ..., 29)
- Print series

In [None]:
## Your answer
even = np.arange(0, 30, 2)
odds = np.arange(1, 30, 2)
# print(even)
# print(odds)

srs = pd.Series(even)
print(srs.head())
srs = pd.Series(even, index = odds)
print(srs.head())

## 3. DataFrames
- DataFrame is similar to two-dimensional array of list (i.e., matrix), but with index
- It has similar structure to that of Excel spreadsheet, RDB Table, R Dataframe, etc.
    - If confused, just think of it as table!
    
<img src="https://i.stack.imgur.com/G5PWJ.png" style="width: 400px"/>

<center> Pandas DataFrame </center>

<img src="https://cloud.addictivetips.com/wp-content/uploads/2010/04/copy1.jpg" style="width: 400px"/>

<center> Excel Spreadsheet </center>

<img src="https://gonehybrid.com/content/images/2017/02/table.png" style="width: 400px"/>

<center> SQL table </center>

<img src="http://www.zorro-trader.com/manual/images/cars-dataframe.png" style="width: 400px"/>

<center> R Dataframe </center>

### Creating DataFrames
- From Python dictionary (of 1-D lists, arrays, or series)
- From 2-D NumPy array
- From list of dictionaries

In [None]:
# creating df from dict
# note that keys become column names
dictionary = {'col': [1, 2, 3], 'col2': np.arange(3), 'col3': pd.Series([2., 4., 6.])}
df = pd.DataFrame(dictionary)
print(df)

In [None]:
# note that arrays should have equal length!
dictionary = {'col': [1, 2], 'col2': np.arange(3)}
df = pd.DataFrame(dictionary)

In [None]:
# index can be set in DataFrames as well
df = pd.DataFrame(dictionary, index = [0, 1, 2])
print(df)   
df = pd.DataFrame(dictionary, index = ['a', 'b', 'c'])
print(df)    # note that col3 has values NaN as index does not match

In [None]:
# creating df from 2-D array
a = np.array([[1,2,3], [4,5,6]])
df = pd.DataFrame(a)
print(df)

In [None]:
# creating df from 2-D array
# index & column names can be deisgnated
a = np.array([[1,2,3], [4,5,6]])
df = pd.DataFrame(a, index = ['a', 'b'], columns = ['x', 'y', 'z'])
print(df)

In [None]:
# creating df from list of dictionaries
# note that keys become column names in df
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}, {'a': 5, 'b': 6}]
df = pd.DataFrame(l)
print(df)

In [None]:
# creating df from list of dictionaries
# index & column names can be deisgnated here as well
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}, {'a': 5, 'b': 6}]
df = pd.DataFrame(l, index = ['x', 'y', 'z'], columns = ['a', 'b'])
print(df)

### Column operations
- It could be said that DataFrames are like set of series with column names attached to it
- So with column names, each column can be manipulated
    - Also note that as columns are basically series, their operations are *element-wise* as default

In [None]:
l1 = [1., 2., 3., 4., 5.]
l2 = [1, 2, 3, 4, 5]
l3 = ['a', 'b', 'c', 'd', 'e']
l4 = ['A', 'B', 'C', 'D', 'E']
df = pd.DataFrame({'float': l1, 'int': l2, 'lower': l3, 'upper': l4})
print(df)

In [None]:
# selecting single column
c1 = df['float']
print(type(c1))      # note that it is Series type
print(c1)

In [None]:
# selecting multiple columns
c23 = df[['int', 'lower']]
print(type(c23))     # note that it is DataFrame type
print(c23)

In [None]:
# deleting single column
del df['upper']
print(df)

In [None]:
# creating columns with single element
df['upper'] = 'A'
print(df)

In [None]:
# creating columns with list/array
df['upper'] = ['A', 'B', 'C', 'D', 'E']
print(df)

In [None]:
# exhibiting dtypes of each columns
print(df.dtypes)

### Indexing & Slicing
- Indexing and slicing are *row-wise* as default. In other words, they are based on *index*, not column names

|Operation | Syntax | Result  |
|----------|--------|---------|
|Select row by label |	df.loc[label]| Series |
|Select row by integer location	| df.iloc[loc]	| Series|
|Slice rows	| df[start_idx:last_idx] |	DataFrame |
|Select rows by boolean vector |	df[bool_vec]|	DataFrame|

In [None]:
# creating data
a1 = np.arange(10, dtype = np.float32)
a2 = np.random.random(10)
a3 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame({'col1': a1, 'col2': a2}, index = a3)
print(df)

In [None]:
# slicing rows
df_ = df[5:]       # start from 6th row
print(df_)
df_ = df[-4:]      # start from 4th last row
print(df_)

In [None]:
# selecting rows by index label
print(df.loc['a'])       # note that result is series
print(df.loc['a':'c'])   # note that result is df

In [None]:
# selecting rows by index location
print(df.iloc[0])        # note that result is series 
print(df.iloc[2:4])      # note that result is df

In [None]:
# getting a scalar value
print(df.iloc[0, 0])       # first element in first row
print(df.iloc[1, 1])       # second element in second row

### Viewing DataFrame
- When DataFrame is too large, one can view only certain number of row using ```head()``` or ```tail()```

In [None]:
# One can get shape of df like in NumPy array
a1 = np.arange(1000)
a2 = np.arange(1000, 0, -1)
df = pd.DataFrame({'col1': a1, 'col2': a2})
print(df.shape)    # so we have 1000 rows and 2 columns

In [None]:
# print only first & last 5 rows
print(df.head())
print(df.tail())

In [None]:
# print only first & last 3 rows
print(df.head(3))
print(df.tail(3))

### DataFrames and arrays
- Like series, DataFrames are interoperatable with NumPy arrays
- Also, most functions for NumPy arrays can be applied to DataFrames

In [None]:
a1 = np.random.random(10)
a2 = np.random.random(10)
a3 = np.random.random(10)

df = pd.DataFrame({'A': a1, 'B': a2, 'C': a3})
print(df)

In [None]:
# transposing df
df_transposed = df.transpose()
print(df_transposed)       # rows & columns are reversed

In [None]:
# applying mathematical function on df
df_exp = np.exp(df)
print(df_exp)
df_dot = np.dot(df_transposed, df)
print(df_dot)

In [None]:
# converting df into 2-D array
df_as_array = df.values
print(type(df_as_array))
print(df_as_array.shape)

### Exercise 1-2.
- Create Pandas DataFrame consisted of three columns ```"X", "Y", and "Z"```
    - ```X column```: integers 1 from 50
    - ```Y column```: all zeros
    - ```Z column```: all ones
- Double elements ```X column```, so multiply each element by 2
- Swap elements in ```Y column``` and ```Z column```
    - So make ```Y column``` all ones, and ```Z column``` all zeros
- Print first ten elements in DataFrame

In [None]:
## Your answer
x = np.arange(1, 51)
y = np.zeros(50)
z = np.ones(50)

df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df['X'] = df['X'] * 2
df['Y'] = z
df['Z'] = y
print(df.head(10))

### Exercise 1-3.
- Create Pandas DataFrame consisted of four columns ```"A", "B", "C", and "D"```
    - ```A column```: integers 0 from 9
    - ```B column```: integers 10 from 19
    - ```C column```: integers 20 from 29
    - ```D column```: integers 30 from 39
    - ```Index```: integers 40 to 49
- Using ```iloc()```, select and print rows with index 45 to 47
- Using ```loc()```, select and print last 3 rows

In [None]:
## Your answer
a = np.arange(10)
b = np.arange(10, 20)
c = np.arange(20, 30)
d = np.arange(30, 40)
idx = np.arange(40, 50)

df = pd.DataFrame({'A': a, 'B': b, 'C': c, 'D': d}, index = idx)
# print(df)
df_ = df.iloc[5:8]
print(df_)
df_ = df.loc[47:]
print(df_)