# Pandas Data Structures

## Overview:

### Part I: Series

*`s = pd.Series()`*

- **1. Characteristics/Attributes of a Series:**
    - `s.name`
    - `s.index`
    - `s.value`
- **2. Accessing Series Elements:**
    - Indexing by integer `s.iloc[start:end:step]`
        - *reverse* $\rightarrow$ `.iloc[-start:-end:-1]`
    - Indexing by name `s.loc["index1":"index2"]`
- **3. Boolean Indexing:**
    - Elements greater than a value: `s.loc[s > n]`
    - Eements between two values: `s.loc[(s > n) & (s < m)]`
    - Index booleans: `s.loc[s.index > "index_name1"]`
- **4. Indexing and Time-Series:**
    - **DateTime Index:** `pd.date_range("startdate", period, freq)`
    - **Resampling:**
        - Resample the Datetime Index by Frequency: `resampled = s.resample("M")`
        - Aggregate the Resampling by Function: `resampled.agg(function)`
    - **Re-indexing:**
        - Create a new frequency date range: `new_range = pd.date_range("startdate", freq)`
        - Re-index the Series with the new date range `new_index = s.reindex(new_range, ffill/bfill)`
- **5. Missing Values:**
    - `s.dropna()`


### Part II: DataFrames

*`df = pd.DataFrame()`*

- **1. Creating a DataFrame:** `pd.DataFrame(matrix, index, columns)`

- **2. Selecting Data:**
    - **a. Selecting Columns:**
       - Single Column: `df["Column name"]`
       - Multiple Columns (list): `df[["Col name 1", "Col name 2"]]` 
       - SQL Syntax: `df.column_name`
    - **b. Selecting Rows:**
        - Based on Row Name: `df.loc["Row Name"]`
        - Based on Row Index: `df.iloc[index number]`
    - **c. Subsetting Rows and Columns:**
        - Values: `df["Row Name", "Column Name"]`
        - Subset: `df[["Row Name Start", "Row Name End"], ["Column Name Start", "Column Name End"]]`
    - **d. Conditional Formatting:**
    - *First Step:* Generate the boolean dataframe condition `df["col"] > n`
    - *Second Step:* Plug in the DataFrame the boolean dataframe `df[df["col"] > n]`

        - 1. Multiple Conditions: `df[(df["Col1"] > n) & (df["Col2"] < m)]`

        - 2. Selecting a Column after Conditioning: `df[df["col1"] > m]["col2"]`

        - 3. Selecting a sub-matrix after conditioning: `df[df["col1"] > m][["col2","col3"]]`
    - **e. Creating New Data:** 
        - `df["New Column"] = df["Col1"] + df["Col2"]`
    - **f. Deleting Data:**
        - Drop a Row: `df.drop("row name", axis = 0, inplace=True)`

        - Drop a Column: `df.drop("col name", axis = 1, inplace = True)`

- **3. Indexing:**
    - **a. Resetting an Index:**
        - `df.reset_index()`
    - **b. Setting an Index:**
        - Create a new Index: `new_index = " ... ".split()`
        - Create a new Column as the new Index: `df["new index"] = new_index`
        - Set the new index be equal to the new Column Index: `df.set_index("new index")`
    - **c. Multi-Indexing:** Two Factor Index:
        - **Step 1:**Generate the two indices such that index 1 match each of index 2 labels

            - *The first index:* `["G1", "G1", "G1", "G2", "G2", "G2", ...]`

            - *The second index:* `[1, 2, 3, 1, 2, 3, ...]`

        - **Step 2:** `zip` the two indices to get the combination of outside and inside indices and place them in a list, i,e. Zip the factors into factor(i,j) tuples

            - `[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]`
        - **Step 3:** Generate a multi-index from the list of tuples
            - `pd.MultiIndex.from_tuples(hierarchiacal_index)`
            **Step 4:** Apply the multi-index to a DataFrame object

            - `df = pd.DataFrame(np.random.randn(6,2), index = hier_index, columns = "A B".split())`

        - **Step 5:** Specify the multi-index column names with the attribute 
            - `df.index.names`
        - **Selecting from a Multi-Index DataFrame:**

            - a. `.loc["Index1 name" ].loc["index2 name"]`
            - b. `df.xs(["first level name", "second level name"]`
            - c. Access the Second level for all First Level Indices
                - `df.xs("second level row name", level = "second level col name")`

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import pandas_datareader.data as web

# Part I. Series

A Pandas Series is a 1-dimensional array **with labels** that can contain any data type.

**Main Usage of Series:** Handling time-series data

In [2]:
s = pd.Series([1,2,np.nan, 4, 5])
print(s)

0    1.0
1    2.0
2    NaN
3    4.0
4    5.0
dtype: float64


# 1. Characteristics/Attributes of a Series

## a. Name

Every Series has a name

In [3]:
s.name = "Toy Series"

## b.  Index

The index of a Series is its collected axis labels. An index must be exactly the same length as the Series itself as each index must match one-to-one with each element of the Series.

An index can:

- Be passed to a Series as a **parameter**

- or **added** later

- or **modified** later

If no index is specified, then it will be composed of integers denoting a numbered list

#### Datetime Index

Pandas has a built in function for creating date indices, `date_range(start, periods, freq)` which is similar to a normal Pythonic `range()` function

In [4]:
# Create new index
new_index = pd.date_range("2016-01-01", periods = len(s), freq="D")

# Assign the index to the Series
s.index = new_index

# Preview the newly assigned index
print(s.index)

DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05'],
              dtype='datetime64[ns]', freq='D')


## c. Values

We can access the values of a Pandas Series with the `.values` attribute

# 2. Accessing Series Elements

## 1. By Integer Index

Using the `iloc[start:end:step]` method

- **First element:** `s.iloc[0]`
- **Last element:** `s.iloc[len(s)-1]`
- **Middle elements:** `s.iloc[1:len(s)-1]`
- **Reverse elements:** `s.iloc[::-1]` or more specifc: `s.iloc[-n:-m:-1]`

In [5]:
print("Series:\n {}, \n\nFirst element: {}, last element:{}, \n\nmiddle elements:{}".format(
                                         s, s.iloc[0], s.iloc[len(s)-1], s.iloc[1:len(s)-1]) +
      "\n\nreverse elements:\n {}".format(
                                         s.iloc[::-1]))

Series:
 2016-01-01    1.0
2016-01-02    2.0
2016-01-03    NaN
2016-01-04    4.0
2016-01-05    5.0
Freq: D, Name: Toy Series, dtype: float64, 

First element: 1.0, last element:5.0, 

middle elements:2016-01-02    2.0
2016-01-03    NaN
2016-01-04    4.0
Freq: D, Name: Toy Series, dtype: float64

reverse elements:
 2016-01-05    5.0
2016-01-04    4.0
2016-01-03    NaN
2016-01-02    2.0
2016-01-01    1.0
Freq: -1D, Name: Toy Series, dtype: float64


## 2. By Index Names

using the `.loc["index_name1":"index_name2"]` method

In [6]:
s.loc["2016-01-02":"2016-01-04"]

2016-01-02    2.0
2016-01-03    NaN
2016-01-04    4.0
Freq: D, Name: Toy Series, dtype: float64

# 3. Boolean Indexing

Filtering Series by **boolean arrays**, i.e. passing boolean expressions into the index of a Series to filter the Series by the boolean index.

- **Elements greater than a value:** `s.loc[s > n]`

- **Elements between two values:** `s.loc[(s > n) & (s < m)]`

- **Index booleans:** `s.loc[s.index > "index_name1"]`

In [7]:
s.loc[(s < 3) & (s > 1)]

2016-01-02    2.0
Freq: D, Name: Toy Series, dtype: float64

In [8]:
s.loc[s.index > "2016-01-03"]

2016-01-04    4.0
2016-01-05    5.0
Freq: D, Name: Toy Series, dtype: float64

# 4. Indexing and Time Series

Using DateTime objects to work with time series in Pandas

A DateTime object has a collection of associated information:

- Associated frequency (`freq`): daily vs monthly vs yearly data
- Associated timezone (`tz`): what locale this index is relative to

In [9]:
symbol = "CMG"
start = datetime.datetime(2012,1,1)
end = datetime.datetime(2016,1,1)
prices = web.DataReader(symbol, "yahoo", start = start, end = end)["Adj Close"]

prices.index

DatetimeIndex(['2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06',
               '2012-01-09', '2012-01-10', '2012-01-11', '2012-01-12',
               '2012-01-13', '2012-01-17',
               ...
               '2015-12-17', '2015-12-18', '2015-12-21', '2015-12-22',
               '2015-12-23', '2015-12-24', '2015-12-28', '2015-12-29',
               '2015-12-30', '2015-12-31'],
              dtype='datetime64[ns]', name='Date', length=1006, freq=None)

## Resampling 

We can adust the frequency of the data by resampling the DateTime index.

**Monthly:**

- **Resample** the datetime index to monthly

- **Aggregate** the data according to a function: "`np.sum`, `np.mean`, etc."

In [10]:
monthly_prices = prices.resample("M").agg(np.mean)
monthly_prices.head()

Date
2012-01-31    354.829002
2012-02-29    379.535503
2012-03-31    407.002272
2012-04-30    422.798997
2012-05-31    405.805456
Freq: M, Name: Adj Close, dtype: float64

**Monthly - Aggregated on Beginning of Month:**

In [15]:
def custom_resampler(array):
    "Returns the fist value of the array"
    return(array[0])

monthly_return_beg = prices.resample("M").agg(custom_resampler)
monthly_return_beg.head()

Date
2012-01-31    341.269989
2012-02-29    370.410004
2012-03-31    394.100006
2012-04-30    418.399994
2012-05-31    419.890015
Freq: M, Name: Adj Close, dtype: float64

## Reindexing

Can be used to realign the existing data according to a new set of index labels (if data does not exist for that particular label the it goes to `nan`).

**Reindexing from Weekdays to Calendar days:**

`ffill` = forward fill, ie.e any nan values will be filled by the *last value* listed

- The prices of the weekends/holidays will be listed as the last market day that we know about

In [12]:
# generate a daily date range

calendar_dates = pd.date_range(start = start, end=end, freq="D")
calendar_dates

DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03', '2012-01-04',
               '2012-01-05', '2012-01-06', '2012-01-07', '2012-01-08',
               '2012-01-09', '2012-01-10',
               ...
               '2015-12-23', '2015-12-24', '2015-12-25', '2015-12-26',
               '2015-12-27', '2015-12-28', '2015-12-29', '2015-12-30',
               '2015-12-31', '2016-01-01'],
              dtype='datetime64[ns]', length=1462, freq='D')

In [13]:
# reindex the price time series with the daily date range

calendar_prices = prices.reindex(calendar_dates, method ="ffill")
calendar_prices.tail()

2015-12-28    493.519989
2015-12-29    489.940002
2015-12-30    485.790009
2015-12-31    479.850006
2016-01-01    479.850006
Freq: D, Name: Adj Close, dtype: float64

# 5. Missing Data

Sometimes, resampling or re-indexing data can create `nan` values.

There are three ways to deal with missing data:

- **a. `.fillna()`**
-  **b. Forward/Backward Fill**
- **c. `.dropna()`**

## Look-Ahead Bias

#### Fillna

is used to fill the na values with a specified value, however, this is not a good way to operate (as illustrated by the concept of stationarity).

#### ffill/bfill

bfill is applied the same way as ffill on reindexing.

**Backward filling:** `NaN`s are filled with the *next* filled value.
**Forward filling:** `NaN`s are filled with the *previous* filled value

These two solutions do not work well as they take into account *future data* that was not available at the time of the data points that we are trying to fill. 

*MAKES NO SENSE:* as it is equivalent to saying that the price of a particular security today is tomorrow's price

## Dropping the Missing Values

is much better than filing the data with arbitrary numbers

In [19]:
dropped_prices = prices.reindex(calendar_dates).dropna()
dropped_prices.head(10)

2012-01-03    341.269989
2012-01-04    348.750000
2012-01-05    350.480011
2012-01-06    348.950012
2012-01-09    339.739990
2012-01-10    341.119995
2012-01-11    347.600006
2012-01-12    347.619995
2012-01-13    354.619995
2012-01-17    353.380005
Name: Adj Close, dtype: float64

# Part II. DataFrames

`DataFrames` are 2-dimensional objects:

- **Index** attribute $\Rightarrow$ Rows
    - Same as in the **Series** case
- **Column** attribute $\Rightarrow$ Cols
    - **Gives the 2-dim characteristics:** allows us to combine named columns (i.e. Series) into a cohesive object with the index lined-up
    
# 1. Creating a DataFrame

`pd.DataFrame(matrix, index, columns)`

The `matrix` can be generated by many different types of data structures:

- nested-lists
- dictionary
- np arrays reshaped as matrices

In [30]:
df = pd.DataFrame(np.random.rand(4,6), index = "A B C D".split(), columns = "Q W E R T Y".split())
df

Unnamed: 0,Q,W,E,R,T,Y
A,0.80351,0.142554,0.4065,0.241446,0.508993,0.256827
B,0.507673,0.223111,0.084345,0.830719,0.967985,0.631227
C,0.634895,0.016561,0.928671,0.462813,0.067493,0.218742
D,0.336325,0.573719,0.406973,0.025647,0.143466,0.289039


-------------------------------------
# 2. Data Selection
-------------------------------------

## a. Selecting Columns
   - Single Column: `df["Column name"]`
   - Multiple Columns (list): `df[["Col name 1", "Col name 2"]]` 
   - SQL Syntax: `df.column_name`

In [46]:
print("Single Col: \n{}, \n\nMulti Cols: \n{}".format(
    
            df["Q"].head(1),  
    
            df[["Q","W"]].head(1)))

Single Col: 
A    0.80351
Name: Q, dtype: float64, 

Multi Cols: 
         Q         W
A  0.80351  0.142554


## b. Selecting Rows
 
- Based on Row Name: `df.loc["Row Name"]`
- Based on Row Index: `df.iloc[index number]`

In [52]:
print("Row Name: \n{}, \n\nRow Index: \n{}".format(

            df.loc["A"].head(1),

            df.iloc[0].head(1)))

Row Name: 
Q    0.80351
Name: A, dtype: float64, 

Row Index: 
Q    0.80351
Name: A, dtype: float64


## c. Selecting a Subset of Rows and Columns

- Values: `df["Row Name", "Column Name"]`
- Subset: `df[["Row Name Start", "Row Name End"], ["Column Name Start", "Column Name End"]]`

In [65]:
print("Subsetting Values: \n{}, \n\nSubsetting Matrix: \n{}".format(

             df.loc["A", "Q"],
     
             df.loc["A B".split(), "Q W".split()]
    
    #also    df.loc[["A","B"],["Q","W"]]
))

Subsetting Values: 
0.8035098063825515, 

Subsetting Matrix: 
          Q         W
A  0.803510  0.142554
B  0.507673  0.223111


## d. Conditional Selecting

General Formula:

- **First Step:** Generate the boolean dataframe condition `df["col"] > n`

- **Second Step:** Plug in the DataFrame the boolean dataframe `df[df["col"] > n]`

**1. Multiple Conditions:**

`df[(df["Col1"] > n) & (df["Col2"] < m)]`

**2. Selecting a Column after Conditioning:**

`df[df["col1"] > m]["col2"]`

**3. Selecting a sub-matrix after conditioning:**

`df[df["col1"] > m][["col2","col3"]]`

In [87]:
# First Step
df["Q"]>0

# Second Step
df[df["Q"] > 0]

Unnamed: 0,Q,W,E,R,T,Y,new
A,0.80351,0.142554,0.4065,0.241446,0.508993,0.256827,0.946063
B,0.507673,0.223111,0.084345,0.830719,0.967985,0.631227,0.730785
C,0.634895,0.016561,0.928671,0.462813,0.067493,0.218742,0.651457
D,0.336325,0.573719,0.406973,0.025647,0.143466,0.289039,0.910043


## Creating a New Column

Can be done by assigning a value to a new column index of the existing DataFrame:

`df["New Column"] = df["Col1"] + df["Col2"]`

In [79]:
df["new"] = df["Q"] + df["W"]
df.head(1)

Unnamed: 0,Q,W,E,R,T,Y,new
A,0.80351,0.142554,0.4065,0.241446,0.508993,0.256827,0.946063


## Removing Rows/Columns 

we can drop a row or a column (either inplace or not):

- **Drop a Row:** `df.drop("row name", axis = 0, inplace=True)`

- **Drop a Column:** `df.drop("col name", axis = 1, inplace = True)`

In [80]:
df.drop("new", axis = 1).head(1)

Unnamed: 0,Q,W,E,R,T,Y
A,0.80351,0.142554,0.4065,0.241446,0.508993,0.256827


------------------------------
# 3. Indexing
---------------------------

## a. Resetting an Index

We can use the `df.reset_index()` method to reset a DataFrame index.

In [95]:
df.reset_index().head(1)

Unnamed: 0,index,Q,W,E,R,T,Y,new
0,A,0.80351,0.142554,0.4065,0.241446,0.508993,0.256827,0.946063


## b. Setting a New Index

- **Create a new Index** `new_index = " ... ".split()`

- **Create a new Column as the new Index** `df["new index"] = new_index`

- **Set the new index be equal to the new Column Index:** `df.set_index("new index")`

In [98]:
# Create a new index
new_index = "CA NY WY OR".split()

# Create a new column in the df
df["States"] = new_index

# Set the index to the new column index
df.set_index("States", inplace=True)

df.head(2)

Unnamed: 0_level_0,Q,W,E,R,T,Y,new
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CA,0.80351,0.142554,0.4065,0.241446,0.508993,0.256827,0.946063
NY,0.507673,0.223111,0.084345,0.830719,0.967985,0.631227,0.730785


## c. Multi-Indexing and Index Hierarchy

This is when an index spans over multiple columns via different factors applied to the main index

### Two Factor Index:

- **Factor 1:** is the outer index `["G1", "G1", "G1", "G2", "G2", "G2"]`
- **Factor 2:** is the inner index `[1, 2, 3, 1, 2, 3]`

**Step 1:** 

Generate the two indices such that index 1 match each of index 2 labels

- *The first index:* `[1, 1, 1, 2, 2, 2, 3, 3, 3, ...]`

- *The second index:* `[1, 2, 3, 1, 2, 3, 1, 2, 3, ...]`

**Step 2:**

`zip` the two indices to get the combination of outside and inside indices and place them in a list, i,e. Zip the factors into factor(i,j) tuples

`[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]`

In [3]:
# Step 1:

## Factor 1
outside = ["G1", "G1", "G1", "G2", "G2", "G2"]

## Factor 2
inside = [1,2,3,1,2,3]

# Step 2: 

hier_index = list(zip(outside, inside))
hier_index

[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]

**Step 3:**

Generate a multi-index from the list of tuples `pd.MultiIndex.from_tuples(hierarchiacal_index)`

In [4]:
# Step 3:

hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

**Step 4:** 

Apply the multi-index to a DataFrame object

In [5]:
df = pd.DataFrame(np.random.randn(6,2), index = hier_index, columns = "A B".split())

**Step 5:**

Specify the multi-index column names with the attribute `df.index.names`

In [6]:
df.index.names = "Group Num".split()
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,-0.342179,-0.311597
G1,2,-0.509472,-0.399181
G1,3,0.475394,0.802766
G2,1,-0.755835,-1.391883
G2,2,-0.388434,0.995909
G2,3,-0.67555,-0.397031


### Selecting from a Multi-Index DataFrame

We can access the different levels of the multi-index dataframe in two ways:

#### a. `.loc["Index1 name" ].loc["index2 name"]`

In [8]:
# First Level
df.loc["G1"]

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,-0.342179,-0.311597
2,-0.509472,-0.399181
3,0.475394,0.802766


In [9]:
# Second Level
df.loc["G1"].loc[1]

A   -0.342179
B   -0.311597
Name: 1, dtype: float64

#### b. `df.xs(["first level name", "second level name"]`

In [12]:
df.xs(["G1",1])

A   -0.342179
B   -0.311597
Name: (G1, 1), dtype: float64

#### c. Access the Second level for all First Level Indices

`df.xs("second level row name", level = "second level col name")`

In [14]:
df.xs(1, level="Num")

Unnamed: 0_level_0,A,B
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,-0.342179,-0.311597
G2,-0.755835,-1.391883
