# Pandas Data Manipulation

## Overview

- **1. GroupBy:**
- **2. Concatenation:**
- **3. Merging:**
- **4. Joining:**
- **5. Pivoting**
- **6. Stacking/Unstacking:**
- **7. Melting:**

## 1. GroupBy - Split-Apply-Combine

Allows to group rows of data together and call aggregate functions

### Step 1: Split - Group By a Column Name: 
Must be saved to a variable as it returns a pandas object

- **Single Grouping:** `by_colname = df.groupby("column name")`
- **Multiple Groupings:** `by_columnames = df.groupby(["col1", "col2"])`

In [38]:
# Create a df
data = {"Company":["GOOG","GOOG","MSFT", "MSFT", "FB", "FB"],
        "Person":["Sam","Charlie", "Amy", "Vanessa", "Carl", "Sarah"],
        "Sales":[200,120,340,124,243,350]}
df = pd.DataFrame(data)

# Group by company
by_company = df.groupby("Company")

df

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200
1,GOOG,Charlie,120
2,MSFT,Amy,340
3,MSFT,Vanessa,124
4,FB,Carl,243
5,FB,Sarah,350


### Step 2: Apply - Call aggregate methods off the object:

- `by_colname.mean()`
- `by_colname.std(),`
- `by_colname.min()`
- `by_colname.count()`
- `by_colname.describe()`
- `by_colname.describe().transpose()`
- `by_colname.describe().transpose()["col name"]`

In [39]:
by_company.describe().transpose()

Unnamed: 0,Company,FB,GOOG,MSFT
Sales,count,2.0,2.0,2.0
Sales,mean,296.5,160.0,232.0
Sales,std,75.660426,56.568542,152.735065
Sales,min,243.0,120.0,124.0
Sales,25%,269.75,140.0,178.0
Sales,50%,296.5,160.0,232.0
Sales,75%,323.25,180.0,286.0
Sales,max,350.0,200.0,340.0


In [42]:
by_company.std()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,75.660426
GOOG,56.568542
MSFT,152.735065


### Accessing the Grouped DataFrames

#### a. All Grouped DataFrames

iterate over the grouped DataFrame to get the corresponding key value pair for each group:

``` python 
[print(key,value) for key,value in grouped_df]```

- **key:** Group Name
- **value:** DataFrame of that particular group

In [34]:
for key,value in by_company:
    print("\n{}: \n\n{}".format(key,value))


FB: 

  Company Person  Sales
4      FB   Carl    243
5      FB  Sarah    350

GOOG: 

  Company   Person  Sales
0    GOOG      Sam    200
1    GOOG  Charlie    120

MSFT: 

  Company   Person  Sales
2    MSFT      Amy    340
3    MSFT  Vanessa    124


#### b. Single Grouped DataFrame

```python
df.get_group("Group Name")```

In [37]:
by_company.get_group("MSFT")

Unnamed: 0,Company,Person,Sales
2,MSFT,Amy,340
3,MSFT,Vanessa,124


### Aggregation

we can also run functions on grouped data with the `grouped_data.agg(function)` method. Aggregate functions reduce the dimensions of the object.

Aggregation function param: 

`as_index = True`: **will not** return the groups you are aggregating over if they are *named* columns

`as_index = False`: **will** return the groups if they are *named* columns

## 2. Concatenation

Concatenation glues together DataFrames, `pd.concat([df1, df2, ...])`

- **Dimensions:** have to match along the axis we are concatenating on

### a. Along the rows, i.e. `axis=0`

which is the same as `rbind()` in `R`

In [35]:
pd.concat([df1,df2,df3]).head(2)

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1


### b. Along the columns, i.e. `axis=1`

which is the same as `cbind()` in `R`

In [36]:
 pd.concat([df1,df2,df3], axis = 1).head(2)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,A0,B0,C0,D0,,,,,,,,
1,A1,B1,C1,D1,,,,,,,,


### c. Multi-Index on Rows

The keys will act as the labels of the first level row indices

`pd.concat([df1,df2,df3, ...], keys = ["First level key1", "First level key2"], join = "outer")`

In [47]:
pd.concat([df1,df2,df3], keys = "Key1 Key2".split()).head(6)

Unnamed: 0,Unnamed: 1,A,B,C,D
Key1,0,A0,B0,C0,D0
Key1,1,A1,B1,C1,D1
Key1,2,A2,B2,C2,D2
Key1,3,A3,B3,C3,D3
Key2,4,A4,B4,C4,D4
Key2,5,A5,B5,C5,D5


### d. Multi-Index on Columns

The keys will act as the labels of the first level column indices

`pd.concat([df1,df2,df3, ...], asxis = 1, keys = ["First level key1", "First level key2"])`

In [19]:
pd.concat([df1,df2,df3], axis = 1,keys = "A B".split()).head(2)

Unnamed: 0_level_0,A,A,A,A,B,B,B,B
Unnamed: 0_level_1,A,B,C,D,A,B,C,D
0,A0,B0,C0,D0,,,,
1,A1,B1,C1,D1,,,,


## 3. Merging

Allows you to merge DataFrames together using a similar logic as merging SQL tables together.

Merging is done only **on columns with SAME INDEX**

`pd.merge(df1, df2, how=" ", on="key1 key2".split())`

**Characteristics of Merging:**

- **Key(s):** These are the columns we are merging on
- **Other columns:** will be merged together as a `cbind()` in `R`
- **Merge Method (how):** 
    - **Outer:** $DF_1 \cup DF_2$
    - **Inner:** $DF_1 \cap DF_2$
    - **Left:** $DF_1 + DF_1\cap DF_2$
    - **Right:** $DF_2 + DF_2 \cap DF_1$

### View of Two Example DataFrames with Same Indices

In [22]:
display_side_by_side(left,right)

Unnamed: 0,key1,key2,A,B
0,K0,K0,A0,B0
1,K0,K1,A1,B1
2,K1,K0,A2,B2
3,K2,K1,A3,B3

Unnamed: 0,key1,key2,C,D
0,K0,K0,C0,D0
1,K1,K0,C1,D1
2,K1,K0,C2,D2
3,K2,K0,C3,D3


### Outer

Includes all the data

In [29]:
pd.merge(left, right, how="outer", on="key1 key2".split())

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,
5,K2,K0,,,C3,D3


### Inner

Gets only the common values

In [29]:
pd.merge(left, right, how="inner", on="key1 key2".split())

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


### Left

leaves out what is not intersected with left $ \rightarrow \forall i \notin Lx \cap Rx: i=NaN$

In [6]:
pd.merge(left,right,how="left",on="key1 key2".split())

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,


### Right

leaves out what is not intersected with right $ \rightarrow \forall i \notin Rx \cap Lx: i=NaN$

In [9]:
pd.merge(left,right,how="right",on="key1 key2".split())

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2
3,K2,K0,,,C3,D3


## 4. Joining

combines the columns of two potentially **differently-indexed** DataFrames into a single DataFrame.

`df1.join(df2, how = " ")`

### View of two example DataFrames with Differing Indices

In [11]:
display_side_by_side(left_index1,right_index2)

Unnamed: 0,A,B
K0,A0,B0
K1,A1,B1
K2,A2,B2

Unnamed: 0,C,D
K0,C0,D0
K2,C2,D2
K3,C3,D3


### Join Example

In [17]:
left_index1.join(right_index2)

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


## 5. Pivot Tables

we can create pivot tables as descriptive summaries of data with categorical variables which are repeating variables within a column:

- **Categorical Variable as Index:** 
    - Single Index: single column
    - Multi-Index: two columns
- **Categorical Variable as Column labels:**
    - Single Column containing the category values of X,Y,Z, ...
- **Value Variable:**
    - Values which are indexed to the index categorical variables and represent the values of the category labels
    
`df.pivot_table(index=["Cat.Index1", "Cat.Index2"], columns = "Cat.Col.Labels", values = "Values Column")`

In [53]:
data = {'Index1':['foo','foo','foo','bar','bar','bar'], # Categorical Index Level 1
     'Index2':['one','one','two','two','one','one'], # Categorical Index Level 2
       'ColumnLabels':['x','y','x','y','x','y'], # Categorical Column Labels
       'Variable1':[1,3,2,5,4,1]} # Values associated with the labels

df = pd.DataFrame(data)
df 

Unnamed: 0,Index1,Index2,ColumnLabels,Variable1
0,foo,one,x,1
1,foo,one,y,3
2,foo,two,x,2
3,bar,two,y,5
4,bar,one,x,4
5,bar,one,y,1


In [55]:
df.pivot_table(index="Index1 Index2".split(), columns = "ColumnLabels", values = "Variable1")

Unnamed: 0_level_0,ColumnLabels,x,y
Index1,Index2,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,4.0,1.0
bar,two,,5.0
foo,one,1.0,3.0
foo,two,2.0,


## 6. Stacking and Unstacking

### Stacking

is done to stack a pivot table. `pivotted_df.stack()`. 

It:

- **Preserves:**
    - Indices
- **Stacks:**
    - Categorical Column varibales into one column
    - values into one column

In [56]:
df.pivot_table(index="Index1 Index2".split(), columns = "ColumnLabels", values = "Variable1").stack()

Index1  Index2  ColumnLabels
bar     one     x               4.0
                y               1.0
        two     y               5.0
foo     one     x               1.0
                y               3.0
        two     x               2.0
dtype: float64

### Unstacking

is done to a stacked pivot table to convert it back to its original pivot table form, `stacked_df.unstack()`

## 7. Melting

It melts, i.e. collapses, specific (or all) the variables in one column, and all the values in another column. 

`pd.melt(df, id_vars = ["Index1 ", "Index2", ..., "IndexN", "Column Labels"])`

This gives a top down view of the different variables and associated values

In [2]:
data = {'Index1':['foo','foo','foo','bar','bar','bar'], # Categorical Index Level 1
     'Index2':['one','one','two','two','one','one'], # Categorical Index Level 2
       'ColumnLabels':['x','y','x','y','x','y'], # Categorical Column Labels
       'Variable1':[1,3,2,5,4,1], # Values associated with the labels
        'Variable2':[1,3,2,5,4,1], # Values associated with the labels
        'Variable3':[1,3,2,5,4,1], # Values associated with the labels
            'Variable4':[1,3,2,5,4,1]} # Values associated with the labels

df = pd.DataFrame(data)
df 

Unnamed: 0,Index1,Index2,ColumnLabels,Variable1,Variable2,Variable3,Variable4
0,foo,one,x,1,1,1,1
1,foo,one,y,3,3,3,3
2,foo,two,x,2,2,2,2
3,bar,two,y,5,5,5,5
4,bar,one,x,4,4,4,4
5,bar,one,y,1,1,1,1


In [51]:
pd.melt(df, id_vars = "Index1 Index2 ColumnLabels".split())

Unnamed: 0,Index1,Index2,ColumnLabels,variable,value
0,foo,one,x,Variable1,1
1,foo,one,y,Variable1,3
2,foo,two,x,Variable1,2
3,bar,two,y,Variable1,5
4,bar,one,x,Variable1,4
5,bar,one,y,Variable1,1
6,foo,one,x,Variable2,1
7,foo,one,y,Variable2,3
8,foo,two,x,Variable2,2
9,bar,two,y,Variable2,5


# Appendix Code

To be run at the beginning, but for presentation purposes put at the very end

## Example DataFrames

In [21]:
import pandas as pd
import numpy as np
from IPython.display import display_html
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})

left_index1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right_index2 = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

city_temperatures = pd.DataFrame([("1/1/2017","new york",32,6,"Rain"),
                   ("1/2/2017","new york",36,7,"Sunny"),
                   ("1/3/2017","new york",28,12,"Snow"),
                   ("1/4/2017","new york",33,7,"Sunny"),
                   ("1/1/2017","mumbai",90,5,"Sunny"),
                   ("1/2/2017","mumbai",85,12,"Fog"),
                   ("1/3/2017","mumbai",87,15,"Fog"),
                   ("1/4/2017","mumbai",92,5,"Rain"),
                   ("1/1/2017","paris",45,20,"Sunny"),
                   ("1/2/2017","paris",50,13,"Cloudy"),
                   ("1/3/2017","paris",54,8,"Cloudy"),
                   ("1/4/2017","paris",42,10,"Cloudy")],

                     columns=("day city temperature windspeed event".split()))