# Pandas DataFrame Operations

## Groupby

GroupBy method can be used to group together rows based off of a column and perform an aggregate function on them.

In the example below, there are three partitions of IDS (1, 2, and 3) and several values for them. We can now group by the ID column and aggregate them using some sort of aggregate function. Here we are sum-ing the values and putting the values.

<img src='https://www.includehelp.com/python/images/pandas-groupby-1.jpg' width="100%" />

---

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Create sales dataframe
sales_data = {'Company': ['GOOG', 'GOOG', 'MSFT', 'MSFT', 'FB', 'FB'],
              'Person': ['Sam', 'Charlie', 'Amy', 'Vanessa', 'Carl', 'Sarah'],
              'Sales': [200, 120, 340, 124, 243, 350]}

In [3]:
sales = pd.DataFrame(sales_data)
sales

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200
1,GOOG,Charlie,120
2,MSFT,Amy,340
3,MSFT,Vanessa,124
4,FB,Carl,243
5,FB,Sarah,350


**We can use `.groupby()` to group rows together based off of a column name. Let's group based off of `Company`.**

**This will create a `DataFrameGroupBy` object:**

In [4]:
sales.groupby('Company')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11bbb6d50>

We can save as a new variable:

In [5]:
sales_by_comp = sales.groupby("Company")

And then call aggregate methods:

In [6]:
sales_by_comp.mean()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


In [7]:
sales.groupby('Company').mean()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


**Other aggregate methods:**

In [8]:
sales_by_comp.std()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,75.660426
GOOG,56.568542
MSFT,152.735065


In [9]:
sales_by_comp.min()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Carl,243
GOOG,Charlie,120
MSFT,Amy,124


In [10]:
sales_by_comp.max()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Sarah,350
GOOG,Sam,200
MSFT,Vanessa,340


In [11]:
sales_by_comp.max().loc['FB']

Person    Sarah
Sales       350
Name: FB, dtype: object

In [12]:
sales_by_comp.count()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,2,2
GOOG,2,2
MSFT,2,2


In [13]:
# returns count, mean, std, min, max, and quartiles
sales_by_comp.describe()

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
FB,2.0,296.5,75.660426,243.0,269.75,296.5,323.25,350.0
GOOG,2.0,160.0,56.568542,120.0,140.0,160.0,180.0,200.0
MSFT,2.0,232.0,152.735065,124.0,178.0,232.0,286.0,340.0


In [14]:
# we can also transpose it to have each company as a column.
sales_by_comp.describe().transpose()

Unnamed: 0,Company,FB,GOOG,MSFT
Sales,count,2.0,2.0,2.0
Sales,mean,296.5,160.0,232.0
Sales,std,75.660426,56.568542,152.735065
Sales,min,243.0,120.0,124.0
Sales,25%,269.75,140.0,178.0
Sales,50%,296.5,160.0,232.0
Sales,75%,323.25,180.0,286.0
Sales,max,350.0,200.0,340.0


In [15]:
sales_by_comp.describe().transpose()['GOOG']

Sales  count      2.000000
       mean     160.000000
       std       56.568542
       min      120.000000
       25%      140.000000
       50%      160.000000
       75%      180.000000
       max      200.000000
Name: GOOG, dtype: float64

---

## Concatenating, Merging, and Joining

There are 3 ways of combining DataFrames together: `Concatenating`, `Merging`, and `Joining`

### Example DataFrames

In [16]:
conc_df1 = pd.DataFrame({'A': range(10, 13),
                         'B': range(20, 23),
                         'C': range(30, 33)},
                         index=[0, 1, 2])
conc_df1

Unnamed: 0,A,B,C
0,10,20,30
1,11,21,31
2,12,22,32


In [17]:
conc_df2 = pd.DataFrame({'A': range(13, 16),
                         'B': range(23, 26),
                         'C': range(33, 36)},
                         index=[4, 5, 6])
conc_df2

Unnamed: 0,A,B,C
4,13,23,33
5,14,24,34
6,15,25,35


### Concatenation

* Concatenation glues together DataFrames.
* Dimensions should match along the axis you are concatenating on.
* Use `pd.concat` and pass a **list** of DataFrames to concatenate together:

In [18]:
pd.concat([conc_df1, conc_df2])

Unnamed: 0,A,B,C
0,10,20,30
1,11,21,31
2,12,22,32
4,13,23,33
5,14,24,34
6,15,25,35


In [19]:
# we can concatenate based on rows (but they do not match)
pd.concat([conc_df1, conc_df2], axis=1)

Unnamed: 0,A,B,C,A.1,B.1,C.1
0,10.0,20.0,30.0,,,
1,11.0,21.0,31.0,,,
2,12.0,22.0,32.0,,,
4,,,,13.0,23.0,33.0
5,,,,14.0,24.0,34.0
6,,,,15.0,25.0,35.0


### Example DataFrames

In [20]:
merge_left = pd.DataFrame({'key': ['a', 'b', 'c', 'd'],
                           'A': [10, 20, 30, 40],
                           'B': [100, 200, 300, 400]})
merge_left

Unnamed: 0,key,A,B
0,a,10,100
1,b,20,200
2,c,30,300
3,d,40,400


In [21]:
merge_right = pd.DataFrame({'key': ['a', 'b', 'c', 'e'],
                            'C': [50, 60, 70, 80],
                            'D': [500, 600, 700, 800]})    
merge_right

Unnamed: 0,key,C,D
0,a,50,500
1,b,60,600
2,c,70,700
3,e,80,800


### Merging

The **merge** function allows you to merge DataFrames together using a similar logic as joining SQL Tables together.

In [22]:
pd.merge(merge_left, merge_right, on='key')

Unnamed: 0,key,A,B,C,D
0,a,10,100,50,500
1,b,20,200,60,600
2,c,30,300,70,700


In [23]:
pd.merge(merge_left, merge_right, on='key', how='left')

Unnamed: 0,key,A,B,C,D
0,a,10,100,50.0,500.0
1,b,20,200,60.0,600.0
2,c,30,300,70.0,700.0
3,d,40,400,,


In [24]:
pd.merge(merge_left, merge_right, on='key', how='right')

Unnamed: 0,key,A,B,C,D
0,a,10.0,100.0,50,500
1,b,20.0,200.0,60,600
2,c,30.0,300.0,70,700
3,e,,,80,800


In [25]:
pd.merge(merge_left, merge_right, on='key', how='outer')

Unnamed: 0,key,A,B,C,D
0,a,10.0,100.0,50.0,500.0
1,b,20.0,200.0,60.0,600.0
2,c,30.0,300.0,70.0,700.0
3,d,40.0,400.0,,
4,e,,,80.0,800.0


### Joining

Joining is similar to merge but uses the dataframe index.

In [26]:
join_left = merge_left.set_index('key')
join_right = merge_right.set_index('key')

In [27]:
join_left

Unnamed: 0_level_0,A,B
key,Unnamed: 1_level_1,Unnamed: 2_level_1
a,10,100
b,20,200
c,30,300
d,40,400


In [28]:
join_right

Unnamed: 0_level_0,C,D
key,Unnamed: 1_level_1,Unnamed: 2_level_1
a,50,500
b,60,600
c,70,700
e,80,800


In [29]:
join_left.join(join_right).dropna()

Unnamed: 0_level_0,A,B,C,D
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,10,100,50.0,500.0
b,20,200,60.0,600.0
c,30,300,70.0,700.0


---

## Operations


In [30]:
data = pd.DataFrame({'col1': [1, 2, 3, 4],
                     'col2': [40, 50, 60, 40],
                     'col3': ['a', 'b', 'c', 'd']})

In [31]:
# returns the first n rows
data.head(2)

Unnamed: 0,col1,col2,col3
0,1,40,a
1,2,50,b


### Unique Values

In [32]:
# unique values
data['col2'].unique()

array([40, 50, 60])

In [33]:
# number of unique values
data['col2'].nunique()

3

In [34]:
# values and their counts
data['col2'].value_counts()

40    2
60    1
50    1
Name: col2, dtype: int64

### Applying Functions

In [35]:
def times2(x):
    return x * 2

In [36]:
data['col1'].apply(times2)

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

In [37]:
data['col3'].apply(len)

0    1
1    1
2    1
3    1
Name: col3, dtype: int64

In [38]:
data['col1'].sum()

10

**Get column and index names:**

In [39]:
data.columns.values

array(['col1', 'col2', 'col3'], dtype=object)

In [40]:
data.index.values

array([0, 1, 2, 3])

**Sorting a DataFrame:**

In [41]:
data

Unnamed: 0,col1,col2,col3
0,1,40,a
1,2,50,b
2,3,60,c
3,4,40,d


In [42]:
data.sort_values('col2')

Unnamed: 0,col1,col2,col3
0,1,40,a
3,4,40,d
1,2,50,b
2,3,60,c


In [43]:
piv_data = {'ind1': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
            'ind2': ['one', 'one', 'two', 'two', 'one', 'one'],
            'cols': ['x', 'y', 'x', 'y', 'x', 'y'],
            'vals': [1, 3, 2, 5, 4, 1]}

piv = pd.DataFrame(piv_data)

In [44]:
piv

Unnamed: 0,ind1,ind2,cols,vals
0,foo,one,x,1
1,foo,one,y,3
2,foo,two,x,2
3,bar,two,y,5
4,bar,one,x,4
5,bar,one,y,1


In [45]:
# We can create a pivot table by specifying the values, index, and columns
piv.pivot_table(values='vals', index=['ind1','ind2'], columns=['cols'])

Unnamed: 0_level_0,cols,x,y
ind1,ind2,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,4.0,1.0
bar,two,,5.0
foo,one,1.0,3.0
foo,two,2.0,
