# Joining Tables (And Appending etc)

The documentation for this section is in Merge-Join-Concatenate:
    http://pandas.pydata.org/pandas-docs/stable/merging.html#merge-join-and-concatenate
    
You should especially study the pictures of the rows and columns and their merge results.

The way we combine tables is heavily influenced by SQL joins.  You need to understand this to understand the options in the commands.

![joins pic](data/sql_joins.jpg)

In [1]:
ls data

Paris_Rainfall_Unpivoted.csv  [31mchinook.db[m[m*
SuperstoreSales.csv           sql_joins.jpg
SuperstoreSales_Returns.csv


In [2]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [3]:
store = pd.read_csv("data/SuperstoreSales.csv", encoding='latin1', parse_dates=['Order Date', 'Ship Date'])

In [4]:
store.head(3)

Unnamed: 0,Row ID,Order ID,Order Date,Order Priority,Order Quantity,Sales,Discount,Ship Mode,Unit Price,Shipping Cost,Customer Name,Province,Region,Customer Segment,Product Category,Product Sub-Category,Product Name,Product Container,Product Base Margin,Ship Date
0,1,3,2010-10-13,Low,6,261.54,0.04,Regular Air,38.94,35.0,Muhammed MacIntyre,Nunavut,Nunavut,Small Business,Office Supplies,Storage & Organization,"Eldon Base for stackable storage shelf, platinum",Large Box,0.8,2010-10-20
1,49,293,2012-10-01,High,49,10123.02,0.07,Delivery Truck,208.16,68.02,Barry French,Nunavut,Nunavut,Consumer,Office Supplies,Appliances,"1.7 Cubic Foot Compact ""Cube"" Office Refrigera...",Jumbo Drum,0.58,2012-10-02
2,50,293,2012-10-01,High,27,244.57,0.01,Regular Air,8.69,2.99,Barry French,Nunavut,Nunavut,Consumer,Office Supplies,Binders and Binder Accessories,"Cardinal Slant-D¨ Ring Binder, Heavy Gauge Vinyl",Small Box,0.39,2012-10-03


In [5]:
store.tail(2)

Unnamed: 0,Row ID,Order ID,Order Date,Order Priority,Order Quantity,Sales,Discount,Ship Mode,Unit Price,Shipping Cost,Customer Name,Province,Region,Customer Segment,Product Category,Product Sub-Category,Product Name,Product Container,Product Base Margin,Ship Date
8397,7907,56550,2011-04-08,Not Specified,8,469.8375,0.0,Regular Air,65.99,8.99,Frank Hawley,Alberta,West,Home Office,Technology,Telephones and Communication,Talkabout T8367,Small Box,0.56,2011-04-09
8398,7914,56581,2009-02-08,High,20,2026.01,0.1,Express Air,105.98,13.99,Grant Donatelli,Alberta,West,Consumer,Furniture,Office Furnishings,"Tenex 46"" x 60"" Computer Anti-Static Chairmat,...",Medium Box,0.65,2009-02-11


In [6]:
store.dtypes

Row ID                           int64
Order ID                         int64
Order Date              datetime64[ns]
Order Priority                  object
Order Quantity                   int64
Sales                          float64
Discount                       float64
Ship Mode                       object
Unit Price                     float64
Shipping Cost                  float64
Customer Name                   object
Province                        object
Region                          object
Customer Segment                object
Product Category                object
Product Sub-Category            object
Product Name                    object
Product Container               object
Product Base Margin            float64
Ship Date               datetime64[ns]
dtype: object

In [7]:
returns = pd.read_csv('data/SuperstoreSales_Returns.csv')

In [8]:
returns.head()

Unnamed: 0,Order ID,Status
0,65,Returned
1,69,Returned
2,134,Returned
3,135,Returned
4,230,Returned


In [10]:
# Left join means we keep all the columns from the store (the left) and also join the ones that match
leftjoin = store.merge(returns, left_on='Order ID', right_on='Order ID', how="left")

In [9]:
#store.merge?

In [12]:
leftjoin.head()

Unnamed: 0,Row ID,Order ID,Order Date,Order Priority,Order Quantity,Sales,Discount,Ship Mode,Unit Price,Shipping Cost,...,Province,Region,Customer Segment,Product Category,Product Sub-Category,Product Name,Product Container,Product Base Margin,Ship Date,Status
0,1,3,2010-10-13,Low,6,261.54,0.04,Regular Air,38.94,35.0,...,Nunavut,Nunavut,Small Business,Office Supplies,Storage & Organization,"Eldon Base for stackable storage shelf, platinum",Large Box,0.8,2010-10-20,
1,49,293,2012-10-01,High,49,10123.02,0.07,Delivery Truck,208.16,68.02,...,Nunavut,Nunavut,Consumer,Office Supplies,Appliances,"1.7 Cubic Foot Compact ""Cube"" Office Refrigera...",Jumbo Drum,0.58,2012-10-02,
2,50,293,2012-10-01,High,27,244.57,0.01,Regular Air,8.69,2.99,...,Nunavut,Nunavut,Consumer,Office Supplies,Binders and Binder Accessories,"Cardinal Slant-D¨ Ring Binder, Heavy Gauge Vinyl",Small Box,0.39,2012-10-03,
3,80,483,2011-07-10,High,30,4965.7595,0.08,Regular Air,195.99,3.99,...,Nunavut,Nunavut,Corporate,Technology,Telephones and Communication,R380,Small Box,0.58,2011-07-12,
4,85,515,2010-08-28,Not Specified,19,394.27,0.08,Regular Air,21.78,5.94,...,Nunavut,Nunavut,Consumer,Office Supplies,Appliances,Holmes HEPA Air Purifier,Medium Box,0.5,2010-08-30,


In [13]:
len(leftjoin)

8399

In [14]:
len(store)

8399

### Inner join instead..

In [15]:
inner = store.merge(returns, left_on='Order ID', right_on='Order ID', how="inner")

In [16]:
len(inner)

872

In [17]:
inner.head()

Unnamed: 0,Row ID,Order ID,Order Date,Order Priority,Order Quantity,Sales,Discount,Ship Mode,Unit Price,Shipping Cost,...,Province,Region,Customer Segment,Product Category,Product Sub-Category,Product Name,Product Container,Product Base Margin,Ship Date,Status
0,107,678,2010-02-26,Low,44,228.41,0.07,Regular Air,4.98,8.33,...,Nunavut,Nunavut,Home Office,Office Supplies,Paper,Xerox 198,Small Box,0.38,2010-02-26,Returned
1,1370,9927,2011-08-16,High,32,4655.07,0.0,Delivery Truck,140.98,53.48,...,Northwest Territories,Northwest Territories,Corporate,Furniture,Bookcases,Bush Heritage Pine Collection 5-Shelf Bookcase...,Jumbo Box,0.65,2011-08-17,Returned
2,1371,9927,2011-08-16,High,44,10087.6,0.01,Regular Air,218.08,18.06,...,Northwest Territories,Northwest Territories,Corporate,Furniture,Chairs & Chairmats,"Lifetime Advantageª Folding Chairs, 4/Carton",Large Box,0.57,2011-08-17,Returned
3,1372,9927,2011-08-16,High,34,1608.08,0.09,Express Air,50.98,6.5,...,Northwest Territories,Northwest Territories,Corporate,Technology,Computer Peripherals,Microsoft Natural Multimedia Keyboard,Small Box,0.73,2011-08-17,Returned
4,1654,11911,2010-11-10,Critical,25,397.84,0.0,Regular Air,15.22,9.73,...,Northwest Territories,Northwest Territories,Consumer,Office Supplies,Binders and Binder Accessories,"GBC Twin Loopª Wire Binding Elements, 9/16"" Sp...",Small Box,0.36,2010-11-12,Returned


## Joins

Another alternative is join.  If two dataframe have the same index, it will merge along the index and just add the columns.

## Testing for Nans

NaN is "not a number".  It is the pandas/numpy version of a missing value.

In [20]:
leftjoin['Status'].head(1)

0    NaN
Name: Status, dtype: object

In [18]:
leftjoin['Status'][0] is np.nan

True

## Drop NAs

In [21]:
returns = leftjoin.dropna(subset=['Status'])

returns

Unnamed: 0,Row ID,Order ID,Order Date,Order Priority,Order Quantity,Sales,Discount,Ship Mode,Unit Price,Shipping Cost,...,Province,Region,Customer Segment,Product Category,Product Sub-Category,Product Name,Product Container,Product Base Margin,Ship Date,Status
9,107,678,2010-02-26,Low,44,228.4100,0.07,Regular Air,4.98,8.33,...,Nunavut,Nunavut,Home Office,Office Supplies,Paper,Xerox 198,Small Box,0.38,2010-02-26,Returned
89,1370,9927,2011-08-16,High,32,4655.0700,0.00,Delivery Truck,140.98,53.48,...,Northwest Territories,Northwest Territories,Corporate,Furniture,Bookcases,Bush Heritage Pine Collection 5-Shelf Bookcase...,Jumbo Box,0.65,2011-08-17,Returned
90,1371,9927,2011-08-16,High,44,10087.6000,0.01,Regular Air,218.08,18.06,...,Northwest Territories,Northwest Territories,Corporate,Furniture,Chairs & Chairmats,"Lifetime Advantageª Folding Chairs, 4/Carton",Large Box,0.57,2011-08-17,Returned
91,1372,9927,2011-08-16,High,34,1608.0800,0.09,Express Air,50.98,6.50,...,Northwest Territories,Northwest Territories,Corporate,Technology,Computer Peripherals,Microsoft Natural Multimedia Keyboard,Small Box,0.73,2011-08-17,Returned
105,1654,11911,2010-11-10,Critical,25,397.8400,0.00,Regular Air,15.22,9.73,...,Northwest Territories,Northwest Territories,Consumer,Office Supplies,Binders and Binder Accessories,"GBC Twin Loopª Wire Binding Elements, 9/16"" Sp...",Small Box,0.36,2010-11-12,Returned
108,1675,12096,2012-09-19,Medium,46,8009.5925,0.02,Regular Air,200.99,8.08,...,Northwest Territories,Northwest Territories,Home Office,Technology,Telephones and Communication,5125,Small Box,0.59,2012-09-19,Returned
109,1676,12096,2012-09-19,Medium,23,4689.6600,0.01,Regular Air,194.30,11.54,...,Northwest Territories,Northwest Territories,Home Office,Furniture,Office Furnishings,Electrix Halogen Magnifier Lamp,Large Box,0.59,2012-09-21,Returned
115,1770,12704,2010-02-09,Low,44,21506.7700,0.06,Regular Air,499.99,24.49,...,Northwest Territories,Northwest Territories,Small Business,Technology,Copiers and Fax,Sharp AL-1530CS Digital Copier,Large Box,0.36,2010-02-09,Returned
116,1771,12704,2010-02-09,Low,28,669.0200,0.02,Delivery Truck,20.98,53.03,...,Northwest Territories,Northwest Territories,Small Business,Office Supplies,Storage & Organization,"Tennsco Lockers, Gray",Jumbo Drum,0.78,2010-02-11,Returned
135,2114,15106,2011-01-27,Not Specified,42,283.5800,0.03,Regular Air,6.64,4.95,...,Northwest Territories,Northwest Territories,Home Office,Furniture,Office Furnishings,G.E. Longer-Life Indoor Recessed Floodlight Bulbs,Small Pack,0.37,2011-01-29,Returned


In [22]:
len(returns)

872

In [26]:
def replace_nan(x):
    if x is np.nan:
        return "Not Returned"
    else:
        return x

In [25]:
leftjoin['Status'].head()

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: Status, dtype: object

In [27]:
leftjoin['Status'] = leftjoin['Status'].apply(replace_nan)

In [30]:
leftjoin['Status'].head()

0    Not Returned
1    Not Returned
2    Not Returned
3    Not Returned
4    Not Returned
Name: Status, dtype: object

In [31]:
leftjoin.Status.value_counts()

Not Returned    7527
Returned         872
Name: Status, dtype: int64

In [33]:
def adjust_sales(x):
    if x['Status'] == 'Returned':
        return 0
    elif x['Status'] == 'Not Returned':
        return x['Sales']

In [34]:
# if we are using row values in our formula, use axis=1 (by-row)
leftjoin['Adjusted Sales'] = leftjoin.apply(adjust_sales, axis=1)

In [37]:
leftjoin['Adjusted Sales'].head()

0      261.5400
1    10123.0200
2      244.5700
3     4965.7595
4      394.2700
Name: Adjusted Sales, dtype: float64

In [38]:
leftjoin['Adjusted Sales'].sum()

13260747.0965

In [39]:
leftjoin['Sales'].sum()

14915600.824000001

In [40]:
# Total returns
leftjoin['Sales'].sum() - leftjoin['Adjusted Sales'].sum()

1654853.727500001

## Copy a column from one DF onto another DF

In [41]:
len(leftjoin['Adjusted Sales'])

8399

In [42]:
len(store)

8399

In [43]:
store.columns

Index(['Row ID', 'Order ID', 'Order Date', 'Order Priority', 'Order Quantity',
       'Sales', 'Discount', 'Ship Mode', 'Unit Price', 'Shipping Cost',
       'Customer Name', 'Province', 'Region', 'Customer Segment',
       'Product Category', 'Product Sub-Category', 'Product Name',
       'Product Container', 'Product Base Margin', 'Ship Date'],
      dtype='object')

You can set one equal to the other, creating the column in Store, if they have the same index. (This didn't work with my latitude and longitude example because of the indices - one was strings and one was integers.)

In [44]:
store['Adjusted Sales'] = leftjoin['Adjusted Sales']

In [45]:
store['Adjusted Sales'].head()

0      261.5400
1    10123.0200
2      244.5700
3     4965.7595
4      394.2700
Name: Adjusted Sales, dtype: float64

## Index Mismatch with Column assignment

Be careful about setting columns equal in cases where the indices aren't the same.  You have to check it worked -- your new columns might all be NaNs like mine were with latitude and longitude.

In [46]:
smalldf = pd.DataFrame({"prenom": ["fred", "sally", "sam"], "id": ['1','2','3']})

In [224]:
smalldf

Unnamed: 0,id,prenom
0,1,fred
1,2,sally
2,3,sam


In [47]:
smalldf = smalldf.set_index("id")
smalldf

Unnamed: 0_level_0,prenom
id,Unnamed: 1_level_1
1,fred
2,sally
3,sam


By the way, to remove an index and make it just rows, you can use reset_index:

In [48]:
# we aren't saving this into smalldf now, just showing you the output
smalldf.reset_index()

Unnamed: 0,id,prenom
0,1,fred
1,2,sally
2,3,sam


In [49]:
smalldf

Unnamed: 0_level_0,prenom
id,Unnamed: 1_level_1
1,fred
2,sally
3,sam


In [50]:
nomsdf = pd.DataFrame({"nom": ["Lagrange", "LeStrange", "Fermier"]})

In [51]:
nomsdf  # notice is has a different index -- row number by default

Unnamed: 0,nom
0,Lagrange
1,LeStrange
2,Fermier


In [52]:
smalldf['nom'] = nomsdf['nom']

In [53]:
# This did not work.  It created the new columm, but assigned NaN.
smalldf

Unnamed: 0_level_0,prenom,nom
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,fred,
2,sally,
3,sam,


If you know these are the same length and order, you can use the values from the df you are trying to assign to:

In [54]:
# check their length
len(smalldf) == len(nomsdf)

True

In [55]:
smalldf['nom'] = nomsdf['nom'].values

In [56]:
smalldf

Unnamed: 0_level_0,prenom,nom
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,fred,Lagrange
2,sally,LeStrange
3,sam,Fermier


## Adding a Row or Rows With Append

You can use append to add some rows onto another df.  It's best if your index matches.

In [58]:
newnames = pd.DataFrame({"prenom": ["harry"], "nom": ["Berbier"], "id": ['4']})

In [59]:
newnames

Unnamed: 0,id,nom,prenom
0,4,Berbier,harry


In [60]:
newnames = newnames.set_index("id")

In [61]:
newnames

Unnamed: 0_level_0,nom,prenom
id,Unnamed: 1_level_1,Unnamed: 2_level_1
4,Berbier,harry


In [62]:
smalldf = smalldf.append(newnames)
smalldf

Unnamed: 0_level_0,nom,prenom
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Lagrange,fred
2,LeStrange,sally
3,Fermier,sam
4,Berbier,harry


Another option is to reset the index so they are both just ordered by row number and then assign.