# Combining DataFrames

In this notebook, you'll see two different ways to combine _pandas_ DataFrames.

In [1]:
import pandas as pd

# Merging DataFrames

First, we'll import a DataFrame containing information on the revenue and quantity for sales that occurred in the year 2012.

In [2]:
sales_2012 = pd.read_csv('../data/sales_2012.csv')
sales_2012.head()

Unnamed: 0,Sale_ID,Year,Quarter,Revenue,Quantity
0,1,2012,Q1 2012,59628.66,489
1,3,2012,Q1 2012,89940.48,147
2,4,2012,Q1 2012,165883.41,303
3,5,2012,Q1 2012,119822.2,1415
4,6,2012,Q1 2012,87728.96,352


Next, we'll bring in a dataset which shows product, retailer, and order method information.

In [3]:
products = pd.read_csv('../data/products.csv')
products.head()

Unnamed: 0,Sale_ID,Retailer_country,Order_method_type,Retailer_type,Product_line,Product_type,Product
0,1,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Deluxe Cook Set
1,2,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Double Flame
2,900000,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Dome
3,4,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Gazer 2
4,5,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Lite


Notice that these two DataFrames can be linked together through the `Sale_ID` column. Let's merge these together so that we can do some further analysis.

Recall that the syntax for merging dataframes in pandas is:

```pd.merge(left dataframe, right dataframe, how to merge, column to merge on)```

In [4]:
combined_data = pd.merge(products, sales_2012, how = 'right', on = 'Sale_ID')

In [5]:
combined_data.head()

Unnamed: 0,Sale_ID,Retailer_country,Order_method_type,Retailer_type,Product_line,Product_type,Product,Year,Quarter,Revenue,Quantity
0,1,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Deluxe Cook Set,2012,Q1 2012,59628.66,489
1,3,,,,,,,2012,Q1 2012,89940.48,147
2,4,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Gazer 2,2012,Q1 2012,165883.41,303
3,5,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Lite,2012,Q1 2012,119822.2,1415
4,6,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Extreme,2012,Q1 2012,87728.96,352


Looks like we have some missing values.

In [6]:
combined_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33895 entries, 0 to 33894
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Sale_ID            33895 non-null  int64  
 1   Retailer_country   33894 non-null  object 
 2   Order_method_type  33894 non-null  object 
 3   Retailer_type      33894 non-null  object 
 4   Product_line       33894 non-null  object 
 5   Product_type       33894 non-null  object 
 6   Product            33894 non-null  object 
 7   Year               33895 non-null  int64  
 8   Quarter            33895 non-null  object 
 9   Revenue            33895 non-null  float64
 10  Quantity           33895 non-null  int64  
dtypes: float64(1), int64(3), object(7)
memory usage: 3.1+ MB


In [7]:
## How many values are we missing from each column?

combined_data.isnull().sum()

Sale_ID              0
Retailer_country     1
Order_method_type    1
Retailer_type        1
Product_line         1
Product_type         1
Product              1
Year                 0
Quarter              0
Revenue              0
Quantity             0
dtype: int64

Notice that we have a row with Sale_ID of 3 which seems to be missing all of the product information. Let's double-check that this information is not contained in the products data.

In [8]:
products[products['Sale_ID'] == 3]

Unnamed: 0,Sale_ID,Retailer_country,Order_method_type,Retailer_type,Product_line,Product_type,Product


Once combined, we can start asking questions of our data.

#### Question 1: Which product type generated us the most total revenue in 2012? 

In [None]:
# Try and fill in the code to answer this question

#### Question 2: What was our highest volume product?

In [None]:
combined_data.groupby('Product')['Quantity'].sum().sort_values(ascending = False)

What is Zone?

In [None]:
combined_data.loc[combined_data['Product'] == 'Zone']

#### Question 3: For which retailer type do we have the highest sales quantity of Zone?

In [None]:
combined_data.loc[combined_data['Product'] == 'Zone'].groupby('Retailer_type')['Quantity'].sum().sort_values(ascending = False)

# Concatenating DataFrames

Notice that we also have access to sales data for 2013. Let's read it in.

In [None]:
sales_2013 = pd.read_csv('../data/sales_2013.csv')
sales_2013.head(2)

This data looks to be formatted in the same way as our 2012 sales data.

In [None]:
sales_2012.head(2)

What if we want to combine these two DataFrames. In this case, we don't want to merge, as each record should still have its own row in the result. Instead, this is a time when we want to **concatenate**. 

To concatenate DataFrames, we can pass the dataframes that we want to combine as a list into the `pd.concat` function.

In [None]:
pd.concat([sales_2012, sales_2013])

Note that while we have 66840 rows, the index value at the end of the DataFrame is only 32944. We can reindex the result by using the `ignore_index` argument.

In [None]:
pd.concat([sales_2012, sales_2013], ignore_index = True)

We've also got sales for 2014. Rather than rewrite the same code multiple times, we could instead make use of a for loop in order to read in the three DataFrames and then combine them.

In [None]:
# Start with an empty list to hold the individual DataFrames
sales_dfs = []

for filename in ['../data/sales_2012.csv', '../data/sales_2013.csv', '../data/sales_2014.csv']:
    df = pd.read_csv(filename)
    sales_dfs.append(df)

In [None]:
sales = pd.concat(sales_dfs, ignore_index = True)

In [None]:
sales