### Pandas Lab -- Cleaning, Merging, & Grouping

This lab is designed to introduce students to common use cases for Pandas when working with data:

 - Creating new information out of your existing data set
 - Merging, concatenating, and joining different data sources
 - Grouping -- With both time & non-time based data

### Section I: Creating Data Out of Your Existing Columns

Go ahead and create the following columns in your dataset.

**Column 1:**

  - **Column Name:** Profitable
  - **Values:** `True` if `Profit` > 0, `False` if not.

In [None]:
# your answer here

**Column 2:**

 - **Column Name:** Expected Ship Time
 - **Values:**
   - `0` if `Ship Mode` == `Same Day`
   - `2` if `Ship Mode` == `First Class`
   - `3` if `Ship Mode` == `Second Class`
   - `6` if `Ship Mode` == `Standard Class`
   - `-1` if none of the above.

In [None]:
# your answer here

**Column 3:**

 - **Column Name:** Actual Ship Time
 - **Values:**
   - `Ship Date` - `Order Date`
 - **Note:** When you subtract these columns, your column will be a **time delta**.  See if you can use the `dt` attribute to convert these values into an integer.  Ie, if your value reads `3 days`, you want that to be 3 instead.  You can read more about different time periods in pandas here:  https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components

In [32]:
# your answer here

**Column 4:**

 - **Column Name:** Late
 - **Values:** `True` if `Actual Ship Time` > `Expected Ship Time`, `False` otherwise

In [None]:
# your answer here

### Section II: Merging Dataframes

This excel spreadsheet has 3 separate sheets.  Look up the documentation on the `pd.read_excel` method on how to load in the other two.  

After that, merge the other two dataframes into your original one, and make sure your original dataset now has the following columns:

 - **Salesperson:** This is the Salesperson in charge of each region.
 - **Returned:** This details whether or not the order was returned.  Fill in null values with the value `no`.
 
Use the `drop()` method if you need to get rid of redundant columns.

**Important:** We want to keep all of the rows in the dataset we first loaded in.  After each merge, it's a good idea to make sure your dataset hasn't shrunk, which will happen if you don't choose the right merge type.  Make sure you have 9,994 rows when you're finished!

In [None]:
# your work here

### Section III: Grouping

Use the `groupby` or `resample` method to answer the following questions.

**Question 1:** What salesperson had the highest average sales amount? 

In [None]:
# your answer here

**Question 2:** Within each ship mode, compare how likely late orders were to be profitable or not

In [None]:
# your answer here

**Question 3:** What ship mode had the most consistently on time orders?

In [None]:
# your answer here

**Question 4:** For each sales person, get their average, median, max, and count of their sales.

In [None]:
# your answer here

**Question 5:** Group your dataset according to  `Region`, and `Category`, and then call the `describe()` method to get the summary statistics for each subgroup.

In [None]:
# your answer here

**Question 6:** Use the `Resample()` method to get the sum of sales for each quarter.

In [None]:
# your answer here

**Question 7:** What quarter had the highest total sales amount?

In [None]:
# your answer here

**Question 8:** See if you can use the `groupby` method to get a list of yearly sales for each region inside the dataset.

**Hint:** Try using the `dt` attribute of the `Order Date` column.

In [None]:
# your answer here

**Bonus:** Creating summary statistics with a `groupby` statement.  

Lots of times it's very useful to be able to create a summary statistic for a particular category to compare with individual samples.

For example, if you were doing fraud detection, and someone were making a purchase at a 7-11, a transaction amount of $175 would be unusually large for someone making a purchase at a convenience store, essentially setting off a red flag that the transaction might be suspicious.

Creating such comparisons is easily done using the `groupby` method and then merging it back into the original dataframe.  

For example, if you wanted to compare every single purchase amount with the average amount for that category, you could do it in the following way:

In [2]:
import pandas as pd
df = pd.read_excel('../data/superstore.xls')
# create the grouping
cat_grouping = df.groupby('Category')[['Sales']].mean()
# this step is mostly just to make the merged dataframe more tidy
cat_grouping.rename({'Sales': 'Cat_Average'}, axis=1, inplace=True)

In [3]:
# join them
df = df.merge(cat_grouping, left_on='Category', right_index=True)

In [4]:
# and now we can see each purchase amount compared to the average amt
# for that category
df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Cat_Average
0,1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136,349.834887
1,2,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582,349.834887
3,4,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031,349.834887
5,6,CA-2014-115812,2014-06-09,2014-06-14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,...,West,FUR-FU-10001487,Furniture,Furnishings,Eldon Expressions Wood and Plastic Desk Access...,48.86,7,0.0,14.1694,349.834887
10,11,CA-2014-115812,2014-06-09,2014-06-14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,...,West,FUR-TA-10001539,Furniture,Tables,Chromcraft Rectangular Conference Tables,1706.184,9,0.2,85.3092,349.834887


So for instance, if wanted to ask ourselves, "which customers consistently punch above their weight when it comes to the actual items that they buy?"

We could easily do the following:

In [5]:
# turn the difference between the two columns into a percent
df['Cat Difference'] = ((df['Sales'] / df['Cat_Average']) - 1) * 100

In [6]:
# now group and sort the values
df.groupby('Customer Name')['Cat Difference'].mean().sort_values(ascending=False)

Customer Name
Mitch Willingham        845.569285
Christopher Martinez    636.313301
Andy Reiter             449.654450
Adrian Barton           418.548520
Sanjit Chand            386.906876
Amy Cox                 363.031889
Yoseph Carroll          344.794116
Yana Sorensen           317.572571
Sean Miller             308.018602
Tamara Chand            304.928500
Alex Avila              296.561353
Greg Maxwell            292.917810
Grant Thornton          291.980410
Jane Waco               272.432920
Tom Ashbrook            270.172385
Paul Knutson            244.824972
Robert Dilbeck          239.904239
Gary Hwang              231.081468
Ken Lonsdale            216.273514
Dennis Pardue           213.919098
Justin Hirsh            205.806859
Stefanie Holloman       199.625920
Cathy Prescott          194.089483
Bill Shonely            194.068945
Erica Smith             193.504802
Neil Ducich             174.357173
Adam Bellavance         174.177216
Hunter Lopez            170.364451
Karen 

**Your Turn:** Using a similar methodology as above, figure out the 10 customers who are the most profitable on average, when compared to the subcategory that they purchased from.

If you wanted, you could also limit this to customers who only made a minimum number of purchases as well.

In [None]:
# your answer here