<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## DataFrame Manipulation Lab

_Authors: Kiefer Katovich (SF)_

---

This lab is intended to cover a variety of skills for data manipulation in `pandas` using a challenging data set from Chipotle.

In addition to practicing Python function writing, you will be implementing multiple `pandas` EDA skills, including:
- Data cleaning
- Grouping
- Data summarization and aggregation
- [`Pandas` split-apply-combine pattern](http://pandas.pydata.org/pandas-docs/stable/groupby.html)
- Basic plotting


In [None]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

#### 1. Load the `chipotle.tsv` and examine the data.

In [None]:
chip_file = './datasets/chipotle.tsv'

Chipotle's data is contained in a `.tsv`, which means "tab-separated value." This is just like a `.csv` but the cells are separated by tabs. There is an argument in `read_csv` called `delimiter`, where you can specify the string that separates the tabs.

In [None]:
# A:

_Chipotle's order data is messy. The column containing ingredients is a list of lists, which will need to be dealt with. Plus, the set will require us to work with long and wide format data._

#### 2. Create a `sub-id` for each `order-id`.

We have an identifier for each order already in `order_id`, but there is no unique identifier for each _`sub_order`_ within the overall order.

Use grouping and the `.apply()` function to assign `sub_id`s for orders.

In [None]:
# A:

#### 3. Clean up the `price` column.

We want the `price` column to be a numeric float value. Currently it is a string (and contains a dollar sign).

In [None]:
# A:

#### 4. Make a new categorical column for broader item types.

Currently, we have many different item types. Make a new column that only has five different broad item types. You should include these types in the new column in your DataFrame. Categories should include:

    chips
    drink
    burrito
    taco
    salad
    
(Put `bowl` items into the `burrito` category).

In [None]:
# A:

#### 5. Calculate the total price by `order_id` and add the results as a new column, `order_total_price`.

There are a variety of different ways you can tackle this problem. One way is to perform a grouped `.apply()` function on `price` and then merge by `order_id` with the total price.

Hints:

- Merging a DataFrame with a Series doesn't work. You need to merge two DataFrames.
- A Series object coming out of a `.groupby()` function with an `.apply()` function chained to it will consider the `.groupby()` results to be potentially hierarchical indices. Using `.reset_index()` will turn these back into columns and also convert to a DataFrame, which can be used to merge.

In [None]:
# A:

#### 6. Make an `adjusted_item_price` column to account for multiple orders per row.

Some items have multiple orders per row, as indicated by `quantity.` Adjust the `price` to account for the number of orders in a new column.

In [None]:
# A:

#### 7. What is the `min`, `max`, `count`, ` mean`, and `std` of `price` for each unique item in  `item_name`?

A pivot table works well for this. You can pass multiple aggregation functions into the `aggfunc` argument.

`count` won't just be the length of each order's subset — there are sometimes multiple orders per row (as we can see in the `quantity` column).

In [None]:
# A:

#### 8. Plot the `mean` `price` of items against the `count` (popularity).

You already have this information in your summary table from the previous question.

In [None]:
# A:

#### 9. Plot the `max` `price` of items against the `count` (popularity).

In [None]:
# A:

#### 10. Calculate the `mean` of adjusted `price` per `broad_type` category.

You can handle these with a single function or in another way you prefer.

FYI, `.apply()` functions can take keyword arguments that you pass in when you call the `.apply()` function chained to the `.groupby()` function.

For example:

```python
def my_applier(df, my_kwarg='placeholder'):
    df['newcol_'+placeholder] = 1.
    return df
    
data = data.groupby('variable').apply(my_applier, my_kwarg='colsuffix').reset_index(drop=True)
```


In [None]:
# A:

#### 11. Make a bar plot of your `mean` `price` by `broad_type` category.


In [None]:
# A:

#### 12. Challenge: Parse the `choice_description` column into two new columns: `order_customizations` and `order_customization_id`.

Here is what your inputs and outputs would look like for a hypothetical section of the DataFrame (I'm only showing some of the columns to give you an idea of what the output format will be):

**Input:**

```python
                                  choice_description     item_name  order_id  \
0                                       [Clementine]          Izze         1   
1  [Red Salsa, [Black Beans, Guacamole, Sour Cream]]  Chicken Bowl         1   

   sub_order_id  
0             1  
1             2
```

**Output:**

```python
   order_customization_id order_customizations  \
0                       0           Clementine   
1                       0            Red Salsa   
2                       1          Black Beans   
3                       2            Guacamole   
4                       3           Sour Cream   

                                  choice_description     item_name  order_id  \
0                                       [Clementine]          Izze         1   
1  [Red Salsa, [Black Beans, Guacamole, Sour Cream]]  Chicken Bowl         1   
2  [Red Salsa, [Black Beans, Guacamole, Sour Cream]]  Chicken Bowl         1   
3  [Red Salsa, [Black Beans, Guacamole, Sour Cream]]  Chicken Bowl         1   
4  [Red Salsa, [Black Beans, Guacamole, Sour Cream]]  Chicken Bowl         1   

   sub_order_id  
0             1  
1             2  
2             2  
3             2  
4             2 
```

**Hints**:

- Remember, you can write your own function and pass it into `.apply()`. In this case there will be one item per group, as we have to parse every row, but you may be able to get a solution with `.iterrows()` if you want to try that out.
- Within a function you are passing into `.apply()`, you can create a _new DataFrame and return it._ This is one of the features that makes `.apply()` so powerful, as you can essentially perform any operations you want on a subset of your original DataFrame, as long as you return DataFrames/groups that can be recombined.
- Your output DataFrame will be lengthy, as there will be a row for every item in the `item_name` column. Expect there to be a lot of repeated information among rows, but they should not be entirely identical to one another.

**Note: This function may take a while to complete. `.apply()` isn't that efficient when used with complicated custom operations per row like it is now.**

In [None]:
# A: