<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# DataFrame Manipulation Lab with Chipotle Data


---

This lab is intended to cover a variety of skills for data manipulation in pandas with a challenging dataset.

In addition to python function-writing practice, you will be practicing multiple pandas EDA skills including:
- Data cleaning
- Grouping
- Summarize and aggregate data
- [Pandas split-apply-combine pattern](http://pandas.pydata.org/pandas-docs/stable/groupby.html)
- Basic plotting

In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

sns.set(font_scale=1.5)

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

#### 1. Load the `chipotle.tsv` and examine the data.

In [2]:
chip_file = '../../../../resource-datasets/chipotle_orders/chipotle.tsv'

The chipotle data is a .tsv, which means "tab separated value". This is just like a csv but the cells are separated by tabs. There is an argument in read_csv called delimiter, where you can specify the string that separates the tabs:

In [3]:
chip = pd.read_csv(chip_file, delimiter='\t')

In [4]:
chip.head(10)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
6,3,1,Side of Chips,,$1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",$9.25


In [5]:
chip.dtypes

order_id               int64
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object

_The chipotle orders data is messy: the column with ingredients in the order is a list of lists, which will need to be dealt with. This specifically will also be practice with long and wide format data._

#### 2. Create a sub-id for each order-id

We have an identifier for each order already in `order_id`, but no unique identifier for each _sub-order_ within the overall order.

Use grouping and `.apply()` to assign sub-ids for orders.

In [6]:
item_ids = chip.groupby('item_name').size().reset_index().reset_index()
item_ids.head()

Unnamed: 0,index,item_name,0
0,0,6 Pack Soft Drink,54
1,1,Barbacoa Bowl,66
2,2,Barbacoa Burrito,91
3,3,Barbacoa Crispy Tacos,11
4,4,Barbacoa Salad Bowl,10


In [26]:
chip_merged = pd.merge(chip, item_ids[['index','item_name']], on='item_name').sort_values('order_id')

chip_merged2 = chip_merged.apply(lambda x: str(x[0])+'_'+str(x[5]), axis=1)
chip_merged['order_suborder'] = chip_merged2
chip_merged.head(10)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,index,order_suborder
0,1,1,Chips and Fresh Tomato Salsa,,$2.39,24,1_24
157,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39,31,1_31
130,1,1,Nantucket Nectar,[Apple],$3.39,35,1_35
110,1,1,Izze,[Clementine],$3.39,34,1_34
188,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98,17,2_17
914,3,1,Side of Chips,,$1.69,37,3_37
189,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98,17,3_17
1383,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25,43,4_43
1015,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75,39,4_39
1438,5,1,Chips and Guacamole,,$4.45,25,5_25


#### 3. Clean up the price column 

We want the price column to be a numeric float value. Currently it is a string (and has the dollar sign in it).

In [28]:
chip_merged2 = chip_merged.copy()
chip_merged2.loc[:,'item_price'] = chip_merged2['item_price'].apply(lambda x: float(x.replace('$','')))

chip_merged2.head(10)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,index,order_suborder
0,1,1,Chips and Fresh Tomato Salsa,,2.39,24,1_24
157,1,1,Chips and Tomatillo-Green Chili Salsa,,2.39,31,1_31
130,1,1,Nantucket Nectar,[Apple],3.39,35,1_35
110,1,1,Izze,[Clementine],3.39,34,1_34
188,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",16.98,17,2_17
914,3,1,Side of Chips,,1.69,37,3_37
189,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",10.98,17,3_17
1383,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",9.25,43,4_43
1015,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",11.75,39,4_39
1438,5,1,Chips and Guacamole,,4.45,25,5_25


#### 4. Make a new categorical column for broader item name

Currently we have many different item types. Make a new column that only has 5 different broad item types. You should have these types in the new column in your DataFrame:

    chips
    drink
    burrito
    taco
    salad
    
(Put the `bowl` items into `burrito` category).

In [32]:
chip_merged2.groupby('item_name').size()[:5]

item_name
6 Pack Soft Drink        54
Barbacoa Bowl            66
Barbacoa Burrito         91
Barbacoa Crispy Tacos    11
Barbacoa Salad Bowl      10
dtype: int64

In [41]:
def category_lookup(x):
    categories = ['chips', 'drink', 'burrito', 'taco', 'salad']
    mapped_items = []
    for cat in categories:
        if cat.lower() in x.lower():
            mapped_items.append(cat)
        else:
            pass
    return mapped_items

#chip_merged2_cat = chip_merged2['item_name'].apply(category_lookup)
chip_merged2_cat['item_categories'] = chip_merged2['item_name'].apply(category_lookup)

chip_merged2_cat

0                                                            [chips]
157                                                          [chips]
130                                                               []
110                                                               []
188                                                               []
914                                                          [chips]
189                                                               []
1383                                                          [taco]
1015                                                       [burrito]
1438                                                         [chips]
1016                                                       [burrito]
1964                                                          [taco]
1917                                                          [taco]
1439                                                         [chips]
190                               

#### 5. Calculate the total price by `order_id` and add it as a new column `order_total_price`.

There are a variety of different ways you can tackle this problem. One way is a grouped apply on the price and then a merge by `order_id` with the total price.

Hints:

- Merging DataFrames with series doesn't work, you need to merge two DataFrames.
- A series object coming out of a groupby with an apply will have the groupby as potentially hierarchical indices. Using `reset_index()` will turn these back into columns and also convert to a DataFrame, which can be used to merge on.

#### 6. Make an `adjusted_item_price` column to account for multiple orders per row.

Some items have multiple orders per row, as indicated by the quantity. Adjust the price to account for the number of orders in a new column.

#### 7. What is the min, max, count, mean and standard deviation of `adj_price` for each unique item in  `item_name`?

Group by `item_name` and use the `.agg()` method on `adj_price` with `max`, `min`, `mean` and `std` and on `quantity` with `sum`.

#### 8. Plot the mean price of items against the sum of quantity (popularity).

You have this info in your summary table from the previous question.

#### 9. Plot the max price of items against the sum of quantity (popularity).

### 10. Calculate the mean of adjusted price per broad category.

You can handle these with a single function if you want, or another way if you prefer.

Just FYI, apply functions can have keyword arguments that you pass in when you call the apply chained to the groupby.

for example:

```python
def my_applier(df, my_kwarg='placeholder'):
    df['newcol_'+placeholder] = 1.
    return df
    
data = data.groupby('variable').apply(my_applier, my_kwarg='colsuffix').reset_index(drop=True)
```


#### 11. Make a barplot of your price mean by the broad type category.


#### 12. [Challenge] Parse the `choice_description` column into two new columns: `order_customizations` and `order_customization_id`

Here is what your inputs and outputs would look like for a hypothetical section of the DataFrame (I'm just showing some of the columns to give you an idea of what the output format will be):

**Input:**

```python
                                  choice_description     item_name  order_id  \
0                                       [Clementine]          Izze         1   
1  [Red Salsa, [Black Beans, Guacamole, Sour Cream]]  Chicken Bowl         1   

   sub_order_id  
0             1  
1             2
```

**Output:**

```python
   order_customization_id order_customizations  \
0                       0           Clementine   
1                       0            Red Salsa   
2                       1          Black Beans   
3                       2            Guacamole   
4                       3           Sour Cream   

                                  choice_description     item_name  order_id  \
0                                       [Clementine]          Izze         1   
1  [Red Salsa, [Black Beans, Guacamole, Sour Cream]]  Chicken Bowl         1   
2  [Red Salsa, [Black Beans, Guacamole, Sour Cream]]  Chicken Bowl         1   
3  [Red Salsa, [Black Beans, Guacamole, Sour Cream]]  Chicken Bowl         1   
4  [Red Salsa, [Black Beans, Guacamole, Sour Cream]]  Chicken Bowl         1   

   sub_order_id  
0             1  
1             2  
2             2  
3             2  
4             2 
```

Hints:

- Remember you can write your own function and pass it into apply. In this case there will be one item per group, since we have to do this parsing for every row, but you may be able to get a solution with `.iterrows()` if you want to try that out.
- Within a function that you are passing into `.apply()`, you can create a _new DataFrame and return that._ This is one of the things that makes apply so powerful, since you can essentially perform any operations you want on a subset of your original DataFrame as long as you return DataFrames/groups that can be recombined.
- *Your output dataframe will be very long as there will be a row for every item in the item_name column.  Expect there to be a lot of repeating information between rows, but they should not be exactly identical.*

**Note: the function may take a while to complete. `apply` isn't that efficient with complicated custom operations per row like this.**