<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing Chipotle Data

_Author: Joseph Nelson (DC)_

---

For Project 2, you will complete a series of exercises exploring [order data from Chipotle](https://github.com/TheUpshot/chipotle), compliments of _The New York Times'_ "The Upshot."

For these exercises, you will conduct basic exploratory data analysis (Pandas not required) to understand the essentials of Chipotle's order data: how many orders are being made, the average price per order, how many different ingredients are used, etc. These allow you to practice business analysis skills while also becoming comfortable with Python.

---

## Basic Level

### Part 1: Read in the file with `csv.reader()` and store it in an object called `file_nested_list`.

Hint: This is a TSV (tab-separated value) file, and `csv.reader()` needs to be told [how to handle it](https://docs.python.org/2/library/csv.html).

In [55]:
import csv
with open('./data/chipotle.tsv', 'r') as f: # Used 'with' to close the file after accessing it
    datafile = csv.reader(f, delimiter='\t') 
    file_nested_list = list(datafile) # cast the reader object to a list



In [56]:
file_nested_list[0:5] # Checking top five lines of data

[['order_id', 'quantity', 'item_name', 'choice_description', 'item_price'],
 ['1', '1', 'Chips and Fresh Tomato Salsa', 'NULL', '$2.39 '],
 ['1', '1', 'Izze', '[Clementine]', '$3.39 '],
 ['1', '1', 'Nantucket Nectar', '[Apple]', '$3.39 '],
 ['1', '1', 'Chips and Tomatillo-Green Chili Salsa', 'NULL', '$2.39 ']]

### Part 2: Separate `file_nested_list` into the `header` and the `data`.


In [60]:
header = file_nested_list[0] # Takes the first row from file_nested_list as a list of column names and stores it in 'header'
data = file_nested_list[1:] # Takes the remaining rows and stores as a list of lists in 'data'

In [63]:
header

['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']

In [134]:
data[0:10]

[['1', '1', 'Chips and Fresh Tomato Salsa', 'NULL', '$2.39 '],
 ['1', '1', 'Izze', '[Clementine]', '$3.39 '],
 ['1', '1', 'Nantucket Nectar', '[Apple]', '$3.39 '],
 ['1', '1', 'Chips and Tomatillo-Green Chili Salsa', 'NULL', '$2.39 '],
 ['2',
  '2',
  'Chicken Bowl',
  '[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]]',
  '$16.98 '],
 ['3',
  '1',
  'Chicken Bowl',
  '[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]',
  '$10.98 '],
 ['3', '1', 'Side of Chips', 'NULL', '$1.69 '],
 ['4',
  '1',
  'Steak Burrito',
  '[Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]',
  '$11.75 '],
 ['4',
  '1',
  'Steak Soft Tacos',
  '[Tomatillo Green Chili Salsa, [Pinto Beans, Cheese, Sour Cream, Lettuce]]',
  '$9.25 '],
 ['5',
  '1',
  'Steak Burrito',
  '[Fresh Tomato Salsa, [Rice, Black Beans, Pinto Beans, Cheese, Sour Cream, Lettuce]]',
  '$9.25 ']]

In [90]:
# Now I'm going to switch over to Pandas so I don't spend forever on this exercise
import pandas as pd
orders = pd.read_table('./data/chipotle.tsv', sep='\t')
orders.head(10)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
6,3,1,Side of Chips,,$1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",$9.25


---

## Intermediate Level

### Part 3: Calculate the average price of an order.

Hint: Examine the data to see if the `quantity` column is relevant to this calculation.

Hint: Think carefully about the simplest way to do this!

In [94]:
# Quantity is not relevant to this calculation as the price already reflects the single item price multiplied by the quantity

In [115]:
orders[['item_price']] = orders[['item_price']].replace('[\$,]','').astype(float) # Converts the column from dollars to floats

In [116]:
order_totals = orders.groupby('order_id').item_price.sum() # Groups by order_id and got the total for each order

In [117]:
order_totals.mean() # Calculates average price of order

18.81142857142869

### Part 4: Create a list (or set) named `unique_sodas` containing all of unique sodas and soft drinks that Chipotle sells.

Note: Just look for `'Canned Soda'` and `'Canned Soft Drink'`, and ignore other drinks like `'Izze'`.

In [132]:
# Looks in item_name for Canned Soda and Canned Soft Drink and stores in a dataframe
unique_items = orders[orders.item_name.isin(['Canned Soda', 'Canned Soft Drink'])] 
unique_sodas = unique_items.choice_description.unique() # looks at the choice_description column to find unique values
print unique_sodas

['[Sprite]' '[Dr. Pepper]' '[Mountain Dew]' '[Diet Dr. Pepper]'
 '[Coca Cola]' '[Diet Coke]' '[Coke]' '[Lemonade]' '[Nestea]']


---

## Advanced Level


### Part 5: Calculate the average number of toppings per burrito.

Note: Let's ignore the `quantity` column to simplify this task.

Hint: Think carefully about the easiest way to count the number of toppings!


In [None]:
# Need to look at the item_name column for the word "Burrito," 
# then use the list of toppings in choice_description from that row.
# This is a little complicated because the choice_description separates the salsa from the other toppings, creating nested lists

In [276]:
burritos = orders[orders.item_name.str.contains('Burrito')] # find all rows with burritos

In [376]:
toppings = list(burritos.choice_description.str.split(',')) # split the choice_description into separate countable elements

In [383]:
total_toppings = 0 # counter variable
for x, i in enumerate(toppings): # looping to add up the total number of toppings
    total_toppings += len(toppings[x]) # since toppings[x] is a list, we look at the number of elements which represents number of toppings
avg_toppings = (total_toppings/float(len(toppings))) # finding the average number of toppings per burrito

print avg_toppings
    

5.39505119454


### Part 6: Create a dictionary. Let the keys represent chip orders and the values represent the total number of orders.

Expected output: `{'Chips and Roasted Chili-Corn Salsa': 18, ... }`

Note: Please take the `quantity` column into account!

Optional: Learn how to use `.defaultdict()` to simplify your code.

In [384]:
# After playing with .defaultdict() and other convoluted code, I actually found it far simpler to just
# use .to_dict() once I realized that .groupby() returned a Series and I could chain .sum() and .to_dict().

chiporders = orders[orders.item_name.str.contains('Chips')] # create a df with only rows of various chip orders

chip_sales_dict = chiporders.groupby('item_name')['quantity'].sum().to_dict()

chip_sales_dict

{'Chips': 230,
 'Chips and Fresh Tomato Salsa': 130,
 'Chips and Guacamole': 506,
 'Chips and Mild Fresh Tomato Salsa': 1,
 'Chips and Roasted Chili Corn Salsa': 23,
 'Chips and Roasted Chili-Corn Salsa': 18,
 'Chips and Tomatillo Green Chili Salsa': 45,
 'Chips and Tomatillo Red Chili Salsa': 50,
 'Chips and Tomatillo-Green Chili Salsa': 33,
 'Chips and Tomatillo-Red Chili Salsa': 25,
 'Side of Chips': 110}

In [385]:
# These are the results when not taking the quantity column into account. I was able to use this to verify the quantity column
# was incorporated above.
chiporders.item_name.value_counts().to_dict()

{'Chips': 211,
 'Chips and Fresh Tomato Salsa': 110,
 'Chips and Guacamole': 479,
 'Chips and Mild Fresh Tomato Salsa': 1,
 'Chips and Roasted Chili Corn Salsa': 22,
 'Chips and Roasted Chili-Corn Salsa': 18,
 'Chips and Tomatillo Green Chili Salsa': 43,
 'Chips and Tomatillo Red Chili Salsa': 48,
 'Chips and Tomatillo-Green Chili Salsa': 31,
 'Chips and Tomatillo-Red Chili Salsa': 20,
 'Side of Chips': 101}

---

## Bonus: Craft a problem statement about this data that interests you, and then answer it!


In [346]:
# What are the top 10 most popular toppings on Burritos?

In [421]:
df = pd.DataFrame.from_records(toppings) # Converted toppings list to a dataframe because they are cool.

In [423]:
df.replace(regex=True,inplace=True,to_replace=(r'\[', r'\]', r'\ '),value=r'') # Removed brackets and spaces to clean the data
df.stack().value_counts().nlargest(10) # Stacked the DF so I can easily get counts across the entire data set and took top 10

Rice                      1063
Cheese                    960 
SourCream                 745 
Lettuce                   691 
BlackBeans                543 
Guacamole                 389 
FreshTomatoSalsa          371 
PintoBeans                284 
FajitaVegetables          226 
FreshTomatoSalsa(Mild)    162 
dtype: int64