<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing Chipotle Data

_Author: Joseph Nelson (DC)_

---

For Project 2, you will complete a series of exercises exploring [order data from Chipotle](https://github.com/TheUpshot/chipotle), compliments of _The New York Times'_ "The Upshot."

For these exercises, you will conduct basic exploratory data analysis (Pandas not required) to understand the essentials of Chipotle's order data: how many orders are being made, the average price per order, how many different ingredients are used, etc. These allow you to practice business analysis skills while also becoming comfortable with Python.

---

## Basic Level

### Part 1: Read in the file with `csv.reader()` and store it in an object called `file_nested_list`.

Hint: This is a TSV (tab-separated value) file, and `csv.reader()` needs to be told [how to handle it](https://docs.python.org/2/library/csv.html).

In [1]:
import csv
from collections import namedtuple   # Convenient to store the data rows
import pandas as pd

DATA_FILE = './data/chipotle.tsv'

In [2]:
# For loop is used to append new rows to the dataframe
hasheader = False
created_df = False
with open('./data/chipotle.tsv') as tsvfile:
    # Load tsv file in
    using_csv_reader_is_pointless_pandas_is_better = csv.reader(tsvfile, delimiter='\t')
    for row in using_csv_reader_is_pointless_pandas_is_better:
        # For loop that loads the rows in one at a time. Creates a new df and appends it to existing one
        if hasheader == False:
            the_header = row
            hasheader = True
        elif created_df == False:
            file_nested_list = pd.DataFrame([row], columns = the_header)
            created_df = True
        else:
            new_df = pd.DataFrame([row], columns = the_header) 
            file_nested_list = file_nested_list.append(new_df)
file_nested_list.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
0,1,1,Izze,[Clementine],$3.39
0,1,1,Nantucket Nectar,[Apple],$3.39
0,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
0,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [3]:
# More efficient way to load data
import pandas as pd
chipotle = pd.read_csv(DATA_FILE, sep='\t')

### Part 2: Separate `file_nested_list` into the `header` and the `data`.


In [4]:
header = file_nested_list.columns
print(header)

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')


In [5]:
just_the_data = file_nested_list.to_string(header=False)
print(just_the_data)

0     1   1           Chips and Fresh Tomato Salsa                                               NULL   $2.39 
0     1   1                                   Izze                                       [Clementine]   $3.39 
0     1   1                       Nantucket Nectar                                            [Apple]   $3.39 
0     1   1  Chips and Tomatillo-Green Chili Salsa                                               NULL   $2.39 
0     2   2                           Chicken Bowl  [Tomatillo-Red Chili Salsa (Hot), [Black Beans...  $16.98 
0     3   1                           Chicken Bowl  [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...  $10.98 
0     3   1                          Side of Chips                                               NULL   $1.69 
0     4   1                          Steak Burrito  [Tomatillo Red Chili Salsa, [Fajita Vegetables...  $11.75 
0     4   1                       Steak Soft Tacos  [Tomatillo Green Chili Salsa, [Pinto Beans, Ch...   $9.25 
0

---

## Intermediate Level

### Part 3: Calculate the average price of an order.

Hint: Examine the data to see if the `quantity` column is relevant to this calculation.

Hint: Think carefully about the simplest way to do this!

In [6]:
# Take the $ signs out of the item price so it can be treated as a numeric
chipotle['item_price'] = chipotle['item_price'].replace('[\$,]', '', regex=True).astype(float)


In [7]:
# Groupby order_id so I can get a sum of the price of each item in the order
orders = chipotle.groupby(['order_id'])['item_price'].sum().reset_index() 
orders.head()

Unnamed: 0,order_id,item_price
0,1,11.56
1,2,16.98
2,3,12.67
3,4,21.0
4,5,13.7


In [8]:
# Average price is the sum of all of the orders divided by the count of the orders
avg_price = orders['item_price'].sum() / orders['order_id'].count()
print("The average order price is ${}".format(round(avg_price,2)))

The average order price is $18.81


### Part 4: Create a list (or set) named `unique_sodas` containing all of unique sodas and soft drinks that Chipotle sells.

Note: Just look for `'Canned Soda'` and `'Canned Soft Drink'`, and ignore other drinks like `'Izze'`.

In [9]:
# Grab just the item name and description columns then select only the rows that are sodas
sodas = chipotle[['item_name','choice_description']]
sodas = sodas[(sodas['item_name'] == 'Canned Soda') | (sodas['item_name'] == 'Canned Soft Drink')]

# Select only the unique soda choices
unique_sodas = sodas['choice_description'].unique()
print(unique_sodas)

['[Sprite]' '[Dr. Pepper]' '[Mountain Dew]' '[Diet Dr. Pepper]'
 '[Coca Cola]' '[Diet Coke]' '[Coke]' '[Lemonade]' '[Nestea]']


---

## Advanced Level


### Part 5: Calculate the average number of toppings per burrito.

Note: Let's ignore the `quantity` column to simplify this task.

Hint: Think carefully about the easiest way to count the number of toppings!


In [10]:
# Select all items that end in burrito
burritos = chipotle[chipotle['item_name'].str[-7:] == 'Burrito']

# The number of toppings is equal to the number of commas plus 1
burritos['count'] = burritos['choice_description'].str.count(",") + 1

print("The average number of topings per burrito is {}".format(burritos['count'].mean()))
burritos.head(10)

The average number of topings per burrito is 5.395051194539249


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,count
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",11.75,8
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",9.25,7
16,8,1,Chicken Burrito,"[Tomatillo-Green Chili Salsa (Medium), [Pinto ...",8.49,4
17,9,1,Chicken Burrito,"[Fresh Tomato Salsa (Mild), [Black Beans, Rice...",8.49,6
21,11,1,Barbacoa Burrito,"[[Fresh Tomato Salsa (Mild), Tomatillo-Green C...",8.99,7
23,12,1,Chicken Burrito,"[[Tomatillo-Green Chili Salsa (Medium), Tomati...",10.98,8
27,14,1,Carnitas Burrito,"[[Tomatillo-Green Chili Salsa (Medium), Roaste...",8.99,6
29,15,1,Chicken Burrito,"[Tomatillo-Green Chili Salsa (Medium), [Pinto ...",8.49,5
31,16,1,Steak Burrito,"[[Roasted Chili Corn Salsa (Medium), Fresh Tom...",8.99,5
43,20,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Pinto Beans, Chees...",11.75,7


### Part 6: Create a dictionary. Let the keys represent chip orders and the values represent the total number of orders.

Expected output: `{'Chips and Roasted Chili-Corn Salsa': 18, ... }`

Note: Please take the `quantity` column into account!

Optional: Learn how to use `.defaultdict()` to simplify your code.

In [11]:
# Select all the orders and limit it to the item name and quantity
all_orders = chipotle[['item_name','quantity']]

# Select the orders that have the word chip in them
chip_orders = all_orders[all_orders['item_name'].str.contains('chip', case=False, regex=True)]
chip_orders.head()

Unnamed: 0,item_name,quantity
0,Chips and Fresh Tomato Salsa,1
3,Chips and Tomatillo-Green Chili Salsa,1
6,Side of Chips,1
10,Chips and Guacamole,1
14,Chips and Guacamole,1


In [12]:
# Group by item_name and sum the quantity
grouped = chip_orders.groupby('item_name')['quantity'].sum().reset_index()
# Create an empty dict
chip_dict = {}

# For loop that enters each row as a dict entry
for i in range(len(grouped)):
    chip_dict[grouped.loc[i,:]['item_name']] = grouped.loc[i,:]['quantity']
print(chip_dict)

{'Chips': 230, 'Chips and Fresh Tomato Salsa': 130, 'Chips and Guacamole': 506, 'Chips and Mild Fresh Tomato Salsa': 1, 'Chips and Roasted Chili Corn Salsa': 23, 'Chips and Roasted Chili-Corn Salsa': 18, 'Chips and Tomatillo Green Chili Salsa': 45, 'Chips and Tomatillo Red Chili Salsa': 50, 'Chips and Tomatillo-Green Chili Salsa': 33, 'Chips and Tomatillo-Red Chili Salsa': 25, 'Side of Chips': 110}


---

## Bonus: Craft a problem statement about this data that interests you, and then answer it!


In [13]:
# Problem Statement: Determine the probability of each topping being put on a burrito

# Select only the burritos
burritos = chipotle[chipotle['item_name'].str[-7:] == 'Burrito']

# reset the index so I can iterate the df and get the number of burritos by summing the quantity
burritos = burritos.reset_index()
number_of_burritos = burritos['quantity'].sum()

# function that takes a string of toppings and the number of orders of the burrito and returns a df with each topping
def get_toppings_df(toppings, num):
    quantity = num
    # Calculae the number of toppings in the string for the for loop
    toppings_num =  toppings.count(",") + 1
    toppings_string = toppings
    # Create empty df for returning from the function
    return_df = pd.DataFrame(columns=['toppings', 'quantity'])
    # The for loop uses the commas via find function to separate the toppings and add them to the df
    for i in range(1,toppings_num):
        comma = toppings_string.find(',')
        topping = toppings_string[0:comma]
        toppings_string = toppings_string[comma+2:]
        return_df = return_df.append({'toppings':topping, 'quantity':quantity}, ignore_index=True)
    # Return the df
    return return_df


# Create empty df for populating
all_the_toppings = pd.DataFrame(columns=['toppings', 'quantity'])

# for loop removes all the brackets so it looks pretty and calls the function to get the toppings split
for index, row in burritos.iterrows():
    # Remove the brakcets
    semi_processed = row['choice_description'].replace(']', '')
    fully_processed = semi_processed.replace('[', '')
    # Get the number of orders for that burrito
    quantity = burritos.loc[index,:]['quantity']
    # Call the function and get df with each topping in the string as a df
    new_rows = get_toppings_df(fully_processed, quantity)
    # Append the df to the total df
    all_the_toppings = all_the_toppings.append(new_rows)

# Calculate the sum of each topping
topping_count = all_the_toppings.groupby('toppings')['quantity'].sum().reset_index()

# Sort the toppings by most popular to least
topping_count = topping_count.sort_values(by='quantity', ascending=False)

# Get the probability of each topping
topping_probability = topping_count
topping_probability['probability'] = topping_probability['quantity'] / number_of_burritos

# For loop to iterate the df and print the probability with a nice format
for index, row in topping_probability.iterrows():
    print(" There is a {}% probability that {} will be in a burrito".format(round(row['probability']*1000)/10, row['toppings']))



 There is a 86.1% probability that Rice will be in a burrito
 There is a 70.2% probability that Cheese will be in a burrito
 There is a 47.3% probability that Sour Cream will be in a burrito
 There is a 45.8% probability that Black Beans will be in a burrito
 There is a 32.7% probability that Fresh Tomato Salsa will be in a burrito
 There is a 23.3% probability that Pinto Beans will be in a burrito
 There is a 19.2% probability that Guacamole will be in a burrito
 There is a 19.0% probability that Fajita Vegetables will be in a burrito
 There is a 13.4% probability that Fresh Tomato Salsa (Mild) will be in a burrito
 There is a 12.3% probability that Roasted Chili Corn Salsa will be in a burrito
 There is a 12.2% probability that Tomatillo Red Chili Salsa will be in a burrito
 There is a 11.5% probability that Fajita Veggies will be in a burrito
 There is a 10.8% probability that Roasted Chili Corn Salsa (Medium) will be in a burrito
 There is a 9.4% probability that Tomatillo-Red Chil