## Lab 02 - Pandas and Data Visualization
This notebook will introduce you to the basics of Pandas and Data Visualization. You will learn how to load data into a Pandas DataFrame, how to perform basic data analysis, and how to visualize data. The first part of this notebook will be an interactive tutorial, and the second part will be practice exercises for you to do! **Note that the practice problems will be checked when submitted!**

#### Pandas
Pandas is a popular open-source Python library that provides data structures and data analysis tools for working with structured data. Pandas is a versatile library that simplifies data manipulation, analysis, and exploration in Python. Some of its uses:
* Tabular Data Handling
* Data Cleaning and Transformation
* Data Exploration
* Data Import/Export
* Data Visualization

#### Matplotlib
Matplotlib is a widely used for creating static, animated, and interactive visualizations. Whether you are conducting data analysis, scientific research, or data communication, Matplotlib helps you present your findings effectively and intuitively.

### Pre-requisites

In [4]:
# In case you don't have pandas, uncomment
# the following lines and run the cell

%pip install pandas




### Overview
In this notebook, you will be learning how to use the Pandas library by working with the `cookies.csv` file. 

#### `cookies.csv` file :

The `cookies.csv` file contains information about cookies that were made in Rico's Bisquito's factory. 

The columns are the following:

`cookie`: the name of a cookie
`ingredients`: a list of the cookie's ingredients
`calories`: the number of calories the created cookie has
`radius`: the radius of the created cookie, in cm
`cost_to_make`: the cost it took to make the created cookie, in dollars

### Reading the CSV file

First, we need to import the Pandas library. We will be using the `pd` alias for the Pandas library.

In [5]:
#TODO: import pandas and matplotlib in this cell
import pandas as pd
import matplotlib as plt

We will now look at the `cookies.csv` file. We will use the `pd.read_csv()` function to read in the CSV file. We will store the data in a variable called `cookies`.

In [6]:
#TODO: read the cookies.csv file into a pandas dataframe
df = pd.read_csv('cookies.csv')

#### Dataframe
Dataframes are a data structure that Pandas uses to store data. Dataframes are similar to tables in a database. Dataframes have rows and columns. Each row represents a single data point, and each column represents a feature of the data point.

We will then make sure we imported the data correctly by printing out the first 10 rows of the data, using the `head()` function.

In [7]:
#TODO: print the head of the dataframe
df.head(10)

Unnamed: 0,cookie,ingredients,calories,radius,cost_to_make
0,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",170,3.102,0.67
1,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",198,3.78,1.07
2,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",224,4.069,1.04
3,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",198,3.78,1.07
4,bad berry bunny,"[""flour"",""blueberries"",""sugar"",""butter""]",191,4.148,1.39
5,orange ozuna,"[""flour"",""orange juice"",""sugar"",""vegetable oil""]",162,3.241,1.15
6,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",204,3.964,0.84
7,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",243,3.684,1.17
8,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",178,3.989,
9,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",184,3.743,0.74


### Checking data types
You can check the data types of each column using the `dtypes` attribute of the DataFrame.

In [8]:
#TODO: check the data types of the columns
df.dtypes

cookie           object
ingredients      object
calories          int64
radius          float64
cost_to_make    float64
dtype: object

Now, lets use use `info()` function to get more information about the Dataframe

In [9]:
# TODO: use info() to get information about datatypes and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129 entries, 0 to 128
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   cookie        129 non-null    object 
 1   ingredients   129 non-null    object 
 2   calories      129 non-null    int64  
 3   radius        129 non-null    float64
 4   cost_to_make  114 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 5.2+ KB


### Cleaning the data
Now that we have the data, we need to clean it. For example, some `cost_to_make` fields of some created cookies are missing. To resolve this, we can do many things: we can replace the missing data with the mean of the column, or we can get rid of the row entirely if the `cost_to_make` field is not set. 

In [10]:
#TODO: clean the dataframe and 
# print the head again to make sure 
# the changes took effect
df.info()
df['cost_to_make'].fillna(df['cost_to_make'].mean)
df.info()
df.head(10)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129 entries, 0 to 128
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   cookie        129 non-null    object 
 1   ingredients   129 non-null    object 
 2   calories      129 non-null    int64  
 3   radius        129 non-null    float64
 4   cost_to_make  114 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 5.2+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129 entries, 0 to 128
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   cookie        129 non-null    object 
 1   ingredients   129 non-null    object 
 2   calories      129 non-null    int64  
 3   radius        129 non-null    float64
 4   cost_to_make  114 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 5.2+ KB


Unnamed: 0,cookie,ingredients,calories,radius,cost_to_make
0,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",170,3.102,0.67
1,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",198,3.78,1.07
2,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",224,4.069,1.04
3,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",198,3.78,1.07
4,bad berry bunny,"[""flour"",""blueberries"",""sugar"",""butter""]",191,4.148,1.39
5,orange ozuna,"[""flour"",""orange juice"",""sugar"",""vegetable oil""]",162,3.241,1.15
6,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",204,3.964,0.84
7,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",243,3.684,1.17
8,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",178,3.989,
9,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",184,3.743,0.74


To also make sure we removed null values, we can check the number of null values in each column using the `isnull()` function.

In [11]:
#TODO: use the isnull method to make sure your data is clean
df.isnull().sum()

cookie           0
ingredients      0
calories         0
radius           0
cost_to_make    15
dtype: int64

Next, lets check for duplicate rows using the `duplicated()` function. Then, remove those rows using `drop_duplicates()` function.

In [12]:
# TODO: check for duplicate rows
# then delete those rows from df
#df.duplicated()
duplicated_row = df[df.duplicated()]
df = df.drop_duplicates()

df.head(10)

Unnamed: 0,cookie,ingredients,calories,radius,cost_to_make
0,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",170,3.102,0.67
1,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",198,3.78,1.07
2,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",224,4.069,1.04
4,bad berry bunny,"[""flour"",""blueberries"",""sugar"",""butter""]",191,4.148,1.39
5,orange ozuna,"[""flour"",""orange juice"",""sugar"",""vegetable oil""]",162,3.241,1.15
6,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",204,3.964,0.84
7,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",243,3.684,1.17
8,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",178,3.989,
9,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",184,3.743,0.74
10,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",216,3.848,1.28


### Parsing the data
Now that we have the data, we could parse it to get the information we want. For example, we can check what types of cookies were made by using the `unique()` function on the `cookie` column.

In [13]:
#TODO: see what cookies are in the dataset
df['cookie'].unique()

array(['laddoo lemon', 'red velvet rauw', 'nevadito', 'bad berry bunny',
       'orange ozuna', 'minty miami', 'chocolate'], dtype=object)

We can also check the number of cookies made by using the `value_counts()` function on the `cookie` column.

In [14]:
#TODO: use value_counts() to see how many 
# cookies of each type there are
df['cookie'].value_counts()

red velvet rauw    18
orange ozuna       17
bad berry bunny    17
minty miami        17
laddoo lemon       17
nevadito           17
chocolate          15
Name: cookie, dtype: int64

Or maybe we don't like how long the names of the cookies are, so we can shorten them by using the `replace()` function on the `cookie` column.

For example, let's try changing `"bad berry bunny"` to `"bbb"`.

In [15]:
#TODO: change bad berry bunny data elements to "bbb"
df['cookie'] = df['cookie'].replace("bad berry bunny", "bbb")

df.head(10)

Unnamed: 0,cookie,ingredients,calories,radius,cost_to_make
0,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",170,3.102,0.67
1,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",198,3.78,1.07
2,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",224,4.069,1.04
4,bbb,"[""flour"",""blueberries"",""sugar"",""butter""]",191,4.148,1.39
5,orange ozuna,"[""flour"",""orange juice"",""sugar"",""vegetable oil""]",162,3.241,1.15
6,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",204,3.964,0.84
7,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",243,3.684,1.17
8,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",178,3.989,
9,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",184,3.743,0.74
10,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",216,3.848,1.28


We may even like the original names better, but we may want to get rid of the spaces. For example, we can change `"orange ozuna"` to `"orange_ozuna"`. Here, we will use the `str.replace()` function.

In [16]:
#TODO: adjust orange ozuna as described
df['cookie'] = df['cookie'].str.replace("orange ozuna", "orange_ozuna")

df.head(10)

Unnamed: 0,cookie,ingredients,calories,radius,cost_to_make
0,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",170,3.102,0.67
1,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",198,3.78,1.07
2,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",224,4.069,1.04
4,bbb,"[""flour"",""blueberries"",""sugar"",""butter""]",191,4.148,1.39
5,orange_ozuna,"[""flour"",""orange juice"",""sugar"",""vegetable oil""]",162,3.241,1.15
6,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",204,3.964,0.84
7,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",243,3.684,1.17
8,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",178,3.989,
9,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",184,3.743,0.74
10,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",216,3.848,1.28


We may even just want to keep the first word of the cookie name. For example, we can change `"orange_ozuna"` to `"orange"`.

In [17]:
#TODO: adjust all cookies so only the first word
# is used as the cookie name
df['cookie'] = df['cookie'].str.split("_").str[0]

df.head(10)

Unnamed: 0,cookie,ingredients,calories,radius,cost_to_make
0,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",170,3.102,0.67
1,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",198,3.78,1.07
2,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",224,4.069,1.04
4,bbb,"[""flour"",""blueberries"",""sugar"",""butter""]",191,4.148,1.39
5,orange,"[""flour"",""orange juice"",""sugar"",""vegetable oil""]",162,3.241,1.15
6,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",204,3.964,0.84
7,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",243,3.684,1.17
8,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",178,3.989,
9,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",184,3.743,0.74
10,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",216,3.848,1.28


Another thing that may come to mind is that maybe getting flour could cost more money due to inflation, so we have to adjust our `cost_to_make` values. We can do this by using the `apply()` function on the `cost_to_make` column.

In [18]:
#Don't edit this method
def adjust_cost(cost):
    return cost + 0.5

#print(adjust_cost(df['cost_to_make']))
#TODO: use apply() to adjust the cost_to_make column.

df['cost_to_make'] = df['cost_to_make'].apply(adjust_cost)
df.head(10)

Unnamed: 0,cookie,ingredients,calories,radius,cost_to_make
0,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",170,3.102,1.17
1,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",198,3.78,1.57
2,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",224,4.069,1.54
4,bbb,"[""flour"",""blueberries"",""sugar"",""butter""]",191,4.148,1.89
5,orange,"[""flour"",""orange juice"",""sugar"",""vegetable oil""]",162,3.241,1.65
6,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",204,3.964,1.34
7,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",243,3.684,1.67
8,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",178,3.989,
9,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",184,3.743,1.24
10,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",216,3.848,1.78


And we can do a lot more things! We will see these concepts be used in the next homework assignment, along with a couple new ones to show you how powerful Pandas is.

### More complicated operations: Grouping, Filtering, Aggregating

Before trying out these complicated operations, lets first sort the sort the df by the radius of the cookies using the `sort_values()` function.

In [19]:
# TODO: sort the df using sort_values(by='Column', ascending=False)
df = df.sort_values(by='radius')

df.head(10)

Unnamed: 0,cookie,ingredients,calories,radius,cost_to_make
78,orange,"[""flour"",""orange juice"",""sugar"",""vegetable oil""]",166,1.695,1.32
32,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",178,2.952,1.34
22,bbb,"[""flour"",""blueberries"",""sugar"",""butter""]",184,2.982,2.51
70,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",164,3.05,
0,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",170,3.102,1.17
93,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",198,3.128,1.09
68,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",168,3.132,1.13
37,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",193,3.172,1.47
63,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",189,3.179,1.22
17,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",209,3.182,2.03


We may also want to group data by certain attributes. This can be done by using `groupby()`. This method takes in a column name, and groups the data by the values in that column. For example, we can group the data by the `cookie` column.

In [20]:
#TODO: group by cookie type

cookie = df.groupby("cookie")
cookie.first()

#df1.head(10)

Unnamed: 0_level_0,ingredients,calories,radius,cost_to_make
cookie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bbb,"[""flour"",""blueberries"",""sugar"",""butter""]",184,2.982,2.51
chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",209,3.182,2.03
laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",164,3.05,1.17
minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",178,2.952,1.34
nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",208,3.583,1.92
orange,"[""flour"",""orange juice"",""sugar"",""vegetable oil""]",166,1.695,1.32
red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",193,3.172,1.47


We can also group by multiple columns. For example, we can group the data by the `cookie` and `ingredients` columns.

In [21]:
#TODO: group by cookie type and ingredients
df.groupby(['cookie', 'ingredients']).head(10)

Unnamed: 0,cookie,ingredients,calories,radius,cost_to_make
78,orange,"[""flour"",""orange juice"",""sugar"",""vegetable oil""]",166,1.695,1.32
32,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",178,2.952,1.34
22,bbb,"[""flour"",""blueberries"",""sugar"",""butter""]",184,2.982,2.51
70,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",164,3.050,
0,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",170,3.102,1.17
...,...,...,...,...,...
82,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",253,4.043,1.62
121,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",227,4.085,1.92
58,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",203,4.112,1.41
107,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",222,4.186,1.36


We may also want to filter the data. For example, we can filter the data to only show the cookies that have a radius greater than 4.3 cm. We can do this by indexing the DataFrame with a boolean expression.

In [22]:
#TODO: filter using the boolean expression

#df1 = df[df['radius'] > 4.3]
df1 = df[df['radius'] > 4.3]

df1.head(10)

Unnamed: 0,cookie,ingredients,calories,radius,cost_to_make
100,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",185,4.307,
76,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",206,4.319,1.74
62,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",206,4.328,1.84
29,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",219,4.346,1.68
47,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",174,4.388,1.14
88,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",167,4.401,1.74
72,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",196,4.425,1.37
80,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",197,4.461,2.17
87,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",199,4.474,1.64
97,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",191,4.475,1.74


We may even want to use `groupby()` and filter idea together! For example, we can filter the data to only show the cookies that have a radius greater than 4.3 cm, and group the data by the `cookie` column.

In [23]:
#TODO: filter the data using the boolean expression
# then group by cookie column

We may also want to aggregate the data. For example, we can aggregate the data by looking at the ratio of calories to radius. We can do this by using indexing and the `apply()` function.

In [24]:
#TODO: add a column to the dataframe that is the
# calories per radius
def calculate_cal_per_radius(row):
    return row['calories'] / row['radius']

Or we can just get rid of this column if we find it useless. We can do this by using the `drop()` function or indexing.

In [25]:
#TODO: drop the created column
df = df.drop('radius', axis='columns')
df.head(10)

Unnamed: 0,cookie,ingredients,calories,cost_to_make
78,orange,"[""flour"",""orange juice"",""sugar"",""vegetable oil""]",166,1.32
32,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",178,1.34
22,bbb,"[""flour"",""blueberries"",""sugar"",""butter""]",184,2.51
70,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",164,
0,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",170,1.17
93,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",198,1.09
68,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",168,1.13
37,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",193,1.47
63,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",189,1.22
17,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",209,2.03


### Visualizing the data

We can also visualize the data. For example, we can visualize the data by plotting the radius of the cookies against the cost to make the cookies. We can do this by using the `plot()` function.

In [26]:
#TODO: plot the radius (x) versus cost to make (y)

We may even want to get more specific and visualize the shape of a distribution of the `laddoo lemon`'s radius by making a boxplot. We can also do this by using the `plot()` function.

In [27]:
#TODO: add the described boxplot
# remember that you changed the name from laddoo lemon to laddoo

Alternatively, we can create a histogram to visualize the distribution of the `laddoo lemon`'s radius. We can also do this by using the `plot()` function.

In [28]:
#TODO: add the described histogram

Things can get more complicated too. Maybe we want to analyze the behaviors of `bad berry bunny` and `laddoo lemon`'s radius using a boxplot. But this time, let's try it using the alternative `boxplot()` function. For practice, try doing it with `plot()` too!

In [29]:
#TODO: analyze the two cookie's radius in a boxplot

### Practice Problems
Now that you have learned some of Pandas' capabilities, let's try some practice problems! **This is the part that will be checked when you submit it!**

#### Problem 1
How many cookies were made? (Hint: use the `shape` attribute)

In [30]:
#Add your code here
df = pd.read_csv('cookies.csv')
num_cookies = df.shape[0]
print("number of cookies made: ", num_cookies)

number of cookies made:  129


#### Problem 2
Add a column to the DataFrame that has the value `True` if the cookie has a radius greater than 4 cm, and `False` otherwise. (Hint: use the `apply()` function)

In [31]:
#Add your code here

def is_bigger_than_four(radius):
    return radius > 4

df['radius_greater_four'] = df['radius'].apply(is_bigger_than_four)

df.head(10)

Unnamed: 0,cookie,ingredients,calories,radius,cost_to_make,radius_greater_four
0,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",170,3.102,0.67,False
1,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",198,3.78,1.07,False
2,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",224,4.069,1.04,True
3,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",198,3.78,1.07,False
4,bad berry bunny,"[""flour"",""blueberries"",""sugar"",""butter""]",191,4.148,1.39,True
5,orange ozuna,"[""flour"",""orange juice"",""sugar"",""vegetable oil""]",162,3.241,1.15,False
6,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",204,3.964,0.84,False
7,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",243,3.684,1.17,False
8,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",178,3.989,,False
9,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",184,3.743,0.74,False


#### Problem 3

Group the data by the `cookie` column, and find the average radius of each cookie. (Hint: use the `groupby()` and `transform()` function). Add this column to the DataFrame.

In [34]:
#Add your code here
df['aver_rad'] = df['radius']
df['aver_rad'] = df.groupby('cookie')['aver_rad'].transform('mean')
df.head(15)

Unnamed: 0,cookie,ingredients,calories,radius,cost_to_make,radius_greater_four,aver_rad
0,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",170,3.102,0.67,False,3.782118
1,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",198,3.78,1.07,False,4.034952
2,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",224,4.069,1.04,True,4.013588
3,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",198,3.78,1.07,False,4.034952
4,bad berry bunny,"[""flour"",""blueberries"",""sugar"",""butter""]",191,4.148,1.39,True,3.911
5,orange ozuna,"[""flour"",""orange juice"",""sugar"",""vegetable oil""]",162,3.241,1.15,False,3.4776
6,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",204,3.964,0.84,False,3.762
7,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",243,3.684,1.17,False,3.933562
8,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",178,3.989,,False,3.782118
9,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",184,3.743,0.74,False,3.762


#### Problem 4
Create a new DataFrame that only contains the cookies that have the ingredient `"chocolate chips"`. (Hint: use the `str.contains()` function)

In [37]:
#Add your code here
df2 = df[df['ingredients'].str.contains("chocolate chips")]
#print(df2.to_string())
df2.head(10)

Unnamed: 0,cookie,ingredients,calories,radius,cost_to_make,radius_greater_four,aver_rad
2,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",224,4.069,1.04,True,4.013588
7,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",243,3.684,1.17,False,3.933562
10,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",216,3.848,1.28,False,4.013588
17,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",209,3.182,1.53,False,3.933562
19,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",236,4.043,1.29,True,4.013588
25,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",205,3.383,,False,3.933562
29,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",219,4.346,1.18,True,4.013588
34,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",205,3.937,,False,3.933562
36,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",211,4.152,1.72,True,4.013588
41,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",206,3.681,,False,3.933562


#### Problem 5

Create a boxplot of `cost_to_make` for all cookies except `chocolate` using the `boxplot()` function.

In [None]:
#Add your code here

#### Problem 6

Create a histogram of the `bad berry bunny`'s calories using the `plot()` function.

In [None]:
#Add your code here


#### Problem 7

Calculate the average calories per cookie type and display the result in a bar chart.

In [40]:
#Add your code here
df2 = df[ df['calories'].isnull().apply(lambda x: not x) ]
df2['aver_calor'] = df2['calories']
df2['aver_calor'] = df2.groupby('cookie')['aver_calor'].transform('mean')
df3 = df2.drop(["calories", "radius","cost_to_make", "radius_greater_four", "aver_rad"], axis = 1, inplace = False)
df3 = df3.drop_duplicates()
#print(df3.to_string())
df3.head(20)

Unnamed: 0,cookie,ingredients,aver_calor
0,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",175.235294
1,red velvet rauw,"[""flour"",""cocoa powder"",""butter"",""red food col...",199.52381
2,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",221.352941
4,bad berry bunny,"[""flour"",""blueberries"",""sugar"",""butter""]",186.941176
5,orange ozuna,"[""flour"",""orange juice"",""sugar"",""vegetable oil""]",166.65
6,minty miami,"[""flour"",""mint extract"",""sugar"",""butter""]",188.52381
7,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",217.625


#### Problem 8

Find the top 3 most expensive cookies in terms of cost_to_make

In [43]:
#Add your code here
df = df[ df['cost_to_make'].isnull().apply(lambda x: not x) ]
df['aver_cost'] = df['cost_to_make']
df['aver_cost'] = df.groupby('cookie')['aver_cost'].transform('mean')
df_av_cost = df.drop(["calories","radius","cost_to_make", "radius_greater_four", "aver_rad"], axis = 1, inplace = False)
df_av_cost = df_av_cost.sort_values(by="aver_cost", ascending=False)
df_av_cost = df_av_cost.drop_duplicates()
df_av_cost_3 = df_av_cost.drop(df_av_cost.index[3:])
#print(df_av_cost_3.to_string())
df_av_cost_3.head()

Unnamed: 0,cookie,ingredients,aver_cost
29,nevadito,"[""flour"",""chocolate chips"",""milk"",""vanilla ext...",1.364667
49,laddoo lemon,"[""flour"",""lemon juice"",""sugar"",""ghee""]",1.303077
128,chocolate,"[""flour"",""chocolate chips"",""sugar"",""butter""]",1.235385
