## Lab 02 - Pandas and Data Visualization - 07 February, 2023
This notebook will introduce you to the basics of Pandas and Data Visualization. You will learn how to load data into a Pandas DataFrame, how to perform basic data analysis, and how to visualize data. The first part of this notebook will be an interactive tutorial, and the second part will be practice exercises for you to do! Note that the practice problems will be checked when submitted!

### Pre-requisites

In [None]:
# In case you don't have pandas, uncomment
# the following lines and run the cell

# %pip install pandas

### Overview
In this notebook, you will be learning how to use the Pandas library by working with the `cookies.csv` file. 

#### `cookies.csv` file

The `cookies.csv` file contains information about cookies that were made from a single Rico's Bisquito's factory. There are, however, a few differences from the classes defined in homework0.

Here, this dataset shows all the cookies made from a single factory, where now the `cost_to_make` may not be the same for a single cookie because someone may have, for example, added too much flour. 

The columns are the following:

`cookie`: the name of a cookie
`ingredients`: a list of the cookie's ingredients
`calories`: the number of calories the created cookie has
`radius`: the radius of the created cookie, in cm
`cost_to_make`: the cost it took to make the created cookie, in dollars

### Reading the CSV file

First, we need to import the Pandas library. We will be using the `pd` alias for the Pandas library.

In [4]:
#TODO: import pandas and matplotlib in this cell
import pandas as pd
import matplotlib

We will now look at the `cookies.csv` file. We will use the `pd.read_csv()` function to read in the CSV file. We will store the data in a variable called `cookies`.

In [12]:
#TODO: read the cookies.csv file into a pandas dataframe
cookies = pd.read_csv("cookies.csv")

What is the DataFrame used? Well, Dataframes are a data structure that Pandas uses to store data. Dataframes are similar to tables in a database. Dataframes have rows and columns. Each row represents a single data point, and each column represents a feature of the data point.

We will then make sure we imported the data correctly by printing out the first 10 rows of the data, using the `head()` function.

In [11]:
#TODO: print the head of the dataframe
print(cookies.head())

            cookie                                        ingredients  \
0     laddoo lemon             ["flour","lemon juice","sugar","ghee"]   
1         nevadito  ["flour","chocolate chips","milk","vanilla ext...   
2  red velvet rauw  ["flour","cocoa powder","butter","red food col...   
3  bad berry bunny           ["flour","blueberries","sugar","butter"]   
4     orange ozuna   ["flour","orange juice","sugar","vegetable oil"]   

   calories  radius  cost_to_make  
0       170   3.102          0.67  
1       224   4.069          1.04  
2       198   3.780          1.07  
3       191   4.148          1.39  
4       162   3.241          1.15  


### Checking data types
You can check the data types of each column using the `dtypes` attribute of the DataFrame.

In [14]:
#TODO: check the data types of the columns
print(cookies.dtypes)

cookie           object
ingredients      object
calories          int64
radius          float64
cost_to_make    float64
dtype: object


### Cleaning the data
Now that we have the data, we need to clean it. For example, some `cost_to_make` fields of some created cookies are missing. To resolve this, we can do many things: we can replace the missing data with the mean of the column, or we can get rid of the row entirely if the `cost_to_make` field is not set. 

In [15]:
#TODO: clean the dataframe and 
# print the head again to make sure 
# the changes took effect
cookies["cost_to_make"].replace(0)

0      0.67
1      1.04
2      1.07
3      1.39
4      1.15
       ... 
113    1.42
114    2.00
115    1.61
116    1.33
117    1.36
Name: cost_to_make, Length: 118, dtype: float64

To also make sure we removed null values, we can check the number of null values in each column using the `isnull()` function.

In [16]:
#TODO: use the isnull method to make sure your data is clean
cookies.isnull()

Unnamed: 0,cookie,ingredients,calories,radius,cost_to_make
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
113,False,False,False,False,False
114,False,False,False,False,False
115,False,False,False,False,False
116,False,False,False,False,False


### Parsing the data
Now that we have the data, we could parse it to get the information we want. For example, we can check what types of cookies were made by using the `unique()` function on the `cookie` column.

In [18]:
#TODO: see what cookies are in the dataset
cookies.columns.unique()

Index(['cookie', 'ingredients', 'calories', 'radius', 'cost_to_make'], dtype='object')

We can also check the number of cookies made by using the `value_counts()` function on the `cookie` column.

In [19]:
#TODO: use value_counts() to see how many 
# cookies of each type there are
cookies.columns.value_counts()

cookie          1
ingredients     1
calories        1
radius          1
cost_to_make    1
dtype: int64

Or maybe we don't like how long the names of the cookies are, so we can shorten them by using the `replace()` function on the `cookie` column.

For example, let's try changing `"bad berry bunny"` to `"bbb"`.

In [29]:
#TODO: change bad berry bunny data elements to "bbb"
cookies["cookie"].replace("bad berry bunny", "bbb")

0         laddoo lemon
1             nevadito
2      red velvet rauw
3                  bbb
4         orange ozuna
            ...       
113          chocolate
114       laddoo lemon
115           nevadito
116    red velvet rauw
117                bbb
Name: cookie, Length: 118, dtype: object

We may even like the original names better, but we may want to get rid of the spaces. For example, we can change `"orange ozuna"` to `"orange_ozuna"`. Here, we will use the `str.replace()` function.

In [27]:
#TODO: adjust orange ozuna as described
cookies["cookie"].replace("orange ozuna", "orange_ozuna")

0         laddoo lemon
1             nevadito
2      red velvet rauw
3      bad berry bunny
4         orange_ozuna
            ...       
113          chocolate
114       laddoo lemon
115           nevadito
116    red velvet rauw
117    bad berry bunny
Name: cookie, Length: 118, dtype: object

We may even just want to keep the first word of the cookie name. For example, we can change `"orange_ozuna"` to `"orange"`.

In [28]:
#TODO: adjust all cookies so only the first word
# is used as the cookie name
cookies["cookie"].replace("orange_ozuna", "orange")

0         laddoo lemon
1             nevadito
2      red velvet rauw
3      bad berry bunny
4         orange ozuna
            ...       
113          chocolate
114       laddoo lemon
115           nevadito
116    red velvet rauw
117    bad berry bunny
Name: cookie, Length: 118, dtype: object

Another thing that may come to mind is that maybe getting flour could cost more money due to inflation, so we have to adjust our `cost_to_make` values, similar to how in the homework there is a `price_adjustments`. We can do this by using the `apply()` function on the `cost_to_make` column.

In [24]:
#Don't edit this method
def adjust_cost(cost):
    return cost + 0.5

#TODO: use apply() to adjust the cost_to_make column.
cookies["cost_to_make"].apply(adjust_cost)

0      1.17
1      1.54
2      1.57
3      1.89
4      1.65
       ... 
113    1.92
114    2.50
115    2.11
116    1.83
117    1.86
Name: cost_to_make, Length: 118, dtype: float64

And we can do a lot more things! We will see these concepts be used in the next homework assignment, along with a couple new ones to show you how powerful Pandas is.

### More complicated operations: Grouping, Filtering, Aggregating

We may also want to group data by certain attributes. This can be done by using `groupby()`. This method takes in a column name, and groups the data by the values in that column. For example, we can group the data by the `cookie` column.

In [31]:
#TODO: group by cookie type
cookies.groupby("cookie")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000011F00039D20>

We can also group by multiple columns. For example, we can group the data by the `cookie` and `ingredients` columns.

In [34]:
#TODO: group by cookie type and ingredients
cookies.groupby("cookie")
cookies.groupby("ingredients")


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000011F7DDDE440>

We may also want to filter the data. For example, we can filter the data to only show the cookies that have a radius greater than 4.3 cm. We can do this by indexing the DataFrame with a boolean expression.

In [36]:
#TODO: filter using the boolean expression
cookies.filter(cookies["radius"] > 4.3)

0
1
2
3
4
...
113
114
115
116
117


We may even want to use `groupby()` and filter idea together! For example, we can filter the data to only show the cookies that have a radius greater than 4.3 cm, and group the data by the `cookie` column.

In [37]:
#TODO: filter the data using the boolean expression
# then group by cookie column
cookies.filter(cookies["radius"] > 4.3)
cookies.groupby("cookie")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000011F00039ED0>

We may also want to aggregate the data. For example, we can aggregate the data by looking at the ratio of calories to radius. We can do this by using indexing and the `apply()` function.

In [None]:
#TODO: add a column to the dataframe that is the
# calories per radius



Or we can just get rid of this column if we find it useless. We can do this by using the `drop()` function or indexing.

In [None]:
#TODO: drop the created column

### Visualizing the data

We can also visualize the data. For example, we can visualize the data by plotting the radius of the cookies against the cost to make the cookies. We can do this by using the `plot()` function.

In [None]:
#TODO: plot the radius (x) versus cost to make (y)

We may even want to get more specific and visualize the shape of a distribution of the `laddoo lemon`'s radius by making a boxplot. We can also do this by using the `plot()` function.

In [None]:
#TODO: add the described boxplot

Alternatively, we can create a histogram to visualize the distribution of the `laddoo lemon`'s radius. We can also do this by using the `plot()` function.

In [None]:
#TODO: add the described histogram

Things can get more complicated too. Maybe we want to analyze the behaviors of `bad berry bunny` and `laddoo lemon`'s radius using a boxplot. But this time, let's try it using the alternative `boxplot()` function. For practice, try doing it with `plot()` too!

In [None]:
#TODO: analyze the two cookie's radius in a boxplot

### Practice Problems
Now that you have learned some of Pandas' capabilities, let's try some practice problems! **This is the part that will be checked when you submit it!**

#### Problem 1
How many cookies were made? (Hint: use the `shape` attribute)

In [39]:
#Add your code here
print(cookies.shape)

(118, 5)


#### Problem 2
Add a column to the DataFrame that has the value `True` if the cookie has a radius greater than 4 cm, and `False` otherwise. (Hint: use the `apply()` function)

In [42]:
#Add your code here
def bigger_than_4(radius):
    return radius > 4

cookies["bigger_than_4"] = cookies["radius"].apply(bigger_than_4)
print(cookies["bigger_than_4"])

0      False
1       True
2      False
3       True
4      False
       ...  
113     True
114    False
115    False
116    False
117     True
Name: bigger_than_4, Length: 118, dtype: bool


#### Problem 3

Group the data by the `cookie` column, and find the average radius of each cookie. (Hint: use the `groupby()` and `transform()` function). Add this column to the DataFrame.

In [None]:
#Add your code here


#### Problem 4
Create a new DataFrame that only contains the cookies that have the ingredient `"chocolate chips"`. (Hint: use the `str.contains()` function)

In [None]:
#Add your code here

#### Problem 5

Create a boxplot of `cost_to_make` for all cookies except `chocolate` using the `boxplot()` function.

In [None]:
#Add your code here

#### Problem 6

Create a histogram of the `bad berry bunny`'s calories using the `plot()` function.

In [None]:
#Add your code here