## Lab 02 - Pandas and Data Visualization
This notebook will introduce you to the basics of Pandas and Data Visualization. You will learn how to load data into a Pandas DataFrame, how to perform basic data analysis, and how to visualize data. The first part of this notebook will be an interactive tutorial, and the second part will be practice exercises for you to do!

#### Pandas
Pandas is a popular open-source Python library that provides data structures and data analysis tools for working with structured data. Pandas is a versatile library that simplifies data manipulation, analysis, and exploration in Python. Some of its uses:
* Tabular Data Handling
* Data Cleaning and Transformation
* Data Exploration
* Data Import/Export
* Data Visualization

#### Matplotlib
Matplotlib is a widely used for creating static, animated, and interactive visualizations. Whether you are conducting data analysis, scientific research, or data communication, Matplotlib helps you present your findings effectively and intuitively.

### Pre-requisites

In [None]:
# In case you don't have pandas, uncomment
# the following lines and run the cell
# pip install --upgrade ipython

!pip3 install pandas

### Overview
In this notebook, you will be learning how to use the Pandas library by working with the `cookies.csv` file. 

#### `cookies.csv` file :

The `cookies.csv` file contains information about cookies that were made in Rico's Bisquito's factory. 

The columns are the following:

`cookie`: the name of a cookie
`ingredients`: a list of the cookie's ingredients
`calories`: the number of calories the created cookie has
`radius`: the radius of the created cookie, in cm
`cost_to_make`: the cost it took to make the created cookie, in dollars

### Reading the CSV file

First, we need to import the Pandas library. We will be using the `pd` alias for the Pandas library.

In [None]:
#TODO: import pandas and matplotlib in this cell
import pandas as pd
import matplotlib.pyplot as plt

We will now look at the `cookies.csv` file. We will use the `pd.read_csv()` function to read in the CSV file. We will store the data in a variable called `cookies`.

In [None]:
#TODO: read the cookies.csv file into a pandas dataframe
df = pd.read_csv('cookies.csv')

#### Dataframe
Dataframes are a data structure that Pandas uses to store data. Dataframes are similar to tables in a database. Dataframes have rows and columns. Each row represents a single data point, and each column represents a feature of the data point.

We will then make sure we imported the data correctly by printing out the first 10 rows of the data, using the `head()` function.

In [None]:
#TODO: print the head of the dataframe
df.head(10)

### Checking data types
You can check the data types of each column using the `dtypes` attribute of the DataFrame.

In [None]:
#TODO: check the data types of the columns
print(df.dtypes)

Now, lets use use `info()` function to get more information about the Dataframe

In [None]:
# TODO: use info() to get information about datatypes and null values
print(df.info())

### Cleaning the data
Now that we have the data, we need to clean it. For example, some `cost_to_make` fields of some created cookies are missing. To resolve this, we can do many things: we can replace the missing data with the mean of the column, or we can get rid of the row entirely if the `cost_to_make` field is not set. 

In [None]:
#TODO: clean the dataframe and 
# print the head again to make sure 
# the changes took effect
df['cost_to_make'].fillna(df['cost_to_make'].mean(), inplace=True)  # fill the null values with the mean of the column
# df.dropna(subset=['cost_to_make'], inplace=True)  # drop the rows with null values in the column 'cost_to_make''])
print(df.head(10))

To also make sure we removed null values, we can check the number of null values in each column using the `isnull()` function.

In [None]:
#TODO: use the isnull method to make sure your data is clean
print(df.isnull().sum())

Next, lets check for duplicate rows using the `duplicated()` function. Then, remove those rows using `drop_duplicates()` function.

In [None]:
# TODO: check for duplicate rows
# then delete those rows form df
duplicate_rows = df.duplicated()
df.drop_duplicates(inplace = True)

### Parsing the data
Now that we have the data, we could parse it to get the information we want. For example, we can check what types of cookies were made by using the `unique()` function on the `cookie` column.

In [None]:
#TODO: see what cookies are in the dataset
unique_cookies = df['cookie'].unique()
print(unique_cookies)

We can also check the number of cookies made by using the `value_counts()` function on the `cookie` column.

In [None]:
#TODO: use value_counts() to see how many 
# cookies of each type there are
cookie_counts = df['cookie'].value_counts()
print(cookie_counts)

Or maybe we don't like how long the names of the cookies are, so we can shorten them by using the `replace()` function on the `cookie` column.

For example, let's try changing `"bad berry bunny"` to `"bbb"`.

In [None]:
#TODO: change bad berry bunny data elements to "bbb"
df['cookie'] = df['cookie'].replace('bad berry bunny', 'bbb')
df.head(10)

We may even like the original names better, but we may want to get rid of the spaces. For example, we can change `"orange ozuna"` to `"orange_ozuna"`. Here, we will use the `str.replace()` function.

In [None]:
#TODO: adjust orange ozuna as described
df['cookie'] = df['cookie'].str.replace('orange ozuna', 'orange_ozuna')

df.head(10)

We may even just want to keep the first word of the cookie name. For example, we can change `"orange_ozuna"` to `"orange"`.

In [None]:
#TODO: adjust all cookies so only the first word
# is used as the cookie name
df['cookie'] = df['cookie'].str.split("_").str[0]
df.head(10)

Another thing that may come to mind is that maybe getting flour could cost more money due to inflation, so we have to adjust our `cost_to_make` values. We can do this by using the `apply()` function on the `cost_to_make` column.

In [None]:
#Don't edit this method
def adjust_cost(cost):
    return cost + 0.5

#TODO: use apply() to adjust the cost_to_make column.
df['cost_to_make'] = df['cost_to_make'].apply(adjust_cost)
df.head()

And we can do a lot more things! We will see these concepts be used in the next homework assignment, along with a couple new ones to show you how powerful Pandas is.

### More complicated operations: Grouping, Filtering, Aggregating

Before tryong out these complicated operations, lets first sort the sort the df by the radius of the cookies using the `sort_values()` function.

In [None]:
# TODO: sort the df using sort_values(by='Column', ascending=False)
grouped_by_cookie = df.groupby('cookie')
print(grouped_by_cookie.size())

We may also want to group data by certain attributes. This can be done by using `groupby()`. This method takes in a column name, and groups the data by the values in that column. For example, we can group the data by the `cookie` column.

In [None]:
#TODO: group by cookie type
grouped_by_cookie.get_group("chocolate")

We can also group by multiple columns. For example, we can group the data by the `cookie` and `ingredients` columns.

In [None]:
#TODO: group by cookie type and ingredients
grouped_by_cookie_ingredients = df.groupby(['cookie', 'ingredients'])
print(grouped_by_cookie_ingredients.size())

We may also want to filter the data. For example, we can filter the data to only show the cookies that have a radius greater than 4.3 cm. We can do this by indexing the DataFrame with a boolean expression.

In [None]:
#TODO: filter using the boolean expression
filtered_cookies = df[df['radius'] > 4.3] # goes through each row and checks to see if the condition is true or not. then it creates a new dataframe
filtered_cookies.shape

We may even want to use `groupby()` and filter idea together! For example, we can filter the data to only show the cookies that have a radius greater than 4.3 cm, and group the data by the `cookie` column.

In [None]:
#TODO: filter the data using the boolean expression
# then group by cookie column
def find_ratio(col1, col2):
    return col1 / col2
filtered_grouped_cookies = df[df['radius'] > 4.3].groupby('cookie')
print(filtered_grouped_cookies.size())

We may also want to aggregate the data. For example, we can aggregate the data by looking at the ratio of calories to radius. We can do this by using indexing and the `apply()` function.

In [None]:
#TODO: add a column to the dataframe that is the
# calories per radius
df['calories_per_radius'] = df.apply(lambda row: find_ratio(row['calories'], row['radius']), axis=1)
df.head()

Or we can just get rid of this column if we find it useless. We can do this by using the `drop()` function or indexing.

In [None]:
#TODO: drop the created column
df.drop('calories_per_radius', axis=1, inplace=True)
df.head()

### Visualizing the data

We can also visualize the data. For example, we can visualize the data by plotting the radius of the cookies against the cost to make the cookies. We can do this by using the `plot()` function.

In [None]:
#TODO: plot the radius (x) versus cost to make (y)
plt.figure(figsize=(7, 5))
plt.scatter(df['radius'], df['cost_to_make'])
plt.xlabel('Radius (cm)')
plt.ylabel('Cost to Make ($)')
plt.title('Radius vs Cost to Make')
plt.show()

We may even want to get more specific and visualize the shape of a distribution of the `laddoo lemon`'s radius by making a boxplot. We can also do this by using the `plot()` function.

In [None]:
#TODO: add the described boxplot
laddoo_lemon = df[df['cookie'] == 'laddoo lemon']
laddoo_lemon['radius'].plot(kind='box')
plt.title('Laddoo Lemon Radius Boxplot')
plt.show()

Alternatively, we can create a histogram to visualize the distribution of the `laddoo lemon`'s radius. We can also do this by using the `plot()` function.

In [None]:
#TODO: add the described histogram
laddoo_lemon['radius'].plot(kind='hist', bins=5)
plt.title('Laddoo Lemon Radius Histogram')
plt.xlabel('Radius (cm)')
plt.show()

Things can get more complicated too. Maybe we want to analyze the behaviors of `bad berry bunny` and `laddoo lemon`'s radius using a boxplot. But this time, let's try it using the alternative `boxplot()` function. For practice, try doing it with `plot()` too!

In [None]:
#TODO: analyze the two cookie's radius in a boxplot
selected_cookies = df[df['cookie'].isin(['bbb', 'laddoo'])]
selected_cookies.boxplot(column='radius', by='cookie')
plt.title('Radius of Bad Berry Bunny and Laddoo')
plt.show()

### Practice Problems
Now that you have learned some of Pandas' capabilities, let's try some practice problems! **This is the part that will be checked when you submit it!**

#### Problem 1
How many cookies were made? (Hint: use the `shape` attribute)

In [None]:
#Add your code here
df.shape

#### Problem 2
Add a column to the DataFrame that has the value `True` if the cookie has a radius greater than 4 cm, and `False` otherwise. (Hint: use the `apply()` function)

In [None]:
#Add your code here
df['big_radius'] = df['radius']>4
df.head()

#### Problem 3

Group the data by the `cookie` column, and find the average radius of each cookie. (Hint: use the `groupby()` and `transform()` function). Add this column to the DataFrame.

In [None]:
#Add your code here
df['avg_radius'] = df.groupby('cookie')['radius'].transform('mean')
df.head()

#### Problem 4
Create a new DataFrame that only contains the cookies that have the ingredient `"chocolate chips"`. (Hint: use the `str.contains()` function)

In [None]:
#Add your code here
chocolate_chip_cookies = df[df['ingredients'].str.contains('chocolate chips')]
chocolate_chip_cookies.head()

#### Problem 5

Create a boxplot of `cost_to_make` for all cookies except `chocolate` using the `boxplot()` function.

In [None]:
#Add your code here


#### Problem 6

Create a histogram of the `bad berry bunny`'s calories using the `plot()` function.

In [None]:
#Add your code here

#### Problem 7

Calculate the average calories per cookie type and display the result in a bar chart.

In [None]:
#Add your code here

#### Problem 8

Find the top 3 most expensive cookies in terms of cost_to_make

In [None]:
#Add your code here