# Data Analysis with `pandas`

The first of the computing libraries we will cover is `pandas`. `pandas` is the most common package used in data analysis, with a focus on data manipulation and processing.

We have already worked a bit with data frames, which are the core data type in `pandas`. In this notebook, we will cover some of the basic functionality of `pandas`.

To learn more, check out D-Lab's [Python Data Wrangling workshop](https://github.com/dlab-berkeley/Python-Data-Wrangling).

In [None]:
# pandas is frequently imported with the alias pd
import pandas as pd

For now, let's use an existing dataset, the [penguins dataset](https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data?resource=download&select=penguins_size.csv)! The dataset consists of body measurements for three penguin species (Adelie, Chinstrap, Gentoo). We will load in the file and use `df.head()` to look at the first few items.

The data has the following columns: 

- Species (Adelie, Gentoo, Chinstrap)
- Island
- Culmen Length (mm)
- Culmen Depth (mm)
- Flipper Length (mm)
- Body Mass (g)
- Sex (MALE / FEMALE)

The culmen is the top part of the penguin's bill!

In [None]:
penguins = pd.read_csv('penguins.csv')
penguins.head()

## DataFrame Methods

There are many methods for summarizing `pd.DataFrames`:
1. `df.describe()`: Summarize the data frame columns.
2. `df.value_counts()`: Calculate counts per unique value in a column.
3. `df[column'].unique()`:  Unique values in a column.
4. `df.isnull().sum()`: Calculate the number of null values.
5. `df.dropna()`: Drop all rows with null values.

In [None]:
# Why are only some of the columns visible here?
penguins.describe()

In [None]:
print(penguins['species'].unique())

In [None]:
print(penguins.value_counts('species'))

In [None]:
penguins.isnull().sum()

In [None]:
penguins = penguins.dropna()

## Selecting Columns and Rows

We can use `.loc[row, column]` to index columns and rows in the DataFrame. A `:` indicates to use *all* columns/rows.

In [None]:
# First row, all columns
penguins.loc[0, :]

In [None]:
# Select the species column, all rows
penguins.loc[:, 'species']

In [None]:
# This is equivalent to directly selecting the column
penguins['species']

In [None]:
# Select the body_mass for the third penguin
penguins.loc[2, 'body_mass_g']

In [None]:
# We can also use Boolean masks to subset our data frame 
penguins.loc[penguins['sex'] == 'FEMALE', :]

## Challenge 1: Subsetting a DataFrame

1. Select all Adelie penguins and calculate the mean body mass (**Hint**: use `.mean()`).
2. Do the same for Gentoo and Chinstrap penguins.

In [None]:
##your code here

## Modifying the DataFrame 

Sometimes, we want to modify a dataframe. For example, we can create a new column by simply assigning it with bracket notation. This is called **vectorization**, where instead of modifying each item in the series individually with a loop, it is computed all at once. This type of method is faster than looping for repeated computations.

In [None]:
penguins['body_mass_kg'] = penguins['body_mass_g'] / 1000

penguins

A column can also be modified using bracket notation and methods. For example, we can convert a column to another type.

If we convert `body_mass_g` to a string column, then numeric operations will no longer work.

In [None]:
penguins['body_mass_g'] = penguins['body_mass_g'].astype(str)
penguins['body_mass_g'] / 1000

**Vectorized string functions** can also be applied to columns, using the syntax `df.str.method()`:

In [None]:
# Make island name lower case
penguins['island'].str.lower()

## Challenge 2: Column Manipulation

1. Calculate the ratio between `culmen length` and `culmen depth` and save it to a new column called `culmen ratio`.
2. Convert the `sex` column to a number (0 = FEMALE, 1 = MALE). 

In [None]:
# YOUR CODE HERE


## Plotting with `pandas`

`pandas` also offers some basic plotting functions. In this section, we will cover three basic types of plots: histograms, scatter plots, and box plots. See the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) for further information on plotting and plot customization.

### Histograms

A histogram shows the distribution of a variable using binned values. We can call this using the syntax: `df[column].plot(kind='hist')`.

The `bins` keyword argument changes the number of bins in the histogram. A few examples of the bins argument are below. Which plot would you pick?

In [None]:
print('Plot A: 5 Bins')
penguins['body_mass_kg'].plot(kind='hist', title='Histogram of body mass values', bins=5)

In [None]:
print('Plot B: 10 Bins')
penguins['body_mass_kg'].plot(kind='hist', title='Histogram of body mass values', bins=10)

In [None]:
print('Plot C: 20 bins')
penguins['body_mass_kg'].plot(kind='hist', title='Histogram of body mass values', bins=20)

### Scatter Plots

Scatter plots visualize bivariate relationships. We can create a scatter plot by specifying the columns to use for the `x` and `y` axes:

In [None]:
penguins.plot(kind='scatter',
              x='culmen_length_mm',
              y='culmen_depth_mm',
              title='Relationship between culmen length and depth')

### Bar Plots

Bar plots visualize the counts of several different groups. We'll often need to do some preprocessing before we can create the bar plot.

For example, if we want to make a bar plot with the number of observations for each species, we summarize those values first, and then plot the result using a bar plot.

In [None]:
penguins.value_counts('species').plot(kind='bar', title='Count of each species')

## Challenge 3: Customizing a Plot

Most visualizations treat images as "layers" on the backend. This allows us to create customizations to plots pretty easily, because each customization would be a new "layer".

Let's create a scatter plot with multiple parts. Specifically, we want to visualize the culmen depth vs. the culmen length for each of the penguin species separately. We'll use different colors for each species.

To do this, we set the first layer equal to the variable `ax`. This represents our plot. Then, in subsequent plots, we include the argument `ax=ax`. This tells `pandas` to send new plots as layers on the original plot.

1. Make three different sub-DataFrames, one for each species, using `.loc[]` and a Boolean mask.
2. Plot the first layer and set it equal to `ax`.
3. Plot subsequent layers. Use a different color for each species (look at the documentation for the name of the color parameter). Some possible colors to use are `'green'`, `'red'`, `'purple'`, `'black'`, etc. 
4. Add a title and any other modifications to the plot (better x and y labels, for example).

In [None]:
# YOUR CODE HERE

# Subset Data 
chinstrap = 
adelie = 
gentoo = 

# Create plot
ax = # First layer
# Plot other layers


For more on data visualization, check out D-Lab's [Python Data Visualization workshop](https://github.com/dlab-berkeley/Python-Data-Visualization).