## Challenge 1: Methods


In [None]:
import pandas as pd
import numpy as np
penguins = pd.read_csv('penguins.csv')
penguins['species'].value_counts(ascending=True)
#Operating on a series
#Returns a series of counts for each unique value in the column

In [None]:
penguins.isnull()
#operates on the dataframe
#returns a dataframe with True where null items are

In [None]:
penguins.dropna()
#operates on the dataframe


In [None]:
penguins['species'].str[0]
#operates on a column
#returns first item of every string

## Challenge 2: Finding the right method

Recall that in the penguins data set, there was one column that had two values 'MALE' and 'FEMALE'. Let's say that for a model, we want to replace the string values with numbers (FEMALE = 0; MALE = 1). Look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) and identify a method to *replace* the strings with their corresponding numbers. Then try to implement the method. What roadblocks do you run across?

In [None]:
penguins['sex_numeric'] = penguins['sex'].replace(['MALE','FEMALE'],[1,0])

2)  Notice that there are some 'NaN' values in the `Series`. You do some research and identify three possible solutions to deal with the NaN values (listed below). For each of the options, describe what will happen to NaN values in the column, and the DataFrame as a whole. Which option seems most appropriate? Modify that function as necessary. 

Consider the following:
- Is the whole DataFrame or just the column (Series) being operated on?
- What exactly are happening to the NaN values?
- What are the consequences, if any, for the solution in the hypothetical model? 
- Should removing null values happen before, during or after the conversion in (1)?

In [None]:
penguins['sex'].replace(['MALE','FEMALE',np.nan],[1,0,2])
#Replaces nulls with 2. Might have issues in the model (since 2 doesn't represent a third category)

In [None]:
penguins.fillna(2)

#Fill in all NAs in the dataset with 2. 
#This could work if you modify it to penguins['sex'].fillna(2), although it runs into the same issues as above.


In [None]:
penguins.dropna(subset = 'sex')
#This drops all rows with NAN in the sex column. This is the most straightforward option.
#However, it reduces the amount of data in the dataset.


## Challenge 3: Subsetting a DataFrame
1. Modify the .loc[] expression above to subset for all Adelie penguins and save it to a variable `adelie`
2. Calculate the mean body mass for this species (**Hint**: use `.mean()`).
3. Repeat 1-2 for Gentoo and Chinstrap penguins.

In [None]:
adelie = penguins.loc[penguins['species']=='Adelie',:]
chinstrap = penguins.loc[penguins['species']=='Chinstrap',:]
gentoo = penguins.loc[penguins['species']=='Gentoo',:]

## Challenge 4: Customizing a Plot

One intuition may be that different penguin species have different culmen length/depth, resulting in the pattern observed in the scatterplot above. Let's say we want to explore this pattern by plotting the data for each species in a different color. This will allow us to visualize this pattern if it is present in the data.

The way we implement this in plotting is by plotting individual layers for each species. Most visualizations treat images as "layers" on the backend. This allows us to create customizations to plots pretty easily, because each customization would be a new "layer".

So let's try it! Specifically, we want to visualize the culmen depth vs. the culmen length for each of the penguin species separately. We'll use different colors for each species.

To do this, we set the first layer equal to the variable `fig`. This represents our plot. All of our plots thus far have had a single layer. To include multiple layers in a plot, we simply include the argument `ax=fig` in any subsequent layers. This tells `pandas` to put new layers on the original plot rather than to make a new plot.

Follow the steps below to make your own layered visualization!

1. Make three different sub-DataFrames, one for each species, using `.loc[]` and a Boolean mask. (**Hint:** This is the solution to Challenge 3)
2. Plot the first layer and set it equal to `fig`.
3. Plot subsequent layers. Use a different color for each species (look at the documentation for the name of the color parameter). Some possible colors to use are `'green'`, `'red'`, `'purple'`, `'black'`, etc. (Remember to include the argument `ax=fig`!)
4. Do you notice a pattern in culmen measurements based on species? What other elements for the plot would be helpful for interpreting it?

**Bonus:** Add a title and any other modifications to the plot (better x and y labels, for example).

In [None]:
adelie = penguins.loc[penguins['species']=='Adelie',:]
chinstrap = penguins.loc[penguins['species']=='Chinstrap',:]
gentoo = penguins.loc[penguins['species']=='Gentoo',:]


fig = adelie.plot(x='culmen_depth_mm',y='culmen_length_mm',kind = 'scatter')
gentoo.plot(x='culmen_depth_mm',y='culmen_length_mm',kind = 'scatter',color='red',ax=fig)
chinstrap.plot(x='culmen_depth_mm',y='culmen_length_mm',kind = 'scatter',color='green',ax=fig)