# Assignment Pandas 101

In this exercise we'll practice our skills from the Pandas 101 workshop, specifically [DataFrameAccess](DataFrameAccess.ipynb) and [BasicPandasPlotting](BasicPandasPlotting.ipynb). If you want to review later, feel free to watch [the video of the Spring 2022 Pandas 101 workshop](https://warpwire.duke.edu/w/4-MGAA/).

*Don't look at them yet, but the solutions are in 
[AssignmentPandas101Solutions](AssignmentPandas101Solutions.ipynb)*

## References

Besides the two notebooks referenced above, if you need to review materials from the Pandas 101 workshop I've created a new summary of the material:

- [Pandas 101 essential review](Pandas101review.ipynb)

---

## Start by importing the Pandas module

We always start by importing the pandas module and giving it the shorter name "pd" so we don't have to write out "pandas" before each command.

In [None]:
import ____ as __

## Load in data from CSV file

Now we want to load in data from a CSV file on **the percent of women in receiving undergraduate degrees in various fields over time**.

- It's stored in a CSV file in the 'data' directory
- the name is 'percent-bachelors-degrees-women-usa.csv'

In [None]:
df = pd.read_csv(____)

### Display the first 6 lines of the DataFrame

In [None]:
df.____(_)

### Display the last lines of the DataFrame

In [None]:
df._____()

### Look at the names of the degree fields (column names)

- Get a list of (really return an Index containing) the column names

In [None]:
df._____

### Look at the default row names (Index)

- Let's also look at the Index, which are the "row names"

*We can see that by default we just get a range of sequential integers*

In [None]:
df._____

## Set the DataFrame index to be the Year column

**Some of the tests and plotting will be easier if the Year column is set as the Index of our DataFrame.** We could have done this in the loading stage, but we'll do it here instead.

- Set the Index of the DataFrame to be the Year column
- Display the first five lines of the DataFrame to check the result.

In [None]:
df = df.____('Year')
df.____()

### The Index values are still integers

It's important to know whether the data type is integer or string so we know what type of value to put when we're selecting ranges of the data. *("Int64" is just a type of integer (whole number) that is stored using 64 binary bits.)*

- Display just the DataFrame index

In [None]:
df._____

### Look at the descriptive statistics in a table

One thing I didn't cover in my Pandas 101 workshop was the simple way of getting summary statistics on all of the columns at once:

`df.describe()`

- Just display the results of running the above function

In [None]:
____

### Transpose the descriptive statistics and sort by the mean column

I find the `df.describe()` display hard to look at because it has so many columns. In this step we will make it more useful by transposing the resulting DataFrame.

- Transpose the descriptive statistics
- Notice that we can just "chain" the operations together by putting a dot between each successive command. They are run left-to-right.

In [None]:
df.describe()._

### Sort the transposed table by the mean to give a ranking of the fields

- Sort the transposed descriptive statistics ascending (which is the default) by the "mean" column
- Do it by chaining the sort operation at the end of the previous operations
- This gives us a ranking of the fields regarding the mean number of women graduates over time (ascending, which is the default ordering for the sort method)
- *Again, we are just displaying these results, not storing them in a new variable*

In [None]:
df.describe()._.____(____)

---

## DataFrame access using the `df[]` notation

Let's practice accessing the values in our DataFrame:

### Selecting a single column

- Display the values from the Engineering column using the `df[]` notation

In [None]:
df[____]

### Selecting multiple specific columns

- Still use the `df[]` notation, but display just the Engineering and Biology columns, in that order

In [None]:
df[____]

### Use the same notation to select a single column of any DataFrame

- Use the `df.describe()` method from above to get summary statistics on the DataFrame
- Again, chain the commands together starting with `df` to produce the transposed DataFrame sorted by the 'mean' column
- Now add the square brackets notation `[]` to the end of the command to just return the 'mean' column of the result

In [None]:
____

## DataFrame access using the `df.loc[]` notation

The `df.loc[]` notation is very readable since it allows us to state explicitly the names of the rows and columns *(hint: in that order)* that we're selecting from the DataFrame.

- Display (select) the Engineering value from the year 2000 using the `df.loc[]` notation

In [None]:
df.loc[____,____]

### Using the slice ":" notation for selections

Similar to selecting a range of values from a Python list, the slice `:` or `start:end` notation allows us to select a range from the rows and/or columns of a DataFrame.

#### Bounded range

- Select the range of years 2000-2008 (inclusive) from the Engineering column
- *Note that, unlike Python lists, Pandas includes the `end` value in what's returned*

In [None]:
df.loc[____,____]

#### Unbounded range

- Still using the `df.loc[]` notation
- Select from the Biology column the years 2005 to the end of the rows

#### Whole range

- Still using the `df.loc[]` notation
- Select all of the years from the English column

## Boolean Series

### Testing a column against a value to get a Boolean Series

There was a series of years when Computer Science had greater than 25% women graduates, but not anymore

- Generate a boolean (True/False) Series showing when the "Computer Science" column values were greater than 25

### Boolean Series from comparing two columns

Up until the 1980s, Math and Statistics had a higher percentage of women graduates than Biology

- Generate a boolean (True/False) Series from testing when "Math and Statistics" was greater than "Biology"

### Selecting rows where a condition is True

We can use a Boolean Series to select/return/display rows of a DataFrame where the Series value is True

- Display all the columns of the DataFrame where "Math and Statistics" was greater than "Biology"
- *Note: Try it with both the `df[]` notation and the `df.loc[]` notation*

---

## Basic Pandas plotting

### Plot all of the data together as a line plot of percents over years

Create a line plot with a built-in Pandas plotting routine, where the x-axis is Year (remember that's in the Index now, not a normal column), and the y-axis will be percent, with all of the field columns plotted.

- A lot of these options are the default!
- Don't worry if the legend covers a lot of the plot for now...
- If you want to suppress the text output for the plot `<AxesSubplot:xlabel='Year'>`, add a semicolon to the end of the command

In [None]:
df.____.____()

### Plot each of the lines in one subplot (small multiples)

- Same plot as above, but now as "subplots"
- Leave everything else (besides the subplots) as default for now
- It's going to look bad, but we'll change that below

In [None]:
df.____.____(____=True);

### Specify a figure size for the group of small multiples

- Leave everything else besides the subplots as default for now
- So we can see the plots, set the size of the figure to 4,12

In [None]:
df.____.____(____=True, ____=____);

### Arrange the subplots in a grid and resize plots

The default arrangement of the subplots we saw above is way too tall, but we can lay out a grid of subplots to make it easier to look at

- Plot again, but this time use a 5 x 4 grid layout of subplots
- Set a size for the overall figure of 12,12

In [None]:
df.____.____(____=True, ____=____, ____=____);

### Make sure all the subplot y-axis ranges are the same!

- **sharing Y axes** (so the y-axis limits are the same on all the plots)
- using a 5 x 4 grid layout of subplots
- size for figure of 12,12

*Notice how this gives you a much better sense of the data!*

In [None]:
df.____.____(____=True, ____=____, ____=____, ____=True);

### Bar chart of sorted summary statistics from above

- Use the `df.describe()` command from above
- Transpose, sort, and select out only the "mean" column from the resulting DataFrame
- Create a horizontal bar chart of the transposed, sorted, mean values
- Again, chain the commands together starting with `df`