Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

This lab is graded based on completion. Every question is worth one point, and you will get credit as long as you attempt the question. 

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

# Lab 2: Pandas Overview

[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will learn commonly used data wrangling operations and tools in Pandas. At the end of this lab you should be able to:

* Create dataframes using pandas
* Slice data frames (i.e., selecting rows and columns)
* Filter data (using boolean arrays)
* Complete Data Aggregation and Grouping in dataframes


In this lab, you are going to use several pandas methods like `drop()`, `loc()`, `groupby()`. Remember that you can press `shift+tab` on any method to see the documentation for that method. We highly recommend looking through the [Pandas documentation](https://pandas.pydata.org/) to get a sense of the functionality and possibilities!

## Setup

In [None]:
# Import the following packages. Note the shorthand for pandas.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

## Part 1: Creating DataFrames & Basic Manipulations

A [dataframe](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe) is a two-dimensional labeled data structure with rows of unique observations and columns holding data of potentially different types.

**Method 1: ** You can create a dataframe by specifying the columns and values as shown below.

Notice the syntax: you're passing a dictionary into the DataFrame.  The keys become the column names (e.g. `'name'`), and the values are lists (`['Peter',....`)

In [None]:
animals = pd.DataFrame(
    data={'name': ['Peter', 'Nutkin', 'Hunca Munca', 'Jemima'],
          'species': ['rabbit', 'squirrel', 'mouse', 'duck']
          })
animals

**Method 2: ** You can also define a dataframe by specifying the rows like below.

Here, you're passing in tuples for each row of data (e.g. `("Peter", "rabbit")`) and specifying the column names separately.

In [None]:
animals2 = pd.DataFrame(
    [("Peter", "rabbit"), ("Nutkin", "squirrel"), ("Hunca Munca", "mouse"),
     ("Jemima", "duck")], 
    columns = ["name", "species"])
animals2

**Other methods**: Usually you won't be creating data frames in such a manual way.  You'll often be loading dataframes in from other file types -- for example, comma separated (csv) files.  More on that later.

You can obtain the dimensions of a dataframe by using the shape attribute, `dataframe.shape`

In [None]:
print(animals.shape)

(num_rows, num_columns) = animals.shape
num_rows, num_columns

### Question 1

You can add a column using the syntax `dataframe['new column name'] = [data]`. Add a column called `favorite food` to the `animals` table which contains the strings 'nut', 'carrot', 'corn', and 'cheese'. Use your best guess as to which animal prefers which food. 

In [None]:
# YOUR CODE HERE

In [None]:
animals

### Question 2

Use the `.drop()` method to [drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) the `favorite food` column you created, and save the new dataframe (without the `favorite food` column) to `animals_original`. Some notes:

* You'll need to look up `drop` to figure out the right syntax.
* Make sure to use the `axis` parameter correctly

In [None]:
# YOUR CODE HERE

In [None]:
animals_original

In [None]:
assert animals_original.shape[1] == 2

### Question 3

Use the `.rename()` method to [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) the columns of `animals_original` so they begin with a capital letter. Set the `inplace` parameter correctly to change the `animals_original` dataframe. (hint: in Question 2, `drop` creates and returns a new dataframe instead of changing `animals` because `inplace` by default is `False`)

In [None]:
# YOUR CODE HERE

In [None]:
animals_original

In [None]:
assert animals_original.columns[1] == 'Species' # the column number might be different for you

*Background*: For the curious, the field values you just worked with were inspired by [Beatrix Potter's](https://en.wikipedia.org/wiki/Beatrix_Potter) characters.

## Part 2: CalEnviroScreen Data
Now that we have learned the basics, we'll use Pandas to wrangle a real-world dataset. Specifically, we will be working with the [California Communities Environmental Health Screening Tool (CalEnviroScreen)](https://oehha.ca.gov/calenviroscreen), which uses demographic and environmental information to identify communities that are susceptible to various types of pollution. The various fields in this dataset contribute to the CES score, which reflects a community's environmental conditions and its vulnerability to environmental pollutants.

Your lab02 folder contains an Excel file downloaded from [here](https://oehha.ca.gov/calenviroscreen/report/draft-calenviroscreen-40).

Start by running the cell below, which creates an Excel file object in Pandas that we can then inspect. The cell below shows you the sheet names in the spreadsheet. Documentation on Pandas' Excel methods can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html). 

In [None]:
# run this cell to import the Excel file and see the names of the tabs
filename = 'CalEnviroScreen_4.0Excel_ADA_D1_2021.xlsx'
xl = pd.ExcelFile(filename)
print(xl.sheet_names) # display a list of the sheets in the spreadsheet

Run the cell below to load the first sheet of the Excel file and assign it to the variable `ces4`. 

In [None]:
ces4 = xl.parse(xl.sheet_names[0]) # display the first sheet as Pandas dataframe
ces4.head()

Note that the dataframe contains 58 columns, but Pandas truncates the number we are able to see at once. We can show all columns using the `pd.set_option` method.

In [None]:
pd.set_option('display.max_columns', None)
ces4.head()

Notice that this dataset doesn't include the units in many of the column headings. Let's load a different sheet to get more information about what we're looking at.

Run the following cell to load the data dictionary.

In [None]:
dd = xl.parse('Data Dictionary', header = 6)
dd.head(10)

### Question 4
The length of a dataframe is equivalent to its number of rows. Find the length of `ces4`. What does each row represent?

In [None]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

## Slicing Data Frames - selecting rows and columns


### Selection Using Label

**Column Selection** 
To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html). General usage looks like `frame.loc[rowname,colname]`. (Reminder that the colon `:` means "everything").

- You can also slice across columns. For example, `ces4.loc[:, 'ZIP':]` would select all rows in the column `ZIP` and every column to the right.

- *Alternative:* While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[]` method, which takes on the form `frame['colname']`.

**Row Selection**
Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (i.e., primary key) of the dataframe.

In [None]:
#Example:
ces4.loc[100:110, 'ZIP']

In [None]:
#Example:  Notice the difference between these two methods
ces4.loc[100:110, ['ZIP']]

The `.loc` method actually uses the index (the bolded, leftmost series in the dataframe) rather than the row position to perform the selection. In the previous example, it's just a coincidence that the `.loc` syntax matches that of the array slicing syntax - the index and row position aren't always the same value. For example, you could set your index to a non-numeric code, like census tract or other unique ID, if that's how you want to identify your records.

Alternatively, we can use [`.iloc`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html) to slice the dataframe using row location and column position.

See the following example:

In [None]:
#Example: We change the index from 0,1,2... to the Census Tract column
df = ces4.set_index("Census Tract") # Why might we want to use Census Tract instead of County or City?
df.head()

We can now lookup rows by name directly:

In [None]:
df.loc[[6037205120, 6019000200], :]

However, if we want to access rows by location we will need to use the integer loc (`iloc`) accessor:

In [None]:
#Example: 
# df.loc[2:5,"Year"] # You can't do this
df.iloc[1:4,6:9]

### Question 5

Selecting multiple columns is easy using `.loc`.  You just need to supply a list of column names.  Select the `California County`,`Diesel PM Pctl`, and `PM2.5 Pctl` columns **in that order** from the `ces4` table.

In [None]:
# YOUR CODE HERE

In [None]:
dsl_PM.head()

In [None]:
assert dsl_PM.shape == (8035, 3)
assert dsl_PM.columns[1] == "Diesel PM Pctl"

As you may have noticed above, the .loc() method is a way to re-order the columns within a dataframe.

## Filtering Data

### Filtering with boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, culling out fishy outliers, or analyzing subgroups of your data set.  Note that compound expressions have to be grouped with brackets. Example usage looks like `df[df[column name] < 5]]`.

For your reference, some commonly used comparison operators are given below.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
==   | a == b   | Does a equal b?
<=   | a <= b   | Is a less than or equal to b?
>=   | a >= b   | Is a greater than or equal to b?
<    | a < b    | Is a less than b?
&#62;    | a &#62; b    | Is a greater than b?
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

In the following we construct the DataFrame containing only census tracts in Sacramento County.

In [None]:
ces4_SC = ces4[ces4['California County'] == 'Sacramento '] # Note the space after "Sacramento." This kind of quirk can be remedied with some simple data cleaning techniques, which we'll discuss in future lessons.
ces4_SC.head()

### Question 6
Select the census tracts in Alameda county whose CES 4.0 Percentile is 90 or higher.

(If you use condition `p` & condition `q` to filter the dataframe, make sure to use `df[(p) & (q)]`)

Hint: The county names are not "clean." Try using the `.unique()` method to look up the **exact** county names. 

In [None]:
# YOUR CODE HERE

In [None]:
AC_highCES

In [None]:
assert len(AC_highCES) == 10
assert AC_highCES["DRAFT CES 4.0 Percentile"].max() == 97.9571248423707

## Data Aggregration 

### Question 7a
We can perform operations across columns and rows of a dataframe to generate summary statistics. 

Find the mean, minimum, maximum, and standard deviation PM2.5 across all census tracts. (You may use the `ces4` DataFrame created above.)

In [None]:
# YOUR CODE HERE

pm_mean = ...
pm_min = ...
pm_max = ...
pm_stdev = ...

print('Mean: {}'.format(pm_mean))
print('Min: {}'.format(pm_min))
print('Max: {}'.format(pm_max))
print('Standard deviation: {}'.format(pm_stdev))

### Question 7b

What is the total population reprepresented in the CalEnviroscreen 4.0 dataset?

In [None]:
# YOUR CODE HERE

### Question 8
To count the number of instances of a value in a `Series`, we can use the `value_counts()` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) as `df["col_name"].value_counts()`. 

Count the number of different census tracts in each California county. (You may use the `ces4` DataFrame created above.) In other words, compute the number of rows in the table for each county.

In [None]:
# YOUR CODE HERE

In [None]:
num_censusTracts

In [None]:
assert num_censusTracts["Alameda "] == 360
assert num_censusTracts.sum() == len(ces4)

### Question 9a

A more versatile way to aggregate data is to use the `.groupby()` [function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html). Find the sum of `Imp. Water Bodies` for each `California County` in the `ces4` table. Use the syntax `df.groupby("col_name").sum()`.

In [None]:
# YOUR CODE HERE

In [None]:
# The syntax below will sort sum_impH20 from most to least impaired bodies. 
sum_impH2O.sort_values(by = "Imp. Water Bodies", ascending=False)['Imp. Water Bodies']

In [None]:
assert sum_impH2O.loc["Los Angeles", "Imp. Water Bodies"] == 7178
assert sum_impH2O.sort_values(by = "Imp. Water Bodies", ascending=False).index[3] == "Ventura "

### Question 9b
Take a look at the the Data Dictionary. What does the sum of the `Imp. Water Bodies` column represent?  In the process you'll read about "buffers".  What is a buffer?

In [None]:
# SCRATCH WORK HERE

*YOUR ANSWER HERE*

### Question 9c

What do the values in `ZIP` represent in the dataframe `sum_impH2O`? Why is the column `DRAFT CES 4.0 Percentile Range` no longer present in the dataframe?

*YOUR ANSWER HERE*

### Question 9d

Find the mean of `Poverty` for each county for census tracts with population greater than or equal to 3,000 and with a `Pollution Burden Pctl` above 85.


In [None]:
# YOUR CODE HERE

In [None]:
poverty_mean.sort_values(by = "Poverty", ascending=False)['Poverty']

In [None]:
assert np.round(poverty_mean.loc["Alameda ", "Poverty"],2) == 33.39
assert len(poverty_mean) == 30

### Question 9e

What does your output to 9d represent?  Dig in to the data a little further and tell us what you notice about the `Poverty` field values in the counties/tract combinations that show up in your result to 9d, versus the `Poverty` field values for all tracts?

*YOUR ANSWER HERE*

#### You are done! Remember to submit this lab on bCourses in html format after clicking Kernel -> Restart & Run All.