# Data Analysis and Visualization in Python
## Indexing, Slicing and Subsetting DataFrames in Python
Questions
* How to select a subset of data

Objectives
* Describe what 0-based indexing is.
* Manipulate and extract data using column headings and index locations.
* Employ slicing to select sets of data from a DataFrame.
* Employ label and integer-based indexing to select ranges of data in a dataframe.
* Reassign values within subsets of a DataFrame.
* Create a copy of a DataFrame.
* Query /select a subset of data using a set of criteria using the following operators: =, !=, >, <, >=, <=.
* Locate subsets of data using masks.
* Describe BOOLEAN objects in Python and manipulate data using BOOLEANs.

### How to Use Jupyter
When a cell is in edit mode:

  Shortcut  | Description
----------- | -----------
Shift+Enter | Run the cell, and go to the next
Tab         | Indent code or auto-completion
Esc         | Go to command mode

When a cell is in command mode:

  Shortcut   | Description
------------ | -----------
Shift+Enter  | Run the cell, and go to the next
Double-click | Go to edit mode
Enter        | Go to edit mode

  Shortcut   | Description
------------ | -----------
A            | Insert a cell above
B            | Insert a cell below
C            | Copy the current cell
V            | Paste the cell below
D D          | Delete the current cell

To reset all cells:
* Go to the top menu, and select Kernel -> Restart & Clear Output

## Making Sure Our Data Are Loaded

In [None]:
# first make sure pandas is loaded
import pandas as pd

# read in the survey csv
surveys_df = pd.read_csv("../data/surveys.csv")

## Indexing & Slicing in Python
### Selecting Data Using Labels (Column Headings)

In [None]:
surveys_df['species']

In [None]:
# This syntax, calling the column as an attribute, gives you the same output
surveys_df.species

In [None]:
# Create an object named surveys_species that only contains the `species` column
surveys_species = surveys_df['species']

In [None]:
# Select the species and plot columns from the DataFrame
surveys_df[['species', 'plot']]

In [None]:
# What happens when you flip the order?
surveys_df[['plot', 'species']]

In [None]:
# What happens if you ask for a column that doesn't exist?
surveys_df['speciess']

### Extracting Range based Subsets: Slicing

In [None]:
# Create a list of numbers:
a = [1,2,3,4,5]

![Indexing: getting a specific element](../fig/slicing-indexing.png)
![Slicing: selecting a set of elements](../fig/slicing-slicing.png)

### Exercise - Indexing

In [None]:
# What value does the code below return?
a[0]

In [None]:
# How about this:
a[5]

In [None]:
# Or this?
a[len(a)]

In the example above, calling `a[5]` returns an error. Why is that?

### Slicing Subsets of Rows in Python

In [None]:
# select rows 0,1,2 (but not 3)
surveys_df[0:3]

In [None]:
# select the first, second and third rows from the surveys variable
surveys_df[0:3]

In [None]:
# select the first 5 rows (rows 0,1,2,3,4)
surveys_df[:5]

In [None]:
# select the last element in the list
surveys_df[-1:]

In [None]:
# copy the surveys dataframe so we don't modify the original DataFrame
surveys_copy = surveys_df
surveys_copy[0:3]

In [None]:
# set the first three rows of data in the DataFrame to 0
surveys_copy[0:3] = 0

In [None]:
surveys_copy.head()

In [None]:
surveys_df.head()

### Referencing Objects vs Copying Objects in Python

In [None]:
# Reload data from file
surveys_df = pd.read_csv("../data/surveys.csv")
surveys_copy = surveys_df.copy()
surveys_copy[0:3] = 0
surveys_df.head()

### Slicing Subsets of Rows and Columns in Python
We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.

* `loc`: indexing via labels or integers
* `iloc`: indexing via integers


In [None]:
surveys_df.iloc[0:3, 1:4]

In [None]:
# select all columns for rows of index values 0 and 10
surveys_df.loc[[0, 10], :]

In [None]:
# what does this do?
surveys_df.loc[0, ['species', 'plot', 'wgt']]

In [None]:
# What happens when you type the code below?
surveys_df.loc[[0, 10, 35549], :]

In [None]:
surveys_df.iloc[2,6]

## Subsetting Data Using Criteria

In [None]:
surveys_df[surveys_df.year == 2002]

In [None]:
surveys_df[surveys_df.year != 2002]

In [None]:
surveys_df[(surveys_df.year >= 1980) & (surveys_df.year <= 1985)]

Use can use the syntax below when querying data from a DataFrame. Experiment with selecting various subsets of the "surveys" data.

* Equals: `==`
* Not equals: `!=`
* Greater than, less than: `>` or `<`
* Greater than or equal to `>=`
* Less than or equal to `<=`

### Exercises - Advanced Selection Challenges
`1`. Select a subset of rows in the `surveys_df` DataFrame that contain data from the year 1999 and that contain weight values less than or equal to 8. How many columns did you end up with? What did your neighbor get?

In [None]:
surveys_df[(surveys_df["year"] == 1999) & (surveys_df["wgt"] <= 8)]

In [None]:
sum((surveys_df["year"] == 1999) & (surveys_df["wgt"] <= 8))

`2`. You can use the `isin` command in python to query a DataFrame based upon a list of values as follows:
```
surveys_df[surveys_df['species_id'].isin([listGoesHere])]
```
Use the `isin` function to find all plots that contain particular species (PB and PL) in the surveys DataFrame. How many records contain these values?

In [None]:
surveys_df[surveys_df['species'].isin(['PB', 'PL'])]['plot'].unique()

In [None]:
surveys_df[surveys_df['species'].isin(['PB', 'PL'])].shape

`3`. Create a query that finds all rows with a weight value > or equal to 0.

In [None]:
surveys_df[surveys_df["wgt"] >= 0]

`4`. The `~` symbol in Python can be used to return the OPPOSITE of the selection that you specify in python. It is equivalent to **is not in**. Write a query that selects all rows that are NOT equal to ‘M’ or ‘F’ in the surveys data.

In [None]:
surveys_df[~surveys_df["sex"].isin(['M', 'F'])]

## Using Masks

In [None]:
# set x to 5
x = 5

In [None]:
# what does the code below return?
x > 5

In [None]:
# how about this?
x == 5

In [None]:
pd.isnull(surveys_df)

In [None]:
#To select just the rows with NaN values, we can use the .any method
surveys_df[pd.isnull(surveys_df).any(axis=1)]

In [None]:
# what does this do?
empty_weights = surveys_df[pd.isnull(surveys_df).any(axis=1)]['wgt']
empty_weights

In [None]:
empty_weights = surveys_df[pd.isnull(surveys_df['wgt'])]['wgt']
empty_weights

### Exercises - Removing NaN
`1`. Create a new DataFrame that only contains observations with sex values that are **not** female or male. Assign each sex value in the new DataFrame to a new value of ‘x’. Determine the number of null values in the subset.

In [None]:
surveys_df[~surveys_df['sex'].isin(['M', 'F'])]

In [None]:
sum(surveys_df['sex'].isnull())

`2`. Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0. Create a stacked bar plot of average weight by plot with male vs female values stacked for each plot.

In [None]:
# make sure figures appear inline in Ipython Notebook
%matplotlib inline

# selection of the data with isin
stack_selection = surveys_df[(surveys_df['sex'].isin(['M', 'F'])) & 
                            surveys_df["wgt"] > 0.][["sex", "wgt", "plot"]]

# calculate the mean weight for each plot id and sex combination: 
stack_selection = stack_selection.groupby(["plot", "sex"]).mean().unstack()

# and we can make a stacked bar plot from this:
stack_selection.plot(kind='bar', stacked=True)

In [None]:
stack_selection = surveys_df[(surveys_df['sex'].notnull()) & 
                    surveys_df["wgt"] > 0.][["sex", "wgt", "plot"]]

stack_selection = stack_selection.groupby(["plot", "sex"]).mean().unstack()

# The legend header contains two levels. In order to remove this, the column naming needs to be simplified : 
stack_selection.columns = stack_selection.columns.droplevel()

stack_selection.plot(kind='bar', stacked=True)