# Data Analysis and Visualization in Python
## Indexing, Slicing and Subsetting DataFrames in Python
Questions
* How can I access specific data within my data set?
* How can Python and Pandas help me to analyse my data?

Objectives
* Describe what 0-based indexing is.
* Manipulate and extract data using column headings and index locations.
* Employ slicing to select sets of data from a DataFrame.
* Employ label and integer-based indexing to select ranges of data in a dataframe.
* Reassign values within subsets of a DataFrame.
* Create a copy of a DataFrame.
* Query/select a subset of data using a set of criteria using the following operators: `=`, `!=`, `>`, `<`, `>=`, `<=`.
* Locate subsets of data using masks.
* Describe BOOLEAN objects in Python and manipulate data using BOOLEANs.

## Loading our data

In [None]:
# first make sure pandas is loaded
import pandas as pd

# read in the survey csv
surveys_df = pd.read_csv("../data/surveys.csv")

## Indexing & Slicing in Python
### Selecting Data Using Labels (Column Headings)

In [None]:
# Method 1: select a 'subset' of the data using the column name
surveys_df['species_id']

In [None]:
# Method 2: use the column name as an 'attribute'; gives the same output
surveys_df.species_id

In [None]:
# Create an object named surveys_species that only contains the `species_id` column
surveys_species = surveys_df['species_id']

In [None]:
# Select the species and plot columns from the DataFrame
surveys_df[['species_id', 'plot_id']]

In [None]:
# What happens when you flip the order?
surveys_df[['plot_id', 'species_id']]

In [None]:
# What happens if you ask for a column that doesn't exist?
surveys_df['speciess']

### Extracting Range based Subsets: Slicing

In [None]:
# Create a list of numbers:
a = [1, 2, 3, 4, 5]

![Indexing: getting a specific element](../fig/slicing-indexing.png)
![Slicing: selecting a set of elements](../fig/slicing-slicing.png)

### Exercise - Indexing

In [None]:
# What value does the code below return?
a[0]

In [None]:
# How about this:
a[5]

In the example above, calling `a[5]` returns an error. Why is that?

In [None]:
# What about?
a[len(a)]

### Slicing Subsets of Rows in Python

In [None]:
# Select rows 0, 1, 2 (row 3 is not selected)
surveys_df[0:3]

In [None]:
# Select the first 5 rows (rows 0, 1, 2, 3, 4)
surveys_df[:5]

In [None]:
# select the last element in the list
surveys_df[-1:]

### Copying Objects vs Referencing Objects in Python

In [None]:
# Using the 'copy() method'
true_copy_surveys_df = surveys_df.copy()

# Using the '=' operator
ref_surveys_df = surveys_df

In [None]:
# Assign the value `0` to the first three rows of data in the DataFrame
ref_surveys_df[0:3] = 0

In [None]:
surveys_df.head()

In [None]:
true_copy_surveys_df.head()

In [None]:
# Reload data from file
surveys_df = pd.read_csv("../data/surveys.csv")
surveys_df.head()

### Slicing Subsets of Rows and Columns in Python
We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.

* `loc`: primarily label based indexing. Integers may be used but they are interpreted as a label
* `iloc`: rimarily integer based indexing


In [None]:
# iloc[row slicing, column slicing]
surveys_df.iloc[0:3, 1:4]

In [None]:
# Select all columns for rows of index values 0 and 10
surveys_df.loc[[0, 10], :]

In [None]:
# What does this do?
surveys_df.loc[0, ['species_id', 'plot_id', 'weight']]

In [None]:
# What happens when you type the code below?
surveys_df.loc[[0, 10, 35549], :]

In [None]:
surveys_df.iloc[2,6]

### Exercise - Range
What happens when you execute:

In [None]:
surveys_df[0:1]

In [None]:
surveys_df[:4]

In [None]:
surveys_df[:-1]

In [None]:
surveys_df.iloc[0:4, 1:4]

In [None]:
surveys_df.loc[0:4, 1:4] # 'month':'plot_id'

## Subsetting Data Using Criteria

In [None]:
surveys_df[surveys_df.year == 2002]

In [None]:
surveys_df[surveys_df.year != 2002]

In [None]:
surveys_df[(surveys_df.year >= 1980) & (surveys_df.year <= 1985)]

Use can use the syntax below when querying data from a DataFrame. Experiment with selecting various subsets of the "surveys" data.

* Equals: `==`
* Not equals: `!=`
* Greater than, less than: `>` or `<`
* Greater than or equal to `>=`
* Less than or equal to `<=`

### Exercises - Advanced Selection Challenges
`1`. Select a subset of rows in the `surveys_df` DataFrame that contain data from the year 1999 and that contain weight values less than or equal to 8. How many columns did you end up with? What did your neighbor get?

In [None]:
surveys_df[(surveys_df["year"] == 1999) & (surveys_df["weight"] <= 8)]

In [None]:
sum((surveys_df["year"] == 1999) & (surveys_df["weight"] <= 8))

`2`. You can use the `isin` command in python to query a DataFrame based upon a list of values as follows:
```
surveys_df[surveys_df['species_id'].isin([listGoesHere])]
```
Use the `isin` function to find all plots that contain particular species (PB and PL) in the surveys DataFrame. How many records contain these values?

In [None]:
surveys_df[surveys_df['species_id'].isin(['PB', 'PL'])]['plot_id'].unique()

In [None]:
surveys_df[surveys_df['species_id'].isin(['PB', 'PL'])].shape

`3`. Create a query that finds all rows with a weight value > or equal to 0.

In [None]:
surveys_df[surveys_df["weight"] >= 0]

`4`. The `~` symbol in Python can be used to return the OPPOSITE of the selection that you specify in python. It is equivalent to **is not in**. Write a query that selects all rows that are NOT equal to ‘M’ or ‘F’ in the surveys data.

In [None]:
surveys_df[~surveys_df["sex"].isin(['M', 'F'])]

## Using Masks

In [None]:
# set x to 5
x = 5

In [None]:
# what does the code below return?
x > 5

In [None]:
# how about this?
x == 5

In [None]:
pd.isnull(surveys_df)

In [None]:
# To select just the rows with NaN values, we can use the .any method
surveys_df[pd.isnull(surveys_df).any(axis=1)]

In [None]:
# What does this do?
empty_weights = surveys_df[pd.isnull(surveys_df['weight'])]['weight']
print(empty_weights)

### Exercises - Removing NaN
`1`. Create a new DataFrame that only contains observations with sex values that are **not** female or male. Assign each sex value in the new DataFrame to a new value of ‘x’. Determine the number of null values in the subset.

In [None]:
nosex = surveys_df[~surveys_df['sex'].isin(['M', 'F'])].copy()

In [None]:
nosex['sex'] = 'x'
nosex.tail()

In [None]:
len(nosex)

`2`. Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0. Create a stacked bar plot of average weight by site with male vs female values stacked for each site.

In [None]:
# make sure figures appear inline in Ipython Notebook
%matplotlib inline

# selection of the data with isin
stack_selection = surveys_df[(surveys_df['sex'].isin(['M', 'F'])) & 
                            surveys_df["weight"] > 0.][["sex", "weight", "plot_id"]]

# calculate the mean weight for each plot id and sex combination: 
stack_selection = stack_selection.groupby(["plot_id", "sex"]).mean().unstack()

# and we can make a stacked bar plot from this:
stack_selection.plot(kind='bar', stacked=True)

In [None]:
stack_selection = surveys_df[(surveys_df['sex'].notnull()) & 
                    surveys_df["weight"] > 0.][["sex", "weight", "plot_id"]]

stack_selection = stack_selection.groupby(["plot_id", "sex"]).mean().unstack()

# The legend header contains two levels. In order to remove this, the column naming needs to be simplified : 
stack_selection.columns = stack_selection.columns.droplevel()

stack_selection.plot(kind='bar', stacked=True)