# Data Analysis and Visualization in Python
## Indexing, Slicing and Subsetting DataFrames in Python
Questions
* How can I access specific data within my data set?
* How can Python and Pandas help me to analyse my data?

Objectives
* Describe what 0-based indexing is.
* Manipulate and extract data using column headings and index locations.
* Employ slicing to select sets of data from a DataFrame.
* Employ label and integer-based indexing to select ranges of data in a dataframe.
* Reassign values within subsets of a DataFrame.
* Create a copy of a DataFrame.
* Query/select a subset of data using a set of criteria using the following operators: `=`, `!=`, `>`, `<`, `>=`, `<=`.
* Locate subsets of data using masks.
* Describe BOOLEAN objects in Python and manipulate data using BOOLEANs.

## Loading our data

In [None]:
# first make sure pandas is loaded
import pandas as pd

# read in the survey csv
surveys_df = pd.read_csv("../data/surveys.csv")

## Indexing & Slicing in Python
### Selecting Data Using Labels (Column Headings)

In [None]:
# Method 1: select a 'subset' of the data using the column name
surveys_df['species_id']

In [None]:
# Create an object named surveys_species that only contains the `species_id` column
surveys_species = surveys_df['species_id']

In [None]:
# Select the species and plot columns from the DataFrame
surveys_df[['species_id', 'plot_id']]

### Slicing Subsets of Rows in Python

In [None]:
# Select rows 0, 1, 2 (row 3 is not selected)
surveys_df[0:3]

In [None]:
# Select the first 5 rows (rows 0, 1, 2, 3, 4)
surveys_df[:5]

In [None]:
# Select the last element in the list
surveys_df[-1:]

### Copying Objects vs Referencing Objects in Python

In [None]:
# Using the 'copy() method'
true_copy_surveys_df = surveys_df.copy()

# Using the '=' operator
ref_surveys_df = surveys_df

In [None]:
# Assign the value `0` to the first three rows of data in the DataFrame
ref_surveys_df[0:3] = 0

In [None]:
surveys_df.head()

In [None]:
true_copy_surveys_df.head()

In [None]:
# Reload data from file
surveys_df = pd.read_csv("../data/surveys.csv")
surveys_df.head()

### Slicing Subsets of Rows and Columns in Python
We can select specific ranges of our data in both the row and column directions using label indexing.

* `loc`: primarily label based indexing. Integers may be used but they are interpreted as a label

In [None]:
# What does this do?
surveys_df.loc[0, ['species_id', 'plot_id', 'weight']]

In [None]:
# Select all columns for rows of index values 0 and 10
surveys_df.loc[[0, 10], :]

In [None]:
# What happens when you type the code below?
surveys_df.loc[[0, 10, 35549], :]

### Exercise - Range
What happens when you execute:

In [None]:
surveys_df[:4]

In [None]:
surveys_df[:-1]

In [None]:
surveys_df.loc[0:4, 1:4] # 'month':'plot_id'

## Subsetting Data Using Criteria

In [None]:
surveys_df[surveys_df["year"] == 2002]

In [None]:
surveys_df[surveys_df["year"] != 2002]

In [None]:
surveys_df[(surveys_df["year"] >= 1980) & (surveys_df["year"] <= 1985)]

Use can use the syntax below when querying data from a DataFrame. Experiment with selecting various subsets of the "surveys" data.

* Equals: `==`
* Not equals: `!=`
* Greater than, less than: `>` or `<`
* Greater than or equal to `>=`
* Less than or equal to `<=`

### Exercises - Advanced Selection Challenges
`1`. Select a subset of rows in the `surveys_df` DataFrame where the year is 1999 and the weight value is less than or equal to 8.

In [None]:
surveys_df[(surveys_df["year"] == 1999) & (surveys_df["weight"] <= 8)]

`2`. You can use the `isin` command in python to query a DataFrame based upon a list of values as follows:
```
surveys_df[surveys_df['species_id'].isin([listGoesHere])]
```
Use the `isin` function to find all sites that contain particular species (PB and PL) in the surveys DataFrame.

In [None]:
surveys_df[surveys_df['species_id'].isin(['PB', 'PL'])]['plot_id'].unique()

`3`. The `~` symbol in Python can be used to return the OPPOSITE of the selection that you specify in python. It is equivalent to **is not in**. Write a query that selects all rows that are NOT equal to ‘M’ or ‘F’ in the surveys data.

In [None]:
surveys_df[~surveys_df["sex"].isin(['M', 'F'])]

## Using Masks

In [None]:
pd.isnull(surveys_df)

In [None]:
# To select just the rows with NaN values, we can use the .any method
surveys_df[pd.isnull(surveys_df).any(axis=1)]

In [None]:
# What does this do?
empty_weights = surveys_df[pd.isnull(surveys_df['weight'])]['weight']
print(empty_weights)

### Exercises - Removing NaN

Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0. Create a stacked bar plot of average weight by site with male vs female values stacked for each site.

In [None]:
# make sure figures appear inline in Ipython Notebook
%matplotlib inline

# selection of the data with isin
stack_selection = surveys_df[(surveys_df['sex'].isin(['M', 'F'])) & 
                            surveys_df["weight"] > 0.][["weight", "plot_id", "sex"]]

# calculate the mean weight for each plot id and sex combination: 
stack_selection = stack_selection.groupby(["plot_id", "sex"]).mean().unstack()

# The legend header contains two levels. In order to remove this, the column naming needs to be simplified : 
stack_selection.columns = stack_selection.columns.droplevel()

# and we can make a stacked bar plot from this:
stack_selection.plot(kind='bar', stacked=True)