# Data Analysis and Visualization in Python
## Indexing, Slicing and Subsetting DataFrames in Python
Questions
* How can I access specific data within my data set?
* How can Python and Pandas help me to analyse my data?

Objectives
* Describe what 0-based indexing is.
* Manipulate and extract data using column headings and index locations.
* Employ slicing to select sets of data from a DataFrame.
* Employ label and integer-based indexing to select ranges of data in a dataframe.
* Reassign values within subsets of a DataFrame.
* Create a copy of a DataFrame.
* Query/select a subset of data using a set of criteria using the following operators: `=`, `!=`, `>`, `<`, `>=`, `<=`.
* Locate subsets of data using masks.
* Describe BOOLEAN objects in Python and manipulate data using BOOLEANs.

## Loading our data

In [None]:
# First make sure pandas is loaded
import pandas as pd

# Read in the survey csv
surveys_df = pd.read_csv("../data/surveys.csv")

## Indexing & Slicing in Python
### Selecting Data Using Labels (Column Headings)

In [None]:
surveys_df###

In [None]:
surveys_### = surveys_df['species_id']

In [None]:
# Select two columns with a list of column names
surveys_df[###'species_id', 'plot_id'###]

### Copying Objects vs Referencing Objects in Python

In [None]:
# Using the 'copy() method'
###_surveys_df = surveys_df

# Using the '=' operator
###_surveys_df = surveys_df

In [None]:
# Assign the value 42 to the 'weight' column
###_surveys_df[###] = ###

In [None]:
surveys_df###

In [None]:
###_surveys_df.head()

In [None]:
# Reload data from file
surveys_df = pd.read_csv("../data/surveys.csv")
surveys_df.head()

### Slicing Subsets of Rows and Columns in Python
We can select specific ranges of our data in both the row and column directions using label indexing.

* `loc`: primarily label based indexing. Integers may be used but they are interpreted as a label

In [None]:
# What does this do?
### surveys_df.loc[0, ['species_id', 'plot_id', 'weight']]

In [None]:
# Select all columns for rows of index values 0 and 10
surveys_df.loc[###, ###]

In [None]:
# What happens when you type the code below?
### surveys_df.loc[[0, 10, 35549], :]

### Exercise - Range
What happens when you execute:

In [None]:
# surveys_df.loc[0:4, 1:4] # 'month':'plot_id'

## Subsetting Data Using Criteria

In [None]:
surveys_df###surveys_df['year'] ### ######

In [None]:
surveys_df[surveys_df['year'] ### ###]

In [None]:
surveys_df[###surveys_df['year'] >= 1980### ### ###surveys_df['year'] <= 1985###]

Use can use the syntax below when querying data from a DataFrame. Experiment with selecting various subsets of the "surveys" data.

* Equals: `==`
* Not equals: `!=`
* Greater than, less than: `>` or `<`
* Greater than or equal to `>=`
* Less than or equal to `<=`

### Exercises - Advanced Selection Challenges
`1`. Select a subset of rows in the `surveys_df` DataFrame where the year is 1999 and the weight value is less than or equal to 8.

In [None]:
surveys_df[(surveys_df['year'] ### ###) # (surveys_df['###'] ### ###)]

`2`. You can use the `isin` command in python to query a DataFrame based upon a list of values as follows:
```
surveys_df[surveys_df['column_name'].isin([value1, value2, ...])]
```
Use the `isin` function to find all different sites (`plot_id`) that contain particular species (AS, CQ, OX and UL) in the surveys DataFrame.

In [None]:
surveys_df[surveys_df['species_id'].###(###)]['plot_id'].###

`3`. The `~` symbol in Python can be used to return the OPPOSITE of the selection that you specify in python. It is equivalent to **is not in**. Write a query that selects all rows that are NOT equal to ‘M’ or ‘F’ in the surveys data.

In [None]:
surveys_df[###surveys_df['sex'].###(["M", "F"])]

## Selecting undefined values

In [None]:
pd.###(surveys_df)

In [None]:
# To select rows with at least one undefined value, we can use the .any() method
surveys_df[pd.###(surveys_df).###(axis=1)]

In [None]:
# What does this do?
one_selection = surveys_df[pd.###(surveys_df['weight'])]
one_selection.groupby('species_id')['record_id'].###()

### Exercises - Removing NaN

Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0. Create a stacked bar plot of average weight by site with male vs female values stacked for each site.

In [None]:
# Selection of the data with isin()
stack_selection = surveys_df[(surveys_df['sex'].###(["M", "F"])) & 
                             (surveys_df['weight'] ###)][[###]]
stack_selection.head()

In [None]:
# Calculate the mean weight for each plot_id and sex combination: 
stack_selection = stack_selection.groupby(['plot_id', 'sex']).###().unstack()
stack_selection.head()

In [None]:
# The legend header contains two levels. In order to remove this,
# the column naming needs to be reduced by one level : 
stack_selection.columns = stack_selection.columns.droplevel()
stack_selection.head()

In [None]:
# And we can make a stacked bar plot from this:
stack_selection.plot(kind='###', ###=True)