# Data Analysis and Visualization in Python
## Indexing, Slicing and Subsetting DataFrames
Questions
* How can I access specific data within my data set?

Objectives
* Employ slicing to select sets of data from a DataFrame.
* Employ label and integer-based indexing
  to select ranges of data in a dataframe.
* Reassign values within subsets of a DataFrame.
* Query/select a subset of data using a set of criteria using
  the following operators: `=`, `!=`, `>`, `<`, `>=`, `<=`.
* Manipulate data using boolean masks.

## Loading our data

In [None]:
# First make sure pandas is loaded
import pandas as pd

# Read in the survey csv
surveys_df = pd.read_csv("../data/surveys.csv")

## Indexing & Slicing in Python
### Selecting Data Using Labels (Column Headings)

In [None]:
# Assign the column of species IDs
surveys_species = surveys_df['species_id']

In [None]:
# Get the length of the column
surveys_species.shape

In [None]:
# Select many columns with a list of column names
columns = ['year', 'month', 'day']
surveys_df[columns]

### Slicing Subsets of Rows and Columns in Python
We can select specific ranges of our data in both the row and
column directions using `loc`: primarily label based indexing.
Integers may be used but they are interpreted as a label.

In [None]:
# What does this do?
surveys_df.loc[0, ['species_id', 'plot_id', 'weight']]

In [None]:
# Select all columns for rows of index values 0 and 10
surveys_df.loc[[0, 10], :]

In [None]:
# What happens when you type the code below?
try:
    print(surveys_df.loc[[0, 10, 35548], :])
except BaseException as error:
    print(f'The problem: {error}')

### Demo - Range
What happens when you execute:

In [None]:
try:
    print(surveys_df.loc[0:4, 'month':'plot_id'])
except BaseException as error:
    print(f'The problem: {error}')

## Subsetting Data Using Criteria

In [None]:
# Select records for the year 2002
surveys_df[surveys_df['year'] == 2002]

In [None]:
# Select records for the other years
surveys_df[surveys_df['year'] != 2002]

In [None]:
# With two conditions
surveys_df[(surveys_df['year'] >= 2001) & (surveys_df['weight'] <= 8)]

Here are the most common operators for conditions:

* Equal, not equal: `==`, `!=`
* Greater than, less than: `>`, `<`
* Greater than or equal to, less than or equal to: `>=`, `<=`
* Element-wise AND and OR operators: `&` and `|`

### Exercises - Selection by presence
`1`. You can use the `isin()` method in python to query
a DataFrame based upon a list of values as follows:
```
surveys_df[surveys_df['column_name'].isin([value1, value2, ...])]
```
Use the `isin()` method to find all different
sites (`plot_id`) that contain particular species
(`AS`, `CQ`, `OX` and `UL`) in the surveys DataFrame.

In [None]:
# Boolean mask of valid species IDs
species_mask = surveys_df['species_id'].isin(['AS', 'CQ', 'OX', 'UL'])

# List all different sites
surveys_df[species_mask]['plot_id'].unique()

`2`. Create a stacked bar plot of average weight by site
with male vs female values stacked for each site (`plot_id`).
* Create a new DataFrame that contains only observations that are
  of sex female or male and where weight values are greater than 0
* For the final plot, only select the
  weight, the site and the sex columns

In [None]:
# Selection of the data with isin()
sex_mask = surveys_df['sex'].isin(['F', 'M'])
weight_mask = surveys_df['weight'] > 0
columns = ['weight', 'plot_id', 'sex']

selection = surveys_df[sex_mask & weight_mask][columns]
selection.tail()

In [None]:
# Calculate the mean weight for each plot_id and sex combination: 
avg_by_site_sex = selection.groupby(['plot_id', 'sex']).mean()
avg_by_site_sex.head()

In [None]:
# Transform categorical values into columns
table_site_sex = avg_by_site_sex.unstack()

# The legend header contains two levels. To remove the first level:
table_site_sex.columns = table_site_sex.columns.droplevel()
table_site_sex.tail()

In [None]:
# And we can make a stacked bar plot from this:
table_site_sex.plot(kind='bar', stacked=True)

`3`. The `~` symbol in Python can be used to return the OPPOSITE
of the selection that you specify in python. It is equivalent
to **is not in**. Write a query that selects all rows
that are NOT equal to `F` or `M` in the surveys data.

In [None]:
surveys_df[~sex_mask]

## Technical Summary
* **Selection**
  * `df[]` :
    * With a list: selection of columns
    * With a vector of boolean values: selection of rows
  * `df.loc[rows, columns]` :
    * With a list: for rows and columns
    * With a range with the `:` notation: for rows and columns
      * Both limits in the range are included
    * With a vector of boolean values: selection of rows
    * Allows overwriting the selection with new values
  * `column.isin([value1, value2, ...])`
* **Operators** on values of one or two columns:
  * Comparison: `<`, `<=`, `==`, `!=`, `>=`, `>`
  * Boolean: `~`, `|`, `&`