# Data Analysis and Visualization in Python
## Indexing, Slicing and Subsetting DataFrames in Python
Questions
* How can I access specific data within my data set?
* How can I manage undefined (null) values?
* How can I save a dataframe to a file?

Objectives
* Employ slicing to select sets of data from a DataFrame.
* Employ label and integer-based indexing to select ranges of data in a dataframe.
* Reassign values within subsets of a DataFrame.
* Create a copy of a DataFrame.
* Query/select a subset of data using a set of criteria using the following operators: `=`, `!=`, `>`, `<`, `>=`, `<=`.
* Manipulate data using boolean masks.
* Transform or remove null values.
* Write modified data to a CSV file.

## Loading our data

In [None]:
# First make sure pandas is loaded
import pandas as pd

# Read in the survey csv
surveys_df = pd.read_csv("data/surveys.csv")

## Indexing & Slicing in Python
### Selecting Data Using Labels (Column Headings)

In [None]:
surveys_df###

In [None]:
surveys_### = surveys_df['species_id']
surveys_species.###

In [None]:
# Select two columns with a list of column names
surveys_df[ 'species_id', 'plot_id' ]

### Slicing Subsets of Rows and Columns in Python
We can select specific ranges of our data in both the row and column directions using `loc`: primarily label based indexing. Integers may be used but they are interpreted as a label

In [None]:
# What does this do?
# surveys_df.loc[0, ['species_id', 'plot_id', 'weight']]

In [None]:
# Select all columns for rows of index values 0 and 10
surveys_df.loc[###, ###]

In [None]:
# What happens when you type the code below?
# surveys_df.loc[[0, 10, 35549], :]

### Exercise - Range
What happens when you execute:

In [None]:
# surveys_df.loc[0:4, 1:4] # 'month':'plot_id'

## Subsetting Data Using Criteria

In [None]:
surveys_df['year'] ### ###

In [None]:
surveys_df[surveys_df['year'] ### ###]

In [None]:
surveys_df[(surveys_df['year'] >= 2001### ### (surveys_df['weight'] <= 8###]

Here are the most common operators for conditions:

* Equal, not equal: `==`, `!=`
* Greater than, less than: `>` or `<`
* Greater than or equal to, less than or equal to: `>=`, `<=`
* Element-wise AND and OR operators: `&` and `|`

### Exercises - Selection by presence
`1`. You can use the `isin` method in python to query a DataFrame based upon a list of values as follows:
```
surveys_df[surveys_df['column_name'].isin([value1, value2, ...])]
```
Use the `isin` method to find all different sites (`plot_id`) that contain particular species (`AS`, `CQ`, `OX` and `UL`) in the surveys DataFrame.

In [None]:
surveys_df[surveys_df['species_id'].###(###)]['plot_id'].###

`2`. Create a new DataFrame that contains only observations that are of sex female or male and where weight values are greater than 0. Create a stacked bar plot of average weight by site with male vs female values stacked for each site.

In [None]:
# Selection of the data with isin()
selection = surveys_df[(surveys_df['sex'].###(["F", "M"])) & 
                       (surveys_df['weight'] ###)][[###]]
selection.head()

In [None]:
# Calculate the mean weight for each plot_id and sex combination: 
selection = selection.groupby(['plot_id', 'sex']).###().unstack()
selection.head()

In [None]:
# The legend header contains two levels. In order to remove this,
# the column naming needs to be reduced by one level : 
selection.columns = selection.columns.droplevel()
selection.head()

In [None]:
# And we can make a stacked bar plot from this:
selection.plot(kind='###', ###=True)

`3`. The `~` symbol in Python can be used to return the OPPOSITE of the selection that you specify in python. It is equivalent to **is not in**. Write a query that selects all rows that are NOT equal to `F` or `M` in the surveys data.

In [None]:
surveys_df[###surveys_df['sex'].###(["F", "M"])]

## Selecting and cleaning undefined values

In [None]:
pd.###(surveys_df)

In [None]:
# To select rows with at least one undefined value, we can use the .any() method
surveys_df[pd.###(surveys_df).###(axis=1)]

In [None]:
# What does this do?
one_selection = surveys_df[pd.###(surveys_df['weight'])]
one_selection.groupby('species_id')['record_id'].###()

### Getting Rid of the NaN’s

In [None]:
# Before the cleanup
print(surveys_df['###']###, surveys_df['weight']###)

In [None]:
# Create a copy to avoid modifying the original object
###_surveys_df = surveys_df.###()

In [None]:
# For a stable mean value
### = copy_surveys_df['weight'].mean()
copy_surveys_df['###'] = copy_surveys_df['weight']###(averageW)

In [None]:
# After the cleanup
print(###_surveys_df['weight']###, copy_surveys_df['weight']###)

### Writing Out Data to CSV

In [None]:
# Only keep (complete) records that have no NA
df_no_na = surveys_df.###()
###

In [None]:
df_no_na.###('surveys_complete.###', ###)