# Introduction

This is the python notebook corresponding to the NMSU Pandas Workshop. For the online reference [**click here**](https://github.com/bleuknight/NMSUworkshop2018/blob/master/learnPandasReference.ipynb).

Selecting specific values of a `pandas` `DataFrame` or `Series` to work on is an implicit step in almost any data operation you'll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively.

I have remixed an excellent kaggle tutorial that can be found [here](https://www.kaggle.com/residentmario/indexing-selecting-assigning)

In this set of exercises we will work on exploring the [pollutants dataset](https://www.kaggle.com/sohier/mussel-watch/data). 

# Relevant Resources
* **[Companion Resource Notebook](https://github.com/bleuknight/NMSUworkshop2018/blob/master/learnPandasReference.ipynb)** 
* [Indexing and Selecting Data](https://pandas.pydata.org/pandas-docs/stable/indexing.html) section of pandas documentation
* [Pandas Cheat Sheet](https://assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf)




# Set Up

Run the following 2 cells to load (1) the pandas library and (2) the dataset we will be working with today

In [None]:
#this imports the pandas library

import pandas as pd


In [None]:
# download the pollutants dataset and save it into the folder where you are currently working 
# with this jupyter notebook
pols = pd.read_csv("pollutants.csv", dtype = object)


### A Pandas "object" can be any data type
### dtype = object specifies all of the data types in the spreadsheet as objects

Verify that this is true by running the code below.  

bonus: what happens if you DONT specify the dtype?

In [None]:
pols.dtypes

## Let's check out the first 100 lines to find out what types of data are present

In [None]:
pols.head(100)

## Let's specify numeric columns so that we can manipulate them

Find out the column names with this code

In [None]:
pols.columns

In [None]:
# now specify the numeric columns by creating a list of their names
cols = ['fiscal_year', 'result']

In [None]:
pols[cols] = pols[cols].apply(pd.to_numeric, errors = 'coerce')

In [None]:
pols.dtypes

# Exercises

**Exercise 1**: Select the `general_location` column from `pols`.

**Exercise 2**: what is the general location for the 94th entry  of `pols`.

**Exercise 3**: Select the first row of data (the first record) from `pols`. Hint: use `loc` or `iloc`.

**Exercise 4**: Select the first 100 values from the `parameter` column in `pols` `Series`.

**Exercise 5**: Select the records with the `1`, `2`, `3`, `5`, and `8` row index positions. In other words, generate the following `DataFrame`:

![DataFrame](pols_pic1.png)

**Exercise 6**: Select the `general_location`, `parameter`, `result`, and `units` columns of the records with the `1`, `9`, `19`, and `91` index positions. In other words, generate the following `DataFrame`:

![DataFrame](pols_pic2.png)

**Exercise 7**: Select the `general_location` and `specific_location` columns of records 1 - 1000

Hint: you may use `loc` or `iloc`. When working on the answer this question and the several of the ones that follow, keep the in mind the following

> `iloc` uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So `0:10` will select entries `0,...,9`. `loc`, meanwhile, indexes inclusively. So `0:10` will select entries `0,...,10`.


**Exercise 8**: Select tests done in the `Beaufort Sea`. Hint: `pols.general_location` equals what?

**Exercise 9**: Select tests whose `parameter` is not `NaN`.

**Exercise 10**: Select the `parameter` column.

**Exercise 11**: <!--What is the distribution of reviews scores for the first 1000 wines in the dataset?--> Select the `parameter` column for the first 1000 tests.

**Exercise 12**: <!--What is the distribution of reviews scores for the last 1000 wines in the dataset?--> Select the `parameter` column for the last 1000 tests.

## Now, let's start to ask some more meaningful questions about our data

**Exercise 13**: What are the different pollutants present in the dataset? List each pollutant only once. Hint: `parameter` corresponds to pollutant?

**Exercise 14**: How many pollutants are there?

**Exercise 15**: What pollutants are in the Mississippi River? Hint: Select the `parameter` column, but only for tests in the Mississippi River.

**Exercise 16**: How many different pollutants are in the Mississippi River? Hint: how long is the array of unique pollutants?

**Exercise 17**: What is the oldest test on record for the Mississippi River? Hint: What is the minimum value of `fiscal_year` for tests in the Mississippi River?

**Exercise 18**: What is the most recent test on record for the Mississippi River? Hint: What is the maximum value of `fiscal_year` for tests in the Mississippi River?

**Exercise 19**: Lets explore the Mississippi River data in more detail. What are the most concentrated pollutants at each specific location? `Hint: first, create a dataframe corresponding to only Mississippi River data, then sort by the highest result. Then, group by 'parameter' and 'units'

**Exercise 20** How do we know that the units are aligned?  Find the different units that are used for the Mississippi River tests:

Oh, it looks like there might be some differences.  I'm going to go through one way to match up the units so we can be sure we are looking at the most concentrated pollutants in the dataset. **I did this last night at almost midnight and I feel like it is kind of klunky but this is my solution** First I am going to duplicate the result column in a new column called 'adj': 

In [None]:
MR = MR.assign(adj = MR['result'])
MR

 have micrograms, nanograms, and picograms per gram so I am going to adjust accordingly now:


In [None]:
MR.loc[MR['units'] == 'micrograms per dry gram', 'adj'] =  1000000
MR.loc[MR['units'].str.contains("ng"), 'adj'] =  1000
MR.loc[MR['units'].str.contains("cent"), 'adj'] = 1
MR.loc[MR['units'] == 'CFU/g', 'adj'] =  1
MR.loc[MR['units'] == 'Number', 'adj'] = 1
MR.loc[MR['units'] == 'G', 'adj'] = 1
MR.loc[MR['units'] == 'pg/g', 'adj'] = 1

**Exercise 21**: Did the adjustment catch all of the different types of units? `(Hint: what are the unique values in the 'adj' column?)`

Now I will compensate for the unit adjustment like this:

In [None]:
MR['result'] = MR.result*MR.adj
MR

**Exercise 22** What pollutants have the highest mean result value? `(Hint: Look at the discussion of groupby in the slides)`

**Exercise 23** Let's see which pollutants changed the most in the MIssissippi river over the course of the study `(hint: create a dataframe from a pivot table, then find the difference between the highest and lowest values, then sort by that difference)`