<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Pandas Transformation Lab

_Authors: Riley Dallas (ATX), Dave Yerrington (SF), Mark Popovich (SF)_

## Objectives

In this lab, you'll get some practice concatenating Pandas dataframes and plotting.

### Imports

Scikit Learn comes pre-loaded with a number of datasets. Today we'll be working with the canonical iris data.

In [None]:
from sklearn.datasets import load_iris
import pandas as pd

## Munging Data

### Load Data
Scikit Learn datasets are actually functions that return an object containing the data we need. 

In the cell below, call the `load_iris()` function and set the result to a variable called `data`.

### Examining Target Data

In machine learning, the column we're trying to predict is usually called the **target** (or **label**). 

**To see the targets for our dataset, call `data['target']` in the cell below.**

 Also, you can use `data.keys()` to see a list of elements available inside the `data` object.
```python
 dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
```

In the case of the iris dataset, the target is the particular species of iris flower. Because machine learning requires our features and target to be numbers, the species are encoded as 0, 1, 2. These indices correspond to the labeled species on `data.target_names`

**Call `data['target_names']` to see the actual names of the three species**

**Get the length of `data['target']` to see how many flowers are in this dataset.**

### Create `species` DataFrame

**Use `data['target']` to create your first pandas DataFrame, which is just a single column (`"species"`).**

### Examining Feature Data

The features for this dataset are on `data['data']`. There are 150 rows, one for each flower and 4 columns, one for each of the following features:
1. sepal length (cm)
2. sepal width (cm)
3. petal length (cm)
4. petal width (cm)

**Output `data['data']` in the cell below to see the features for this dataset**

### Create `features` DataFrame

Create a `features` DataFrame in pandas, using `data['data']` and `data['feature_names']` 

You can use `pd.DataFrame()` to create a DataFrame, passing the `columns` parameter using `data['feature_names']`

### `pd.concat`

**Use `pd.concat` to combine the two dataframes.**

- The concat method essentially squishes two dataframes together, along either Axis 0 or Axis 1.
- Axis 0 here refers to the rows. If we concat along axis 0, we would be stacking one DataFrame on top of the other. That is not what we want to accomplish in this case.
- Instead, we want to put the two dataframes side-by-side. In order to do that, we concat the columns by using Axis 1.
- Review / research the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) on the useage of this function.  

### `df.join`

Compare concat with `DataFrame.join` by joining the two original DataFrames instead.

In this case, we do not have to specify an axis. Join here is explicitly for joining two or more dataframes on columns. You can even pass a list of DataFrames to join if you have more than one (they must all have the same column to join on). Refer to the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html)

### `df.merge`

Compare concat with `DataFrame.merge` by merging the two original DataFrames instead.

Like join, merge does not function across rows. Additionally, you will note that without a common column to merge on, you must be explicit about telling pandas to merge on the right and left indices. You can do this with `left_index=True` and `right_index=True`. Because we are not merging on a column, we do not pass `on` or `how` parameters, but both of those parameters will become very important to you as you do more sophisticated merging. Refer to the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)

### Make Features List

Create a list of all the numeric column names by dropping the species column, then calling the `columns` attribute followed by the `values` attribute. Save it to a variable named `num_cols`. 

You may want to reference the order of your numeric columns (so just print them out).

### Insert an Interaction Column

Multiply the four numeric columns together and save them as a new column `interaction`.

Inserting a new column is as easy as:
```
df['new_column_name_as_string'] = values
```
The `values` here can be many different things:
- a `pd.Series` that shares an index with our DataFrame will align perfectly and no NaNs will be present.
- a list or array of the same length as our DataFrame will also align perfectly.
- a single value will set that value for all rows.

(It's also important to note that when we set a value at a specific location in a new column, all unspecified locations will be filled with NaNs.)

**We can use `apply` and `lambda` to do this. Here's an example:**

To get the sepal area (len * width), we can:

**Let's break down this code:**
```
df.apply(lambda x: x[num_cols[0]] * x[num_cols[1]], axis=1)
```

First, let's remember that:
- `num_cols[0] == 'sepal length (cm)'`
- `num_cols[1] == 'sepal width (cm)'`

The apply function will allow us to pass a function over an axis of our DataFrame:
- here, we specify `axis=1`, which means that we are grabbing all of our rows and applying our function to their columns.

`lambda` functions are disposable, and often just use the variable `x` as a placeholder for whatever they're operating on:
- because we're operating on our rows, `x` becomes a row each time our function is applied.
- we can specify which columns we want to operate on, noting that those column values will be multiplied by the same column in that row.

Thus, we can basically read this code as:
```
For each row in df, multiply sepal length (cm) by sepal width (cm)
```

**Now, do this for ALL numeric columns and assign this the new column `interaction`**

### Return All Rows for a Subset of Columns

There are lots of ways to select data of interest in a DataFrame.

Use the `.loc` method to display all of the rows (`:`) and only the numeric columns

Note that we can do this same thing just by passing a list of column names:

### Insert `target_names`

Let's insert a string (object in pandas) encoded column from the species column for human-readable labels. 

Let's look at our `target_names` again.

And our `species` column.

*Using `.value_counts()` will get us the count for each of our classes.*

While not extremely well documented, the order of `data.target_names` corresponds with our numeric targets.


We'll start by writing a dictionary `species_name` by using a dictionary comprehension on the `data['target_names']` array from the original iris dataset. We'll use the `enumerate` function as an easy way to accomplish this.

Let's look at those key value pairs:

Then use the `map` pandas method on the `species` column with the `species_name` dict. Save the result to `df['labels']`.

**Why this works:**

When we `map` a dict to a Series, we pass each value of the Series as a key to our dict. Here, our `species` are the integers `[0, 1, 2]` which are the keys in our dict, so we access the corresponding species name currently stored as our values.

**Note**: `map` is much like `apply`, but only applies to a single column (a Pandas series).

## Plotting

We'll start by importing our plotting packages using their preferred aliases.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### Pairplot

Create a pairplot from your dataframe. Use the seaborn package to do so (`sns.pairplot`) and pass the argument `hue` with the column `'labels'`. This will use the labels column we created above to both color the scatterpoints and create a legend. Add a semicolon `;` at the end of your plotting code to prevent unsightly output above the chart. 

### Histograms

Generate one histogram for each of the 4 original numeric columns. Use at least 3 different plotting methods to do so. (pandas built-in, matplotlib, seaborn). BONUS: Use plotly for one of the histograms. 

Hint: for seaborn you will want `sns.distplot`

In [None]:
# matplotlib

In [None]:
# seaborn distplot

#### BONUS: Plotly

If you want to get started with plotly, you will need to pip install the two packages above and then head to [plotly](https://plot.ly/feed/#/) and like the sign-up button. After you create your account, you will see a field marked API Key. Click the `Regenerate Key` button to display a new api key, then copy and paste that with your username into the field below. 

In [None]:
# Plotly histogram
#!pip install plotly --upgrade
#!pip install cufflinks --upgrade

In [None]:
# In order to get plotly to work, you will need to create a username and api_key
# the following code writes your credentials to a static file, 
# so you will only ever need to run it once. 
import plotly 
plotly.tools.set_credentials_file(username='', api_key='')