# Tutorial 2: exercise

(c) 2016 Justin Bois. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

*This tutorial exercise was generated from an Jupyter notebook.  You can download the notebook [here](t2_exercise.ipynb). Use this downloaded Jupyter notebook to fill out your responses.*

In [2]:
import numpy as np
import pandas as pd

### Exercise 1

The [Anderson-Fisher iris data set](https://en.wikipedia.org/wiki/Iris_flower_data_set) is a classic data set used in statistical and machine learning applications. Edgar Anderson carefully measured the lengths and widths of the petals and sepals of 50 irises in each of three species, *I. setosa*, *I. versicolor*, and *I. virginica*. Ronald Fisher then used this data set to distinguish the three species from each other.

**a)** Load the data set, which you can download [here](../data/anderson-fisher-iris.csv) into a Pandas `DataFrame` called `df`. Be sure to check out the structure of the data set before loading. You will need to use the `header=[0,1]` kwarg of `pd.read_csv()` to load the data set in properly.

**b)** Take a look `df`. Is it tidy? Why or why not?

**c)** Melt the `DataFrame` into a tidy `DataFrame`called `df_tidy` with columns `['species', 'quantity', 'value']`. Discuss why this is a tidy data frame.

**d)** Using `df_tidy`, slice out all of the sepal lengths for *I. versicolor* as a Numpy array.

In [8]:
f = '../data/anderson-fisher-iris.csv'

df = pd.read_csv(f, comment='#', header=[0, 1])
df

Unnamed: 0_level_0,setosa,setosa,setosa,setosa,versicolor,versicolor,versicolor,versicolor,virginica,virginica,virginica,virginica
Unnamed: 0_level_1,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2,7.0,3.2,4.7,1.4,6.3,3.3,6.0,2.5
1,4.9,3.0,1.4,0.2,6.4,3.2,4.5,1.5,5.8,2.7,5.1,1.9
2,4.7,3.2,1.3,0.2,6.9,3.1,4.9,1.5,7.1,3.0,5.9,2.1
3,4.6,3.1,1.5,0.2,5.5,2.3,4.0,1.3,6.3,2.9,5.6,1.8
4,5.0,3.6,1.4,0.2,6.5,2.8,4.6,1.5,6.5,3.0,5.8,2.2
5,5.4,3.9,1.7,0.4,5.7,2.8,4.5,1.3,7.6,3.0,6.6,2.1
6,4.6,3.4,1.4,0.3,6.3,3.3,4.7,1.6,4.9,2.5,4.5,1.7
7,5.0,3.4,1.5,0.2,4.9,2.4,3.3,1.0,7.3,2.9,6.3,1.8
8,4.4,2.9,1.4,0.2,6.6,2.9,4.6,1.3,6.7,2.5,5.8,1.8
9,4.9,3.1,1.5,0.1,5.2,2.7,3.9,1.4,7.2,3.6,6.1,2.5


b) The data is not tidy. One reason is that each row does not correspond to one observation. Also, there are two variables in a single column (species and lenght measurements).



In [12]:
myCol = ['species', 'quantity', 'value']

df_tidy = pd.melt(df)
df_tidy.columns = myCol
df_tidy

Unnamed: 0,species,quantity,value
0,setosa,sepal length (cm),5.1
1,setosa,sepal length (cm),4.9
2,setosa,sepal length (cm),4.7
3,setosa,sepal length (cm),4.6
4,setosa,sepal length (cm),5.0
5,setosa,sepal length (cm),5.4
6,setosa,sepal length (cm),4.6
7,setosa,sepal length (cm),5.0
8,setosa,sepal length (cm),4.4
9,setosa,sepal length (cm),4.9


c) df_tidy is tidy because there is only one variable for each column, each row constitutes an observation, and each type of observational forms a separate table. Each row is a flower observed, and the three recorded variables are what species it is, what quantity is being measured, and then the value of that quantity. And each type of observational unit has its own table; this is satisfied because all these rows and columns are related to each other. They're all measuring different lenghts of flowers.

In [15]:
versi = df_tidy['species'] == 'versicolor'
df_tidy_versi = df_tidy.loc[versi, :]
seplen = df_tidy['quantity'] == 'sepal length (cm)'
df_tidy_versi_seplen = df_tidy_versi.loc[seplen, :]
df_tidy_versi_seplen

Unnamed: 0,species,quantity,value
200,versicolor,sepal length (cm),7.0
201,versicolor,sepal length (cm),6.4
202,versicolor,sepal length (cm),6.9
203,versicolor,sepal length (cm),5.5
204,versicolor,sepal length (cm),6.5
205,versicolor,sepal length (cm),5.7
206,versicolor,sepal length (cm),6.3
207,versicolor,sepal length (cm),4.9
208,versicolor,sepal length (cm),6.6
209,versicolor,sepal length (cm),5.2


In [20]:
versiSepLen = df_tidy_versi_seplen.iloc[:,2]
values = np.asarray(versiSepLen)
type(values)

numpy.ndarray

In [21]:
values

array([ 7. ,  6.4,  6.9,  5.5,  6.5,  5.7,  6.3,  4.9,  6.6,  5.2,  5. ,
        5.9,  6. ,  6.1,  5.6,  6.7,  5.6,  5.8,  6.2,  5.6,  5.9,  6.1,
        6.3,  6.1,  6.4,  6.6,  6.8,  6.7,  6. ,  5.7,  5.5,  5.5,  5.8,
        6. ,  5.4,  6. ,  6.7,  6.3,  5.6,  5.5,  5.5,  6.1,  5.8,  5. ,
        5.6,  5.7,  5.7,  6.2,  5.1,  5.7])

### Exercise 2

**a)** Perform the following operations to make a new `DataFrame` from the original one you loaded in exercise 1 to generate a new `DataFrame`. Do these operations one-by-one and explain what you are doing to the `DataFrame` in each one. The Pandas documentation might help.

In [26]:
df_new = df.stack(level=0)
df_new

Unnamed: 0,Unnamed: 1,petal length (cm),petal width (cm),sepal length (cm),sepal width (cm)
0,setosa,1.4,0.2,5.1,3.5
0,versicolor,4.7,1.4,7.0,3.2
0,virginica,6.0,2.5,6.3,3.3
1,setosa,1.4,0.2,4.9,3.0
1,versicolor,4.5,1.5,6.4,3.2
1,virginica,5.1,1.9,5.8,2.7
2,setosa,1.3,0.2,4.7,3.2
2,versicolor,4.9,1.5,6.9,3.1
2,virginica,5.9,2.1,7.1,3.0
3,setosa,1.5,0.2,4.6,3.1


The stack method takes a specified row R and reformats the rows and columns so that each row contains n 'sub-rows' for each of the n columns in R. It adds an extra column to store include in the new rows the n columns.

In [29]:
df_new = df_new.sortlevel(1)
df_new

Unnamed: 0,Unnamed: 1,petal length (cm),petal width (cm),sepal length (cm),sepal width (cm)
0,setosa,1.4,0.2,5.1,3.5
1,setosa,1.4,0.2,4.9,3.0
2,setosa,1.3,0.2,4.7,3.2
3,setosa,1.5,0.2,4.6,3.1
4,setosa,1.4,0.2,5.0,3.6
5,setosa,1.7,0.4,5.4,3.9
6,setosa,1.4,0.3,4.6,3.4
7,setosa,1.5,0.2,5.0,3.4
8,setosa,1.4,0.2,4.4,2.9
9,setosa,1.5,0.1,4.9,3.1


sortlevel takes a specified column C (species in this case) and organizes the data by that column. In our case, there is a new column dedicated only to the species type, and all data corresponding to that observation is recorded to the right, and the species is organized alphabetically from top to bottom.

In [30]:
df_new = df_new.reset_index(level=1)
df_new

Unnamed: 0,level_1,petal length (cm),petal width (cm),sepal length (cm),sepal width (cm)
0,setosa,1.4,0.2,5.1,3.5
1,setosa,1.4,0.2,4.9,3.0
2,setosa,1.3,0.2,4.7,3.2
3,setosa,1.5,0.2,4.6,3.1
4,setosa,1.4,0.2,5.0,3.6
5,setosa,1.7,0.4,5.4,3.9
6,setosa,1.4,0.3,4.6,3.4
7,setosa,1.5,0.2,5.0,3.4
8,setosa,1.4,0.2,4.4,2.9
9,setosa,1.5,0.1,4.9,3.1


reset_index() does more or less the same as sortlevel(), except that instead of making the specified column act as label such as 0, 1, 2, etc., it creates a new column, which it calls by default 'level_1'. Note that the table created by sortlevel() has 4 columns, while reset_index() creates a table with 5.

In [31]:
df_new = df_new.rename(columns={'level_1': 'species'})
df_new

Unnamed: 0,species,petal length (cm),petal width (cm),sepal length (cm),sepal width (cm)
0,setosa,1.4,0.2,5.1,3.5
1,setosa,1.4,0.2,4.9,3.0
2,setosa,1.3,0.2,4.7,3.2
3,setosa,1.5,0.2,4.6,3.1
4,setosa,1.4,0.2,5.0,3.6
5,setosa,1.7,0.4,5.4,3.9
6,setosa,1.4,0.3,4.6,3.4
7,setosa,1.5,0.2,5.0,3.4
8,setosa,1.4,0.2,4.4,2.9
9,setosa,1.5,0.1,4.9,3.1


rename() allows you to rename a column. In this case, we change level_1 to species, so that it describes the column accurately.

**b)** Is the resulting `DataFrame` tidy? Why or why not?

Yes, the resulting `DataFrame` is tidy. Each column represents a variable: species, along with 4 quantities. Each row represents a measurement; each row is a flower whose species is recorded, as well as sizes of sepal and petals. All the measurements in table and in each row are related to each other, and are actually relavant.

Because it satisfies these three criteria, it is indeed tidy.

**c)** Using `df_new`, slice out all of the sepal lengths for I. versicolor as a Numpy array. 

In [43]:
versi = df_new['species'] == 'versicolor'
df_new_versi = df_new.loc[versi, :]
df_new_versi_seplen = df_new_versi.iloc[:, 3]
values = np.asarray(df_new_versi_seplen)
type(values)

numpy.ndarray

In [44]:
values

array([ 7. ,  6.4,  6.9,  5.5,  6.5,  5.7,  6.3,  4.9,  6.6,  5.2,  5. ,
        5.9,  6. ,  6.1,  5.6,  6.7,  5.6,  5.8,  6.2,  5.6,  5.9,  6.1,
        6.3,  6.1,  6.4,  6.6,  6.8,  6.7,  6. ,  5.7,  5.5,  5.5,  5.8,
        6. ,  5.4,  6. ,  6.7,  6.3,  5.6,  5.5,  5.5,  6.1,  5.8,  5. ,
        5.6,  5.7,  5.7,  6.2,  5.1,  5.7])