# DSE 200 FINAL PROJECT
## Fall 2018
### Due Date:  December 7th, 2018

The final project is comprised of two parts: 
* <b>Part I</b> is a set of coding questions that require the _numpy_ library to analyze the provided dataset.  
* <b>Part II</b> is a guided project for you to build your own end-to-end analysis using Python, especially using what you learned on Python _IO_, _pandas_, _matplotlib_ and _scilitlearn_ libraries.  

<b>Deliverables</b>: Submit both parts as one notebook via Github by midnight on the due date above along with clear instructions on how to download the datasets you used for Part II and reproduce your results. The notebook should be organized with a clear table of contents on top _(see example in the Pylaski notebook from Day 5)_ and links to the parts/steps outlined. Don't forget to add your name on top as the author of the notebook. 

# PART I: 20%

### Preliminaries

In [1]:
import numpy as np

### 1.1 Preliminaries

Use numpy to load `iris.npy` into a numpy matrix. Print the dataset's shape and the first 5 rows.<br>

**Output required**: 
<ul>
    <li>Tuple representing dataset's shape</li>
    <li>Matrix representing the first 5 rows</li>
</ul>

In [2]:
# For reference
column_names = ['Id','SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm','Species']
species_encoding = {'Iris-setosa': 1, 'Iris-versicolor': 2, 'Iris-virginica': 3}

In [3]:
data = np.load('iris.npy')
print(data.shape)
print(data[:5])

(150, 6)
[[ 1.   5.1  3.5  1.4  0.2  1. ]
 [ 2.   4.9  3.   1.4  0.2  1. ]
 [ 3.   4.7  3.2  1.3  0.2  1. ]
 [ 4.   4.6  3.1  1.5  0.2  1. ]
 [ 5.   5.   3.6  1.4  0.2  1. ]]


### 1.2  Transformations

The first column is the id of the sample, which isn't relevant for our purposes. Remove that column from the matrix by creating a new matrix composed of the rest of the columns.<br>
As usual, print the shape of the resulting dataset and the first 5 rows.

**Output required**: 
<ul>
    <li>Tuple representing dataset's shape</li>
    <li>Matrix representing the first 5 rows</li>
</ul>

In [4]:
data = data[:, 1:]
print(data.shape)
print(data[:5])

(150, 5)
[[ 5.1  3.5  1.4  0.2  1. ]
 [ 4.9  3.   1.4  0.2  1. ]
 [ 4.7  3.2  1.3  0.2  1. ]
 [ 4.6  3.1  1.5  0.2  1. ]
 [ 5.   3.6  1.4  0.2  1. ]]


### 1.3 Summary Statistics

*Note: Don't worry about the order in which you display the values in this section. Display them in whatever order/grouping makes most sense to you*

**a)** Print the means and standard deviations of each column.

**Output required**: 
<ul>
    <li>Floats representing the standard deviation of each column</li>
    <li>Floats representing the mean of each column</li>
</ul>

In [5]:
print("Means:\n", data.mean(axis=0))
print("Std devs:\n", data.std(axis=0))

Means:
 [ 5.84333333  3.054       3.75866667  1.19866667  2.        ]
Std devs:
 [ 0.82530129  0.43214658  1.75852918  0.76061262  0.81649658]


**b)** Print the minimum and maximum values of each column

**Output required**: 
<ul>
    <li>Floats representing the minimum value found in each column</li>
    <li>Floats representing the maximum value found in each column</li>
</ul>

In [6]:
print("Mins:\n", data.min(axis=0))
print("Maxs:\n", data.max(axis=0))

Mins:
 [ 4.3  2.   1.   0.1  1. ]
Maxs:
 [ 7.9  4.4  6.9  2.5  3. ]


**c)** Calculate the species-wise means and standard deviations.<br>
**Report these values with respect to the actual *name* of the species, for which you must refer to 1.1**

**Output required**: 
<ul>
    <li>For each of the 3 species in the dataset:<ul>
        <li>Floats representing the standard deviation of each column for this species</li>
        <li>Floats representing the mean of each column for this species</li>
</ul>

In [7]:
for species in species_encoding:
    species_id = species_encoding[species] 
    print("Species:", species)
    print("\tMean: ", data[data[:, -1] == species_id].mean(axis=0))
    print("\tSTD: ", data[data[:, -1] == species_id].std(axis=0))


Species: Iris-setosa
	Mean:  [ 5.006  3.418  1.464  0.244  1.   ]
	STD:  [ 0.34894699  0.37719491  0.17176728  0.10613199  0.        ]
Species: Iris-versicolor
	Mean:  [ 5.936  2.77   4.26   1.326  2.   ]
	STD:  [ 0.51098337  0.31064449  0.46518813  0.19576517  0.        ]
Species: Iris-virginica
	Mean:  [ 6.588  2.974  5.552  2.026  3.   ]
	STD:  [ 0.62948868  0.31925538  0.54634787  0.27188968  0.        ]


### 1.4  Advanced list comprehensions and numpy

Use list comprehensions to generate a list of tuples for each species.<br>
Each tuple will have the column name and that column's mean. Note that the column names are listed in **1.1**, but recall that you dropped the id column.
Each list will have the following format:
    `[(column_name, column_mean), (column_name, column_mean), ...]`
    
   *hint*: The enumerate function might be helpful in creating a concise comprehension<br>
   *hint*: Check your intuition using your **1.3c** output

**Output required**: 
<ul>
    <li>Three lists of tuples</li>
</ul>

In [9]:
species1 = [(name, '%.2f' % data[data[:, -1] == 1][:,i].mean()) for i, name in enumerate(column_names[1:])]
species2 = [(name, '%.2f' % data[data[:, -1] == 2][:,i].mean()) for i, name in enumerate(column_names[1:])]
species3 = [(name, '%.2f' % data[data[:, -1] == 3][:,i].mean()) for i, name in enumerate(column_names[1:])]

print(species1)
print(species2)
print(species3)

[('SepalLengthCm', '5.01'), ('SepalWidthCm', '3.42'), ('PetalLengthCm', '1.46'), ('PetalWidthCm', '0.24'), ('Species', '1.00')]
[('SepalLengthCm', '5.94'), ('SepalWidthCm', '2.77'), ('PetalLengthCm', '4.26'), ('PetalWidthCm', '1.33'), ('Species', '2.00')]
[('SepalLengthCm', '6.59'), ('SepalWidthCm', '2.97'), ('PetalLengthCm', '5.55'), ('PetalWidthCm', '2.03'), ('Species', '3.00')]
