# Welcome to the Data Science Gym!

Sharpen your data science skills by just *doing* data science.

<br>

### Workout Overview

**ID:** <a href="https://github.com/dskarbrevik/Data-Science-Gym">DSG1ML1</a>

**Type:** Machine Learning

**Main task:** Simple classification

**Data theme:** Flower types

**Data types:** Numerical

**Data size:** Small (<1GB)

**Special System requirements:** None 

**Difficulty:** &#11088;

[Note: If this workout doesn't seem like a good fit for you at the moment, the <a href="https://github.com/dskarbrevik/Data-Science-Gym">Data Science Gym</a> has other workouts that may be a better fit.]

<br>
***
<br>

### SIGN-IN TO THE GYM!

**Username:** dskarbrevik

**Date:** 2/25/2018

**Favorite flower:** corpus flower... tulips are cool too I guess

<br>
***
<br>

**BASIC GYM RULES (everything you need to know)**

**1)** You have a maximum of **3 hours to complete a workout** in this gym.

**2)** Start your timer immediately after reading these rules.

**3)** You don't have to use all 3 hours, but you should spend **at least 1 full hour** in this gym or else your Data Science muscles might not get much bigger :(

**4)** As long as you are the one typing into this notebook, you may use any resource you like (Python libraries, StackOverflow, phone a friend, etc.). 

**5)** If you copy any code directly from another source (e.g. StackOverflow) please put the link in the "Resources" section at the bottom of this notebook.

<br>
***
<br>

## READY?... OK, start your timer and get started!... and have fun!

<br>
***
<br>

<a id="toc"></a>

## Today's Workout Routine

<br>

<ol>
    <li><a href="#section1">Introduction</a> [just reading here]</li>
    <br>
    <li><a href="#section2">Loading the Data</a> [just run the code]</li>
    <br>
    <li><a href="#section3">Exploratory Data Analysis</a> [optional]</li>
    <br>
    <li><a href="#section4">Modeling</a> [most of your time should be spent here]</li> 
    <br>
    <li><a href="#section5">Conclusion</a> [try to leave at least 15-30 mins for this part]</li>             
</ol>

***

<div align="right">
    <a href="#toc">back to top</a>
</div>
<a id='section1'></a>

## 1) Introduction

This is a VERY small, 150 observations, dataset where each row corresponds to one of three types of flowers. There are only five features in each observation/row (flower type, sepal length, sepal width, pedal width, pedal length). It is one of the canonical "Hello World"s of machine learning. 

More info on dataset: <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set">Iris Dataset</a>


### Main Goal:
Your goal is to **build a model that can classify the three types of flowers in the dataset**. That's it!

<br>
***
<br>

<div align="right">
    <a href="#toc">back to top</a>
</div>
<a id='section2'></a>

## 2) Loading the Data

In [40]:
from io import BytesIO
import urllib.request

# STEP 1) get the data from S3
url = "http://data-science-gym.s3.amazonaws.com/iris_data.csv"
response = urllib.request.urlopen(url)
data_bytes = response.read()

# STEP 2) make IO object
data_io = BytesIO(data_bytes)

# STEP 3) read into Pandas
iris_df = pd.read_csv(data_io, sep=",")

There you go, `iris_df` is now your Pandas Dataframe, ready to play with!

<br>
***
<br>

<div align="right">
    <a href="#toc">back to top</a>
</div>
<a id='section3'></a>

## 3) Exploratory Data Analysis

By importing this dataset from the sklearn package we get the data in array form, however, to better explore the data it is generally useful to convert it to a Pandas dataframe. So let's do that.

In [41]:
iris_df.head()

Unnamed: 0,Flower Type,Sepal Length,Sepal Width,Petal Length,Petal Width
0,setosa,5.1,3.5,1.4,0.2
1,setosa,4.9,3.0,1.4,0.2
2,setosa,4.7,3.2,1.3,0.2
3,setosa,4.6,3.1,1.5,0.2
4,setosa,5.0,3.6,1.4,0.2


Now that we have our data in a dataframe, we see there are quick functions we can run on it to get some basic information. 

Note the first few lines are all "setosa", this is because the dataset is ordered. The first 50 rows are Setosa, the next 50 are versicolor, and the last 50 rows are virginica. It is important to appreciate this, because when we model the data to answer some question, we want to train the model on a random subset of our data (not just give it the first 100 rows and find it can only classify 2 of the 3 types of flowers we have). 

Also, we can note from the basic statistics of the features that the petal length seems to have a particularly high standard deviation and thus may be the most valuable feature in differentiating one or more types of flowers (just a hunch, not a for sure).

<br>
***
<br>

<div align="right">
    <a href="#toc">back to top</a>
</div>
<a id='section4'></a>

## 4. Modeling

Some possible modeling techniques for a discrete classification problem like this:
* logistic regression
* Naive Bayes
* K-NN
* Decision Tree

But first let's split our data into a training and testing groups so we can evaluate our models later...

#### Data split

In [12]:
# prepping our data to train models
from sklearn.model_selection import train_test_split

data = iris.data[:,:]
labels = flower_type
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size=.20, random_state=25)

#### k-NN model

In [19]:
# training a k-NN classifier and evaluating it

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

n_neighbors = 3

knn = KNeighborsClassifier(n_neighbors)
knn = knn.fit(train_data, train_labels)
preds = knn.predict(test_data)

accuracy = accuracy_score(test_labels, preds)
print("Accuracy of k-NN with k = {} : {:.2f}%".format(n_neighbors, (accuracy*100)))

Accuracy of k-NN with k = 3 : 93.33%


#### Decision Tree

In [15]:
from sklearn import tree
tree = tree.DecisionTreeClassifier()
tree = tree.fit(iris.data, iris.target)
accuracy = accuracy_score(test_labels, preds)
print("Accuracy of decision tree: {:.2f}%".format((accuracy*100)))

Accuracy of decision tree: 90.00%


#### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier()


<br>
***
<br>

<div align="right">
    <a href="#toc">back to top</a>
</div>
<a id='section5'></a>


## 5. Conclusion

This dataset has almost no data and what data it has is extremeley specific. There aren't many interesting questions or great impact we can make with this data but we can at least see that being able to predict flower type is trivially easy, which is why it is the hello world of machine learning. Moving on!