 ### An Iris Case Study Notebook with Machine Learning and SKLearn
###  Notebook by [Anchal Gupta]
### Guided By [Dalijit Singh]
###  [Netmax Technologies Pvt Ltd]

##Table of contents

1. [Introduction](#Introduction)

2. [Required libraries](#Required-libraries)

3. [The problem domain](#The-problem-domain)

4. [Step 1: Answering the question](#Step-1:-Answering-the-question)

5. [Step 2: Checking the data](#Step-2:-Checking-the-data)

6. [Step 3: Tidying the data](#Step-3:-Tidying-the-data)

7. [Step 4: Exploratory analysis](#Step-4:-Exploratory-analysis)

8. [Step 5: Classification](#Step-5:-Classification)

9. [Step 6: Conclusion](#Step-6:-Conclusion)



####  Introduction

[[ go back to the top ]](#Table-of-contents)

In the time it took you to read this sentence, terabytes of data have been collectively generated across the world — more data than any of us could ever hope to process, much less make sense of, on the machines we're using to read this notebook.

***In response to this massive influx of data, the field of Data Science has come to the forefront in the past decade. Cobbled together by people from a diverse array of fields — statistics, physics, computer science, design, and many more — the field of Data Science represents our collective desire to understand and harness the abundance of data around us to build a better world.***

In this notebook, I'm going to go over a basic Python data analysis pipeline from start to finish to show you what a typical data science workflow looks like.

#### Required libraries

[[ go back to the top ]](#Table-of-contents)

If you don't have Python on your computer, you can use the [Anaconda Python distribution](http://continuum.io/downloads) to install most of the Python packages you need. Anaconda provides a simple double-click installer for your convenience.

This notebook uses several Python packages that come standard with the Anaconda Python distribution. The primary libraries that we'll be using are:

* ***NumPy***: Provides a fast numerical array structure and helper functions.
* ***pandas***: Provides a DataFrame structure to store data in memory and work with it easily and efficiently.
* ***scikit-learn***: The essential Machine Learning package in Python.
* ***matplotlib***: Basic plotting library in Python; most other Python plotting libraries are built on top of it.
* ***Seaborn***: Advanced statistical plotting library.

To make sure you have all of the packages you need, install them with `conda`:

    conda install numpy pandas scikit-learn matplotlib seaborn
    
`conda` may ask you to update some of them if you don't have the most recent version. Allow it to do so.


##The problem domain

[[ go back to the top ]](#Table-of-contents)

For the purposes of this exercise, let's pretend we're working for a startup that just got funded to create a smartphone app that automatically identifies species of flowers from pictures taken on the smartphone. We're working with a moderately-sized team of data scientists and will be building part of the data analysis pipeline for this app.

We've been tasked by our company's Head of Data Science to create a demo machine learning model that takes four measurements from the flowers (sepal length, sepal width, petal length, and petal width) and identifies the species based on those measurements alone.


<div style="float:left;width:200px;">
<img src="images/iris_setosa.jpg" width="150px" height="200px"  />
    <b>Iris Setosa</b>
</div>
<div style="float:left;width:200px;">
<img src="images/irsi_versicolor.jpg" width="150px" height="100px" />
    <b>Iris Versicolor</b>
</div>
<div style="width:200px;">
<img src="images/iris_virginica.jpg" width="150px"height="200px"  />
    <b>Iris Virginica</b>
    </div>
<br/>
The four measurements we're using currently come from hand-measurements by the field researchers, but they will be automatically measured by an image processing model in the future.


##Step 1: Answering the question

[[ go back to the top ]](#Table-of-contents)

The first step to any data analysis project is to define the question or problem we're looking to solve, and to define a measure (or set of measures) for our success at solving that task. The data analysis checklist has us answer a handful of questions to accomplish that, so let's work through those questions.

Let's do that now. Since we're performing classification, we can use [accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision) — the fraction of correctly classified flowers — to quantify how well our model is performing. Our company's Head of Data has told us that we should achieve at least 90% accuracy.

**Thinking about and documenting the problem we're working on is an important step to performing effective data analysis that often goes overlooked.** 
#### Don't skip it.

In [1]:
import pandas as pd
from sklearn import neighbors,datasets

In [2]:
# Loading the data
iris = datasets.load_iris()
X,y = iris.data,iris.target
X = pd.DataFrame(X,columns = iris.feature_names)
X.head()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [3]:
X.shape

(150, 4)

In [4]:
y.shape

(150,)

In [5]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], 
      dtype='<U10')

In [6]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [7]:
knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn.fit(X,y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [8]:
X_pred = [3,5,4,2]
result = knn.predict([X_pred])
print(iris.target_names[result])

['versicolor']


In [9]:
print(iris.target_names)
print(knn.predict_proba([X_pred]))

['setosa' 'versicolor' 'virginica']
[[ 0.   0.8  0.2]]


In [10]:
# split data into train and test set
from sklearn.cross_validation import train_test_split

