# Data Loading

Get some data to play with

We are going to load labeled [hand-written digits](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html). Each digit is an 8x8 image of a hand-written digit stored as a vector of size 64. Values in each vector are between 0 and 15.

You need Python/3.x to run the examples.

In [None]:
from sklearn.datasets import load_digits
import numpy as np
digits = load_digits()
digits.keys()

In [None]:
digits.data.shape

In [None]:
print(digits.DESCR)

In [None]:
digits.target.shape

There are 10 classes corresponding to digits 0 to 9

In [None]:
digits.target

In [None]:
np.bincount(digits.target)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.matshow(digits.data[0].reshape(8, 8), cmap=plt.cm.Greys)

In [None]:
fig, axes = plt.subplots(4, 4)
for x, y, ax in zip(digits.data, digits.target, axes.ravel()):
    ax.set_title(y)
    ax.imshow(x.reshape(8, 8), cmap="gray_r")
    ax.set_xticks(())
    ax.set_yticks(())
plt.tight_layout()

**Data is always a numpy array (or sparse matrix) of shape (n_samples, n_features)**

Split the data into traning and test set to get going

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data,
                                                    digits.target)

In [None]:
X_train.shape

In [None]:
X_test.shape

# Exercises

Load the iris dataset from the ``sklearn.datasets`` module using the ``load_iris`` function.
The function returns a dictionary-like object that has the same attributes as ``digits``.

What is the number of classes, features and data points in this dataset?
Use a scatterplot to visualize the dataset.

You can look at ``DESCR`` attribute to learn more about the dataset.


In [None]:
from sklearn.datasets import load_iris
iris_data=load_iris()
iris_data.keys()
iris_data.data.shape
print(iris_data.feature_names)
print(iris_data.data[:10])

### Features in the Iris dataset:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
    
### Target classes to predict:
1. Iris Setosa
2. Iris Versicolour
3. Iris Virginica

<img src="figures/petal_sepal.jpg" alt="Sepal" style="width: 30%;"/>

Petal-sepal. from [Kyle's repo](https://github.com/kastnerkyle/scipy_2015_sklearn_tutorial/blob/master/notebooks/01.3%20Data%20Representation%20for%20Machine%20Learning.ipynb) Licensed under CC BY-SA 3.0 via [Wikimedia Commons]( https://commons.wikimedia.org/wiki/File:Petal-sepal.jpg#/media/File:Petal-sepal.jpg)
   

In [None]:
# %load solutions/load_iris.py

Usually data doesn't come in that nice a format. You can find the csv file that contains the iris dataset at the following path:

```python
import sklearn.datasets
import os
iris_path = os.path.join(sklearn.datasets.__path__[0], 'data', 'iris.csv')
```
Try loading the data from there using pandas ``pd.read_csv`` method.

In [None]:
# %load solutions/load_iris_csv.py