# Manipulating data using Python and Numpy

## Objective

The goal of this notebook is to start manipulating data with [Python](https://python.org), [Numpy](https://www.numpy.org/), and [scikit-learn](https://scikit-learn.org/). 

## Dataset

We will use the `digit` dataset. It contains digital images of handwritten digits.

## Credits

This notebook was inspired by Chloé-Agathe Azencott

## 1. How to use Jupyter 

Jupyter is a web application that allows us to create and share documents such as this one, called _notebooks_. A _notebook_ comprises a set of _cells_. A cell may contain raw text, code, images, [markdown](https://en.wikipedia.org/wiki/Markdown) texts, such as this cell. Cells can be edit and executed.

* You can run a cell by clicking inside it and hitting `Shift+Enter` (or the **Run** button in the toolbar).

### Formatting texts

Texts can be formatted using the [markdown](https://en.wikipedia.org/wiki/Markdown) [syntax](https://www.markdownguide.org/basic-syntax/). Furthermore, they can have mathematical equations, both inline $e^{i\pi} + 1 = 0$ and displayed $$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$$ 

### Including $\LaTeX$ equations

Inline expressions can be added by surrounding the $\LaTeX$ code with \$

```
$e^{i\pi} + 1 = 0$
```

Expressions on their own line are surrounded by \$\$

```
$$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$$ 
```

You can check the oficial [guide](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html) to learn more about markdown support in Jupyter notebook.


In [0]:
2 + 2  # hit Shift+Enter to run

* If you want to create a new cell below the one you're running, hit `Alt+Enter` (or the **plus** button in the toolbar).
* If the notebook hangs, you can restart it with "**Restart**" in the "**Kernel**" menu.

Some tips on using a Jupyter notebook with Python:

* A notebook behaves like an interactive python shell! This means that
    * classes, functions, and variables defined at the cell level have global scope throughout the notebok
    * hitting `Tab` will autocomplete the keyword you have started typing
    * typing a question mark after a function name will load the interactive help for this function.
    
* Jupyter has special Python commands (i.e., shortcuts) called _magics_. For instance, 
   * `%bash` will allow you to run bash code
   * `%paste` will allow you to paste a block of code while retaining its formating, and 
   * `%matplotlib inline` will import the visualization library matplotlib, and automatically display its plots inline, that is, below the cell. 
   * A full list is available at: http://ipython.readthedocs.io/en/stable/interactive/magics.html 
   
* Learn more about the interactive Python shell here: http://ipython.readthedocs.io/en/stable/interactive/tutorial.html

For more info on Jupyter access: https://jupyter.org/

## 2. Loading the data

### Importing Numpy and Matplotlib. 

* Numpy (stands for Numerical Python) is the Python library for numerical computations, and in particular for the manipulation of vectors and matrices.

* Matplotlib is a plotting library.

In [0]:
import numpy as np
import matplotlib.pyplot as plt

In this course relies on `scikit-learn` library for machine learning in Python. Therefore, in this notebook, we will only use it to load one of the classical datasets that it makes available.

In [0]:
# Import the dataset
from sklearn.datasets import load_digits

digits = load_digits()

Checking the attributes of the `digits` dataset 

In [0]:
digits.keys()

In [0]:
# Get descriptors and target to predict
X, y = digits.data, digits.target

We have loaded the data into two _numpy arrays_ X and y. 

* X is a two-dimensional array (i.e., a matrix), containing the samples as rows and the features describing them as columns. 
* y is a one-dimensional array (i.e., a vector), containing the labels.

The dimension of an array is accessible via its `shape` attribute:

In [0]:
print(X.shape)
print(y.shape)

In [0]:
# Get the shape of the data
print("Number of samples: %d" % X.shape[0])
print("Number of pixels: %d" % X.shape[1])
print("Number of classes: %d" % len(np.unique(y))) # number of unique values in y

We have loaded 1797 images, each containing 64 pixels (they are 8 x 8 images), and belonging to one of 10 classes (the digits from 0 to 9).

In [0]:
# Pick one sample to "visualize" it
sample_idx = 42

print(X[sample_idx, :])
print(y[sample_idx])

__Question 1:__ 
* What is the type of X and y?
* What is the dataset's description provided by `scikit-learn`?
* Play with different values for `sample_idx`. Can you guess `y[sample_idx`]?

In [0]:
# TODO

## 3. Visualizing the data

Each sample is a scanned image, of size 8x8, containing 64 pixels. They have been flattened out to a vector of size 64, such as `X[sample_idx, :]`. Each entry of that vector is the intensity of the corresponding pixel.

Let us now visualize the original image.

In [0]:
# Reshape the vector X[sample_idx] in a 2D, 8x8 matrix
sample_image = np.reshape(X[sample_idx, :], (8, 8))
print(sample_image.shape)

In [0]:
# Display the corresponding image
plt.imshow(sample_image)

In [0]:
# Let us improve visualization by using grayscale plotting 
plt.imshow(sample_image, cmap='binary')

# Give the plot a title
plt.title('The digit at index %d is a %d' % (sample_idx, y[sample_idx]))

__Question 2:__ 
Visualize all classes of digits in the data set.

In [0]:
# TODO