In [1]:
# This cell is used to change parameter of the rise slideshow, 
# such as the window width/height and enabling a scroll bar

from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
              'width': 1000,
              'height': 600,
              'scroll': True,
})

{'height': 600, 'scroll': True, 'width': 1000}

In [None]:
# This command transforms the Jupyter notebook into a slideshow
!jupyter nbconvert CMM201_Week10.ipynb --to slides --post serve
# once a new browser opens, replace the "#" after the the_notebook.slides.html in the browser URL with 
# ?print-pdf so that the url looks most likely like http://127.0.0.1:8000/the_notebook.slides.html?print-pdf
# finally, print to PDF file

# Importing and Wrangling Data in Python

## Aims of the Lecture
* Learn how to import numerical data to Python from different sources.
* Understand how to select certain parts of the imported data.

## Example

### Loading Data from a Module

* Python has a module called **scikit-learn** or *sklearn* which contains several datasets commonly used in data science and business analytics. 

* For this excercise, we will use the **IRIS** database contained in this module. 

* This dataset contains the sepal and petal lengths and widths from 150 samples of 3 different types of the iris flower.

![Fig 1: Iris dataset.](https://www.dropbox.com/s/kqgsr9tmjdgou4g/iris.png?raw=1)

* Unlike last week, we will **NOT** work with the actual images, but rather with the numerical information extracted from samples.

* First, we need to install **sklearn**:

In [None]:
!pip install sklearn

* Then, we can load the iris dataset:

In [None]:
## Load iris dataset
from sklearn import datasets
iris = datasets.load_iris()
print(type(iris))

* The dataset is contained on a **dictionary-like** structure referred to as **sklearn.utils.Bunch**.

* If you print it, you will see a lot of things contained:

In [None]:
print(iris)

* Therefore, we need to extract each index of this dictionary into a different variables to understand and analyse them separately.

* First, we will import the data:

In [None]:
data = iris['data']
print(data, type(data), data.shape)

* The data is stored in a *numpy array* of 150 rows and 4 columns, each corresponding to the  measurements of a flower.

* Then, we will import the headers of the data:

In [None]:
header = iris['feature_names']
print(header, type(header))

* **Why do you think the data and the header are stored separately?**

* Afterwards, we will import the **class/target**:

In [None]:
target = iris['target']
print(target, type(target), target.shape)

* The class/target is a *numpy array* which contains the **category** of each flowers.

* Each sample is labelled as $0$, $1$ or $2$ instead of the iris type since the labels can be better used as numbers.

 * A separate key called **target_names** contains the name corresponding to each numerical label.

In [None]:
target_names = iris['target_names']
print(target_names, type(target_names), target_names.shape)

* Finally, just in case you are interested, there is an entry containing the description of the dataset (a string):

In [None]:
iris['DESCR']

### Wrangling Data

* Accessing an individual entry of the dataset (along with its class/target):

In [None]:
print(data[0], target[0])

* Creating a table for each iris type ("manually")

In [None]:
setosa = data[0:50]
print(setosa, setosa.shape)

In [None]:
## Use this cell to create and print versicolor and virginica (with the shape)

* Creating a table for each iris type ("automatically")

In [None]:
## In case that data is not in order or you don't want to count,
## we can use this alternative:
import numpy as np
setosa2 = data[np.where(target==0)]
print(setosa, setosa.shape)

In [None]:
## Verify that we get the same
setosa == setosa2

* Creating a new table with "less" columns (by column number):

In [None]:
## creating a "reduced" table
## with ony the first two columns
data_red1 = data[:,:2]
print(data_red1,data_red1.shape)

In [None]:
## Use this cell to create a new dataset called data_red2 
## with the last two columns

In [None]:
## Use this cell to create a new dataset called data_red3 
## with the first and the third columns

In [None]:
## creating a "reduced" table with ony the first column
col_0 = data[:,0]
print(col_0,col_0.shape)

* Getting a column by it's name:

In [None]:
sepal_length = data[:,header.index('sepal length (cm)')]
print(sepal_length,sepal_length.shape)

## Importing YOUR data

* For the coursework output 2, you will need to import the data from a **.csv** file.

* For instance, the IRIS dataset would look something like this:

![Fig 1: Iris dataset in csv](https://www.dropbox.com/s/q16d6opdperm8yp/dataset.png?raw=1)

* Your datasets will have a **first column** with the id of each entry (**NOT** the same as the row index).

* Your dataset will have the class/target in the **last column**.

* The **first row** contains the header.

* You need to find a pre-existing module that lets you import data from a csv file into a numpy array.

* Try to import the header in a different variable as the data.

* Since the classes/targets are numeric for all datasets, you can leave them on the same numpy array as the data.

* You don't need the target names, just work with the numbers!