# Week 1: Getting Started with Jupyter Notebooks

In this notebook, we will make sure all the packages required for this course are properly installed and working. 

To use this notebook, select the input cells (shown as `In [x]`) in order and press Shift-Enter to execute the code. 

Your installation is properly working if none of the cells below return any errors.

## Loading and Testing the Course's Required Packages

In this notebook we will load and test each of packages that we will mainly be using during the course: Numpy, Matplotlib, Pandas, and SciKit-Learn.

### Numpy
First, we will import NumPy. NumPy is a linear algebra library, and provides useful vector and matrix functionality, similar to MATLAB. It is convention to define NumPy as `np` for the sake of brevity:

In [None]:
import numpy as np

Make a vector with 6 elements:

In [None]:
a = np.array([1,2,3,4,5,6])

# Print the contents of a
a

Get some information about the vector:

In [None]:
print("The vector a has " + str(a.ndim) + " dimension(s) and has the shape " + str(a.shape) + ".")

Create a matrix like this:

In [None]:
m = np.array([[1,2,3], [4,5,6]])

m

Get some information about the matrix:

In [None]:
print("The matrix m has " + str(m.ndim) + " dimension(s) and has the shape " + str(m.shape) + ".")

A very powerful feature of NumPy and Python are _List Comprehensions_. These can replace many `for` loops and are much more efficient to run. Here we square every element in the vector `a` from above:

In [None]:
a_squared = [i**2 for i in a]

a_squared

Using NumPy we can select rows and columns of data very easily (known as _array slicing_). 

For example, we can print the first row of the matrix, `m`:

In [None]:
m[0]

Or we can slice the first column only. Using the `,` symbol we can ask for specfic rows and columns. The first integer __always specifies the rows__, which is followed by `,` and the second integer __specifies the columns__. The colon character `:` is shorthand for _all rows_ or _all columns_. Here we select all rows of matrix `m` using `:` and select the first column, using `0`:

In [None]:
m[:,0]

We can select a specific element using the matrix's column and row index, for example we want to select the item in the second column's second row:

In [None]:
m[1,1]

Entire books have been written about NumPy. Let's move on to Matplotlib.

### Matplotlib

First we will load Matplotlib, a 2D plotting library, and run `%matplotlib inline`. This magic function (functions beginning with the % symbol are called magic functions) will display any plots inline in the notebook.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

Using the `plt.plot()` function, you can plot many types of data and Matplotlib will try to figure out what you want to do with the data:

In [None]:
plt.plot([1,2,3,4,5])

The `plt.plot()` function will take an arbritrary number of `x` `y` argument pairs:

In [None]:
# x**y is shorthand for x to the power of y
plt.plot([1,2,3,4,5],[1**2,2**2,3**2,4**2,5**2]) 

Or plot three lines at once by supplying three pairs of `x` `y` values:

In [None]:
p = np.arange(1,10) # Get a range of numbers form 1 to 10
plt.plot(p, p, p, p**2, p, p**3)

Combined with functions, more complex curves can be plotted. Let's plot the sigmoid curve:

In [None]:
import math
def sigmoid(x):
    a = []
    for item in x:
        a.append(1/(1+math.exp(-item)))
    return a

x = np.arange(-10., 10., 0.1)
sig = sigmoid(x)

plt.plot(x,sig)

### Pandas
Let's now import Pandas. This library provides R-style dataframe table functionality. Like NumPy, it is convention to import the Pandas library as `pd` for brevity.

In [None]:
import pandas as pd

Let's load the well known Wisconsin Breast Cancer dataset:

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/mdbloice/Machine-Learning-for-Health-Informatics/master/data/breast-cancer-wisconsin.csv")

You can view a summary of the data using the `describe()` function:

In [None]:
df.describe()

Let's rename the column names:

In [None]:
df.columns = ["ID","Clump_Thickness","Size_Uniformity","Shape_Uniformity","Marginal_Adhesion","Epithelial_Size","Bare_Nucleoli","Bland_Chromatin","Normal_Nucleoli","Mitoses","Class"]

# Print the new header names:
df.columns

See the first few rows:

In [None]:
df.head()

Notice how the `Class` column (the last column in the table) consists of 2s and 4s. In this case 2 stands for malignant and 4 stands for benign. You can check this quickly using:

In [None]:
df.Class.unique()

So you see only 2s and 4s are contained in this column. See a breakdown of the counts using the `value_counts()` function:

In [None]:
df.Class.value_counts()

However, it is convention in machine learning to use a 0-based index to represent classes. Let's replace the 2s with 0s and the 4s with 1s:

In [None]:
df = df.replace({"Class": {2: 0, 4: 1}})
df.Class.value_counts()

Perhaps now we would like to drop the `ID` column:

In [None]:
df = df.drop(["ID"], axis=1)
df.describe()

As you can see you can use Pandas to quickly manipulate and access tabular data. Here we access the first 10 rows of the `Size_Uniformity` column: 

In [None]:
df.Size_Uniformity[0:10]

Columns can also be accessed using the name of the column as an index:

In [None]:
df['Size_Uniformity'][0:10]

You can examine the data types (Pandas dataframes can contain multiple types):

In [None]:
df.dtypes

The column `Bare_Nucleoli` appears as type `object` as it contains some missing data, which appear as `?` in the dataset. Later in the course we will learn how to handle missing data.

You can also use Pandas to perform quick statistical analyses. Here we calculate the standard deviation for each column:

In [None]:
df.std()

Or calculate the standard deviation for a certain column:

In [None]:
df.Clump_Thickness.std()

Pandas also provides useful plotting tools. To look for correlations in data, a scatter matrix is often useful. 

Here we will plot __only three columns of the data__, and only the __first 100 rows of the data__, as a scatter plot with so many columns can take some time to render and can result in a very large plot.

In [None]:
from pandas.tools.plotting import scatter_matrix

# Manually select three of the table's columns by passing an array of column names:
df_subset = df[['Clump_Thickness','Size_Uniformity', 'Shape_Uniformity']]

# The semi colon at the end of this line is to suppress informational output (we only want to see the plot)
scatter_matrix(df_subset.head(100), alpha=0.2, figsize=(6,6), diagonal='kde');

### SciKit-Learn

Last but not least, we shall import some modules from SciKit-Learn. SciKit-Learn is the main machine learning library for Python. It is a large library and is not normally loaded entirely; in general you load only the modules you need from the main library. Here we will load the `datasets` module and the _k_-nearest neighbours module (`KNeighborsClassifier`):

In [None]:
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier

Load the Iris dataset (a flower data set often used for demonstration purposes):

In [None]:
iris = datasets.load_iris()

Convention states that matrices are represented using uppercase letters, often the letter `X`, and label vectors are represented using lower case letters, often `y`:

In [None]:
X = iris.data
y = iris.target

The k-Nearest Neighbour algorithm is possibly the simplest classifier. Given a new observation, take the label of the sample closest to it in the _n_-dimensional feature space. 

First we must randomise the data, but we must ensure we randomise the labels as well __in sync__. We can use a NumPy feature to create indices that then correlate to both the targets and the data:

In [None]:
np.random.seed(376483)
random_indices = np.random.permutation(len(y))

random_indices

Then, the data must be split into a test set and a training set (again we are using naming conventions here for the training and test data `X_train` and `X_test` and their labels `y_train` and `y_test`):

In [None]:
X_train = X[random_indices[:-10]]
X_test  = X[random_indices[-10:]]

y_train = y[random_indices[:-10]]
y_test  = y[random_indices[-10:]]

print("Number of training samples: %d. Number of test samples: %d." % (len(X_train), len(X_test)) )

Now we will try to fit the _k_-nearest neighbours classifier to the training data:

In [None]:
knn = KNeighborsClassifier() # Initialise the classifier.
knn.fit(X_train, y_train)    # Fit the classifier.

The classifier has now been trained on the training data (`X_train`). We can now check how well it predicts newly seen data (using our test set, `X_test`):

In [None]:
y_pred = knn.predict(X_test)

# The classifier's predicted labels are now contained in y_pred:
y_pred

We can now element-wise compare our predicted results, in `y_pred`, with the true labels stored in `y_test`:

In [None]:
y_pred == y_test

As you should see, the _k_-Nearest Neighbour classifier predicted each of them correctly! 

# End

There is no deliverable for this week, you must only ensure this notebook works correctly on your machine! Hopefully, after installing Anaconda, this entire notebook will have worked without any issues.