# Week 1: Getting Started with Jupyter Notebooks

In this notebook, we will make sure all the packages required for this course are properly installed and working. 

To use this notebook, select the input cells (shown as `In [x]`) in order and press Shift-Enter to execute the code. 

Your installation is properly working if none of the cells below return any errors.

## Loading and Testing the Course's Required Packages

The first thing you normally want to do in any project is to import all the required packages. For this course, the main packages we will be using as MatPlotLib, NumPy, Pandas, and SciKit-Learn.

### Numpy
Now, we will import NumPy. NumPy is a linear algebra library, and provides useful vector and matrix functionality, similar to MATLAB. It is convention to define NumPy as `np` for the sake of brevity:

In [None]:
import numpy as np

Make a vector with 6 elements:

In [None]:
a = np.array([1,2,3,4,5,6])

a

Get some information about the vector:

In [None]:
print("The vector a has " + str(a.ndim) + " dimension(s) and has the shape " + str(a.shape) + ".")

Create a matrix like this:

In [None]:
m = np.array([[1,2,3], [4,5,6]])

m

Get some information about the matrix:

In [None]:
print("The matrix m has " + str(m.ndim) + " dimension(s) and has the shape " + str(m.shape) + ".")

A very powerful feature of NumPy and Python are _List Comprehensions_. These can replace many `for` loops and are much more efficient to run. Here we square every element in the vector `a` from above:

In [None]:
a_squared = [i**2 for i in a]

a_squared

Using NumPy we can select rows and columns of data very easily (known as _array slicing_). 

For example, we can print the first row of the matrix, `m`:

In [None]:
m[0]

Or we can slice the first column only. Using the `,` symbol we can ask for specfic rows and columns. The first integer specifies the rows, which is followed by `,` and the second integer specifies the columns. The colon character `:` is shorthand for _all rows_ or _all columns_. Here we select all rows of matrix `m` using `:` and select the first column, using `0`:

In [None]:
m[:,0]

We can select a specific element using the matrix's column and row index, for example we want to select the second row of the second columm:

In [None]:
m[1,1]

Entire books have been written about NumPy. Let's move on to Matplotlib.

### MatPlotLib

First we will load MatPlotLib, a 2D plotting library, and run `%matplotlib inline`. This magic function (functions beginning with the % symbol are called magic functions) will display any plots inline in the notebook.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

Using the `plt.plot()` function, you can plot many types of data and Matplotlib will try to figure out what you want to do with the data:

Let's plot the sigmoid curve:

In [None]:
import math
def sigmoid(x):
    a = []
    for item in x:
        a.append(1/(1+math.exp(-item)))
    return a

x = np.arange(-10., 10., 0.1)
sig = sigmoid(x)

plt.plot(x,sig)

### Pandas
Let's now import Pandas. This library provides R-style dataframe table functionality. Like NumPy, it is convention to import the Pandas library as `pd` for brevity.

In [None]:
import pandas as pd

Let's load the well known Wisconsin Breast Cancer dataset:

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/mdbloice/Machine-Learning-for-Health-Informatics/master/data/breast-cancer-wisconsin.csv")

You can view a summary of the data using the `describe()` function:

In [None]:
df.describe()

Let's rename the column names:

In [None]:
df.columns = ["ID","Clump_Thickness","Size_Uniformity","Shape_Uniformity","Marginal_Adhesion","Epithelial_Size","Bare_Nucleoli","Bland_Chromatin","Normal_Nucleoli","Mitoses","Class"]


See the first few rows:

In [None]:
df.head()

You can use Pandas to perform analyses. Here we calculate the standard deviation for each column:

In [None]:
df.std()

Or calculate the standard deviation for a certain column:

In [None]:
df.Clump_Thickness.std()

Or access the first 10 rows of the `Size_Uniformity` column:

In [None]:
df.Size_Uniformity[0:10]

Columns can be accessed using the name of the column as an index:

In [None]:
df['Size_Uniformity'][0:10]

You can examine the data types (Pandas dataframes can contain multiple types):

In [None]:
df.dtypes

The column `Bare_Nucleoli` appears as type `object` as it contains some missing data, which appear as `?` in the dataset. Later in the course we will learn how to handle missing data.

Pandas has many functions that are useful for data analysis:

In [None]:
df.Mitoses.unique()

Pandas also provides useful plotting tools. To look for correlations in data, a scatter matrix is useful. 

Here we will plot __only three columns of the data__, and only the __first 100 rows of the data__, as a scatter plot with so many columns can take some time to render.

In [None]:
from pandas.tools.plotting import scatter_matrix

# Manually select three of the table's columns by passing an array of column names:
df_subset = df[['Clump_Thickness','Size_Uniformity', 'Shape_Uniformity']]

scatter_matrix(df_subset.head(100), alpha=0.2, figsize=(6,6), diagonal='kde')

### SciKit-Learn
Last but not least, we shall import some modules from SciKit-Learn. SciKit-Learn is the main machine learning library for Python. It is a large library and is not normally loaded directly; in general you load modules from the main library. Here we will load the `datasets` module and `svm` Support Vector Machine module:

In [None]:
from sklearn import datasets
from sklearn import svm

Load the Iris dataset:

In [None]:
iris = datasets.load_iris()

To help us visualise the data, we will use only two dimensions of the dataset. Convention states that matrices are represented using uppercase letters, in this case `X`, and label vectors are represented using lower case letters, in this case `y`: 

In [None]:
X = iris.data[:, :2] 
y = iris.target

Let's fit the dats using a linear kernel support vector machine:

In [None]:
svc = svm.SVC(kernel='linear', C=1.0).fit(X, y)

In [None]:
h = .02 
C = 1.0 
svc = svm.SVC(kernel='linear', C=C).fit(X, y)
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X, y)
poly_svc = svm.SVC(kernel='poly', degree=3, C=C).fit(X, y)
lin_svc = svm.LinearSVC(C=C).fit(X, y)

# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

titles = ['SVC with linear kernel',
          'LinearSVC (linear kernel)',
          'SVC with RBF kernel',
          'SVC with polynomial degree 3 kernel']


for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
    plt.subplot(2, 2, i + 1)
    plt.subplots_adjust(wspace=0.4, hspace=0.4)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
    plt.xlabel('Sepal length')
    plt.ylabel('Sepal width')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(titles[i])

plt.show()