# Class 2 - Programming with Python: packages

## A few announcements

+ **We now have two exercise classes**.One additional class has already been fixed for Wednesdays, 8:00 - 9:45 in Y27H28 (here) and will be taught by Marcel Fenzl.
+ **Another parallel exercise class**. We are in the process of establishing a third additional class. We will post more detailed information ASAP on the course Institute of Mathematics website and via the mailing list.
+ **Mailing list**. We established a _mailing list_ that will be used for urgent information about the course. If you are not a UZH student officially registered (e.g. you are from ETH and/or you're a ZGSM PhD student) send an email with your name at gabriele.visentin@math.uzh.ch asking to be added to the mailing list.

## A list of Python packages for Machine Learning

Today we are going to see some packages that are very useful for Machine Learning:
+ **NumPy**: functions for basic mathematics, specifically useful for linear algebra [Documentation](https://docs.scipy.org/doc/numpy/user/) [Tutorial](https://docs.scipy.org/doc/numpy/user/quickstart.html)
+ **Matplotlib**: functions for visualization of data via charts [Documentation](https://matplotlib.org/contents.html) [Tutorial](https://matplotlib.org/tutorials/introductory/pyplot.html) 
+ **Scikit-learn**: functions for implementation of Machine Learning models [Documentation](https://scikit-learn.org/stable/documentation.html) [Tutorial](https://scikit-learn.org/stable/tutorial/index.html)

We will discuss some of the fundamental objects and functions in these packages, what their "philosophy" is, and how to use them.

As usual, students are encouraged to further familiarize themselves with these packages also at home. In order to do this, above you can find also some recommended resources. Of course you don't have to go through all the documentation now: today we are going to rapidly go through the tutorials. You can come back to this page for reference when you'll need to consult the documentation.

## Numpy

NumPy is a Python package (or library, or module) that provides a multidimensional array object and a set of functions for fast operations on arrays, for basic linear algebra, basic statistical operations, random simulation and much more.

### Multi-dimensional arrays: declaration

The main object in NumPy is the homogeneous multidimensional **array**. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. 

Arrays resemble lists in Python: their values are indexed and can be edited.

In [None]:
import numpy as np

x = np.array([2,3,1,0])
x

In [None]:
print(x)

In [None]:
x[1]

In [None]:
x[1] = 490
x

Arrays are not just vectors! They're used to store matrices as well. **Think of arrays as multi-dimensional objects of unspecified dimension.**

In [None]:
x = np.array([[1,2],[0,0],[1,2]])
x

We can declare new arrays by using specific function that provide "shortcut" definitions.

In [None]:
x = np.arange(10)
x

arange() is similar to range() in standard Python, but it outputs a NumPy array!

In [None]:
range(10)

In [None]:
x = np.arange(2, 10, dtype=float)
x

In [None]:
x = np.linspace(0, 2, 5) 
x

In [None]:
x = np.zeros((3,4))
x

In [None]:
print(x)

### Multi-dimensional arrays: methods

NumPy arrays look like stardard Python lists, but they are much more complex.

In [None]:
type(x)

In particular, they are objects with very useful **methods** (i.e. functions associated to them).

In [None]:
x = np.arange(15).reshape(3, 5)
x

**Attention** Think of "methods" as a sort of functions that can act on objects of the same type. Methods are invoked with the syntax: object.method(arguments)

The number of elements in an array is an integer, known as size. It can be accessed via the size method.

In [None]:
x.size

To each array is associated a tuple, known as its shape. This can be accessed via the shape method.

In [None]:
x.shape

In [None]:
a = (3,3)
y = np.ones(a)
y

In [None]:
a = (4,5,5)
y = np.ones(a)
y

To each array is associated an integer, known as its number of dimensions, in the sense of independent array axes (i.e. vectors have dimension 1, matrices have dimension 2, and so on).

In [None]:
y.ndim

### Multi-dimensional arrays: operations

Arrays are particularly useful, because we can define and perform useful operations on them.

In [None]:
a = np.array([20, 30, 40, 50])
b = np.arange(4)

In [None]:
a + b

In [None]:
a**2

In [None]:
b>2

In [None]:
a = np.arange(4).reshape((2,2))
b = np.arange(2, 6).reshape((2,2))

In [None]:
a

In [None]:
b

In [None]:
a + b

In [None]:
a.dot(b) # Matrix product a times b 

In [None]:
c = np.array([1,2])
d = a.dot(c)
a

In [None]:
f = np.ones((1,2))
f

In [None]:
a.T

In [None]:
a = np.random.random((2,2))
a

In [None]:
b = 10*np.random.random((2,2))
b

In [None]:
c = np.vstack((a,b))
c

In [None]:
c = np.hstack((a,b))
c

In [None]:
np.linalg.eig(a)

In [None]:
np.trace(a)

### Multi-dimensional arrays: indexing and slicing in higher dimensions

Remember indexing and slicing for string/lists/vectors:

In [None]:
x = np.arange(5)
x

In [None]:
x[3]

In [None]:
x[1:3]

The same techniques are generalized to arrays of arbitrary dimensions, with a particular syntax:

In [None]:
x = np.array([[1, 2], [3, 4]])
x

In [None]:
x[1,1]

In [None]:
x[:,0]

In [None]:
x[0,:]

In [None]:
x = np.arange(15).reshape((3,5))
x

In [None]:
x[1, 2:4]

In [None]:
x[:, 0:3]

### Why NumPy?

Python is a very high-level language (customizable types, loops over all iterables, object-oriented and other paradigms available), but this comes at a computational cost.

C is a lower level language (i.e. to achieve the same level of complexity of python code, one needs more lines, on average), but it is computationally much faster.

NumPy gives us the best of both worlds: arrays are cool objects with lots of methods, but operations between arrays are speedily executed by pre-compiled C code. 

In [None]:
import time

a = 10*np.random.random(10**6)
b = 10*np.random.random(10**6)

c = []
start = time.time()
for i in range(len(a)):
    c.append(a[i]*b[i])
stop = time.time()
print(stop-start)

In [None]:
start = time.time()
d = a*b
stop = time.time()
print(stop-start)
d

## Matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

See a few [examples](https://matplotlib.org/gallery/index.html).

The pyplot library (which is what we are going to use) is structured similarly to plotting as done in Matlab: there is a "current figure" (technically, it is a plot object) that can be manipulated using functions.

In [None]:
import matplotlib.pyplot as plt

y = [1, 2, 6, 7]

plt.plot(y)
plt.ylabel('some numbers')
plt.show()

In the example above we only gave as input one vector and matplotlib interpreted it as "y" and automatically produced a sequence of "x" to display it as a 2d chart. 

If you want to specify an actual function, you must provide a list for the "x" coordinates and one for the "y" coordinates of all the points that you want plotted.

In [None]:
x = [1, 2, 3, 4]
y = [1, 4, 9, 16]

plt.plot(x, y)
plt.show()

The style of your chart can be modified. Matplotlib assumes you want blue lines by default, but you can specify the style of your graph by passing an additional argument. This "style" argument is just a particular string that must be formatted in a certain way. If you want to discover all the styles of plotting, you must read the documentation for the plot() function.

In [None]:
plt.plot(x, y, 'ro')
plt.show()

**Attention!** Notice that invoking the function "plot" repeatedly clears the old figure and creates a new one (have you noticed in the example above that the y-labels have disappeared?). 

You have to imagine that Python works on a "current figure" which it keeps in memory and it works on. Python keeps repainting it every time you invoke the "plot" function. When you invoke the function "show", Python will always show the latest version of this current figure.

**TL;DR:** **all plotting commands apply to the current figure**.

In [None]:
t = np.arange(0., 5., 0.2)

# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()

There are many different kinds of plots. 

In [None]:
data = {'a': np.arange(50),
        'c': np.random.randint(0, 50, 50),
        'd': np.random.randn(50)}
data['b'] = data['a'] + 10 * np.random.randn(50)
data['d'] = np.abs(data['d']) * 100

plt.scatter('a', 'b', c='c', s='d', data=data)
plt.xlabel('entry a')
plt.ylabel('entry b')
plt.show()

Working with multiple plots in the same figure, is also possible:

In [None]:
def f(t):
    return np.exp(-t) * np.cos(2*np.pi*t)

t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

plt.figure()
plt.subplot(211)
plt.plot(t1, f(t1), 'bo', t2, f(t2), 'k')

plt.subplot(212)
plt.plot(t2, np.cos(2*np.pi*t2), 'r--')
plt.show()

If you want to learn the commands for matplotlib more in depth, feel free to consult the documentation. We will not need to produce particularly good looking plots, just some simple charts to show data, outputs, and especially (estimations of) errors. 

Therefore an in-depth knowledge of matplotlib is not necessary, but you should be able to produce simple charts (lines, scatter, histograms) on the spot.

## Scikit-learn

Scikit-learn is a package specifically for Machine Learning. It provides objects, methods, and functions needed in order to implement all the normally used machine learning models. 

We can import "toy datasets":

In [None]:
from sklearn import datasets
iris = datasets.load_iris()

X, Y = iris.data, iris.target

print(X.shape, Y.shape)

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems" as an example of linear discriminant analysis. The data quantifies the morphologic variation of Iris flowers of three related species.

In particular, this dataset gives us data about 150 flowers, of which we measured 4 features (petal length, petal width, sepal length, sepal width). Each flower belongs to one of 3 species.

The point is to train an algorithm that, given the four features as input, would output the correct species.

In [None]:
print(X)

In [None]:
print(Y)

We can then train our favorite model on the training set. In this case, let us declare a Support Vector Machine (SVM) (we will study this model during the course), with some parameter values.

In [None]:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)

We can then train this particular model on our data. 

In [None]:
clf.fit(X, Y)  

And we can finally ask him to output prediction on new, previously unseen points in X.

In [None]:
newX = X[np.random.randint(0, 120, 30), :] + 0.0000001*(np.random.random((30, 4))-0.5)
clf.predict(newX)

All the models seen in class can be implemented (more or less) in this fashion using scikit-learn. We will see during our classes how to code these models from scratch, so that you will be able to produce your own models and/or modify existing ones.