# 2.3 Lab: Introduction to Python

This is the first lab in the Introduction to Statistical Learning with Python lab series. In these notebooks, we'll show you how to perform the same (or almost the same) tasks that are performed in the original and excellent book _Introduction to Statistical Learning with Applications in R_ by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, published by Springer.

Although the book does not cover any specific R library, we'll make use of some Python libraries. I think there is no good reason to write every algorithm from scratch in Python as there are so many reliable libraries that have already done this job for us. We'll make use of scipy, numpy, pandas, matplotlib and scikit-learn most of the time.

We also do not intend to follow the book rigorously. The key idea here is to present the same _concepts_ and to be a simple guide for those who ever wanted to see the labs in Python instead of R.

## 2.3.1 Basic Commands

We start with the concept of function in Python. A function is defined using the keyword `def` and can have any number of formal parameters, including variable (that are taken as tuple) and keyword arguments. You can also specify default values to parameters to be used when they're not supplied by the client of your function. A basic example is shown below.

In [1]:
def show(first, second, *args, **kwargs):
    print("The first argument is", first)
    print("The second argument is", second)
    
    for arg in args:
        print("Each argument collected by the tuple is", arg)
        
    for parameter, value in kwargs.items():
        print("Each argument collected by keyword is", parameter, "with value", value)
        

show("hello", "world", "a", "different", "argument",
     keyword_arg1="first kwarg", keyword_arg2="second kwarg")

The first argument is hello
The second argument is world
Each argument collected by the tuple is a
Each argument collected by the tuple is different
Each argument collected by the tuple is argument
Each argument collected by keyword is keyword_arg1 with value first kwarg
Each argument collected by keyword is keyword_arg2 with value second kwarg


We can also use the pack and unpack (sometimes referred to as _splat_) features of Python to perform some interesting things like this

In [2]:
args_list = ["a", "different", "argument"]
kwargs_dict = {'keyword_arg1': "first kwarg", 'keyword_arg2': "second kwarg"}

show("hello", "world", *args_list, **kwargs_dict)

The first argument is hello
The second argument is world
Each argument collected by the tuple is a
Each argument collected by the tuple is different
Each argument collected by the tuple is argument
Each argument collected by keyword is keyword_arg1 with value first kwarg
Each argument collected by keyword is keyword_arg2 with value second kwarg


Before we move on, we need to introduce the concept of modules. A module is simply a Python file with code. It can be imported by other modules or scripts and its code can be reused. There are a few different ways to import a module (or its objects). For example

In [3]:
import numpy
import scipy as sc
from pandas import DataFrame
from matplotlib.pyplot import *

Each statement is briefly explained:

* The first statement just makes the _numpy_ module available for us to use. Every object in it (instances, classes, functions etc) used in our code must be preceded by the module's full name.

* The second statement imports the module _scipy_ with the alias _sc_. It is handy since we don't need to write the complete name of the module every time we use its objects. Instead, we use its alias.

* The third import statement just makes a single object (namely the `DataFrame` class) available. We don't need to refer to the _pandas_ modules when using it, but we are limited to this single object.

* The last statement imports all names and objects in the _matplotlib.pyplot_ module. We can now refer to them with their names directly. This is the least recommended form of importing things into Python as it can cause name conflicts and be confuse to know from where things were imported.

Most of the time, we'll use the second form of import. It is more clear and helps keep each object under the module's namespace avoiding potential name conflicts.

That said, we can talk about lists, tuples and the super useful _numpy arrays_.

Lists are simple... well, lists. They are Python built-in data structures created with `[` and `]` and can be very useful to keep data organized and together. A list can hold different data types sequentially in it and its elements can be accessed via indexing, slicing or through the _iterator_ pattern (lists are so called _iterables_). They can also be concatenated through the `+` operator. A basic usage of lists is shown below.

In [4]:
l1 = [1, 2, 3, 4] # Create a list with four numbers.
l2 = ["hello", "world", 4, 5] # Create a second list with strings and numbers.

print("The first element of l1 is", l1[0]) # Indexing starts at zero.
print("The last element of l1 is", l1[-1]) # We can index it backwards using negative indices.
print("Wow we can slice", l1[1:3]) # Slicing is done with the pattern 'start:end'. The end is exclusive.
print("And concatenate them", l1 + l2) # Concatenating two lists is as simple as summing them.

# We can modify any of its elements.
l2[2] = "again"
l2[3] = "and again"

# We can iterate over lists with the for statement.
for elem in l2:
    print("Element of l2:", elem)
    
print("The length of l1 is", len(l1)) # The length of a list can be accessed with the len() function.

The first element of l1 is 1
The last element of l1 is 4
Wow we can slice [2, 3]
And concatenate them [1, 2, 3, 4, 'hello', 'world', 4, 5]
Element of l2: hello
Element of l2: world
Element of l2: again
Element of l2: and again
The length of l1 is 4


Tuples are similar to lists. The key difference is mutability. Lists are said _mutable_ data structures since they can be modified any time. Tuples on the other hand are said _immutable_ as they cannot be modified after their creation. In other words, once a tuple is created, it remains the same until it is wiped out from memory. They are created with `(` and `)` instead of `[` and `]`. Some examples are show below.

In [5]:
t1 = (1, 2, 3, 4) # Create a tuple with four numbers.
t2 = ("hello", "world", 4, 5) # Create a second tuple with strings and numbers.

print("The first element of t1 is", t1[0]) # Indexing starts at zero.
print("The last element of t1 is", t1[-1]) # We can index it backwards using negative indices.
print("Wow we can slice", t1[1:3]) # Slicing is done with the pattern 'start:end'. The end is exclusive.
print("And concatenate them", t1 + t2) # Concatenating two tuples is as simple as summing them.

# We can iterate over tuples with the for statement.
for elem in t2:
    print("Element of l2:", elem)
    
print("The length of t1 is", len(t1)) # The length of a tuple can be accessed with the len() function.

The first element of t1 is 1
The last element of t1 is 4
Wow we can slice (2, 3)
And concatenate them (1, 2, 3, 4, 'hello', 'world', 4, 5)
Element of l2: hello
Element of l2: world
Element of l2: 4
Element of l2: 5
The length of t1 is 4


Last but not least, we present numpy arrays. Numpy arrays are sequential data structures similar to lists, but with some differences. The first and most obvious is that it must be imported befored used as it belongs to the _scipy_ module. The second is that it accepts only elements of the same type: we can not mix data types. Numpy arrays may be a deep subject, so we'll introduce only the basics.

In [6]:
import numpy as np # This is the standard way to import numpy.

np.set_printoptions(precision=3) # Just set the number of digits to be printed to 3.

v1 = np.array([1, 2, 3, 4]) # We can create a numpy array from a list as long as all elements have the same data type.
v2 = np.array([10, 20, 30, 40])

print("First element of v1", v1[0], "and last element of v2", v2[-1]) # Same as lists just shown.
print("Slicing also works", v1[1:3]) # The end is exclusive.
print("Slicing indefinitely", v1[1:]) # We don't even need to supply an ending point. Just leave it blank.
print("First, third and fourth elements of v1", v1[[0, 2, 3]]) # This is called fancy indexing.
print("First, third and fourth elements of v1 again", v1[[True, False, True, True]]) # This is called masking.
print("First, sec... again", v1[v1 != 2]) # Boolean comparisons can also be used for indexing.
print("We can even reshape them:") # Reshaping a array changes its structure.
print(v2.reshape(2, 2)) # For example, to a multi dimensional-array (in this case, a 2x2 matrix).
print("Maximum of v2 is", np.max(v2)) # Numpy and scipy provide a number of functions to perform on numpy arrays.

print("Summing numpy arrays element-wise", v1 + v2) # Adding numpy arrays actually sum their elements.

print("Concatenation of v1 and v2", np.concatenate([v1, v2])) # Use numpy's concatenate function to concatenate two arrays.

print("The length of v1 is", len(v1)) # The length of a numpy array can be accessed with the len() function.

First element of v1 1 and last element of v2 40
Slicing also works [2 3]
Slicing indefinitely [2 3 4]
First, third and fourth elements of v1 [1 3 4]
First, third and fourth elements of v1 again [1 3 4]
First, sec... again [1 3 4]
We can even reshape them:
[[10 20]
 [30 40]]
Maximum of v2 is 40
Summing numpy arrays element-wise [11 22 33 44]
Concatenation of v1 and v2 [ 1  2  3  4 10 20 30 40]
The length of v1 is 4


There are many, many functions that accept numpy arrays and they perform so well (they are implemented in C) that they became the _de facto_ data structure for numerical methods, statistics, machine learning and scientific computing in general in Python. If you're not so familiar with them yet, I recommend spending some time understading and trying it a little before diving in the world of numerical computing in Python.

As we have already shown, a numpy array can be transformed into a multi-dimensional array with the `reshape` function. Many other functions exist to change the format or nature of a numpy array. There is even a `matrix` class in numpy, although it is not used so much. For the sake of completeness we demonstrate its basics here.

In [7]:
m1 = np.matrix([[1, 2], [3, 4]])
m2 = np.matrix([[10, 20], [30, 40]])
print("First matrix created with matrix:")
print(m1)

print("Second matrix created with matrix")
print(m2)

m3 = np.array([1, 2, 3, 4]).reshape(2, 2)

print("Same as the first matrix but created with array and reshape:")
print(m3)

print("m1 + m2:") # Addition of two matrices follows usual algebra we expect to.
print(m1 + m2)

print("m1 x m2:") # Product also follows the usual linear algebra.
print(m1 * m2)

First matrix created with matrix:
[[1 2]
 [3 4]]
Second matrix created with matrix
[[10 20]
 [30 40]]
Same as the first matrix but created with array and reshape:
[[1 2]
 [3 4]]
m1 + m2:
[[11 22]
 [33 44]]
m1 x m2:
[[ 70 100]
 [150 220]]


Although they are very nice, matrices are not really much used and I am not so sure about the reason. Instead, people seem to prefer to work with multi-dimensional arrays. If you enjoy working with numpy's matrices, feel free to do so.

When working with randomness, it can be very difficult to reproduce results. If we're generating random numbers in our experiments, we get a different result each time we run the script. It can then become cumbersome to reproduce experiments, talk about results or even be consistent between executions. To fix it, we can make use of _seeds_.

Random numbers generated by a computer are not actually random. They are the result of a series of computations (very well defined and deterministic computations) performed by the computer following a given algorithm, and so these numbers are more accurately called _pseudo-random numbers_. To start generating numbers, an algorithm needs a first, initial value called a seed. For each seed used as input, the algorithm will produce a different but deterministic sequence of pseudo-random numbers. We can control our experiments by providing the same seed to the algorithm in each execution of the script.

The numpy module provides a way to do just that through the `numpy.random.seed` function. Its use is better understood while generating pseudo-random numbers as shown below.

In [8]:
print("No seed: ", np.random.normal(size=5)) # First time running without setting a seed.
print("No seed: ", np.random.normal(size=5)) # Second time running without setting a seed. The results will be different.

np.random.seed(1234) # Set seed to be 1234.
print("Seed is 1234:", np.random.normal(size=5)) # First time running with seed 1234.

np.random.seed(1234) # Set seed to be 1234.
print("Seed is 1234:", np.random.normal(size=5)) # Second time running with seed 1234.

No seed:  [ 0.226  0.007  0.398 -1.461  1.272]
No seed:  [ 0.213  0.634  0.669  1.178 -1.694]
Seed is 1234: [ 0.471 -1.191  1.433 -0.313 -0.721]
Seed is 1234: [ 0.471 -1.191  1.433 -0.313 -0.721]


Each time we set (or reset) a seed, the algorithm starts fresh, generating the same sequence from the beginning. This way, we can be sure of the numbers that will be generated each time we run the script. Actually, I am so sure about the sequence that I can even tell you here: [0.471 -1.191  1.433 -0.313 -0.721] (but there is no way I can tell you the first two sequences without the seed).

There is a lot of interesting things we can do with numpy arrays. The most basics are summary statistics as mean, median, variance, standard deviation and so on. We show how below.

In [9]:
np.random.seed(1234) # Set seed to 1234.

x = np.random.normal(size=10) # Generate a new array of pseudo-random number drawn from a standard normal distribution.

print("The array is", x) # All methods are from numpy module.
print("The mean of x is", np.mean(x))
print("The median of x is", np.median(x))
print("The variance of x is", np.var(x))
print("The standard deviation of x is", np.std(x))

The array is [ 0.471 -1.191  1.433 -0.313 -0.721  0.887  0.86  -0.637  0.016 -2.243]
The mean of x is -0.143683492447
The median of x is -0.148477761989
The variance of x is 1.10648680223
The standard deviation of x is 1.05189676405
