# Lab 1: Introduction to Python

This lab is adapted from https://github.com/intro-stat-learning/ISLP_labs/blob/main/Ch02-statlearn-lab.ipynb

## Basic Commands


In this lab, we will introduce some simple `Python` commands. 
 For more resources about `Python` in general, readers may want to consult the tutorial at [docs.python.org/3/tutorial/](https://docs.python.org/3/tutorial/). 


 


Like most programming languages, `Python` uses *functions*
to perform operations.   To run a
function called `fun`, we type
`fun(input1,input2)`, where the inputs (or *arguments*)
`input1` and `input2` tell
`Python` how to run the function.  A function can have any number of
inputs. For example, the
`print()`  function outputs a text representation of all of its arguments to the console.

In [9]:
print('fit a model with', 11, 'variables')


fit a model with 11 variables


 The following command will provide information about the `print()` function.

In [10]:
print?


[1;31mDocstring:[0m
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file:  a file-like object (stream); defaults to the current sys.stdout.
sep:   string inserted between values, default a space.
end:   string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
[1;31mType:[0m      builtin_function_or_method

Adding two integers in `Python` is pretty intuitive.

In [11]:
3 + 5


8

In `Python`, textual data is handled using
*strings*. For instance, `"hello"` and
`'hello'`
are strings. 
We can concatenate them using the addition `+` symbol.

In [12]:
"hello" + " " + "world"


'hello world'

A string is actually a type of *sequence*: this is a generic term for an ordered list. 
The four most important types of sequences are: lists, tuples, dictionaries and strings.  

We introduce lists now. 

The following command instructs `Python` to join together
the numbers 3, 4, and 5, and to save them as a
*list* named `x`. When we
type `x`, it gives us back the list.

In [13]:
x = [3, 4, 5]
x


[3, 4, 5]

Note that we used the brackets
`[]` to construct this list. 

We will often want to add two sets of numbers together. It is reasonable to try the following code,
though it will not produce the desired results.

In [14]:
y = [4, 9, 7]
x + y


[3, 4, 5, 4, 9, 7]

The result may appear slightly counterintuitive: why did `Python` not add the entries of the lists
element-by-element? 
 In `Python`, lists hold *arbitrary* objects, and  are added using  *concatenation*. 
 In fact, concatenation is the behavior that we saw earlier when we entered `"hello" + " " + "world"`. 
 

Much of `Python`'s  data-specific
functionality comes from other packages, notably `numpy`
and `pandas`. 
We will discuss these packages later.


## For Loops
A `for` loop is a standard tool in many languages that
repeatedly evaluates some chunk of code while
varying different values inside the code.
For example, suppose we loop over elements of a list and compute their sum.

In [15]:
total = 0
for value in [3,2,19]:
    total += value
print(f'Total is: {total}')


Total is: 24


The indented code beneath the line with the `for` statement is run
for each value in the sequence
specified in the `for` statement. The loop ends either
when the cell ends or when code is indented at the same level
as the original `for` statement.
We see that the final line above which prints the total is executed
only once after the for loop has terminated. Loops
can be nested by additional indentation.

In [16]:
total = 0
for value in [2,3,19]:
    for weight in [3, 2, 1]:
        total += value * weight
print(f'Total is: {total}')

Total is: 144


Above, we summed over each combination of `value` and `weight`.
We also took advantage of the *increment* notation
in `Python`: the expression `a += b` is equivalent
to `a = a + b`. Besides
being a convenient notation, this can save time in computationally
heavy tasks in which the intermediate value of `a+b` need not
be explicitly created.

Perhaps a more
common task would be to sum over `(value, weight)` pairs. For instance,
to compute the average value of a random variable that takes on
possible values 2, 3 or 19 with probability 0.2, 0.3, 0.5 respectively
we would compute the weighted sum. Tasks such as this
can often be accomplished using the `zip()`  function that
loops over a sequence of tuples.

In [17]:
total = 0
for value, weight in zip([2,3,19],
                         [0.2,0.3,0.5]):
    total += weight * value
print(f'Weighted average is: {total}')


Weighted average is: 10.8


### String Formatting
In the code chunk above we printed a string
displaying the total. However, the object `total`
is an  integer and not a string.
Inserting the value of something into
a string is a common task, made
simple using
some of the powerful string formatting
tools in `Python`.
Many data cleaning tasks involve
manipulating and producing strings.

For example we may want to loop over the columns of a data frame and
print the percent missing in each column.
Let’s create a data frame `D` with columns in which 20% of the entries are missing i.e. set
to `np.nan`.  We’ll create the
values in `D` from a normal distribution with mean 0 and variance 1 using `rng.standard_normal()`
and then overwrite some random entries using `rng.choice()`.

In [8]:
import numpy as np
import pandas as pd

rng = np.random.default_rng(1)
A = rng.standard_normal((127, 5))
M = rng.choice([0, np.nan], p=[0.8,0.2], size=A.shape)
A += M
D = pd.DataFrame(A, columns=['food',
                             'bar',
                             'pickle',
                             'snack',
                             'popcorn'])
D[:3]


ModuleNotFoundError: No module named 'pandas'

In [None]:
for col in D.columns:
    miss = np.isnan(D[col]).mean()
    print(f'Column {col} has {miss:.2%} missing values')

For the second value printed, we specify that it should be expressed as a percent with two decimal digits.

The reference
[docs.python.org/3/library/string.html](https://docs.python.org/3/library/string.html)
includes many helpful and more complex examples.

## Dictionaries

Another data type that can be useful is dictionaries. As mentioned previously, this is a sequence data type, similar to a string or a list. The material below was adapted from https://developers.google.com/edu/python/dict-files.

A dictionary is made up of a series of key:value pairs within braces `{ }`, e.g. dict1 = {key1:value1, key2:value2, ... }. An empty dictionary is represented by an empty pair of curly braces `{}`.

Looking up or setting a value in a dict uses square brackets, e.g. dict['foo'] looks up the value under the key 'foo'. Strings, numbers, and tuples work as keys, and any type can be a value. Other types may or may not work correctly as keys (strings and tuples work cleanly since they are immutable). Looking up a value which is not in the dict throws a KeyError -- use the keyword `in` to check if the key is in the dictionary, or use dict.get(key) which returns the value or None if the key is not present.

In [None]:
# Can build up a dictionary by starting with the empty {}
# and storing key/value pairs into the dict like this:
# dict[key] = value-for-that-key
dict1 = {}
dict1['a'] = 'alpha'
dict1['g'] = 'gamma'
dict1['o'] = 'omega'

print(dict1)

In [None]:
print(dict1['a'])
dict1['a'] = 6      
print(dict1['a'])
print('a' in dict1)

In [None]:
print(dict1['z'])                 

In [None]:
if 'z' in dict1: 
    print(dict1['z'])
else:
    print("z not found")

In [None]:
print(dict1.get('z'))

A for loop on a dictionary iterates over its keys by default. The methods dict.keys() and dict.values() return lists of the keys or values explicitly. There's also an items() method which returns a list of (key, value) tuples, which is the most efficient way to examine all the key value data in the dictionary.

In [None]:
for key in dict1:
    print(key)

# Same as above
for key in dict1.keys():
    print(key)

# Get a list of keys
print(list(dict1.keys()))

# Get a list of values
print(list(dict1.values()))

In [None]:
# Common case -- loop over the keys in sorted order, to access each key/value pair
for key in sorted(dict1.keys()):
    print(key, dict1[key])

# .items() is the dictionary expressed as (key, value) tuples
print(list(dict1.items()))

# This loop syntax accesses the whole dict by looping over the .items() tuple list, 
# accessing one (key, value) pair on each iteration
for k, v in dict1.items(): 
    print(k, '>', v)

## Functions
At the beginning of this lab, we mentioned how Python uses many different built-in functions. You can also create your own functions, to use in the main body of your code. Below is a simple function that is used to print the largest input to the screen:

In [None]:
def func(input1, input2):
    if input1 > input2:
        print(input1)
    else:
        print(input2)
        
func(4,5)

The previous example did not return any values, since it printed the desired value to the screen. Let's repeat the previous example, except now we return the largest value:

In [None]:
def func(input1, input2):
    if input1 > input2:
        return input1
    else:
        return input2
        
max1 = func(4,5)
print("The maximum value is:", max1)

## Introduction to Numerical Python

As mentioned earlier, this book makes use of functionality   that is contained in the `numpy` 
 *library*, or *package*. A package is a collection of modules that are not necessarily included in 
 the base `Python` distribution. The name `numpy` is an abbreviation for *numerical Python*. 
 
See [docs.scipy.org/doc/numpy/user/quickstart.html](https://docs.scipy.org/doc/numpy/user/quickstart.html) for more information about `numpy`.

  To access `numpy`, we must first `import` it.

In [None]:
import numpy as np 

In the previous line, we named the `numpy` *module* `np`; an abbreviation for easier referencing.

In `numpy`, an *array* is  a generic term for a multidimensional
set of numbers.
We use the `np.array()` function to define   `x` and `y`, which are one-dimensional arrays, i.e. vectors.

In [None]:
x = np.array([3, 4, 5])
y = np.array([4, 9, 7])

Note that if you forgot to run the `import numpy as np` command earlier, then
you will encounter an error in calling the `np.array()` function in the previous line. 
 The syntax `np.array()` indicates that the function being called
is part of the `numpy` package, which we have abbreviated as `np`. 

Since `x` and `y` have been defined using `np.array()`, we get a sensible result when we add them together. Compare this to our results in the previous section,
 when we tried to add two lists without using `numpy`. 

In [None]:
x + y

In `numpy`, matrices are typically represented as two-dimensional arrays, and vectors as one-dimensional arrays. {While it is also possible to create matrices using  `np.matrix()`, we will use `np.array()` throughout the labs in this book.}
We can create a two-dimensional array as follows. 

In [None]:
x = np.array([[1, 2], [3, 4]])
x

The object `x` has several 
*attributes*, or associated objects. To access an attribute of `x`, we type `x.attribute`, where we replace `attribute`
with the name of the attribute. 
For instance, we can access the `ndim` attribute of  `x` as follows. 

In [None]:
x.ndim

The output indicates that `x` is a two-dimensional array.  
Similarly, `x.dtype` is the *data type* attribute of the object `x`. This indicates that `x` is 
comprised of 64-bit integers:

In [None]:
x.dtype

Why is `x` comprised of integers? This is because we created `x` by passing in exclusively integers to the `np.array()` function.
  If
we had passed in any decimals, then we would have obtained an array of
*floating point numbers* (i.e. real-valued numbers). 

In [None]:
np.array([[1, 2], [3.0, 4]]).dtype


Typing `fun?` will cause `Python` to display 
documentation associated with the function `fun`, if it exists.
We can try this for `np.array()`. 

In [None]:
np.array?


This documentation indicates that we could create a floating point array by passing a `dtype` argument into `np.array()`.

In [None]:
np.array([[1, 2], [3, 4]], float).dtype


The array `x` is two-dimensional. We can find out the number of rows and columns by looking
at its `shape` attribute.

In [None]:
x.shape


A *method* is a function that is associated with an
object. 
For instance, given an array `x`, the expression
`x.sum()` sums all of its elements, using the `sum()`
method for arrays. 
The call `x.sum()` automatically provides `x` as the
first argument to its `sum()` method.

In [None]:
x = np.array([1, 2, 3, 4])
x.sum()

We could also sum the elements of `x` by passing in `x` as an argument to the `np.sum()` function. 

In [None]:
x = np.array([1, 2, 3, 4])
np.sum(x)

 As another example, the
`reshape()` method returns a new array with the same elements as
`x`, but a different shape.
 We do this by passing in a `tuple` in our call to
 `reshape()`, in this case `(2, 3)`.  This tuple specifies that we would like to create a two-dimensional array with 
$2$ rows and $3$ columns. {Like lists, tuples represent a sequence of objects. Why do we need more than one way to create a sequence? There are a few differences between tuples and lists, but perhaps the most important is that elements of a tuple cannot be modified, whereas elements of a list can be.}
 
In what follows, the
`\n` character creates a *new line*.

In [None]:
x = np.array([1, 2, 3, 4, 5, 6])
print('beginning x:\n', x)
x_reshape = x.reshape((2, 3))
print('reshaped x:\n', x_reshape)


The previous output reveals that `numpy` arrays are specified as a sequence
of *rows*. This is  called *row-major ordering*, as opposed to *column-major ordering*. 

`Python` (and hence `numpy`) uses 0-based
indexing. This means that to access the top left element of `x_reshape`, 
we type in `x_reshape[0,0]`.

In [None]:
x_reshape[0, 0] 

Similarly, `x_reshape[1,2]` yields the element in the second row and the third column 
of `x_reshape`. 

In [None]:
x_reshape[1, 2] 

Similarly, `x[2]` yields the
third entry of `x`. 

Now, let's modify the top left element of `x_reshape`.  To our surprise, we discover that the first element of `x` has been modified as well!



In [None]:
print('x before we modify x_reshape:\n', x)
print('x_reshape before we modify x_reshape:\n', x_reshape)
x_reshape[0, 0] = 5
print('x_reshape after we modify its top left element:\n', x_reshape)
print('x after we modify top left element of x_reshape:\n', x)


Modifying `x_reshape` also modified `x` because the two objects occupy the same space in memory.
 

    

We just saw that we can modify an element of an array. Can we also modify a tuple? It turns out that we cannot --- and trying to do so introduces an *exception*, or error. This is because tuples are *immutable*, compared to lists, dictionaries and arrays that are *mutable*.

In [None]:
my_tuple = (3, 4, 5)
my_tuple[0] = 2


We now briefly mention some attributes of arrays that will come in handy. An array's `shape` attribute contains its dimension; this is always a tuple.
The  `ndim` attribute yields the number of dimensions, and `T` provides its transpose. 

In [None]:
x_reshape.shape, x_reshape.ndim, x_reshape.T


Notice that the three individual outputs `(2,3)`, `2`, and `array([[5, 4],[2, 5], [3,6]])` are themselves output as a tuple. 
 
We will often want to apply functions to arrays. 
For instance, we can compute the
square root of the entries using the `np.sqrt()` function: 

In [None]:
np.sqrt(x)


We can also square the elements:

In [None]:
x**2


We can compute the square roots using the same notation, raising to the power of $1/2$ instead of 2.

In [None]:
x**0.5


Throughout the textbook, we will often want to generate random data. 
The `np.random.normal()`  function generates a vector of random
normal variables. We can learn more about this function by looking at the help page, via a call to `np.random.normal?`.
The first line of the help page  reads `normal(loc=0.0, scale=1.0, size=None)`. 
 This  *signature* line tells us that the function's arguments are  `loc`, `scale`, and `size`. These are *keyword* arguments, which means that when they are passed into
 the function, they can be referred to by name (in any order). {`Python` also uses *positional* arguments. Positional arguments do not need to use a keyword. To see an example, type in `np.sum?`. We see that `a` is a positional argument, i.e. this function assumes that the first unnamed argument that it receives is the array to be summed. By contrast, `axis` and `dtype` are keyword arguments: the position in which these arguments are entered into `np.sum()` does not matter.}
 By default, this function will generate random normal variable(s) with mean (`loc`) $0$ and standard deviation (`scale`) $1$; furthermore, 
 a single random variable will be generated unless the argument to `size` is changed. 

We now generate 50 independent random variables from a $N(0,1)$ distribution. 

In [None]:
x = np.random.normal(size=50)
x


We create an array `y` by adding an independent $N(50,1)$ random variable to each element of `x`.

In [None]:
y = x + np.random.normal(loc=50, scale=1, size=50)

The `np.corrcoef()` function computes the correlation matrix between `x` and `y`. The off-diagonal elements give the 
correlation between `x` and `y`. 

In [None]:
np.corrcoef(x, y)

If you're following along in your own `Jupyter` notebook, then you probably noticed that you got a different set of results when you ran the past few 
commands. In particular, 
 each
time we call `np.random.normal()`, we will get a different answer, as shown in the following example.

In [None]:
print(np.random.normal(scale=5, size=2))
print(np.random.normal(scale=5, size=2)) 


In order to ensure that our code provides exactly the same results
each time it is run, we can set a *random seed* 
using the 
`np.random.default_rng()` function.
This function takes an arbitrary, user-specified integer argument. If we set a random seed before 
generating random data, then re-running our code will yield the same results. The
object `rng` has essentially all the random number generating methods found in `np.random`. Hence, to
generate normal data we use `rng.normal()`.

In [None]:
rng = np.random.default_rng(1303)
print(rng.normal(scale=5, size=2))
rng2 = np.random.default_rng(1303)
print(rng2.normal(scale=5, size=2)) 

The `np.mean()`,  `np.var()`, and `np.std()`  functions can be used
to compute the mean, variance, and standard deviation of arrays.  These functions are also
available as methods on the arrays.

In [None]:
rng = np.random.default_rng(3)
y = rng.standard_normal(10)
np.mean(y), y.mean()

In [None]:
np.var(y), y.var(), np.mean((y - y.mean())**2)

In [None]:
np.sqrt(np.var(y)), np.std(y)

The `np.mean()`,  `np.var()`, and `np.std()` functions can also be applied to the rows and columns of a matrix. 
To see this, we construct a $10 \times 3$ matrix of $N(0,1)$ random variables, and consider computing its row sums. 

In [None]:
X = rng.standard_normal((10, 3))
X

Since arrays are row-major ordered, the first axis, i.e. `axis=0`, refers to its rows. We pass this argument into the `mean()` method for the object `X`. 

In [None]:
X.mean(axis=0)

The following yields the same result.

In [None]:
X.mean(0)

If we want to do the same calculation for each column:

In [None]:
X.mean(axis=1)

## Sequences and Slice Notation

The function `np.linspace()` can be used to create a sequence of numbers.

In [None]:
seq1 = np.linspace(0, 10, 11)
seq1


The function `np.arange()`
 returns a sequence of numbers spaced out by `step`. If `step` is not specified, then a default value of $1$ is used. Let's create a sequence
 that starts at $0$ and ends at $10$.

In [None]:
seq2 = np.arange(0, 10)
seq2


Why isn't **10** output above? This has to do with *slice* notation in `Python`. 
Slice notation is used to index sequences such as lists, tuples and arrays.
Suppose we want to retrieve the fourth through sixth (inclusive) entries
of a string. We obtain a slice of the string using the indexing  notation  `[3:6]`.

In [None]:
"hello world"[3:6]

In the code block above, the notation `3:6` is shorthand for  `slice(3,6)` when used inside
`[]`. 

In [None]:
"hello world"[slice(3,6)]


You might have expected  `slice(3,6)` to output the fourth through seventh characters in the text string (recalling that  `Python` begins its indexing at zero),  but instead it output  the fourth through sixth. 
 This also explains why the earlier `np.arange(0, 10)` command output only the integers from $0$ to $9$. 
See the documentation `slice?` for useful options in creating slices. 

    



    


    

 

    

 

    


    


## Indexing Data
To begin, we  create a two-dimensional `numpy` array.

In [None]:
A = np.array(np.arange(16)).reshape((4, 4))
A


Typing `A[1,2]` retrieves the element corresponding to the second row and third
column. (As usual, `Python` indexes from $0.$)

In [None]:
A[1,2]


The first number after the open-bracket symbol `[`
 refers to the row, and the second number refers to the column. 

### Indexing Rows, Columns, and Submatrices
 To select multiple rows at a time, we can pass in a list
  specifying our selection. For instance, `[1,3]` will retrieve the second and fourth rows:

In [None]:
A[[1,3]]


To select the first and third columns, we pass in  `[0,2]` as the second argument in the square brackets.
In this case we need to supply the first argument `:` 
which selects all rows.

In [None]:
A[:,[0,2]]


Now, suppose that we want to select the submatrix made up of the second and fourth 
rows as well as the first and third columns. This is where
indexing gets slightly tricky. It is natural to try  to use lists to retrieve the rows and columns:

In [None]:
A[[1,3],[0,2]]


 Oops --- what happened? We got a one-dimensional array of length two identical to

In [None]:
np.array([A[1,0],A[3,2]])


 Similarly,  the following code fails to extract the submatrix comprised of the second and fourth rows and the first, third, and fourth columns:

In [None]:
A[[1,3],[0,2,3]]


We can see what has gone wrong here. When supplied with two indexing lists, the `numpy` interpretation is that these provide pairs of $i,j$ indices for a series of entries. That is why the pair of lists must have the same length. However, that was not our intent, since we are looking for a submatrix.

The *convenience function* `np.ix_()` allows us  to extract a submatrix
using lists, by creating an intermediate *mesh* object.

In [None]:
idx = np.ix_([1,3],[0,2,3])
A[idx]


Alternatively, we can subset matrices efficiently using slices.
  
The slice
`1:4:2` captures the second and fourth items of a sequence, while the slice `0:3:2` captures
the first and third items (the third element in a slice sequence is the step size).

In [None]:
A[1:4:2,0:3:2]


Why are we able to retrieve a submatrix directly using slices but not using lists?
Its because they are different `Python` types, and
are treated differently by `numpy`.
Slices can be used to extract objects from arbitrary sequences, such as strings, lists, and tuples, while the use of lists for indexing is more limited.




    

 

    

 

### Boolean Indexing
In `numpy`, a *Boolean* is a type  that equals either   `True` or  `False` (also represented as $1$ and $0$, respectively).
The next line creates a vector of $0$'s, represented as Booleans, of length equal to the first dimension of `A`. 

In [None]:
keep_rows = np.zeros(A.shape[0], bool)
keep_rows

We now set two of the elements to `True`. 

In [None]:
keep_rows[[1,3]] = True
keep_rows


Note that the elements of `keep_rows`, when viewed as integers, are the same as the
values of `np.array([0,1,0,1])`. Below, we use  `==` to verify their equality. When
applied to two arrays, the `==`   operation is applied elementwise.

In [None]:
np.all(keep_rows == np.array([0,1,0,1]))


(Here, the function `np.all()` has checked whether
all entries of an array are `True`. A similar function, `np.any()`, can be used to check whether any entries of an array are `True`).

   However, even though `np.array([0,1,0,1])`  and `keep_rows` are equal according to `==`, they index different sets of rows!
The former retrieves the first, second, first, and second rows of `A`. 

In [None]:
A[np.array([0,1,0,1])]


 By contrast, `keep_rows` retrieves only the second and fourth rows  of `A` --- i.e. the rows for which the Boolean equals `True`. 

In [None]:
A[keep_rows]


This example shows that Booleans and integers are treated differently by `numpy`.

We again make use of the `np.ix_()` function
 to create a mesh containing the second and fourth rows, and the first,  third, and fourth columns. This time, we apply the function to Booleans,
 rather than lists.

In [None]:
keep_cols = np.zeros(A.shape[1], bool)
keep_cols[[0, 2, 3]] = True
idx_bool = np.ix_(keep_rows, keep_cols)
A[idx_bool]


We can also mix a list with an array of Booleans in the arguments to `np.ix_()`:

In [None]:
idx_mixed = np.ix_([1,3], keep_cols)
A[idx_mixed]


For more details on indexing in `numpy`, readers are referred
to the `numpy` tutorial mentioned earlier.