# STAT1100 Data Communication and Modelling: Week 3



For this lab we will cover the last of the Python fundamentals for this course. Our focus
will be on two important python libraries within data science, numpy and pandas.

# NumPy: Easy Linear Algebra Programming

First, we will look at [NumPy](https://numpy.org/). NumPy provides us with an array data structure, fast mathematical operations
over those arrays, and Linear Algebra, Fourier Transforms, Random Number Generation, etc. To start we will install numpy with the
following command, which must be run from the command line. Note that lines in code blocks in Jupyter that start with the percent
sign are run similar to being direct from the terminal:

In [None]:
# In the following, the -U flag is short for --upgrade which just makes sure to update if we already have the package installed
%pip install -U numpy

The basis of numpy, the array data structure, is similar to a list but it can only store data of a single type. While this does
sound limiting, these arrays are substantially faster than lists for performing mathematical operations, especially when
concerning vectors, matrices, and further ranks of tensors. Constructing our own arrays from numpy is very simple, as follows:

In [None]:
import numpy as np

a = np.array(5)

In the first line we rename the module `numpy` to `np` with the import as statement, this is common place with numpy simply to
save typing. Next, we create a numpy array containing the scalar with value 5.

For many basic mathematical operations, arrays can be treated the same as any other number in python. For example:

In [None]:
# We can use normal python numbers
print(a + 5)
print(a - 4)
print(a * 2)
print(a / 3)
print(a ** 6)

# Or we can use numpy arrays
b = np.array(2)
print(a + b)
print(a * b)

Numpy arrays also have their own unique functions, which become more important when we look at numbers of higher rank.
So to look at those we will first construct a vector, $c = [1, 2, 3]$, then call some of the most common functions on it:

In [None]:
c = np.array([1, 2, 3])  # We use a list to specify the elements and shape of the array

print(c.ndim)  # Gets the number of axes, in this case 1 but for matrices it is 2 and so on
print(c.shape) # Gets a tuple saying the scale of elements per axis
print(c.size)  # Gets the total number of elements in the array
print(c.dtype) # Gives the typing of the elements in the array

Numpy also includes more specific mathematical functions, for example:

In [None]:
print(np.sqrt(c))  # Computes the square root of each element in the array
print(np.sin(c))   # Also performed elementwise
print(np.mean(c))  # Finds the mean of the entire array
print(np.std(c))

# There are also operations between multiple arrays
d = np.array([4, 5, 6])
print(c * d)        # This will be elementwise multiplication
print(c.dot(d))     # But this will be the dot product of the vectors
print(np.dot(c, d)) # This is the same as the previous line
print(np.linalg.norm(c - d, ord=2)) # The Euclidean distance between the vectors

There are many more functions included in numpy, if you are interested you can look at the
[documentation](https://numpy.org/doc/stable/reference/index.html#reference).

The final important feature of arrays we will cover is indexing. This is were we will take subarrays
by specifying the indices of the elements we want to take. For numpy arrays, the indices are not only limited
to just singular numbers or slices, but you can also use lists/arrays of indices, or even booleans. For example:

In [None]:
print(d[0])  # The first element, counting starts at 0 in python
print(d[1:3]) # Take the slice of second and third elements
print(d[[0, 2]]) # Take the first and last elements using a list
print(d[d > 4]) # Take all elements greater than 4
print(d[np.array([False, True, True])]) # Take all elements where the boolean is true

For arrays with more than one axis, the indexing is done through a tuple with a number of elements that is less
than or equal to the number of axes. For example:

In [None]:
m = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # A 3x3 matrix

print(m[0, 0]) # The first element of the first row
print(m[0]) # The first row
print(m[0, :]) # Same as above
print(m[:, 0]) # The first column
print(m[[1, 2], [0, 2]]) # The first and third elements of the second and third rows
print(m[m > 3]) # All elements greater than 3

### Exercises

1. Create an array called `v` with the values $[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]$.
2. Reshape `v` into a 2x5 matrix.
3. n/a.
4. Generate a random array, `r`, of integers in the range $[0, 10)$ with shape $(3, 4)$.
5. Sort the rows of `r` in ascending order.
6. Create the identity matrix $I$ with shape $(3, 4)$.
7. Create two 4x2x3 tensors of ones and zeros, respectively.

# Pandas: Collecting and Structuring Data

[Pandas](https://pandas.pydata.org/) is a library that for handling structured data in Python. Its key feature is the
DataFrame data structure, which is similar to R's dataframes, a table in an SQL database,
or an excel spreadsheet. We will start by installing pandas:

In [None]:
%pip install -U pandas

## Series

First we will look at the building block of the dataframe, the Series. A Series is a labelled one-dimensional array of data,
it represents a single column of a dataframe. For example:

In [None]:
import pandas as pd  # It is common to rename pandas to pd, saves typing

# We will use a dictionary for the values of the Series.
# But there many other valid types to pass in, such as numpy arrays.
s = pd.Series({'a': 1, 'b': 2, 'c': 3})

The series is mostly encountered when taking parts of a dataframe. When dealing with them independently, they act
very similar to numpy arrays, except with labelled indices. For example:

In [None]:
print(s + 2)
print(s / 4)

# When operating on two Series, the elementwise operations are aligned by label
print(s + pd.Series({'b': 4, 'a': 1, 'c': 9}))

Like with numpy arrays, the series also includes numerous functions for more specific operations. For example:

In [None]:
print(s.mean())
print(s.max())
print(s.prod())

## DataFrames

The DataFrame extends the Series into two dimensions, being a matrix with labelled rows and columns. It shares
functionality with the Series, except those functions are called across each column independently and return
a series of the results. In the following, we demonstrate the construction of a dataframe and the calling of
some of its functions:

In [None]:
# If labels for the rows are not specified, they are numbered starting from 0
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})

print(df + 2)  # Any normal arithmetic works elementwise
print(df.mean())
print(df.max())
print(df.describe())  # Similar to R's summary()
print(df * pd.DataFrame({'c': [1, 2, 3], 'a': [9, 12, 7], 'b': [6, 5, 4]}))

A key difference between the numpy array and both the dataframe and series, is in accessing specific elements.
For accessing single columns or elements, there are two ways. The most intuitive way being the "bracket method"
which is the same as with dictionaries, except the indices are also permissive of lists and arrays.

In [None]:
print(df['a'])  # Access a column by name
print(df[['a', 'c']]) # Access multiple columns by name
print(df['b'][1]) # Access a single element

There is also the dot method for getting columns by name, it is similar accessing items from the modules.
It also happens to replicate R's `$` operator.

In [None]:
print(df.a) # Same as the first line of the previous block

You may notice that `df[['a', 'c']]` returns another dataframe, meaning you can not access the rows in the same
way as we got an element. As in, the following is invalid:

In [None]:
print(df[['a', 'c']][1])

This is not a complete loss though, as pandas provides a method for accessing rows by name or index. These are
the `loc` and `iloc` methods, which use column names and column indices respectively. For the following demonstration,
we will first give `df` row names, then use the `loc`/`iloc` methods:

In [None]:
df.index = ['d', 'e', 'f']

print(df.loc['d'])
print(df.loc[['f', 'e']])
print(df[['a', 'c']].loc['e'])  # Working version of print(df[['a', 'c']][1]

# iloc versions
print(df.iloc[0])
print(df.iloc[[2, 1]])
print(df[['a', 'c']].iloc[1])

There are a few other significant summary methods whose importance are better seen in larger data sets. The
`head` and `tail` methods, which show the first and last $n$ rows of the dataframe, respectively, and in
addition to the previously mentioned `describe` method. These methods give a general "at-a-glance" view of
of the stored data.

In [None]:
# Construct the big dataframe
rng = np.random.default_rng()
big_df = pd.DataFrame(rng.integers(0, 10, size=(100, 5)), columns=['a', 'b', 'c', 'd', 'e'])

# Look at the summaries
print(big_df.head())
print(big_df.tail())
print(big_df.describe())
# The default number of rows is 5, but can be changed through the parameters
print(big_df.head(2))
print(big_df.tail(20))

### Exercises

For these exercises, we will also use the [scikit-learn](https://scikit-learn.org/) and
[matplotlib](https://matplotlib.org/) libraries. We will cover these libraries in more detail in future tutorials,
but for now, we will do some basic data analysis with them. To install these libaries,
run the following:

In [None]:
%pip install -U scikit-learn
%pip install -U matplotlib

From scikit-learn, we will take the iris dataset for our analysis. The following code will
load the dataset into a pandas dataframe:

In [None]:
import matplotlib.pyplot as plt  # For plots later on
%matplotlib inline
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)['frame']

1. Print the column names of `iris`.
2. Print the first twelve rows of `iris`.
3. Get a summary of the data in `iris`.
4. Print the first, fifth, and forty-second rows of the `sepal width (cm)` and `petal length (cm)`
columns in `iris`.
5. Find the product of the `petal width (cm)` and `petal length (cm)` columns in `iris`.
6. Make scatter plot of the `petal length (cm)` against the `petal width (cm)` columns in `iris`.
Try to color the points according to the `target` column.  (Hint: dataframes have a collection of 
[plot methods](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html). After plotting, you can
use the `plt.show()` method to display it.)


# Supplementary Material: What is a library?

From here, we will cover some optional material that is not required to understand this course,
but will improve your understanding of python programming. To complete this, you cannot use a
jupyter notebook, you must use a text-editor/IDE.

By now, you have probably heard of the terms "library", "module", and "package" used almost
interchangeably. They all have subtle differences that will be clear in a moment, but at the same
time they all share a purpose, collecting code for easier reuse and readability.

## Set-up

For this Section we will need to first create a folder called "lab3", we will then add files and
folders to this folder. You will need to use a text editor to create any Python files, ensure that
they get saved with the extension ".py" noting that some text editors might add ".txt".

The choice of text editor is up to you. Notably, most lab computers would have notepad++ by default.

## Creating a Module

A module is a Python file that is independent of the file which your are running your
main code. That module can then be "imported" into the main code to make it's functions,
variables, etc. available. We will see what this means with the following example, where we will construct
a module called "stats" which will perform a mean on a collection of numbers stored in our
main file.

First, we will create the module by creating a file called "stats.py" in the lab3 folder.
In this file we will write the following code:

```python
"""
An assortment of basic statistics functions
"""

def mean(numbers):
    return sum(numbers) / len(numbers)
```

The first three lines of that code are an optional "docstring" which is a special type of comment
that provides documentation for what it is attached to. In this case, the docstring is attached to
the module to provide a description of its purpose, you can also put them on functions,
classes, variables, etc. Including docstrings ensures that when you call the `help` function
on the module, function, etc. you will see that documentation.

Next, we will create our main file, where we will import and use the module. The main file
will be called "main.py" and will also be in the lab3 folder, it will contain the following
code:

```python
import stats

if __name__ == "__main__":
    numbers = [1, 2, 3, 4, 5]
    print(stats.mean(numbers))
```

The first line of the code imports our module `stats` into the "namespace" called `stats`. A namespace
is similar to a variable, but it collects the variables, functions, and classes inside modules, each of
which can be accessed with `namespace.variable_function_etc`. The next notable line is the `if __name__ == "__main__"`
line which is not required, but is a good practice to include. What it does is ensure that the code contained
in the if block is only executed if this is the main file that is executed. That means that if you import main,
that code will not be executed. Finally, we call the `mean` function from the `stats` namespace on our list of numbers
and print the result.

To execute this file, you can use the command `python main.py` from the command line.

### Exercises

1. Add a function called `std` to the `stats` which computes the standard deviation of a list of numbers.
Call it on the numbers list in the main file.
2. Print the results of calling `dir` on the `stats` module.
3. Add docstrings to both of the functions in the `stats` module. Then, in the main file, call the help
function on the module and each of those functions. (press q to exit the help function)
4. Create a module called `functions` and add in two functions, $f(x) = x^2 + 3$ and $f(x) = x^3 + 5$.
Then, in the main file, import the `functions` module and call the two functions on $4$. Print the results.

## Collecting Modules into a Library

A library is collection of modules, they create an additional namespace for the modules better sorting them.
Now that we have two modules, we can create a library that we will call `lib`. To do this, we will create
a folder called "lib" and move "stats.py" and "functions.py" into this folder.

Now to import the modules into our main file, we will need to replace `import stats` with `from lib import stats`
and `import functions` with `from lib import functions`. This is due to our added namespace.

We can also make `lib` directly contain the modules. We just need to add a file called `__init__.py` to the the `lib`
folder, it will contain:

```python
from . import stats
from . import functions
```

Where the `from .` means take the module from this folder.

With this file, we can now simplify the imports of our main file to:

```python
import lib

if __name__ == "__main__":
    numbers = [1, 2, 3, 4, 5]
    print(lib.stats.mean(numbers))
    # etc.
```

Notice that this form of import gives us a two level namespace, `lib.module.variable_function_etc`. The reason these namespaces
are important outside of sorting code, is that they allows us to not worry about overwritng variables, functions, etc. from other
modules. This also happens to determine the level which we should import from. For example, since there is nothing else in
`main.py` called `stats`, we can import `stats` from `lib` with `from lib import stats` and not worry about overwriting it

### Exercise

Create a another library called `strings`. Inside `strings` create two modules, `words` and `letters`. In `words`
create a functions called `count` which takes a string and returns the number of words in the string, also create
a function called `reverse` which reverses the string. In `letters`, create a function called `count` which finds
the number of letters in a string. In the main file, call each of these functions on a string and print the results.

## Then, What is a Package?

Finally, this takes us to packages. A package is simply a library that is built so it can be imported into Python from
any folder, they appear as an extending feature for Python. We commonly use packages built by other people to make our own
code simpler, and to prevent us from "reinventing the wheel". The `pip` tool that is a part of the Python distribution is
used to manage and install packages, of which usually come from [PyPI](https://pypi.org/).