# 1. Tools and introduction

## Getting Started: Jupyter Notebook

This course aims to provide an introduction to data analysis and machine learning with Python. You will learn ideas and concepts about how to extract useful information from gathered data - in order to gain important new insights and draw interesting conclusions from its contents. Such information is often not easily or directly attainable, and can only be accessed with the help of specialized methods.

The Python programming language offers a large number of libraries for analysing and manipulating data. A convenient way to get started is by installing a free Python distribution called *Anaconda* (www.anaconda.com). In addition to standard libraries, Anaconda automatically incorporates many of the central data science tools and packages - including *Jupyter Notebook*, a web-based interactive environment, which provides a platform to write and execute Python code and complement it with explanatory text and rich annotations. All the material for this course (including this document) will be provided as notebook documents, denoted by the `.ipynb` extension.

After installing the Anaconda distribution, you will find Jupyter Notebook among the installed applications. On launching it, a directory listing opens up in a web browser. You can either locate and select an existing notebook for opening, or create a new notebook: press the button "New" in upper right corner of the page, and select "Notebook: Python 3 (ipykernel)".  

Jupyter Notebook documents contain two kinds of cells: *code cells* for writing and executing Python code, and *markdown cells* (like this one) for annotation purposes. To run the code in one of the code cells, select the cell first and then either press SHIFT + ENTER, or the "Run" button in the above toolbar. The interactive-mode IPython interpreter then displays the resulting output (if any) for inspection. With markdown cells, you can create explanatory content to accompany your code; for additional information, see e.g. https://jupyter-notebook.readthedocs.io/en/stable/.

**NOTE:** When the notebook is reopened after closing it, the kernel restarts, and all the variables from before are lost; you will have to execute the code again to restore their values. Also, when working with notebooks and constantly rewriting and executing the cells, sometimes errors might creep up and cause the kernel to crash or the notebook to get stuck. In this case, it is usually a good idea to restart the kernel and execute its cells all over again; this can conveniently be accomplished by selecting "Restart & Run All" from the "Kernel" dropdown menu.

As the next step, execute the first code cell right below to import two Python modules: **NumPy** is a basic package for scientific computing, whereas the **pandas** library is a specialized library for reading and manipulating data. These two libraries form the background for any data analysis project with Python. In the following, we take a brief look at both of them.

In [1]:
import numpy as np
import pandas as pd

## Basic Properties of NumPy Arrays

NumPy is an open-source extension library of Python for comprehensive and efficient numerical computation.

In particular, the NumPy library introduces a powerful new object called *ndarray* (N-dimensional array), or *array* for short, which can be thought of as an enhanced version of lists in standard Python. NumPy arrays have many useful computational features, which are optimized for rapid execution. Practically all data analysis and machine learning libraries are based on using NumPy arrays in the background. Below, we get acquainted with some of their basic properties.

A new array can be created by converting a Python list with the use of the `array()` function:

In [2]:
arr1d = np.array([1, 2, 3, 4, 5])
arr1d

array([1, 2, 3, 4, 5])

Note that executing the second of these two code lines results in the value of the variable `arr1d` to be displayed in the IPython output; this is a useful way to make sure that the code has intended consequences (however, this only applies to the final line in each code cell). The array object has the following useful attributes:

In [3]:
print(arr1d.dtype) # data type of elements in array
print(arr1d.ndim) # array dimension / number of axes / rank
print(arr1d.size) # number of elements in array
print(arr1d.shape) # array shape = size of each dimension (tuple)

int32
1
5
(5,)


Arrays can have any dimension, but an important special case is a two-dimensional one: such arrays can used to represent data in a form of a simple table. Here is an example with four rows and three columns:

In [4]:
arr2d = np.array([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.], [10., 11., 12.]])
arr2d

array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.],
       [ 7.,  8.,  9.],
       [10., 11., 12.]])

As seen below, the inclusion of the decimal point ensures that the elements are treated as floating-point numbers instead of integers (the trailing zeroes in the fractional part can be omitted):

In [5]:
print(arr2d.dtype) 
print(arr2d.ndim) 
print(arr2d.size) 
print(arr2d.shape)

float64
2
12
(4, 3)


Individual array elements can be accessed as follows (note that indexing begins at zero, as with lists):

In [6]:
print(arr1d[0]) # first element of 1D array
print(arr2d[1][2]) # element in 2nd row, 3rd column

1
6.0


Unlike lists, NumPy arrays can be subjected to arithmetic operations, which are performed *elementwise*. The same applies for e.g. roots and trigonometric functions; however, remember to use the NumPy versions of these functions (prefixed with the `np.` alias).

In [7]:
print(arr1d + 3) # adds 3 to each element
print(arr2d / 2.0) # divides all the elements by 2.0
print(arr1d ** 2) # squares all elements
print(np.sqrt(arr1d)) # replaces all elements with their square roots

[4 5 6 7 8]
[[0.5 1.  1.5]
 [2.  2.5 3. ]
 [3.5 4.  4.5]
 [5.  5.5 6. ]]
[ 1  4  9 16 25]
[1.         1.41421356 1.73205081 2.         2.23606798]


Subsets can be extracted out of existing arrays via an operation called *slicing*. The desired selection range is specified by giving the start and end indices, separated by a colon. These ranges can be specified for each of the axes in the array separately. If the start and/or end index is omitted, the default values are 0 and the size of the axis, respectively. Here are some examples of array slicing:

In [8]:
print(arr2d[0:3]) # extract rows with indices 0-2 (3 not included) 
print(arr2d[:,:1]) # extract first column only
print(arr2d[:,-2:]) # extract last two columns
print(arr1d[::2]) # extract every other element (step size 2)

[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]
[[ 1.]
 [ 4.]
 [ 7.]
 [10.]]
[[ 2.  3.]
 [ 5.  6.]
 [ 8.  9.]
 [11. 12.]]
[1 3 5]


**NOTE:** Slicing produces a *view* (shallow copy) of the original data. Accordingly, any changes in the original array are reflected in the slice as well:

In [9]:
a = np.array([1, 2, 3, 4, 5, 6])
a_slice = a[0:3] # array([1, 2, 3])
a[0] = 0 # changing the first element of a
a_slice # ... also modifies the slice

array([0, 2, 3])

In order to produce a true separate copy of an array, you can use the `copy()` method of the array:

In [10]:
a = np.array([1, 2, 3, 4, 5, 6])
a_copy = a.copy()
a[0] = 0 # changing the first element of a
a_copy # ... does not affect a_copy

array([1, 2, 3, 4, 5, 6])

Sometimes it is useful to generate an array with some desired initial content. The following cell gives some examples.

In [11]:
print(np.zeros((2,2))) # 2x2 array with all zeroes
print(np.random.random((3,3))) # 3x3 array with random values 0 ... 1
print(np.arange(3,10)) # sequence from 3 to 9 (10 not included)
print(np.linspace(0, 10, 5)) # five evenly spaced values from 0 to 10

[[0. 0.]
 [0. 0.]]
[[0.58742978 0.32160727 0.91680687]
 [0.84017056 0.29866969 0.624413  ]
 [0.07672841 0.685516   0.33230241]]
[3 4 5 6 7 8 9]
[ 0.   2.5  5.   7.5 10. ]


The above listing of the properties of NumPy arrays is by no means exhaustive; see e.g. NumPy user guide (https://numpy.org/devdocs/user/index.html) for much more information.

Even though the underlying computation in most data science libraries in Python is performed using NumPy arrays behind the scenes (and, therefore, it is useful to know about their basic properties), it is not always necessary to deal with them directly when writing code. For example, the pandas library introduces another powerful object (DataFrame) for manipulating data in the form of a table. This is our next topic of interest.

## Pandas and Data Tables

In order to conduct data analysis, we obviously need some kind of data to deal with. As our first example, we use the famous *Iris flower dataset* about morphological variation of iris flowers. The dataset comprises measurements of 150 different flower samples belonging to three different species (Iris setosa, Iris virginica, and Iris versicolor). For each sample, four numerical features were measured (length and width of its petals and sepals in units of cm) and recorded together with the species of that sample. 

This data is provided in the course material in the form of a CSV (comma-separated values) file, and can be easily accessed with the `read_csv` function in the pandas library: the path to locate the file just needs to be specified in the function argument. 

In [12]:
iris_df = pd.read_csv('datasets/iris/iris.csv')
iris_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


The data is now available for further analyzing as a pandas *DataFrame* object; it has the form of a table with 150 rows and 5 columns (for clarity, only a few rows from the beginning and the end of the table are displayed). Each row in the data corresponds to a single flower **sample**, and the columns of the data represent the recorded **features** of the samples. In this dataset, each sample is characterized by five recorded features: four *numerical* features (sepal length, sepal width, petal length and petal width) and one *categorical* feature (species). The leftmost column in the output containing the unique row indices from 0 to 149 is auto-generated by pandas, and not part of the actual data. The column index names are set according to the header row of the CSV file.

The distinction between numerical and categorical features is very significant in data analysis. Numerical features usually represent results of measurements; their values can be ordered in a meaningful way, and often vary on a continuous scale. In contrast, categorical features (such as the flower species in this example dataset) can only take on a limited number of discrete (often mutually exclusive) values, usually with no particular order among them. 

Pandas provides a huge selection of operations for manipulating the information stored in DataFrames. First, individual samples (rows) can be accessed by their indices using the `loc` attribute:

In [13]:
print(iris_df.loc[0]) # row with index 0
print(iris_df.loc[[0,3]]) # list of indices (0 and 3)
print(iris_df.loc[0:3]) # index slice

sepal_length       5.1
sepal_width        3.5
petal_length       1.4
petal_width        0.2
species         setosa
Name: 0, dtype: object
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa


Individual columns can be accessed by their index names:

In [14]:
iris_df['sepal_length']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

The resulting output is a pandas *Series* object: a one-dimensional counterpart of a DataFrame. Also, a single data item can be accessed as follows:

In [15]:
iris_df.loc[0, 'sepal_width'] # sepal width of sample with index 0

3.5

Any subsets can be extracted by selecting rows and columns with slices and lists for both:

In [16]:
iris_df.loc[147:, ['petal_length', 'petal_width']] # petal sizes for last three samples

Unnamed: 0,petal_length,petal_width
147,5.2,2.0
148,5.4,2.3
149,5.1,1.8


One interesting property of DataFrames is the possibility to use Boolean conditions as indices to apply filtering based on the feature values:

In [17]:
iris_df[iris_df['petal_length'] < 1.4] # all samples with petal length less than 1.4 cm

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
2,4.7,3.2,1.3,0.2,setosa
13,4.3,3.0,1.1,0.1,setosa
14,5.8,4.0,1.2,0.2,setosa
16,5.4,3.9,1.3,0.4,setosa
22,4.6,3.6,1.0,0.2,setosa
35,5.0,3.2,1.2,0.2,setosa
36,5.5,3.5,1.3,0.2,setosa
38,4.4,3.0,1.3,0.2,setosa
40,5.0,3.5,1.3,0.3,setosa
41,4.5,2.3,1.3,0.3,setosa


Rows or columns can be deleted with the `drop()` function. 

In [18]:
print(iris_df.drop(iris_df.index[5:145])) # delete rows with indices from 5 to but not including 145
print(iris_df.drop(columns = 'species')) # drop the "species" column

     sepal_length  sepal_width  petal_length  petal_width    species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica
     sepal_length  sepal_width  petal_length  petal_width
0             5.1          3.5           1.4          0.2
1             4.9          3.0           1.4          0.2
2             4.7          3.2           1.3          0.2
3        

Finally, comprehensive descriptive statistics of the DataFrame can be easily accessed with the `describe()` function. If the DataFrame contains mixed data types, only numeric columns are included in the analysis. The categorical data can also be included by adding the option `include = 'all'` to the function call. 

In [19]:
iris_df.describe(include = 'all')

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
count,150.0,150.0,150.0,150.0,150
unique,,,,,3
top,,,,,setosa
freq,,,,,50
mean,5.843333,3.054,3.758667,1.198667,
std,0.828066,0.433594,1.76442,0.763161,
min,4.3,2.0,1.0,0.1,
25%,5.1,2.8,1.6,0.3,
50%,5.8,3.0,4.35,1.3,
75%,6.4,3.3,5.1,1.8,


This concludes our brief introduction to the pandas DataFrames. See pandas user guide (https://pandas.pydata.org/docs/user_guide/index.html) for much more.

## Some basic terminology

Assume now that you encounter a new iris flower sample, of which you have access to the measured values of the lengths and widths of its petals and sepals. You are, however, not aware of the variety of this new sample, but would instead like to make a sensible prediction about which of the three possible iris species it belongs to. More precisely, consider that you are expected to write a computer program for accomplishing this task: the program should first ask the user for the numerical values of the four measured dimensions, and then output the name of the predicted category. How would you set about solving such a task? 

The early approaches for solving these kinds of problems involved first gathering domain-specific knowledge related to the task, expressing these in terms of a set of rules and, finally, converting these rules to program code. While this type of approach could be quite useful for some simple and well-defined tasks, this is not the case for many interesting and important problem categories (such as computer vision and pattern recognition). For more complex tasks, much better results can be obtained by using methods of **machine learning**.

In machine learning, there is no need for the programmer to explicitly figure out and implement the specific rules for solving the problem under consideration. Instead, these rules are **automatically extracted from data**. In terms of our iris example, this means that the programmer does not need to study botany, or interview a botanist, in order to find out explicit rules for identifying the correct species. Instead, the classifier (the computer program for predicting the iris variety) can be constructed directly from the contents of the dataset containing 150 examples of inputs and outputs. Of course, the quality of such a classifier is very much dependent on the quality of the available data, and needs to be tested before deployment. The (often iterative) process of building a machine-learning model is called **training** the model, and the dataset used for this purpose is called the **training set**.

There are several different categories of problems within machine learning. The above example, where the value of a categorical target variable (variety) has to be predicted in terms of a set of input variables, is a **classification** problem. If the target variable varies on a continuous scale instead of being categorical, the problem is that of **regression**; one might *e.g.* wish to estimate the prize of an apartment in terms of the input values defining its size, number or rooms, age, neighborhood etc. Whenever the training set contains the true (correct) values of the target variable for the training samples, the problem is that of **supervised learning**. This, however, is not always the case. As an example, consider the iris dataset without the variety column, containing only the values of the four numerical dimensions for each sample. One might *e.g.* ask whether this set of 150 samples could be divided in three subsets so that each of them contains samples "similar" to each other, but distinct from those in other subsets. Such **clustering** problems are often encountered in the context of market segmentation, and 
 belong to the category of **unsupervised learning**.     

During this course, we learn methods and tools for dealing with these kinds of problems, and much more. 

 