# Introduction to Python for machine learning

Author: Brian Stucky

## 1. Introduction

We will be using the programming language [Python](https://www.python.org/) for all lessons in this workshop.  Because we don't have enough time to give a complete, from-scratch introduction to programming, we are assuming that even if you've not used Python before, you have some experience with at least one so-called "scripting" programming language, such as R or MATLAB.  You really don't need to know very much of Python for this workshop, so in this first lesson, we'll cover the most important bits of Python and also introduce two Python packages that are commonly used for doing machine learning with Python, _NumPy_ and _pandas_.

## 2. Introducing Jupyter notebooks

There are several ways to run Python programs.  For general scripting and programming tasks, it is common to run Python programs directly from the command line, much as you would any other piece of software.  However, for this workshop, we will be using a programming environment called [*Jupyter notebooks*](https://jupyter.org/) that allows you to run Python programs one line (or several lines) at a time and inspect the results at each step.  This approach to programming is especially well suited for data analysis work and will be familiar to you if you've used R programming environments such as RStudio.

Jupyter notebooks are comprised of _cells_ that contain either Python code or text.  To run the code in a cell, click in the cell to make it active, then type `shift+enter`.  The results from running the Python code will be displayed below the cell.  Typing `shift+enter` will also open a new cell below the active cell if there is not already a cell there.

Next, we'll try using a Jupyter notebook to learn some basic Python commands.

## 3. Python basics

First, let's look at how we can write literal values in Python.  Numbers are written just as you'd expect; for example, `12` or `3.141592654`.  Literal text values are called _strings_ and are enclosed in either single or double quotes: `'this is a string'` or `"this is a string"`.

We can use the `print()` function to write output to the console.  If multiple values are provided to the `print()` function, they must be separated by commas.

In [None]:
print(12)
print(3.141592654)
print('this is a string')
print('This is the value of pi:', 3.141592654)

Python provides all of the basic arithmetic operators for working with numerical values.

In [None]:
print(3 + 4)
print(3 - 4)
print(3 * 4)
print(3 / 4)

print(3 * 4 + 1)
print(3 * (4 + 1))

Just as with other programming languages, data values in Python can be assigned to _variables_.  The `=` operator is used to assign a value to a variable (and create the variable if it does not yet exist).

In [None]:
pi = 3.141592654
print('This is the value of pi:', pi)

radius = 4
area = pi * radius * radius
print('The area of the circle is', area)

## 4. Conditional statements

Like most programming languages, Python provides an `if` statement that can be used to make a decision.  One of the most common patterns is to check the value of some numerical variable using one of the comparison operators: `>` (greater than), `<` (less than), `==` (equal to), or `!=` (not equal to).  Let's look at some examples.

In [None]:
a = 12

if a < 20:
    print('Less than 20!')

if a == 12:
    print('Equal to 12!')

if a != 20:
    print('Not equal to 20!')

if a > 20:
    print('Greater than 20!')

The basic idea is that whenever the test following the keyword `if` evaluates to `True`, the indented lines following the `if` will be run.  As the last example shows, if the test evaluates to `False`, then nothing happens.  If we'd like to also do something when the test is `False`, we can add an `else` clause to the `if` statement.

In [None]:
if a > 20:
    print('Greater than 20!')
else:
    print('Less than or equal to 20!')

### Exercise

Given a variable `someval` that can have any real number value, write code that ensures `someval` is in the range -10 to 10, inclusive, by truncating values outside of that range.  E.g., if the starting value of `someval` is -23, the ending value of `someval` would be -10.

## 5. Lists and loops

So far, we've seen variables and literals that represent a single value (e.g., `a = 14`).  A Python _list_ allows us to group multiple values together in a single data structure.  Python's lists are roughly analogous to the data structures called "arrays" or "vectors" in other programming languages.  We can define a list using brackets, `[` and `]`.

In [None]:
fseq = [1, 1, 2, 3, 5, 8, 13, 21, 34]
print(fseq)

Individual elements of a list are accessed using *subscript notation*, which uses an integer index to refer to the desired list item.  The first element of a list is at index 0, the next is at index 1, and so on.

In [None]:
print(fseq[0])
print(fseq[1])
print(fseq[6] + fseq[7])
print(fseq[7] + fseq[8])

Python's `for` loop provides a convenient way to sequentially access every item in a list.

In [None]:
for item in fseq:
    print(item)

The indented part of a `for` loop is called the loop's *body*, and it can contain multiple lines of code.

In [None]:
for item in fseq:
    if item > 10:
        print('Greater than 10:', item)

prev_item = 0
for item in fseq:
    if prev_item > 0:
        print(prev_item, '+', item, '=', prev_item + item)
    prev_item = item

The `len` function returns the number of items in a list.

In [None]:
print(len(fseq))

total = 0
for item in fseq:
    total = total + item

mean = total / len(fseq)
print('Arithmetic mean:', mean)

### Exercise

Given a non-empty list of non-negative numbers, called `num_list`, write code that uses a `for` loop to find the largest item in the list.

## 6. Working with Python packages, modules, and functions

Python code is often organized into units called _packages_ and _modules_.  Although there are technical differences between the two, for the purposes of this workshop, we can think of them as functionally the same (to avoid typing "module or package" over and over, I'll sometimes use the term "library" to refer to both).  Modules and packages group together related code and make it easy to reuse that code in other programs.  To introduce the basic concepts, we'll work with a standard Python module called `math` that contains a variety of mathematical functions.

To use a package or a module, we use the `import` statement to tell Python that we want to load the library.  Once a library is loaded, the dot operator, `.`, lets us access the objects contained in the library.  For example, the `math` module includes a variable that defines the constant _pi_ and also includes implementations of all of the standard trigonometric functions.  Let's take a look at how we can access them.

In [None]:
import math

print('This is the value of pi:', math.pi)

radius = 4
area = math.pi * radius * radius
print('The area of the circle is', area)

The `math` module also contains a large set of _functions_.  In Python, a function comprises a block of code that accepts one or more *arguments*, does some computations using the argument values, and then returns the result.  For example, the `math` module includes a function called `cos()` that implements the standard trigonometric cosine function.

In [None]:
print(math.cos(0))
print(math.cos(math.pi))
print(math.cos(math.pi * 2))

In the examples above, we call the function `cos()` with the values `0`, `math.pi` and `math.pi * 2` as the argument values.  The result of a function call can be assigned to a variable, just like any other value.

In [None]:
result = math.cos(math.pi * 2)
print(result)

Functions can take any number of arguments.  Arguments are separated by a comma, `,`.  We've already seen examples of this by calling `print()` with more than one argument.  For another example, the `math` library includes a function called `gcd()` that returns the greatest common divisor of 2 integers.

In [None]:
print(math.gcd(5, 7))
print(math.gcd(12, 48))
print(math.gcd(143, 253))

To use functions in the `math` module, we've needed to type the full name of the module every time we call one of its functions.  For a library with a relatively short name, such as `math`, this isn't much of a burden, but for libraries with long names, it can quickly get tedious to type the full name over and over again.  To help solve this problem, Python allows us to assign a shortcut name for a library as part of the `import` statement.

In [None]:
import math as m

print(m.pi)
print(m.cos(m.pi * 2))

One last comment about importing libraries.  In the examples so far, every time we've used an object in the library, we've needed to type the library name first, followed by a dot (`.`), and then the object name, as in `math.cos()`.  Sometimes, it is convenient to be able to access the object in the library directly without typing the library name every time.  Python provides an alternative `import` syntax that makes this easy:

In [None]:
from math import cos

print(cos(m.pi * 2))

### Exercise

With the help of the math library, write a short Python program to find the length of the hypotenuse of a right triangle given the lengths of the other two sides, represented by the variables `a` and `b`.  Use the [documentation for the math library](https://docs.python.org/3/library/math.html) as needed.

## 7. Using NumPy

[*NumPy*](https://www.numpy.org/) is a Python package that supports fast numerical computation in Python.  For our purposes, the most important component of NumPy is NumPy's *multidimensional array*, which, in this lesson, I will simply refer to as "array".  NumPy arrays can be used in many of the same ways as Python lists, but they also support a large set of mathematical operations that are not provided for lists.  Furthermore, performing numerical calculations on arrays is typically much faster than using lists.

There are many ways to create an array.  Here, we will use the `array()` function to create an array directly from a Python list.

In [None]:
import numpy as np

f_arr = np.array([1, 1, 2, 3, 5, 8, 13, 21, 34])
print(f_arr)

print(f_arr[0])
print(f_arr[8])

print('Array length:', len(f_arr))

for item in f_arr:
    print(item)

The code above illustrates several key points:
  1. By convention, the shortcut name `np` is used for `numpy`.
  2. Indexing of numpy arrays is exactly as for Python lists, with the first element at index 0.
  3. Arrays can generally be used in the same ways you'd use lists.

Numpy arrays support all of the basic arithmetic operations, which are performed *element-wise*.  That is, the arithmetic operation is independently applied to each element of the array or pair of corresponding elements.

In [None]:
print(f_arr + 1)
print(f_arr / 2)
print(f_arr * f_arr)

NumPy also provides many common mathematical functions that can be used with arrays.  As with the arithmetic operators, most of NumPy's mathematical functions operate element-wise, but some calculate a single value from the contents of an array.

In [None]:
print(np.sqrt(f_arr))

print('The range is', np.min(f_arr), ',', np.max(f_arr))
print('Arithmetic mean:', np.mean(f_arr))

Common statistical summary functions, such as `min()` and `mean()`, can also be accessed as properties of the array objects themselves, which is sometimes more convenient.

In [None]:
print('The range is', f_arr.min(), ',', f_arr.max())
print('Arithmetic mean:', f_arr.mean())

### Exercise

Consider the code below:
```
arr_1 = np.array([1, 2, 3, 4, 5, 6])
arr_2 = arr_1

arr_1[2] = 2
arr_2[3] = 5
```
What will be the final value of `arr_1`?  What will be the final value of `arr_2`?  Run the code and check your answers.  Were you surprised by the results?

## 8. Using Pandas

[*Pandas*](https://pandas.pydata.org/) is another Python package that is commonly used for doing statistics and machine learning in Python.  Pandas provides a structure called `DataFrame` that implements a convenient way to work with tabular data (i.e., data with one or more variables arranged in columns, with each row containing one observation of each variable).  If you've worked with dataframes in R, pandas' `DataFrame` is roughly analogous.

To illustrate the basic ideas of pandas' `DataFrame`, we'll work with a famous historical dataset, the [iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).

The iris data are provided in a plain-text format known as [*comma-separated values*](https://en.wikipedia.org/wiki/Comma-separated_values), or *CSV*.  CSV is a popular format for tabular data because it is both human and machine readable and it is supported by a huge number of software programs.

First, we'll load pandas and then use the `read_csv()` function to load the iris data.  Note the conventional use of `pd` as an abbreviation for `pandas`.

In [None]:
import pandas as pd

idata = pd.read_csv('../nb-datasets/iris_dataset.csv')

To inspect the contents of a DataFrame, we can use the `head()` or `tail()` functions.  `head()` returns the first 5 rows of a DataFrame; `tail()` returns the last 5.


In [None]:
idata.head()

In later lessons, we'll spend more time talking about the contents of the iris dataset.  For now, simply note that the DataFrame comprises 5 columns, where each column contains values for a single variable.  Each row contains a complete set of observations of all variables for one individual iris flower.

Just as with Python lists and numpy arrays, we can use the `len()` function to find out how many rows (i.e., individual flowers) are in our dataset.

In [None]:
len(idata)

DataFrames include a function called `describe()` that gives you a basic statistical overview of a DataFrame.

In [None]:
idata.describe()

We can access individual columns of a DataFrame using a special form of subscript notation that uses the column name.

In [None]:
idata['petal_length']

Pandas DataFrames are built on top of numpy arrays.  In fact, a column in a Pandas DataFrame is a special kind of numpy array.  Therefore, you can work with data in a pandas DataFrame much as you would work with any other numpy array.

In [None]:
print('First element:', idata['petal_length'][0])

p_lens = idata['petal_length']
print('First element:', p_lens[0])


print('Mean petal length:', p_lens.mean())

Basic statistical summary methods are defined for DataFrames, too, and they return the summary statistic for each column.

In [None]:
idata.mean()

## 9. Conclusion

That's it for our quick introduction to Python, NumPy, and Pandas.  Although we've covered only a tiny fraction of what these tools can do, you should at least be able to begin using them for basic data manipulation tasks.  Upcoming lessons will introduce more features of these tools, as well as give you plenty of opportunity to practice using them.