Python
======

When I discovered Python in around the year 2000 it was a revelation.

Before that I had programmed almost entirely in C or Basic. Python
was a huge improvement in usability over those languages. (At the time I didn't know about Scheme or Lisp or Smalltalk or Javascript 
or R - all languages I strongly prefer today).Sometime in the last few years Python has experienced rapid growth as
a data science language with packages like numpy, scipy, sklearn, etc 
providing a large library base for data science. If you had to compare R and Python in terms of libraries, you'd say R 
is more the statistician's language and Python is more the data scientist's.

I personally think from a design point of view R is better in many ways. But 
some machine learning implementations are better supported in Python. You just
have to put up with it.



Jupyter
=======

Jupyter (specifically Jupyter Lab) is roughly the equivalent of Rstudio for Python. It places a much higher emphasis on Notebooks.

Docker
======

You can run a Jupyter lab session from inside an extended rocker/verse image with these lines:

```
RUN apt update -y && apt install -y python3-pip
RUN pip3 install jupyter jupyterlab
```

A similar command line to the one we've been using to start RStudio can start Jupyter:

```
docker run -p 8765:8765 -v \
 `pwd`:/home/rstudio \
  -e PASSWORD=some_password \
  -it l14 sudo -H -u rstudio /bin/bash \
  -c "cd ~/; jupyter lab --ip 0.0.0.0 --port 8765"
```

Note that you want to replace `pwd` with something like "$(pwd)" if you have spaces in your path on a mac. You might need to literally type your folder location if you are running in windows.

Python Basics
=============

Python supports the usual programming language things:

In [1]:
3*12 + 14

50

Python is not, at base, an array language.

In [2]:
3 + [1,2,3]

TypeError: unsupported operand type(s) for +: 'int' and 'list'

In fact, the built in list type is heterogenous:

In [4]:
["x",1,"y",[]]
("x","y")

('x', 'y')

Indexing Lists
==============

The objects denoted by `[a,b,c,...]` are lists. Lists are indexed similarly to R but starting from 0.


In [5]:
a = ["Hello", "Cruel", "World"];
a[0]

'Hello'

Note that we can slice lists sort of like we do in R, but since Python has the notion of truly atomic values like numbers and strings, there is a difference between a slice and an index. In R a single bracket index with one value conceptually returns a vector of one value. But not in Python. A slice returns a list, a index returns a value.

In [6]:
x = [1,2,3,4,5];
y = x[0:3]
print(x);
print(y)
y[1] = 1000
print(y)
print(x)

[1, 2, 3, 4, 5]
[1, 2, 3]
[1, 1000, 3]
[1, 2, 3, 4, 5]


These features makes the built in lists less than efficient for doing numerical computations. We'll have to use a library to implement similar features in Python.

Object Orientation
==================

Python is object oriented in a much more traditional sense than R. Everything in Python is an object. Unlike in R, the primary way to experience their objectness is by calling methods:

In [5]:
l = [1,2,3]
l.append("Some value")
# l$append
# <list>append(l, "Some Value")



[1, 2, 3, 'Some value']

Note a few things here.

1. `=` is the assignment operator. The only one. Unlike R.
2. we use `.` to mean "access a method or property of the object before the ." In R "." is just another character that might appear in a variable name and has no special properties. R's `$` is the closest thing to `.` but `.` does more. `$` is not an allowed character in python variables and doesn't have a meaning.
3. `l.append` is the name of the "append" method on the "l" object (a list). Note that calling append "mutates" the list bound to "l". This is atypical for R where we typically create new values rather than mutate old ones.
4. Note again that we can put different types of things in our list.

Numbers are, of course, immutable. 

Everything really is an object in a sense. You can call methods on numbers:

In [7]:
(10).to_bytes(8,"little")
help((10).to_bytes)

Help on built-in function to_bytes:

to_bytes(length, byteorder, *, signed=False) method of builtins.int instance
    Return an array of bytes representing an integer.
    
    length
      Length of bytes object to use.  An OverflowError is raised if the
      integer is not representable with the given number of bytes.
    byteorder
      The byte order used to represent the integer.  If byteorder is 'big',
      the most significant byte is at the beginning of the byte array.  If
      byteorder is 'little', the most significant byte is at the end of the
      byte array.  To request the native byte order of the host system, use
      `sys.byteorder' as the byte order value.
    signed
      Determines whether two's complement is used to represent the integer.
      If signed is False and a negative integer is given, an OverflowError
      is raised.



Variables, Bindings, Environments, Functions
============================================

Python is somewhat simple compared to R here. `=` introduces a variable bindings in the local scope exclusively. At the top level `=` introduces a global variable.
Here we create a binding to "x" of "10" at the top level. The "=" sign creates a local binding inside the body of the function f. In the body of the function "x" refers to that binding.

In [8]:
x = 10
def f():
    x=11
    return x
[f,f(),x]

[<function __main__.f()>, 11, 10]

Things to note:
    
1. Python is whitespace sensitive. The body of functions must be indented compared to the enclosing context. That ":" at the end of the `def` line is also required.
2. Unlike in R we _must_ explicitely return a value from functions using "return". "return" terminates the function immediately if it is placed in some non-tail position.
3. These are some of the worst features of python that tell you it was designed by a rube.

Mutating an Enclosing Variable
==============================

If you want to change global variable binding (as you would do with "<<-" in R) you have to make this desire known by declaring the variable global in your function.

In [20]:
y = 10;
def set_y(v):
    global y
    y = v
    return y
[set_y(100),y]

[100, 100]

Things become increasingly absurd as you may nest scopes:

In [8]:
def make_counter(start_from):
    state = start_from;
    def counter():
        nonlocal state
        cv = state;
        state = state + 1;
        return cv;
    return counter;

c0 = make_counter(0);
c10 = make_counter(10);

print([c0(), c0(), c0()])
print([c10(), c10(), c10()])

[0, 1, 2]
[10, 11, 12]


A global variable cannot be declared "nonlocal" even though the relationship which obtains between a function scope and a global scope is the same as one obtained between two function scopes. This doesn't really matter that much but it chaps my britches.

Conditionals and Loops
======================

If
--

In [22]:
x = 1
y = 2
if x < y :
    print("Hiho")
    print("Hiho")
else:
    print("Silver")

Hiho


Note the ":" and indentation. Also note that you do not need an enclosing () for the conditions.
If statements can have many legs:

In [None]:
if x < y:
    print("smaller")
elif x == y:
    print("equal")
else:
    print("larger")

Finally note that if statements don't produce any values. They only perform side effects. The following function returns no value at all.

In [23]:
def if_example(x,y):
    if x < y:
        "smaller"
    elif x == y:
        "equal"
    else:
        "larger"

It should look like this:

In [25]:
def if_example(x,y):
    if x < y:
        return "smaller"
    elif x == y:
        return "equal"
    else:
        return "larger"

Loops and Comprehensions
------------------------

Loops come in a few flavors.

In [26]:
for x in [1,2,3]:
    print(x)

1
2
3


In [27]:
for x in range(10):
    print(x)

0
1
2
3
4
5
6
7
8
9


This is a good time to remark upon the fact that python is zero indexed based. Thus the `range` function returns a list of indexes for an arrange of the input length.

While loops are predictable at this point:

In [28]:
x = 0;
while x < 10:
    print(x)
    x = x + 1;

0
1
2
3
4
5
6
7
8
9


We can see from this example that for and while loops do not create their own contexts in their body. If they did we'd need a "global x" above.

Comprehensions
--------------

Comprehensions are a nice feature if you don't know about functional programming. They let you construct new lists from old lists and often this is what you want when you think you want a loop:

In [9]:
x = [1,2,3]
x_plus_one = [e + 1 for e in x]
x_plus_one

[2, 3, 4]

Comprehensions can get somewhat complex: 

In [32]:
def odd(n):
    return (n % 2) == 1

[e + 1 for e in range(10) if odd(e)]

[2, 4, 6, 8, 10]

Anonymous Functions
===================

Another way that Python is broken is that anonymous functions are pretty limited. Note that we always have to give a name during a `def` in Python. In R, the `function` form returns a function which we bind to a name via `<-`. We don't have to give it a name and often we don't. 

In python there is no equivalent. There are `lambda` expressions, however.

In [34]:
def map(l, f):
    return [f(x) for x in l]

def square(x):
    return x*x;

times_two = lambda x: x*2;

map(range(10),square)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Lambda expressions are limited to a single expression in their body. You cannot create new variable bindings inside of them.  This is a big limitation on their expressiveness.

All is not lost, however. Functions in Python are first order objects, so you can say:

In [35]:
def times_3(x):
    return x * 3
map(range(10), times_3)

[0, 3, 6, 9, 12, 15, 18, 21, 24, 27]

Since we can nest function definitions this gives us most of what we want. Note that `lambda` expressions are the only sorts of functions where we don't need to say "return" to return a value.

Dictionaries
============

Dictionaries are used extensively in Python to represent adhoc objects. They simply map names (strings) to values of any type. They can be created with the following syntax:

In [13]:
d = {"x":10, "y":11,"z":"tophat"}
print(d["z"])


tophat


It is worth meditating on the fact that most array types in R support named values and thus behave like dictionaries. In Python these types are disjoint. As we'll see, the conceptual simplicity of R really shines when we get to dataframes.

Classes and Objects
===================

R is strongly object oriented but the typical programmer doesn't deal with classes directly. In Python that is less true, so let's go over what classes and objects are.

1. A class is a description of a set of objects. It says "these are values these objects have inside them and the methods that the objects support"
2. An instance is one realization of the class. We create instances and then access their data and methods.

In [1]:
from random import random
from math import sqrt, sin, cos, pi

class Person:
    first_name = ""
    last_name = ""
    
    def __init__(self, first, last):
        first_name = first
        last_name = last;
        
class Employee(Person):
    tax_id_no = 0;
    def __init__(self, first, last, tax_id):
        super(Person,self)(first, last);
        self.tax_id = tax_id;

class Point:
    x = 0
    y = 0
    def __init__(self, x, y):
        self.x = x;
        self.y = y;
        
    def length(self):
        r = sqrt(self.x*self.x + self.y*self.y);
        return r;
    
    def randomize_dir(self):
        r = self.length();
        theta = random()*2*pi;
        self.x = r*cos(theta);
        self.y = r*sin(theta);
        return self;
    
    def __str__(self):
        return "<{}, {}>".format(self.x, self.y)
    
    def __repr__(self):
        return "<Point: {}, {}>".format(self.x, self.y)
        
        
p = Point(10,0)
print(p)
p.randomize_dir()
print(p)
print(p.length())

<10, 0>
<-2.2392415057677244, 9.746065743614041>
10.000000000000002


As we will see, the idea of special methods (the ones with `__` bookends) will be sort of important with numpy.

Libraries
=========

In R we use a function to install libraries. Python requires us to use an external package manager and unfortunately there are a few choices. The most common among data scientists is probably one called "Anaconda" but there are many things I don't like about it. In this course we just use the more standard package manager "pip". In our Docker container we have both Python 2 and Python 3 and so we invoke library installation from Bash like this:

```
pip3 install numpy
```

This installs numpy (for instance).

Import
======

Once we have a library we use some variation on import. Much like an R package, a Python library can be thought of as a package of exported symbols. When we say "library(ggplot2)" in R we (effectively) pull in all the symbols from ggplot2 into our environment (what really happens is that the ggplot2 package is placed on our environment stack, but the effect is the same). We can accomplish something similar like this:

```
from numpy.random import *
binomial(100, 0.3, 100)
```

However this is not considered best practice among Python people. What you typically see is that the package is just imported and used with dot notation:

```
import pandas
pandas.DataFrame({"x":[1,2,3],"y":[4,5,6]})
```

Since we don't always want to type out the full library name we sometimes give them a shorter, local, name:

```
import pandas as pd
pd.DataFrame({"x":[1,2,3],"y":[4,5,6]})
```

This last one is the most common pattern.

Scientific Python
=================

Python, at its base, is a general purpose programming language originally designed for software engineers as a "scripting" language. This notion is less meaningful today than it was back in the dawn of time, but the idea was to let you write quick scripts to do common tasks without involving a compiler or complicated type systems.

Its simple syntax and "batteries included" approach (roughly: most of the original tasks people used Python for were "built in" to the language), made Python very popular. And with popularity came the desire to use it in other contexts. 

The basis for all scientific computing in Python is the numpy library. Using numpy we can do many of the Array oriented programming tricks we are used to from R.

In [4]:
import numpy as np

np.array([1,2,3]) + 10

array([11, 12, 13])

Note that we have to explicitly lift up at least one of the operands to a numpy array for the `+` method to get the idea.

In [5]:
[1,2,3] + np.array(10)

array([11, 12, 13])

But this is still an error. It is worth meditating on what is happening here. `+` has a set of default behaviors. But if either operand has a method called _add_ then that method is invoked instead.


In [3]:
[1,2,3] + 10

TypeError: can only concatenate list (not "int") to list

We can create our own overloaded `+` behavior by adding a `__add__` method to our point class.

In [6]:
from random import random
from math import sqrt, sin, cos, pi
class Point:
    x = 0
    y = 0
    def __init__(self, x, y):
        self.x = x;
        self.y = y;
        
    def length(self):
        r = sqrt(self.x*self.x + self.y*self.y);
        return r;
    
    def randomize_dir(self):
        r = self.length();
        theta = random()*2*pi;
        self.x = r*cos(theta);
        self.y = r*sin(theta);
        return self;
    
    def __str__(self):
        return "<{}, {}>".format(self.x, self.y)
    
    def __repr__(self):
        return "<Point: {}, {}>".format(self.x, self.y)
    
    def __add__(self, other):
        return Point(self.x + other.x, self.y + other.y)

Point(1,0) + Point(0, 1)

<Point: 1, 1>

Of course, our Point class is made a little superfluous by the existence of the numpy library, which allows us to represent multidimensional vectors fairly straightforwardly.

In [7]:
np.linalg.norm(np.array([10,10]))

14.142135623730951

Numpy examples:

In [8]:
v1 = np.array([1,2,3])
v2 = np.array([4,5,6])
v3 = np.array([7,8,9])

c1 = 10;

print(v1 + v2)
print(v1*v2)
print(c1*v3)
print((c1*v3).sum())
print(v3.max())

s1 = v3[0:2]
s2 = v3[1:3]
print(s1)
s1[0] = 1000
print(v3)
s2[0] = 1001
print(s1)
print(v3)

[5 7 9]
[ 4 10 18]
[70 80 90]
240
9
[7 8]
[1000    8    9]
[1000 1001]
[1000 1001    9]


Note that when we slice an array in numpy we get a reference to the internals of the array. If we modify the slice we modify the original array and all the other slices which might refer to it.

This is actually really bad (though sometimes useful). It shouldn't be the default behavior because it lets our variables "leak". Almost always you want to slice and then copy.

In [46]:
subset = v2[0:2].copy()
print(subset);
subset[0] = 100;
print(subset)
print(v2)

[4 5]
[100   5]
[4 5 6]


Concluding Notes
================

There are many reasons to know Python. 

1. Its a general purpose programming language which is much closer in spirit to most currently popular languages. Thus, if you have any interest in writing software in general or just to support your data science lifestyle, Python is a powerful entrypoint into the broader world of software engineering.
2. It at least gives the appearance of simplicity. 
3. It is extremely popular and thus forms a sort of lingua franca. Even libraries not written in Python often have Python bindings and, for instance, the major cloud infrastructure providers provide Python bindings to their API's.
4. Python has some of the best or at least most diverse data science tools, as we will cover in the next few classes.