# Python bootcamp

This bootcamp will be teaching python through a just-in-time programming approach. That is, instead of walking you through fundamentals step by step, we will jump right into analyzing code and teaching you the tools necessary to understand and modify that code as we go. You can find a more systematic review of python fundamentals in the subfolders of this repository, with the following organization:
- ``0_getting_started`` contains information on what python is and its current popularity and use.
- ``1_fundamentals`` gives an intro to datatypes, the building blocks of any programming language.
- ``2_functions`` introduces you to functions and scope in python.
- ``3_classes`` gives a basic introduction to python classes and a more advanced example of an OLS class.
- ``5_list_comprehensions`` introduces list comprehensions, an efficient method of creating iterable objects.

This notebook starts with an abitious goal: the creation if a python ``class`` object (don't worry, we'll tell you what this is) that runs OLS. To do this, we will introduce concepts like importing packages, defining functions, and manipulating numpy matrices, then organize all of these into a class object.  Specifically our goals are going to be:
- Make a linear projection function with b0, b1, x as input, y as output
- Write a data generating function
- Give a brief explanation of the scipy.optimize.minimize function
- Minimize the squared errors to estimate b0 and b1
- Create a class that implements the same minimization, 
  that takes data in instantiation, and has an 'estimate' method.

## Package imports

We will be using object types and methods from a couple of different packages in python. These packages must first be installed in the environment you are working in. For the datahub environment we are using for this bootcamp, the necessary packages are already installed. If you want to work on your own computer, you will need to install these using either ``pip``, python's native installer, or preferably the package manager Anaconda. Lucy will go over installation using Anaconda on Friday. For now, note that python packages are importing using the ``import`` command, as below.

In [1]:
# for later
import numpy as np
from scipy.stats import distributions as iid
from scipy.stats import rv_continuous

## Functions: Building blocks of an OLS class
We are going to start by writing functions for the main actions of an OLS class and for generating the simmulated data we will use to test our OLS functions. An OLS estimator must do two things; define a linear model and (to start) minimize the sum of squared errors. We can summarize these activities as:

1. Linear projection: Predict "y" given a set of betas and data X
    - inputs: b0, b1, x
    - outputs: y hat
2. Define the data
    - Inputs: N, true betas
    - Outputs: y, X matrices
3. Minimizer function: minimizes the squared distance between the linear projection and y
    - inputs: A function to minimize (SSE)
    - output: betas that minimize that function
    
### 1. Linear projection
Suppose we have a vector X that is N by 2, where the first column is a column of ones, and a vector of betas: b = [b0, b1]. The projection matrix, or the matrix that predicts y, is given by $Xb$. 

Note that the text in red documents the function, telling future users (and your future self! What the function takes in, and what it returns. This is optional, but good practice. Tip: being detailed about what data types are acceptable will help you even more!

### Quick aside: Matrix algebra and some helpful numpy functions.
``numpy`` is a python package that contains number generators, its own matrix objects and methods, and more! Plus it's blazing fast. The building blocks of numpy are called numpy arrays, and they can have as many dimentions as you want. The next block of code shows you how to create some numpy arrays.

In [2]:
m1 = np.array([1, 2, 3]) # shape (3,)
m2 = np.array([[1, 2, 3]]) # shape (1,3)
m3 = np.array([[1, 2, 3], # shape (2,3)
               [4, 5, 6]])
m4 = np.array([[2], [2], [2]]) # shape (3,1)
# We can inspect an array's shape
print("My matrix shapes are\n m1: %s\n m2: %s\n m3: %s\n m4: %s" % (m1.shape, m2.shape, m3.shape, m4.shape))

My matrix shapes are
 m1: (3,)
 m2: (1, 3)
 m3: (2, 3)
 m4: (3, 1)


In [3]:
# numpy arrays can be added with +, matrix multiplied with @, or element-wise multiplied with *:
matmult = m3@m4
elementmult = m2*m1
print("Matrix multiplying m3, 4 gives:\n %s" % matmult)
print("Element-wise multiplying m1, m3 gives:\n %s" % elementmult)

Matrix multiplying m3, 4 gives:
 [[12]
 [30]]
Element-wise multiplying m1, m3 gives:
 [[1 4 9]]


In [4]:
# they can also be (element-wise) raised to powers, etc:
print(m1**2)
print(m1*4)

[1 4 9]
[ 4  8 12]


### Back to the linear projection function...

In [5]:
def linear_projection(X, b):
    '''
    Inputs:
        - X: numpy array of dimensions NxK. The first column is assumed to be a column of ones.
        - b: numpy array of dimensions Kx1.
    Returns:
        - Xb, an Nx1 numpy array
    '''
    # make sure b is the right shape
    b = b.reshape((len(b),1))
    return X@b

In [6]:
# test
X = np.c_[np.ones((5,1)), np.random.rand(5,1)]
b = np.random.rand(2,1)
linear_projection(X,b)

array([[0.91823303],
       [1.00742063],
       [1.1260404 ],
       [1.12395384],
       [1.2477445 ]])

### 2. Data generating process
We will create data that has N observations and a "true" but noisy relationship between $x$ and $y$. This type of data is used in Monte Carlo simulations to test theory.

In [7]:
def dataGenerator(beta, N):
    '''
    Inputs:
        - beta: A Kx1 numpy array. True beta from which to generate data.
        - N: Number of observations the data should have
    Returns:
        - X: A numpy array of random data with shape NxK
        - y: A numpy array generated by Xbeta + e with shape Nx1
    '''
    # create an X vector
    # note I instantiate iid.norm() and call method rvs() in the same step!
    x = iid.norm().rvs((N,1))

    # create a random error
    e = iid.norm().rvs((N,1))
    # add an intercept by horizontally stacking x with an array of ones,
    X = np.c_[np.ones((N,1)), x]
    # make sure beta is the right shape: Kx1
    beta = beta.reshape((beta.shape[0],1))
    # create y
    y = linear_projection(X, beta) + e
    
    return X, y

In [8]:
# test
beta_true = np.array([1,1])
dataGenerator(beta_true, 5)

(array([[ 1.        ,  0.42549221],
        [ 1.        ,  0.36560729],
        [ 1.        , -0.92767122],
        [ 1.        , -0.56178431],
        [ 1.        ,  0.93022811]]),
 array([[ 2.14178002],
        [ 2.41450946],
        [-0.57991515],
        [-0.5496822 ],
        [ 3.76518356]]))

In [9]:
# create data
beta_true = np.array([2,8])
N = 100

X, y = dataGenerator(beta_true, N)

# test shapes!
print("X is %s x %s" % (X.shape[0], X.shape[1]))
print("y is %s x %s" % (y.shape[0], y.shape[1]))

X is 100 x 2
y is 100 x 1


### 3. Minimizer function
In order to minimize, we are going to use a minimizer function from ``scipy.optimize``. The documentation for this function can be found [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html), but the key arguments are the following:
* ``fun``: the function to be minimized. This must be a function of only one input; if there are multiple inputs, we will "mask" these using lambda functions. Note that functions can be passed to other functions! Functions are objects just like other python data types, so we can pass them around using their name.
* ``x0``: The start guess for the solution. In the case that the solution has a global minimum (as the least squares problem does) the choice will only affect computation time.

Thus the final syntax is `minimize(fun = function(x), x0 = [start guess])`.

The function returns an instance of the ``OptimizeResult`` class, which has several attributes. The only one we will be interested in for now is ``x``, the solution that solves the minimization.

Let's set up a function that returns the object we want to minimize for OLS: the sum of squared errors. Note that the SSE is given by:

$$SSE = \sum_i (\widehat{y}_i - y_i)^2$$

Implementing this in code:

In [10]:
def sse(y, X, b):
    '''
    Inputs:
        - y: Numpy array Nx1
        - X: Numpy array NxK
        - b: Numpy array Kx1
    '''
    yhat = linear_projection(X, b)
    sse = np.sum((yhat - y)**2)
    return sse

In [11]:
sse(y,X,beta_true)

111.15309346674218

Now we can minimize this function, making it a function of just one variable by masking the other inputs in a lambda function:

In [12]:
from scipy.optimize import minimize

In [13]:
# remember the syntax: 
# minimize(fun = function(x), x0 = [start guess])
# the lambda function allows sse to be a function of only x, the other inputs
# come from the variables X and y we already defined.
minimize(lambda x: sse(y, X, x), x0 = [0,0])
# as expected, we get an intercept of around 2 and a slope around 8

      fun: 110.93531136213308
 hess_inv: array([[ 0.00504965, -0.00044726],
       [-0.00044726,  0.00402937]])
      jac: array([ 9.53674316e-07, -9.53674316e-07])
  message: 'Optimization terminated successfully.'
     nfev: 27
      nit: 6
     njev: 9
   status: 0
  success: True
        x: array([2.00911686, 8.04008407])

In [14]:
# let's do it again with a higher N, letting the LLN work for us!
X, y = dataGenerator(beta_true, 10000)
minimize(lambda x: sse(y, X, x), x0 = [0,0])
# now it's even more accurate!

      fun: 10059.21721418913
 hess_inv: array([[5.12932394e-05, 5.13365240e-05],
       [5.13365240e-05, 5.47497049e-05]])
      jac: array([0.00012207, 0.        ])
  message: 'Desired error not necessarily achieved due to precision loss.'
     nfev: 42
      nit: 7
     njev: 14
   status: 2
  success: False
        x: array([1.99789901, 8.00919576])

## Creating an OLS class
Now let's take our functions and organize them into an OLS class. What is a python class? 

python classes are objects that have attributes and methods. These objects are abstract in the sense that a class in the abstract is used to "instantiate" specific "instances" of the class. For example, here we will create a class called OLS, and will create several instantiations of specific OLS models. That is, the OLS class might have attributes like X and y data, but an OLS class _instance_ will have _specific_ data for X and y.

To define a class, we use ``class ClassName:``, followed by indented lines (convention is to name classes with upper case).

Classes have 3 components:
* **The constructor** this is a method (think: function) that creates an _instance_ of the class. This component must be present in all classes, as it is the engine that creates the object. It looks like ``def __init__(self):``
* **Attributes** These are attributes that all instances of a class have. They can be anything, or not exist at all. For example, consider a Student class that stores information for a student database. In that case, we might want students to have attributes like a student ID, gender, age, etc. These can vary by instance of the Student class, but all instances of ``Student`` have one. If we imagine an alternative class called `Alien()`, attributes might include home planet and and number of limbs.
* **Methods** Methods are functions that belong to a class. They may return output or not. For example, we may want to write a method that calculates a student's GPA given a series of numeric grades. In the case of OLS, out main method will be to estimate beta.

The simplest class is made up of only a constructor, and the simplest constructor doesn't do anything except create an instance of the class. This looks like:

In [15]:
class OLS:
    # constructor
    def __init__(self):
        pass

Now to create an _instance_ of the class, we do the following:

In [16]:
myOLSmodel = OLS()
myOLSmodel

<__main__.OLS at 0x108b7e220>

However, this OLS model is not very interesting. For example, we might think that each OLS model might have data associated with it. Therefore, we can require that the instantiation receive X and y data, and create attributes from these data. Attributes "belong" to the class, and can be accessed with the syntax ``classInstance.attribute``.

In [17]:
class OLS:
    # constructor
    def __init__(self, X, y):
        # define attributes
        self.X = X
        self.y = y

In [18]:
y_test = np.ones((5,1))
x_test = np.ones((5,2))*3

myOLS = OLS(x_test, y_test)
myOLS.y # note that within OLS(), the argument is called y!

array([[1.],
       [1.],
       [1.],
       [1.],
       [1.]])

Now that our OLS class has attributes, we want it to calculate something! So we are going to give it **methods**, which are functions that belong to a class. For example, let's add our linear_projection function to this class as a method. In order to do this, we define the function within the class body, and we add another special argument to the method ``self``. By adding self as an argument, we will have access to all the attributes and methods that the class contains! For example:

In [19]:
class OLS:
    # constructor
    def __init__(self, X, y):
        # define attributes
        self.X = X
        self.y = y
    
    # ------- methods --------
    # linear projection
    def linear_projection(self, b):
        b = b.reshape((len(b),1))
        return self.X@b

What do you notice is different about linear_projection the method?
1. It has ``self`` as an argument. This allows it to know about attributes and even other methods contained in the class.
2. I took away X as an argument, instead calling ``self.X`` in the method body. How can this be?? Since X is now an _attribute_ of the class, and the method has the ``self`` argument, ``self.X`` is saying "grab the X that you defined as the class attribute". This way we don't have to constantly be entering our data into all the function calls, because our OLS instance is storing it for us! Here's how we would use this method:

In [20]:
myOLS = OLS(X, y)

# method call with instanceName.method. Only input is b; we don't have to pass "self" to the method.
myOLS.linear_projection(np.array([2,1]))

array([[1.500279  ],
       [0.87359401],
       [2.03864013],
       ...,
       [3.76198732],
       [2.47630347],
       [3.62583155]])

Now let's add our other methods. What other changes do you notice to these expressions?

In [21]:
class OLS:
    # constructor
    def __init__(self, X, y):
        # define attributes
        self.X = X
        self.y = y
    
    # ------- methods --------
    # linear projection
    def linear_projection(self, b):
        b = b.reshape((len(b),1))
        return self.X@b
    
    # SSE
    def sse(self, b): # used to be a f'm of y, X, and b. Now only need b!
        yhat = self.linear_projection(b)
        sse = np.sum((yhat - self.y)**2)
        return sse
    
    # minimize the SSE
    def estimate(self, x0 = [0,0]):
        # default initial guess of [0,0]
        sol = minimize(self.sse, x0 = x0)
        return sol.x

What happened to the arguments in minimize?? Here something cool happens, and it actually starts with the `sse` function. Now that `X` and `y` are attributes of the class, the `self.sse()` method knows what they are thanks to the `self` argument that is implicitly passed into it! Therefore we can call `sse()` as only a function of one argument: `b`. Here's proof:

In [22]:
# instantiate model
model1 = OLS(X, y)

# call sse method
model1.sse(np.array([2,1]))

483818.64196098014

Now that `OLS.sse()` is only a function of one argument, we can omit the arguments altogether in the minimze function, and call it just by its name, `self.sse`. This is an example of passing a function to another function; when there is only 1 argument, we only need its name. And because of the class structure, it already knows that the single argument is what it is minimizing over! Let's test it:

In [23]:
# call the solve_OLS() method
model1.estimate()

array([1.99789901, 8.00919576])

## Exercises
Try these out for yourself:
1. After estimating $\widehat{\beta}$, add it as an attribute of the OLS class.
2. Estimate the White-robust SEs. There are a few ways to do this; which do you prefer? Why?
     - Estimate and return them with beta in a tuple
     - Estimate them with beta and add as an attribute
     - Write a method that calculates and returns them upon request (nice code will avoid re-estimating the betas each time you do this. How can this be avoided?)
3. Rewrite the estimation in terms of matrix algebra instead of a minimization.

In [24]:
class OLS:
    # constructor
    def __init__(self, X, y):
        # define attributes
        self.X = X
        self.y = y
    
    # ------- methods --------
    # linear projection
    def linear_projection(self, b):
        b = b.reshape((len(b),1))
        return self.X@b
    # SSE
    def sse(self, b): # used to be a f'm of y, X, and b. Now only need b!
        yhat = self.linear_projection(b)
        sse = np.sum((yhat - self.y)**2)
        return sse
    # minimize the SSE
    def estimate(self, x0 = [0,0]):
        # default initial guess of [0,0]
        sol = minimize(self.sse, x0 = x0)
        # Exercise 1
        ...
        
        return sol.x
        
        
    # Exercise 2
    def whiteSEs(self):
        if not hasattr(self, 'beta'):
            self.estimate()
        ...
    
     # Exercise 3
    def estimate_matAlg(self):
        ...

### Test our work
I test the results against the ``statsmodels`` package.

In [25]:
import statsmodels.api as sm

In [26]:
# instantiate
model = OLS(X, y)
# run
model.estimate()

array([1.99789901, 8.00919576])

In [27]:
print("White SEs: \n%s" % model.whiteSEs()**(1/2))
print("beta_hat (minimizer): \n %s" % model.beta)
print("beta_hat (matrix algebra): \n%s" % model.estimate_matAlg())

White SEs: 
[[0.01003197 0.00160577]
 [0.00160577 0.01030198]]
beta_hat (minimizer): 
 [1.99789901 8.00919576]
beta_hat (matrix algebra): 
[[1.99789902]
 [8.00919577]]


In [28]:
robust_ols = sm.OLS(y, X).fit(cov_type='HC0')

In [29]:
print("statsmodels White SEs:\n %s" % robust_ols.cov_params()**(1/2))
print("statsmodels beta_hat:\n%s" % robust_ols.params)

statsmodels White SEs:
 [[0.01003197 0.00160577]
 [0.00160577 0.01030198]]
statsmodels beta_hat:
[1.99789902 8.00919577]
