# Lecture 4--Object oriented programming

## 1. Programming: Pandas

* [Introduction to Pandas](lec4files/03.00-Introduction-to-Pandas.ipynb)
* [Introducing Pandas Objects](lec4files/03.01-Introducing-Pandas-Objects.ipynb)
* [Data Indexing and Selection](lec4files/03.02-Data-Indexing-and-Selection.ipynb)
* [Missing Values](lec4files/03.04-Missing-Values.ipynb)
* [Concat And Append](lec4files/03.06-Concat-And-Append.ipynb)
* [Merge and Join](lec4files/03.07-Merge-and-Join.ipynb)
* [Aggregation and Grouping](lec4files/03.08-Aggregation-and-Grouping.ipynb)
* [Working With Strings](lec4files/03.10-Working-With-Strings.ipynb)
* [Working with Time Series](lec4files/03.11-Working-with-Time-Series.ipynb)
* [Performance Eval and Query](lec4files/03.12-Performance-Eval-and-Query.ipynb)
* [Further Resources](lec4files/03.13-Further-Resources.ipynb)


## 2. Topic: Object-oriented programming	

### Benefits

**Instantiation**

Being able to make instances of something. Suppose you have a new data structure. You want to define the structure and makek instances given particular data sets. This can be accomplished with classes. You make a general class, then when faced with a data set, you make a new instance. This ties into the second benefit of object-orientation which is encapsulation

**Encapsulation**

Being able to combine multiple data points into a single object which will always have the same naming conventions. This allows us to define custom functions which work on a class and which all the instances of that class will have access to. For instance, in pandas, all data sets which would normally just be numpy matrices have dedicated column names attached to them. This allows us to query for the column names and guarantees that something will be there for every data set. 

**Inheritance**

Suppose you wanted to make a panel data frame. This means that you need to create a bunch of panel functions and features, but if you didn't have inheritance, you would have to build this completely from scratch rewriting much of the pandas DataFrame code. With inheritance, you can simply extend the DataFrame class and start writing the new features and functions.

### Example

In [1]:
class A:
    def __init__(self,name):
        self.name = name
    def print_name(self):
        print(self.name)

<__main__.A at 0x7fe910190f98>

In [5]:
A('jason')

<__main__.A at 0x7fe910190d68>

In [2]:
A('jason').name

'jason'

In [3]:
A('jason').print_name()

jason


In [7]:
class B(A):
    def __init__(self,name,job):
        super(B,self).__init__(name)
        self.job = job
    def print_name_and_job(self):
        print('{0}\'s job is: {1}'.format(self.name,self.job))

In [8]:
B('jason','teacher').print_name()

jason


In [9]:
B('jason','teacher').print_name_and_job()

jason's job is: teacher


## 3. Data Science: Linear model	
Our first goal is to understand the variance of $\hat\beta$. To do this, we will combine the formula for $\hat\beta$ with the linear modeling equation, $y=\mathbf X\beta+e$:
$$\begin{align}
\hat\beta&=\left(\mathbf X'\mathbf X\right)^{-1}\mathbf X'y\\
&=\left(\mathbf X'\mathbf X\right)^{-1}\mathbf X'\mathbf X\beta+\left(\mathbf X'\mathbf X\right)^{-1}\mathbf X'e\\
&=\beta+\left(\mathbf X'\mathbf X\right)^{-1}\mathbf X'e\\
\hat\beta-\beta&=\left(\mathbf X'\mathbf X\right)^{-1}\mathbf X'e
\end{align}$$
Then we can define the variance:
$$\text{Var}(\hat\beta\vert\mathbf X)=\text{E}\left[\left(\hat\beta-\beta\right)\left(\hat\beta-\beta\right)'\vert\mathbf X\right]$$
This is true so long as $\text{E}(\hat\beta)=\beta$. I.e., when there is no endogeneity. Notice how our derived equation fits neatly into this definition. 
$$\text{Var}(\hat\beta\vert\mathbf X)=\left(\mathbf X'\mathbf X\right)^{-1}\mathbf X'\text{E}\left[ee'\vert\mathbf X\right]\mathbf X\left(\mathbf X'\mathbf X\right)^{-1}$$
So we have our general formula for the variance. The OLS assumptions (no heteroskedasticity, no autocorrelation) imply in matrix form that $\text{E}\left[ee'\vert\mathbf X\right]=\sigma^2\mathbf I_n$. Now we can simply our variance as follows:
$$\begin{align}
\text{Var}(\hat\beta\vert\mathbf X)&=\sigma^2\left(\mathbf X'\mathbf X\right)^{-1}\mathbf X'\mathbf I_n\mathbf X\left(\mathbf X'\mathbf X\right)^{-1}\\
 &=\sigma^2\left(\mathbf X'\mathbf X\right)^{-1}\mathbf X'\mathbf X\left(\mathbf X'\mathbf X\right)^{-1}\\
 &=\sigma^2\left(\mathbf X'\mathbf X\right)^{-1}
\end{align}$$
So we can define the OLS variance estimator:
$$\hat{\text{Var}}(\hat\beta)_{\text{OLS}}=s^2\left(\mathbf X'\mathbf X\right)^{-1}$$
where $s^2=(n-r)^{-1}\sum_{i=1}^n\hat e_i^2$
If we want to only assume no autocorrelation then $\text{E}\left[ee'\vert\mathbf X\right]=\text{diag}\left( \sigma_i^2\right)$ and the variance becomes:
$$\hat{\text{Var}}(\hat\beta)_{\text{White}}=\left(\mathbf X'\mathbf X\right)^{-1}\mathbf X' \text{diag}\left(\hat{e}^2\right)\mathbf X\left(\mathbf X'\mathbf X\right)^{-1}$$
t-statistics can be defined by:
$$t_{\hat\beta}=\hat\beta\circ\text{diag}\left(\hat{\text{Var}}(\hat\beta)^{-1}\right)$$
and p-values can similarly be defined:
$$p_{\hat\beta}=2\Phi(-\vert t_{\hat\beta}\vert)$$

## 4. Programming challenges 

### Package structure

Consider how you would organize a data science package. What classes would you use? How would the inheritance structure work?
### Audioactive decay


In [14]:
#Package structure

class unsupervised_model :
    def __init__(self,x):
        self.x = x
        self.n_obs = x.shape[0]
        self.job = x.shape[1]
        
class supervised_model(unsupervised_model):
    def __init__(self,x,y):
        super(supervised_model,self).__init__(x)
        self.y = y
        
class predictor(supervised_model):
    def predict(self,newx):
        raise NotImplementedError()
    def fitted(self):
        return self.predict(self.x)
    def residuals(self):
        return self.y-self.fitted()

class linear_model(predictor):
    def __init__(self,x,y):
        super(linear_model,self).__init__(x,y)
        self.params = self.__fit__()
    def __fit__(self):
        raise NotImplementedError()
    def predict(self,newx):
        return newx@self.params
    
class least_square_regressor(linear_model):
    def __fit__(self):
        return np.linalg.solve(self.x.T@self.x,self.x.T@self.y)
    
    

In [11]:

import numpy as np
import pandas as pd
bwght = pd.read_csv('BWGHT.csv')
bwght

Unnamed: 0,faminc,cigtax,cigprice,bwght,fatheduc,motheduc,parity,male,white,cigs,lbwght,bwghtlbs,packs,lfaminc
0,13.5,16.5,122.3,109,12.0,12.0,1,1,1,0,4.691348,6.8125,0.0,2.602690
1,7.5,16.5,122.3,133,6.0,12.0,2,1,0,0,4.890349,8.3125,0.0,2.014903
2,0.5,16.5,122.3,129,,12.0,2,0,0,0,4.859812,8.0625,0.0,-0.693147
3,15.5,16.5,122.3,126,12.0,12.0,2,1,0,0,4.836282,7.8750,0.0,2.740840
4,27.5,16.5,122.3,134,14.0,12.0,2,1,1,0,4.897840,8.3750,0.0,3.314186
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1383,27.5,30.0,138.3,110,12.0,12.0,4,1,1,0,4.700480,6.8750,0.0,3.314186
1384,5.5,30.0,138.3,146,,16.0,2,1,1,0,4.983607,9.1250,0.0,1.704748
1385,65.0,8.0,118.6,135,18.0,16.0,2,0,1,0,4.905275,8.4375,0.0,4.174387
1386,27.5,8.0,118.6,118,,14.0,2,0,1,0,4.770685,7.3750,0.0,3.314186


In [15]:
bwght['(intercept)'] =1 

x = bwght[['(intercept)','cigs','faminc', 'male']]
y = bwght['bwght']
least_square_regressor(x,y).params

array([ 1.15227708e+02, -4.61045700e-01,  9.68798348e-02,  3.11396789e+00])