# NumPy

"numpy" is a scientific library for Python. Think matrix algebra on steroids. Its main data structure is the NumPy array.

### Pros of numpy:
* fast (but really)
* some nice built-in functions, including random number generation
* plays nice with other python packages

### Cons of numpy:
* lower-level than many other packages


Where do I use numpy in economics? My advanced macroeconomics class relied on it heavily.

## A note on python packages:
As you may have noticed, a lot of python programs will start with something like 'import numpy as np' What does this mean? It means that your computer will know to look for functions in the NumPy library that it has previously downloaded and stored elsewhere. The 'np' is just shorthand - you could just type 'import numpy' and then refer to all NumPy functions with the prefix 'numpy.' When you finish off the statement with 'as np,' you can type 'np' instead of 'numpy.'

## Basics

In [1]:
#!pip install numpy
import numpy as np

In [2]:
A = np.array([1,2,3,6,5,4,8,7,9])
print(A)

[1 2 3 6 5 4 8 7 9]


In [3]:
A.shape

(9,)

What?

NumPy treats one-dimensional matrices (vectors) specially. Let's have NumPy treat it as a multi-dimensional array.

In [4]:
A = A.reshape(1,9)
print(A)

[[1 2 3 6 5 4 8 7 9]]


In [5]:
A.shape

(1, 9)

### The shape of an array
Each NumPy multidimensional array has a 'shape' attribute. It is always a tuple - meaning, in the form (a, b) where a is the number of rows and b is the number of columns.

In [6]:
A = A.reshape(3,3)
print(A)

[[1 2 3]
 [6 5 4]
 [8 7 9]]


Can I make arrays that are 3-dimensional?

In [7]:
C = np.array(range(1,28))
C = C.reshape((3,3,3))
print(C, '\nThe shape of C is:', C.shape)

[[[ 1  2  3]
  [ 4  5  6]
  [ 7  8  9]]

 [[10 11 12]
  [13 14 15]
  [16 17 18]]

 [[19 20 21]
  [22 23 24]
  [25 26 27]]] 
The shape of C is: (3, 3, 3)


See how we passed a python range object into the array constructor? That's pretty neat.

## Introducing: zeros and element-wise arithmetic

In [8]:
B = np.zeros((3,3))
B = (B + 1)*2
print(B)

[[2. 2. 2.]
 [2. 2. 2.]
 [2. 2. 2.]]


### Why is everything a '2.' instead of a '2'?

It has to do with integer division. Computers store integers and decimal numbers differently. Integers take less space. Sometimes, if you give it whole numbers to divide, it will try to keep the answer as a whole number, telling you something along the lines of "12/10 = 1" instead of 1.2.

However, if you don't want this to happen, you can enter numbers with a decimal after them. '2' is an integer, but '2.' is what you call a floating point number. Python handles them differently.

Often, python and its libraries (like numpy) are pretty good at doing what you want them to do. This is a much bigger thing in C, C++, and similar languages. But, it's good to be aware that things like this can happen.

## Adding, multiplying, inverting

In [9]:
print(A + B)

[[ 3.  4.  5.]
 [ 8.  7.  6.]
 [10.  9. 11.]]


In [10]:
print(A * B)

[[ 2.  4.  6.]
 [12. 10.  8.]
 [16. 14. 18.]]


This may be fine for business majors, but this is not the matrix algebra we want.

In [11]:
D = np.dot(A, B)
print(D)

[[12. 12. 12.]
 [30. 30. 30.]
 [48. 48. 48.]]


This is the old school way, and is also most efficient on memory use. Here's a way that seems a little more intuitive:

In [12]:
D = A @ B
print(D)

[[12. 12. 12.]
 [30. 30. 30.]
 [48. 48. 48.]]


Let's look into inverting a matrix:

In [13]:
B = np.linalg.inv(A)
print(B)

[[-0.80952381 -0.14285714  0.33333333]
 [ 1.04761905  0.71428571 -0.66666667]
 [-0.0952381  -0.42857143  0.33333333]]


In [14]:
idmat = np.dot(A,B)
print(idmat)

[[ 1.00000000e+00 -5.55111512e-17 -5.55111512e-17]
 [-4.44089210e-16  1.00000000e+00  2.22044605e-16]
 [ 5.55111512e-16 -1.66533454e-16  1.00000000e+00]]


## A real-life (ish) example:

Let's start off with a regression example. We're going to make up some data, knowing the relationship between all the variables, and then we'll add in some random chance (epsilon). Then, we'll see if our standard Ordinary Least Squares (OLS) regression can estimate the original relationship. Here's our model:

$$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \varepsilon_i $$

In [15]:
nobs = 10000
numxvars = 3

x1 = np.random.rand(nobs)*10
x2 = np.random.rand(nobs)*5
x3 = np.random.randint(1,10,nobs)

beta0 = 10
beta1 = 3
beta2 = 2
beta3 = 4

eps = np.random.randn(nobs)*10

Y = beta0 + beta1*x1 + beta2*x2 + beta3*x3 + eps

In [16]:
print(eps[:10])

[ -3.19720576  -5.13551127  -5.46313827   8.84925058   5.45390226
  -3.81774715   5.73988402 -19.81544326 -16.69128432  -5.26036559]


What's with the colon?

It's python's 'slice' notation. I only wanted what was in the array up to the element in position 10. Slice notation could be its own class. If you're interested, Google it.

In [17]:
X = np.stack((np.ones(nobs), x1, x2, x3), axis = 1)

Xprime = np.transpose(X)
XprimeXinv = np.linalg.inv(Xprime @ X)

bhat = XprimeXinv @ Xprime @ Y

uhat = X @ XprimeXinv @ Xprime @ Y
sigma2 = (np.transpose(uhat) @ uhat) / (nobs - (numxvars + 1))
varbhat = XprimeXinv*sigma2

print(bhat)
print(np.sqrt(varbhat[2,2]))

[9.97934431 3.01209678 2.00134726 3.99886939]
0.35772413650989365


# Pandas

"pandas" is a python library that was specially created to work with tabular data. These tables of data are called Data Frames. Pandas offers a simple interface for simple analysis. It plays nicely with numpy.

### Pros of pandas:
* completely free
* links with other python packages that provide more advanced utilities
    * numpy 
    * matplotlib for easy visualization
    * sklearn for machine learning
    * hundreds of other packages tailored to data analysis
* widely used -> Google-ability of problems
* can work with nearly any file type

### Cons of pandas:
* slow on large (2GB+) datasets
* a bit of a learning curve
* not as intuitive or efficient as languages built specifically for economic analysis (STATA). You can never beat 'regress y x' for simplicity