# 1. Introduction: The Numpy ndarray

Welcome to the SSRIC Instructional Modules for the project, "Teaching Statistics and Economic Data Analysis in Python with Jupyter Notebooks", by Daniel MacDonald, Associate Professor and Chair, Economics Department, CSU San Bernardino. These were written in Summer 2023.

Most of the modules draw extensively on Kevin Sheppard's e-book, *Introduction to Python for Econometrics, Statistics, and Data Analysis*, available here: https://bashtage.github.io/kevinsheppard.com/files/teaching/python/notes/python_introduction_2021.pdf. 

Rather than begin instruction in Python through the core tools of computer programming (such as conditions, loops, and functions), Sheppard begins with Python's major "containers", or data structures. Through practice, I have learned that this is an effective method for teaching Python to economics majors. 

The learning objectives of this set of Instructional Modules are as follows. By the end of these modules, students will be able to...

1. Create data structures in Python based on economic data
1. Summarize the statistical properties of economic data (median, mean, max, min, correlation) using Python
1. Create and manage economic data: create new columns and rows, merge and append, and import data from .csv and .xlsx files into Python
1. Visualize economic data using line and scatterplots

The objectives/content of Module 2 are as follows. By the end of this module, students will be able to...

1. Importing libraries into Python for advanced data analysis
1. Perform basic data manipulation operations on the Numpy `ndarray` datatype
1. Calculate economic statistics using Numpy

## 1.1 Libraries in Python

Previously, we worked through the basic Python data structures - floats, integers, strings, lists, tuples, and dictionaries. These data structures are native to Python - i.e., they constitute the core of the programming language. But most economic work in Python involves **libraries**, which are additional tools and structures that are available to you and which allow you to tackle problems in a more advanced way. 

The issue with libraries is that, since they are not part of the core Python package, you will need to `import` them into your scripts/programs in order to use them. 

One of the most popular libraries is called "Numpy". We can import it as follows and start to work with it. Note that "np" is a standard convention for calling tools from this library.

In [None]:
import numpy as np

## 1.2 Import successful

Once a library has been imported, you can use it by calling the label you applied, such as `np` from above.

In [None]:
x=[0.0, 1, 2, 3, 4]
print('x is type ', type(x))

y=np.array(x) # method for converting a list to a Numpy array
print(y)
print('y is type ', type(y))

# 2. Getting started with Numpy - the array data type

We converted the list `x` into a Numpy array by using `np.array()`. Notice that the data in this form have a different `type` - it is now type 'numpy.ndarray', where 'ndarray' stands for N-dimensional array. You can think of an array as a matrix.

Whenever you want to use a Numpy function or method, you will need to preface it with `np`. 

Converting the list to a new data type, an `ndarray`,  seems pointless at first - **why not just keep it as a list**? We have functions like `sum()` and `len()` which we used on lists last time, and there are others as well such as `sort()`. 

Of course, converting the list to a Numpy array is *not* pointless - it opens up all sorts of functions that are in Numpy but which are not part of the standard Python library's data structures.

It also allows us to give a little bit more shape or structure to our data containers:

In [None]:
x=[[1.0], [2.0], [3.0], [4.0], [5.0]]
print(x)

y=np.array(x)
print(y)

print('shape (rows, columns): ', np.shape(y))
print(type(np.shape(y)))

## 2.1 Try it yourself: create a list

Suppose you have a sample of wages 10.52, 21.45, 51.78, 70.12, 89.83. Do the following: 

1. Create a list object from these data
1. Convert the list object into a Numpy `ndarray` type

In [None]:
#Try it yourself. Write your code below:



## 2.2 Arrays as matrices and data structures

As you can see in the above example, converting to a Numpy array gives some "shape" to our list. That is a simple 5x1 matrix, but you can see how useful it could be in more advanced applications, especially involving linear algebra. Here is a representation of a 2x2 matrix as well as other common matrix types:

In [None]:
x=[[1,3], [2,5]]
y=np.array(x)

print(y)
print(np.shape(y))

In [None]:
x=[[-4,2,1,0], [-8,1,0,0], [8, -3,-3,0]]
y=np.array(x)

print(y)
print(np.shape(y))

In [None]:
wage_lq=np.array([[1342., 0.91], [732.,0.14],[793.,0.76],[977.,0.97],[840.,0.96],
                  [1084.,0.85],[1280.,0.93],[694.,0.75],[955.,0.76],[995.,0.95]])
print(wage_lq)
print(np.shape(wage_lq))
print('rows: ', np.shape(wage_lq)[0])
print('columns: ', np.shape(wage_lq)[1])

## 2.3 Try it yourself: create a simple economic dataset

Use the following data and do the following:

1. Create a Numpy array with two columns of data. First column includes the County name and the second column includes the unemployment rate
1. **Each row should have a new county associated with it**

Los Angeles: 4.5 <br>
Riverside: 4.2 <br>
San Bernardino: 4.1 <br>
San Luis Obispo: 2.8 <br>
Santa Barbara: 3.2 <br>
Ventura: 3.7 <br>
Kern: 6.8 <br>
Orange: 3.0 <br>
San Diego: 3.3 <br>
Imperial: 16.7

In [None]:
#Try it yourself. Write your code below:



# 3. Accessing information from an array using slices

For economics and statistics, Numpy has many other useful features. Let's start with a random 4x4 matrix and explore how to access elements from it:

In [None]:
np.random.seed(92407) # I set the seed to make exposition clearer: 

x=np.random.randint(1, 10, (4,4))
print(x)

We can access rows and columns of this array in an easy way using the basics of slicing and indexing which we saw first in the previous tutorial - with a small twist:

In [None]:
print('First row of x: ', x[0,:]) # Entire first row
print('Second row of x: ', x[1,:]) # Entire second row
print('First and second rows of x: \n', x[0:2,:]) # Entire first and second row
print('Fourth (i.e., final) column of x: ', x[:,3]) # Entire 4th column
print('Final column of x: ', x[:,-1])

print('Third element of the third row: ', x[2,2])
print('Third element of the third row: ', x[2][2]) #This way of "slicing" might be more or less intuitive - use whatever works!

**The "twist":** instead of the somewhat clunky `x[a][b]` notation to denote the index positions of the outer and inner lists as we might do for a `list` type, Numpy `ndarray` types allow use of the more intuitive `x[a,b]` notation.

Going back to our wage_lq array,, which is a small sample of 10 counties' average weekly wages (column 0) and employment location quotients in the logistics industry (column 1), let's access the columns and some individual rows:

In [None]:
wage_lq=np.array([[1342., 0.91], [732.,0.14],[793.,0.76],[977.,0.97],[840.,0.96],
                  [1084.,0.85],[1280.,0.93],[694.,0.75],[955.,0.76],[995.,0.95]])
print(wage_lq)

print(wage_lq[:,0]) # average weekly wage
print(wage_lq[:,1]) # location quotient
print(wage_lq[3,:]) # average and weekly wage from the 4th location in the dataset

## 3.1 Try it yourself: slicing Numpy arrays

Using the array in the next cell, write code to find the following:

1. Print the 6th row
1. Print the 3rd column
1. Print the 3rd element of the 2nd column
1. Print the last row (hint: recall the position of a "last" element is `-1` in Python)
1. Print the last element of the last column (see above hint)

In [None]:
np.random.seed(92407)
tryit=np.random.randint(400, 600, (8,5))
print(tryit)

#Try it yourself. Write your code below to answer 1-5:



## 3.2 Altering datasets and making copies

Sometimes it's useful to make a copy of a dataset and work with it, or add a column or row of data. While you can do this with list methods, it's easier to use Numpy methods.

In [None]:
np.random.seed(92407)
x=np.random.randint(1,20,(5,7))
print(x)

new_row=np.ones((1,7)) #Remember to enclose the shape as a tuple

y=np.append(x,new_row, axis=0) #Make sure to set the axis (0 for appending a row, 1 for appending a column)
print(y)

new_column=np.ones((5,1))

z=np.append(x, new_column, axis=1) 
print(z)

What about a copy of a dataset? Making a copy can be dangerous if it's not a **deep copy**, because then, making a change to a copy can also change the original dataset. When you're handling copies, make sure to use the `.copy()` method, which is built in Numpy to refer to a deep copy: 

In [None]:
np.random.seed(359)
a=np.random.randint(1,20,(2,4))
print(a)

b=a

b[0,0]=100

print('b:', b)
print('a:', a)

print("\n\nOh! That wasn't supposed to happen.")
print("The first element of the first row/column of was supposed to change to 100, but the first row/column of a was not supposed to change.")
print("Let's try that again: \n\n")

a=np.random.randint(1,20,(2,4))

b=np.copy(a)

b[0,0]=100

print(b)
print(a)

print("Better.")

## 3.3 Try it yourself: append a row of ones

Use the `tryit` object defined earlier and add a column of ones to it:

In [None]:
np.random.seed(92407)
tryit=np.random.randint(400, 600, (8,5))
print(tryit)

new_column=np.ones((8,1))

#Write code to add this new column to our dataset:


# 4.Mathematical and statistical functions in Numpy

As mentioned earlier, Numpy is an advanced library for handling data. We can perform many mathematical and statistical operations on arrays using Numpy functions/methods. Here are some using the random integer dataset generated earlier:

In [None]:
x=np.random.normal(0,4,(3,5)) #mean (=0) and standard deviation (=4) specified as well as shape
print(x)
print(np.sum(x[:, 1])) # Sum of the elements in the 2nd column
print(np.max(x[:, 3])) # Maximum value of the elements in the 4th column
print(np.std(x[:, 2])) # Standard deviation of the elements in the 3rd column
print(np.mean(x[:, 1])) #Mean of the elements of the 2nd column
print(np.median(x[:, 1])) #Median of the elements of the 2nd column

print(np.corrcoef(x, rowvar=False)) # rowvar=False tells Numpy to correlate columns, not rows
print(np.corrcoef(x, rowvar=False)[1,2]) #extracts correlation coeff. between the 2nd (index pos = 1) and 3rd (pos = 2) columns
print(np.corrcoef(x, rowvar=False)[2,1]) #correlation between 3rd and 2nd; should be same as above
print(np.corrcoef(x, rowvar=False)[2,3]) #correlation coeff. between 3rd and 4th columns

## 4.1 Try it yourself: Calculate mean, medians, and correlations with the wage-location quotient data

Using the example above, take "wage_lq" and do the following tasks:

1. Calculate average of the average weekly wage (column 1)
1. Calculate the average location quotient (column 2)
1. Calculate the correlation between the two columns

In [None]:
wage_lq=np.array([[1342., 0.91], [732.,0.14],[793.,0.76],[977.,0.97],[840.,0.96],
                  [1084.,0.85],[1280.,0.93],[694.,0.75],[955.,0.76],[995.,0.95]])

# Try it yourself: print your code for 1-3 below:



## Summary

Numpy significantly advances our understanding of the importance of arrays as structures for economic data. Indeed, the (even more) advanced libraries we will use in later modules are ultimately based on the Numpy `ndarray` type. Having a thorough understanding of Numpy is therefore critical for later modules. 