# 1. Introduction: Basic economic data structures in Python

Welcome to the SSRIC Instructional Modules for the project, "Teaching Statistics and Economic Data Analysis in Python with Jupyter Notebooks", by Daniel MacDonald, Associate Professor and Chair, Economics Department, CSU San Bernardino. These were written in Summer 2023.

Most of the modules draw extensively on Kevin Sheppard's e-book, *Introduction to Python for Econometrics, Statistics, and Data Analysis*, available here: https://bashtage.github.io/kevinsheppard.com/files/teaching/python/notes/python_introduction_2021.pdf. 

Rather than begin instruction in Python through the core tools of computer programming (such as conditions, loops, and functions), Sheppard begins with Python's major "containers", or data structures. Through practice, I have learned that this is an effective method for teaching Python to economics majors.

The learning objectives of this set of Instructional Modules are as follows. By the end of these modules, students will be able to...

1. Create data structures in Python based on economic data
1. Summarize the statistical properties of economic data (median, mean, max, min, correlation) using Python
1. Create and manage economic data: create new columns and rows, merge and append, and import data from .csv and .xlsx files into Python
1. Visualize economic data using line and scatterplots

The objectives/content of Module 1 are as follows. By the end of this module, students will be able to...

1. Use basic data structures to build economic datasets
1. Perform basic manipulations of data structures like add, remove, and extract elements
1. Calculating basic statistics using the native Python data structures

## 1.1 Data types: numeric, Boolean, and string

Chapter 2 of Sheppard's book introduces you to the built-in data types, or containers, that Python has available. In terms of numeric data, for the purposes of this class, you have floats and integers: 

**Note: to run the code block below, click anywhere in it and then hold the "Shift" key and press "Enter" or "Return":**

In [None]:
x=-10.7 # A float 
y=1 # An integer

In [None]:
print('x: ', type(x), '\ny: ', type(y))

The `print` function above is one of the most basic available to you. The **\n** tells Python to do a carriage return before printing y. It's not necessary, but it makes the printed code look more readable. 

The `type` function returns the datatype or class of the variable you specify. It's highly recommend to often use `type()` to check on a data format.

Another type of data structure is the Boolean, or true/false type:

In [None]:
print(float == type(x))

print(float == type(y))

In [None]:
print(x > 0 and x<2)

print(x < 0)

print(x>0 and x>2)

print(y>0 or y>2)

## 1.2 Try it yourself

In the cell below, your task is to assign the numbers in 1-5 to different variables (Python data types) and check their type. In part 6, you will need to think creatively about a Boolean condition to return **True** or **False** based on the condition specified. Run the entire code in a single cell and correct for any errors. 

Use the following information, and call your variables whatever you want (for example, the first variable could be `avg_wage=28.03`):

1. Average hourly wage in the Inland Empire: 28.03
1. Unemployment rate in Orange County: 3.0
1. Number of counties in Southern California: 10
1. California Real GDP, Q3 2022 (trillions of dollars): 2.893
1. California Real GDP, Q3 2021 (trillions of dollars): 2.895
1. Check if California Real GDP declined between Q3 2021 and 2022 (hint: use a Boolean logic to evaluate whether Q3 2021 GDP was less than Q3 2022 GDP

In [None]:
#Try it yourself: insert code below



It might not be immediately obvious, but **Boolean data types are very important when working with economic data**, because they are built on different conditions that you might set for viewing/analyzing data, such as only analyzing positive data, or looking for data in a specific range (like >0 or >2 or whatever).

## 1.3 String data types

Another common data structure that has a lot of surprising economic applications is the string:

In [None]:
a='economics'
b="sociology"
print(type(a))
print(type(b))

You can do a lot with strings in Python. One of the first things you learn is that strings in Python are zero-indexed, so if you want to pull out the first character, you need to remember it has an index location of zero. You can access other characters in the string using this zero-indexing as a starting point:

In [None]:
print(a[0])
print(a[3])
print(a[0:4])
print('Number of characters in {}: '.format(a), len(a))

Zero-indexing is tricky because as you can see in this example, the 3-index position is the fourth character. 

You should also notice that **when extracting a substring, the end of the extraction point (the "4" in "0:4" is located at one less of whatever number you specify as the end point**. So `a[0:4]` ends at the 4th character, 'n', not the fifth. These are quirks about Python programming that you'll just have to get used to! But also, this method assures you that you know the length of your resulting string: a[0:4] has 4-0 = 4 characters, and so on.

It's hard to show examples now, but slicing strings is a very common task when doing programming with economic data. Here is a very simple example of how we might extract course number from some course lists: 

In [None]:
course1='Econ 3500'
course2='Econ 4300'

course1_number=course1[-4:]
course2_number=course2[-4:]

print(course1_number)
print(course2_number)

We will learn faster ways of doing this in later classes, but it is useful to know that negative values can be used as well: they start at the end of your string (which has location -1) and work backward from there (-2, -3, and so on).

## 1.4 Boolean variables applied to string data types

You can use Booleans to check if certain characters, or groups of characters, exist in strings:

In [None]:
print('e' not in 'economics')
print('econ' in 'economics')
print(a)
print('e' in a)
print('z' in a)

## 1.5 Try it yourself

In a general state-county FIPS code, the state code is the first two numbers and county the last three.

Take the following varaibles and extract the State or county FIPS for each one. For example, if the variable is `san_bernardino="06071"` and you're asked to find the county FIPS code, create a new variable `sb_countyfips=sb[-3:]`

In [None]:
orange_county='06059' #Extract county FIPS below:


riverside_county='06065' #Extract county FIPS below:


san_diego_county='06073' #Extract state FIPS below:


los_angeles_msa='SMU063108400500000003' #Extract Metro code below (5 characters after '06'):


ie_msa='SMU064014000500000003' #Extract Metro code below (5 characters after '06'):



## 1.6 Data types: lists

Lists form the backbone of many of the economic data structures we will use in this course. The reason - as you will see in a moment - is that they look very much like arrays or columns of data that you might see in a spreadsheet. 

Lists are indexed just like strings and have many other useful properties:

In [None]:
x = [1.0, 2.0, 5.0, 7.0, 8.0] 
print(x)
print(type(x))

In [None]:
print(x[0])
print(x[0:2])
print(x[-1])
print(x)
print('x has {} elements'.format(len(x)))
print('x has {} elements and the last element is {}'.format(len(x), x[-1]))
print(1.0 in x)

One thing you might notice about the above code, which I also did earlier, is this `.format()` technique. This is referred to as a `method` in Python because it does something to the object under question - in this case, formats the string by inserting something else into it, using {} to define the place to insert. 

All data structures have methods in Python. You can look up what you can do with an object *x* using `print(dir(x))`

There are several useful methods for lists, such as `append` and `pop`, used below:

In [None]:
x = [1.0, 2.0, 5.0, 7.0, 8.0] 

print(dir(x))

print(x)

x.append(9.0)
print(x)

x.pop(-1)
print(x)

# 2. More things to do with Python lists: methods, functions, and basic statistics

In addition to **methods**, data structures have **functions** attached to them as well. I've already showed the `len` function which counts the number of characters/elements, but other functions exist as well:

In [None]:
x = [11.0, 3.0, -2.0, -13.0, 6.0, 7.0, 8.0, 1.0, -3.0] 

print('Sum of values in x: ', sum(x)) #Adds up the elements of x

x_mean=sum(x)/len(x)
print(x_mean, sum(x)/len(x))

print('The mean of x is {}'.format(x_mean))

In [None]:
x.sort() #permanently sorts the list!
print(x)

x_min=x[0]
x_max=x[-1]
x_median=x[len(x)//2]

print(x_min, x_max, x_median)

In the above code, `//` invokes a modular operation and discards the remainder. The list x has 9 elements, and 9 divided by 2 is 4 (remainder 1). So, x_median reports `x[4]` (which equals 3).

## 2.1 Try it yourself: calculate the average unemployment rate in Southern California

Consider the unemployment statistics below and do the following with them:

1. Create a list out of them (just the unemployment rates - don't include separate lists for the FIPS or county) 
1. Check that your object is in fact a list (using `type`)
1. Calculate the maximum and minimum of the list and create a new variable defining them
1. Calculate the average of the list
1. Calculate the median of the list (hint: the median of a variable with an even number of observations is the average of the n - n/2 and n - n/2 + 1 observations)

06037 Los Angeles: 4.5 <br>
06065 Riverside: 4.2 <br>
06071 San Bernardino: 4.1 <br>
06079 San Luis Obispo: 2.8 <br>
06083 Santa Barbara: 3.2 <br>
06111 Ventura: 3.7 <br>
06029 Kern: 6.8 <br>
06059 Orange: 3.0 <br>
06073 San Diego: 3.3 <br>
06025 Imperial: 16.7

In [None]:
#Try it yourself... write your code below:



## 3. The last two data types: tuples, dictionaries, and their statistical applications

The last two important data structures for economics in Python are tuples and dictionaries.
First, the tuple:

In [None]:
y=(-4.0, -1.0, 3.0, 5.0, 5.0)
print(y)
print(type(y))

**Tuples are immutable**, which means you cannot alter them like lists. You still have some useful methods, but in general, this data container is seen as more restrictive and therefore not ideal in many cases for holding data.

In [None]:
print(len(y))
print(dir(y)) # You'll see that only .count() and .index() are methods
print(y.count(5.0)) # .count() is a method that returns the number of times that the specified value occurs in the tuple
print(y.index(-1.0)) # .index() returns the index of the first instance of the specified value

**Dictionaries** are like lists (and, indeed, often *contain* lists), except the index can be anything, including a string:

In [None]:
z={'a': [1.0, 4.0, 3.0], 'b': [-2.0, -1.0, 3.5]}
print(z)
print(type(z))

You'll notice in the above example that **dictionaries have the potential to look very similar to a spreadsheet**, with a key as a column name and the ensuing list as the values of that column.

In [None]:
print(z['a'])
print('a' in z.keys())
print(1.0 in z['a'])
print(1.0 in z['b'])
print(dir(z))

## 3.1 Example

Now that we have seen the main kinds of data structures in Python, let's take an example:

In [None]:
u_rate={'socal_counties': [4.5, 4.2, 4.1, 2.8, 3.2, 3.7, 6.8, 3.0, 3.3, 16.7],
                    'socal_metros': [4.2, 6.8, 16.7, 6.9, 4.1, 3.3, 2.8, 3.7]}
print(type(u_rate)) # Checks to make sure we have dictionary
print(u_rate) # prints it out so we can check for accuracy

In [None]:
print(u_rate['socal_counties'])
print(u_rate['socal_metros'])

In [None]:
u_rate['socal_counties'].sort()
max_socal_counties=u_rate['socal_counties'][-1]
print(max_socal_counties)

In [None]:
print('Average county unemployment rate: ', sum(u_rate['socal_counties'])/len(u_rate['socal_counties']))

In [None]:
print('Average metro unemployment rate', sum(u_rate['socal_metros'])/len(u_rate['socal_metros']))

## 3.2 Try it yourself

With the data listed below, do the following: 

1. Create a dictionary that compares unemployment in coastal counties to inland counties in Southern California
1. Sort the data
1. Calculate and report the average, minimum, and maximum in each group 

To set up the dictionary, create two keys: `unemployment_socal={'coastal': [], 'inland': []}` and enter the data accordingly

Los Angeles: 4.5 [COASTAL] <br> 
Riverside: 4.2 [INLAND] <br>
San Bernardino: 4.1 [INLAND] <br>
San Luis Obispo: 2.8 [COASTAL] <br>
Santa Barbara: 3.2 [COASTAL] <br>
Ventura: 3.7 [COASTAL] <br>
Kern: 6.8 [INLAND] <br>
Orange: 3.0 [COASTAL] <br>
San Diego: 3.3 [COASTAL] <br>
Imperial: 16.7 [INLAND]

In [None]:
#Try it yourself...type your code below:




## Summary

Some of the code that we wrote appears cumbersome, but that's because **we're working with the very basic data structures of Python**. 

Soon we will learn more advanced data structures and tools which allow the quick computation of mean, standard deviation, correlation coefficient, scatter plots, and other summary statistics. 