# 1. Introduction: Series and DataFrames of economic data

Welcome to the SSRIC Instructional Modules for the project, "Teaching Statistics and Economic Data Analysis in Python with Jupyter Notebooks", by Daniel MacDonald, Associate Professor and Chair, Economics Department, CSU San Bernardino. These were written in Summer 2023.

Most of the modules draw extensively on Kevin Sheppard's e-book, *Introduction to Python for Econometrics, Statistics, and Data Analysis*, available here: https://bashtage.github.io/kevinsheppard.com/files/teaching/python/notes/python_introduction_2021.pdf. 

Rather than begin instruction in Python through the core tools of computer programming (such as conditions, loops, and functions), Sheppard begins with Python's major "containers", or data structures. Through practice, I have learned that this is an effective method for teaching Python to economics majors.

The learning objectives of this set of Instructional Modules are as follows. By the end of these modules, students will be able to...

1. Create data structures in Python based on economic data
1. Summarize the statistical properties of economic data (median, mean, max, min, correlation) using Python
1. Create and manage economic data: create new columns and rows, merge and append, and import data from .csv and .xlsx files into Python
1. Visualize economic data using line and scatterplots

The objectives/content of Module 3 are as follows. By the end of this module, students will be able to...

1. Build economic data from Padas `Series` and `DataFrame` types
1. Use different methods to create "views" or "slice" economic data in Pandas datatypes
1. Calculating basic statistics on economic data using Pandas methods and functions

In [None]:
import numpy as np
import pandas as pd

## 1.1 Importing Pandas

Just as we did with Numpy in a previous lecture, we need to import the pandas library in order to use it. We can then begin "panda-fy" our arrays just as we did when we converted `list`'s into Numpy `ndarrays`: 

In [None]:
print('A Numpy array type: \n')

a=np.array([1.,2.,6.,-1.])
print(a)
print(type(a))

print('\nA Pandas series type: \n')

series=pd.Series(a)
print(series)
print(type(series))

## 1.2 Try it yourself: create a `series` based on a Numpy array

In the cell below, a Numpy array has been generated for you. Take the array and convert it into a Pandas `series` type using the code above:

In [None]:
np.random.seed(92407)
x=np.random.randint(50,100,7)
print(x)

#Convert x into a Pandas series-type and print it below:



# 2. The Pandas `series` type: it's about the structure

After printing the "try it yourself" example above, you should notice two things about the `series` format:

1. The `series` type, which is like a spreadsheet with 1 column, *automatically* converts your data into a column format
1. There is an index (similar to "row number" in Excel, but as we will see, even more flexible than that)

Pandas is all about giving additional structure to your data, which is why these two elements are so crucial.

## 2.1 More "try it yourself"

After seeing the examples below, try it yourself to access 

1. The second of the series **x** generated above 
1. Last row of the series **x** generated above
1. Print the total number of observations in the series (hint: how do you count the number of elements in a list, or characters in a string?)

In [None]:
print('First element/row of series: ', series[0])
print('Third element/row of series: ', series[2])

In [None]:
# Try it yourself.



## 2.2 Indexes

For example, indexes can be other things aside from numeric type. See below:

In [None]:
wage=pd.Series([30.8, 20.1, 44.0, 83.2, 25.6], index=['Bob', 'Ana', 'John', 'Belen', 'Alexis'])
print(wage)
print("Ana's wage:", wage['Ana'])

## 2.3 Try it yourself: create a Pandas series with informative indexes

Create a `series` type from the unemployment rate data below. The index should be the County name, and the column of data refers to the unemployment rates. **Each row should have a new county associated with it**:

Los Angeles: 4.5 <br>
Riverside: 4.2 <br>
San Bernardino: 4.1 <br>
Orange: 3.0 <br>
San Diego: 3.3 <br>
Imperial: 16.7

In [None]:
#Try it yourself. Add your code below:



# 3. Operations on series

You can square/divide series, convert to percentages, and perform many other mathematical operations on the series type. Note that while you can also do this in Numpy, the benefit of Pandas is - again, the *structure* it gives to your data:

In [None]:
x=pd.Series(np.random.randint(1,10,5))
print(x)
y=x**2
print(y)
y[0]=100
print(y)
print(x) # Notice that pandas automatically creates "deep" copies - so x is not affected by y[0]=100

In [None]:
print(x.mean())
print(x.sum()/x.count())
print('Maximum value in x:', x.max())
print('Minimum value in x: {}'.format(x.min()))
print(round(x.std(),2)) #Standard deviation rounded to 2 decimal places

In [None]:
print(x)
y=x/x.sum() # Just imagine that x is a series of frequencies and you want to convert it to a probability distribution...
print(y)
print(np.round(y,3)) #When you want to round an entire array, use `np.round` instead of `round()` (used above)

## 3.1 Try it yourself: compute summary stats for So Cal counties

In [None]:
socal_counties=pd.Series([4.5, 4.2, 4.1, 2.8, 3.2, 3.7, 6.8, 3.0, 3.3, 16.7], index=['Los Angeles', 'Riverside',
                                                                                    'San Bernardino', 'San Luis Obispo',
                                                                                    'Santa Barbara', 'Ventura', 'Kern',
                                                                                    'Orange', 'San Diego', 'Imperial'])
print(socal_counties)
#Try it yourself: using the above Series...
#...write two lines of code - one that prints the minimum SoCal unemployment rate, and one that prints the maximum:



## 3.2 Use `describe()` for other summary statistics

While you can pull a range of summary statistics in pandas, you can also simply `describe` the series, which will produce its own series of useful information. `describe` is another example of a method that we can perform on our objects:

In [None]:
print(socal_counties.describe())

described=socal_counties.describe()
print('\n\nAverage:', described['mean'])

## 3.3 A reminder to use the directory

If you ever want to see the full range of things you can do to an object like a `Series`, you can use the `dir` command. This is especially useful when you're starting to learn your way around the programming language. If you want to see everytihng you can do with a Pandas Series object, run the line of code below:

In [None]:
print(dir(socal_counties))

# 4. `DataFrame` type

Now that we've seen a bit of what a `Series` type can do, we can now consider the DataFrame. DataFrames are essentially multiple columns of data but which have the same structural benefits of the series, such as indexing.

The above code creates an 8x3 Numpy array. It then converts that array into a DataFrame and gives it some basic column names:

In [None]:
np.random.seed(92407)
df=pd.DataFrame(np.random.randint(30,60,(8,3)), columns=['a', 'b', 'c']) # 8 rows, 3 columns, with column names
print(df)

## 4.1 Try it yourself

Take the Numpy array below and...

1. Convert it into a DataFrame
1. Label the columns `weekly_wage` and `loc_q` respectively
1. Print the DataFrame

In [None]:
wage_lq=np.array([[1342., 0.91], [732.,0.14],[793.,0.76],[977.,0.97],[840.,0.96],
                  [1084.,0.85],[1280.,0.93],[694.,0.75],[955.,0.76],[995.,0.95]])
print(wage_lq)

#Try it yourself. Write your code below:



## 4.2 Learning about your `DataFrame`

As you can see, DataFrames are a "step up" from the `Series` type and, indeed, we will have less need of the series now that we have the DataFrame. 

You can do many things and access much information with the DataFrame:

In [None]:
print(df.columns)

In [None]:
print(df)
print(df['a'])

In [None]:
print(df[['a', 'c']]) #When you want to print more than one column, enclose all column names in square brackets (as in a list)

In [None]:
print(df)
print(df.iloc[4]) #prints rows based on the index value

In [None]:
print(df.describe())

In [None]:
df.columns=['col 1', 'col 2', 'col 3'] #Renames your columns. There are other ways to do this!
print(df)

## 4.3 Try it yourself: add an index

Indexes are added to DataFrames just as they are to series. Add the names of the counties to the `wage_lq` DataFrame we generated above:

In [None]:
counties = ['Alameda', 'Alpine', 'Amador', 'Butte', 'Calaveras', 'Colusa', 'Contra Costa', 'Del Norte', 'El Dorado', 'Fresno']


wage_lq=np.array([[1342., 0.91], [732.,0.14],[793.,0.76],[977.,0.97],[840.,0.96],
                  [1084.,0.85],[1280.,0.93],[694.,0.75],[955.,0.76],[995.,0.95]])

#Try it yourself. Create a dataframe from the above and add an index based on the list of counties provided:



## 4.4 Slicing dataframes by index values

Notice that the below code also uses a different format to input the DataFrame. We define the data as a `dict` type, where each key of the dictionary refers to a column.

You might have noticed earlier that we could "call" a column using `df['column name']` - this is kind of like how you'd call all the items associated with a particular dictionary key. The syntax for the Pandas DataFrame is very similar.

In [None]:
unemployment=pd.DataFrame({'state': ['California', 'California', 'Oregon', 'Oregon', 'Arizona'], 
                         'year': [2021, 2022, 2021, 2022, 2022],
                         'urate': [5.8, 4.1, 4.2, 4.5, 4.0]})
unemployment.set_index('state', inplace=True)
print(unemployment)

In [None]:
#print(unemployment.iloc['California']) # This will not work. 'iloc' uses index position (i.e., 0, 1, 2, etc.) not values
print(unemployment.loc['California'])
print(unemployment.loc['Oregon'])

In [None]:
#If you only want a specific column of data, you can refer to that and then access by .loc:

print(unemployment['urate'].loc['California'])
print('\n\n', unemployment[['year', 'urate']].loc['Oregon'])

## 4.5 Try it yourself: wage/location quotient dataset

Do the following tasks with the wage/location quotient dataset you created earlier (created for you below):

1. Summarize (`describe`) the dataset 
1. Print the "weekly_wage" column only
1. Print all information for Fresno County
1. Print the location quotient only, for Alameda County
1. Print all information for both Fresno and Amador Counties. (Challenge - Hint: similar to accessing information on two columns at once)

In [None]:
wage_lq=np.array([[1342., 0.91], [732.,0.14],[793.,0.76],[977.,0.97],[840.,0.96],
                  [1084.,0.85],[1280.,0.93],[694.,0.75],[955.,0.76],[995.,0.95]])
counties = ['Alameda', 'Alpine', 'Amador', 'Butte', 'Calaveras', 'Colusa', 'Contra Costa', 'Del Norte', 'El Dorado', 'Fresno']

df_wage_lq=pd.DataFrame(wage_lq, columns=['weekly_wage', 'loc_q'], index=counties)
print(df_wage_lq)

#Try it yourself: do the tasks 1-5 assigned above with this dataset

