In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("proj02.ipynb")

<table style="width: 100%;" id="nb-header">
    <tr style="background-color: transparent;"><td>
        <img src="https://data-88e.github.io/assets/images/blue_text.png" width="250px" style="margin-left: 0;" />
    </td><td>
        <p style="text-align: right; font-size: 10pt;"><strong>Economic Models</strong>, Fall 2023<br>
            Dr. Eric Van Dusen<br>
        Umar Maniku<br>
        Akhil Venkatesh</p></td></tr>
</table>

# Project 2: The Cobb-Douglas Production Function

<!-- ## Due Dates and Important Information:

- This project is in three parts.
- The whole project (all three parts) will be due 2 Mondays from now, on March 16, 2020 at 11:59pm on Gradescope.
- **Do not change any of the given variable or function names as this would cause autograder problems. Make sure to name your columns and tables exactly as the questions ask you to.** -->

In [None]:
from utils import *
import pandas as pd
from datascience import *
from sympy import *
import matplotlib.pyplot as plt
import numpy as np 
import seaborn as sns
from sklearn.linear_model import LinearRegression
%matplotlib inline

The goal of this project is to gain experience completing the following key steps in the data science pipeline:

1. Cleaning and filtering data collected from online sources
2. Identifying and visualizing overall trends in the data using a process called Exploratory Data Analysis (EDA)
3. Using the data to complete a problem of prediction

We hope that by the end of this project, you will see how the skills you have learned in Data 8 and this class can prepare you for dealing with real world datasets, and how you can use them to answer questions about the economy or the world.

The question you will be answering today is the following: 

> How can we apply the Cobb-Douglas Production Function to understand the different ways countries produce output or GDP?

## Part 1: Simplifying the Problem

Let's load in the data for this project. The cell below loads the data from `pwt100.csv` and saves it to the variable `data`. Take a look at the first 10 rows.

In [None]:
file_name = "pwt100.csv"
data = to_table(file_name)
data

Notice that there are a lot of `-1` values. This dataset uses `-1` to indicate missing data for a given country-year combination.

To get an idea of the dataset's geographic scope, let us find out what countries are included, and if they are spelled in interesting ways. This will be helpful for us later on in our analysis.

**Question 1.1:** Create a two-column table called `all_countries`. Its first column will be called `Country` and the second `Earliest Year`. It should contain all of the countries that appear in the `data` table sorted in alphabetical order with the earliest year they appear in the dataset where the `cgdpe` column is not `-1`.

Hint: You may want to use `where` and `group`. 


In [None]:
all_countries = ...
all_countries

In [None]:
grader.check("q1_1")

Take a look at ```data``` again. Notice that it has a lot of columns, most of which we won't need. As we will be using the Cobb-Douglas production function, think about what variables are needed in the equation, and which ones are already present in the table. This is an important part of the data science process: understanding the dataset that you are using. Most real-world datasets provide documentation listing the definitions and equations behind each variable. 

PWT provides this and has identified three variables that will be helpful to us:
1. `cn` $\Rightarrow$ Capital Stock in millions of USD
2. `cgdpe` $\Rightarrow$ Expenditure-side Real GDP in millions of USD
3. `emp` $\Rightarrow$ Number of Persons employed in millions

**An important note: for the sake of simplicity, we will be assuming that nations exhibit constant returns to scale.**

**Question 1.2:** Without assuming constant return to scale, the Cobb-Douglous Production Function is given by 

$$ Y = A K^\alpha L^\beta $$

Which variable in the Cobb-Douglas function represents the following? Fill in the blanks. 

- $Y$ measures the dollar value of goods and services produced in a country
- \_\_ measures total factor productivity: how effectively a country uses its inputs in producing output
- \_\_ measures the amount of capital in a country
- \_\_ measures the amount of labor in a country
- \_\_ measures how much emphasis is placed on capital 

Assign an array of letters corresponding to your answer to `q1_2` below. For example, `q1_2 = make_array('alpha', 'beta', 'K', 'L')`.


In [None]:
q1_2 = ...

In [None]:
grader.check("q1_2")

**Again, for the remainder of this project, we will assume nations exhibit constant return to scale, unless otherwise specified.**   
**That is, we will have**

$$ Y = A K^\alpha L^{1-\alpha}$$



**Question 1.3:**
Assign the variable <code>missing_variables</code> to an array containing the Cobb-Douglas function variables that are missing from the dataset.


In [None]:
missing_variables = ...

In [None]:
grader.check("q1_3")

**Question 1.4:**
Remove all columns from `data` except for `cn`, `cgdpe`, `emp`, `country` and `year`. Ensure that `country` and `year` are the two left-most columns respectively. Call the new table `cleaned_data` and display its first five rows. Rename the `cn` column to `Capital Stock`, `cgdpe` to `Real GDP` and `emp` to `Labor Force`.


In [None]:
cleaned_data = ...

In [None]:
grader.check("q1_4")

Our goal will be to predict what $\alpha$ and $A$ are for each of the countries that we will be examining. From these, we will be able to explore how output is produced in each of these countries. The question is, how can we use the Cobb-Douglas Production function to solve for the missing variables? An easy way would be to take the natural log of the equation, making it linear, providing us with ways to quantify $\alpha$ and $A$.

<!-- BEGIN QUESTION -->

**Question 1.5:**
In the cell below, using LaTeX, take the natural log of the Cobb-Douglas Production Function and rewrite it as a **linear function** of one variable. Show all of your work. Full credit will not be given if you just display the final simplified equation without showing any work.

_Hint:_ Begin by taking natural log of both sides of the Cobb-Douglas Production Function, that is $\ln{Y} = \ln{(A K^\alpha L^{1-\alpha})}$.

Note: We do not officially cover LaTeX till Week 6: Utility, so if you are completing this project prior to that lecture, feel free to either skip this question and come back to it later, or learn something new!


_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Part 2: Exploring the Data

Whenever we are dealing with a large dataset like PWT, it is a good idea to see how the variables interact with each other. A common method, especially when dealing with economic data, is to generate a time series. This is a plot of some variable on the $y$-axis and time on the $x$-axis. We are going to do just that for our important Cobb-Douglas variables for different countries.

In [None]:
# Run this cell to see a table of all the countries in the dataset.
# Select two countries and proceed to the next code question.
all_countries["Country"]

**Question 2.1:**
1. Create an array of four countries, the US, China, and two of your choice from the list above (though not Canada or Mexico), in **alphabetical order** and call this `country_array`.
2. Using this array, construct `comparison_data`, a table containing GDP, Capital Stock and Labor for these four countries in `country_array` from 1990 to 2019. 

_Use the cell above to check if your countries of choice exist in the dataset._ Make sure to use the country name exactly as it appears in the data table.

_Hint:_ Look at the <a href="https://ds-connectors.github.io/econ-fa20/python-reference.html">Python Reference</a> for a table function you can use.


In [None]:
country_array = make_array("China", "United States", ..., ...)
comparison_data = ...

In [None]:
grader.check("q2_1")

**Note:** If you ever need to refer to a list of the countries you selected in your code, do **not** use `country_array`. When you place the data in `comparison_data`, Python will automatically re-order the countries. Using `country_array` will cause a mis-match between your rows of data in `comparison_data` and the country it is actually from.

<!-- BEGIN QUESTION -->

**Question 2.2:**
To help us in later questions, fill in the blanks in the `country_table_plotter` function below. Its inputs will be a table of the form `comparison_data` and the names of the two columns plotted, `columnX` and `columnY`. The `country_table_plotter` function will plot `columnX` versus `columnY` using data from `data_table` for all the countries 

_Hint:_ Look at the <a href="https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.plot.html"> Matplotlib Plotting Reference</a> for ideas of what to place in the blanks.


In [None]:
def country_table_plotter(data_table, columnX, columnY):
    
    countries = ...
    for country in countries:
        current_country_table = ...
        plt.plot(..., ..., label = country, linewidth = 1.5)
    
    ### Do not change the code below ###
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), fontsize = 'x-large')
    plt.xlabel(columnX)
    plt.ylabel(f"{columnY} (Logarithmic Scale)")
    plt.grid()
    plt.yscale("log")
    
    return countries, columnX, columnY

<!-- END QUESTION -->

**Question 2.3:**
Produce a plot of time and capital stock for the countries in your table.


In [None]:
q2_3 = country_table_plotter(comparison_data, ..., ...)

In [None]:
grader.check("q2_3")

<!-- BEGIN QUESTION -->

**Question 2.4:**
Identify differences between the countries in your plot above and discuss what surprised you.


_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 2.5:**
Produce a similar plot, but this time of time and labor for the countries in your table.


In [None]:
q2_5 = country_table_plotter(comparison_data, ..., ...)

In [None]:
grader.check("q2_5")

<!-- BEGIN QUESTION -->

**Question 2.6:**
Identify differences between the countries in your plot above and discuss what surprised you.


_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 2.7:**
Create a plot of time and GDP for the countries in your table.


In [None]:
q2_7 = country_table_plotter(comparison_data, ..., ...)

In [None]:
grader.check("q2_7")

<!-- BEGIN QUESTION -->

**Question 2.8:**
Using your knowledge of the Cobb-Douglas Production Function, identify differences between the countries and discuss these in relation to your findings about each of the country's levels of capital stock and labor. Also note about how these have changed over time, if at all.


_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Part 3: Prediction and Comparison

We are now going to provide numerical backing to your discussions of differences between the nations. We will predict values for $\alpha$ and $A$ for each of the 4 countries that we are examining. To do this, we will revisit the equation that you derived in question 1.4. 

**Question 3.1:**
A key step in the original paper by Cobb and Douglas was that they converted the data into an index. This is important because the data is measured in different units. Do this for each country and each of the variables in `comparison_data`. Let the 2011 value for each variable in each country be the base year, i.e. 100. Place this in a new table called `indexed_data` together with `country` and `year` columns.

_Hint:_ The formula for calculating an index is as follows:
$$
\dfrac{Q_{\text{current year}}}{Q_{\text{base year}}} \cdot 100 \, \text{ for some variable } Q
$$


In [None]:
# We need to sort comparison_data in alphabetical order because the countries array is in alphabetical order
comparison_data = comparison_data.sort('country')

base_values = ...
countries = ...
indexed_Ks = make_array()
indexed_Ls = make_array()
indexed_Ys = make_array()

for country in countries:
    ...
    
indexed_data = Table().with_columns(
    "country", ...
    "year", ...
    "Indexed K", ...
    "Indexed L", ...
    "Indexed Y", ...
)

indexed_data

In [None]:
grader.check("q3_1")

**Question 3.2:**
Using the equation you derived in Part 1 and the `indexed_data` table, calculate the two log ratios that you need to perform linear regression. Place them in the table `log_ratios` with `country` and `year` as the two leftmost columns respectively. The `log_ratios` table should have 4 columns: `country`, `year`, `ln(Y/L)`, and `ln(K/L)`.


In [None]:
log_ratios = ...
log_ratios

In [None]:
grader.check("q3_2")

The function `country_table_scatter` defined below takes in a data table, $x$ column label, and $y$ column label and plots these columns as against each other for each country in `data_table`.

In [None]:
def country_table_scatter(data_table, columnX, columnY):
    # First getting a list of all the countries in data_table
    country_list = data_table.group("country").column("country")
    
    # For each country, creating a scatter plot of columnX vs. columnY
    for country in country_list:
        curr_data_table = data_table.where("country", country)
        curr_data_table.scatter(columnX, columnY)
        plt.title(country)
        plt.grid()

<!-- BEGIN QUESTION -->

**Question 3.3.1:**
Using the `country_table_scatter` function provided, plot the log ratios for each country from the `log_ratios` table below.


In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.3.2:**
What do you notice about the scatter plots? How do they differ?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Extra Credit Question:**
Let's take a look at the scatter plot for the U.S in particular. What do you notice as the value of ln(K/L) nears -0.10 to 0.00? What could be the reason behind the stagnation?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

To help you compare, the code below will plot all of the scatter plots on the same axis. Remember, the axes are in terms of logarithms. Thus, even small differences in slope or intercepts would yield big changes.

In [None]:
all_scatter(log_ratios, "ln(K/L)", "ln(Y/L)")

Now, we want to make a linear approximation of the curves above. How will we do this? Through linear regression.

We will be using NumPy's `polyfit` function to get the $\alpha$ and $A$ values for each country. You will learn more about linear regression later in Data 8, but think of it as fitting a line to a set of data points. The polyfit function would return the slope and intercept of such a line. Here is the link to the documentation of the <a href = "https://numpy.org/doc/1.18/reference/generated/numpy.polyfit.html"> function</a>: 

Let us break down how this function works by taking a look at the examples section of the documentation:

```python
import numpy as np
x = np.array([0.0, 1.0, 2.0, 3.0,  4.0,  5.0])
y = np.array([0.0, 0.8, 0.9, 1.0, 2.0, 3.0])
model = np.polyfit(x, y, 1)
```

The first argument to `np.polyfit()` is the array of data $x$, the second argument contains the values we wish to estimate $y$. The last argument specifies the degree of the polynomial we wish to use. A 1-degree polynomial takes the following form: $y = a+ b x$. A 2-degree polynomial looks like this: $y = a + bx + cx^2$.

`np.polyfit()` returns an array of the coefficients of the linear regression line. A degree 1 polynomial would contain two elements. A degree 2 polynomial would contain three elements. Let's look at the structure of the array:

```python
>>> model
array([ 0.53428571, -0.05238095])
>>> model.item(0) # The slope term
0.53428571
>>> model.item(1) # The intercept term
-0.05238095
```

From this, we can construct the equation of the regression line as a function of $x$: $y = -0.05238095 + 0.53428571x$. 

Take a look at the derivation for the Cobb-Douglas production function we did in Part 1 and think about what kind of polynomial we want to fit on our data.

**Question 3.4:**
Using `np.polyfit`, fit the data taken from each country. 

_Hint:_ Fill in the `<country>_x` and `<country>_y` arrays to make your life easier. You will need to call `np.polyfit` four times (once for each country).


In [None]:
# Create arrays of the data we will need from each country.
# Think about which variable should be on each axis.

china_x = ...
china_y = ...
us_x = ...
us_y = ...
country1_x = ...
country1_y = ...
country2_x = ...
country2_y = ...

model_china = np.polyfit(..., ..., ...)
model_us = np.polyfit(..., ..., ...)
model_country1 = np.polyfit(..., ..., ...)
model_country2 = np.polyfit(..., ..., ...)

In [None]:
grader.check("q3_4")

In [None]:
# for your reference
print(country_array)

**Question 3.5:**
Now that we have fit the data of each country, we can then retrieve the slope and intercept of each fit. Using the equation you derived in Question 1.3, fill in the blanks in the print statements below such that they display the $\alpha$ and $A$ values for each country. Note that you will need to transform at least one of the variables.

**Note:** Python has special strings called **f-strings** where it fills in the value of a variable for you. For example:

```python
>>> arr = make_array(1, 2, 3)
>>> print(f"The second element of arr is {arr.item(1)}")
The second element of arr is 2
```

Fill `alpha_array` and `A_array` with the correct values from your `model_*` arrays. **Make sure their elements are in the same order as in the** `country_array`. Then, you will need to replace the `...` inside the curly braces in the f-strings print statements. You can use the example above as a reference. 


In [None]:
alpha_array = make_array(...)
A_array = make_array(...) # you may want to use np.exp here
for i in np.arange(len(country_array)):
    print(f"{country_array.item(i)} alpha value: {...}")
    print(f"{country_array.item(i)} A value: {...}")
    print()

In [None]:
grader.check("q3_5")

**Question 3.6:**
What do you notice about USA's $\alpha$ value? What does this say about our model's assumption of constant returns to scale?

<ol type="A" style="list-style-type: lower-alpha;">
    <li>USA's $\alpha$ value is greater than 1, and it violates our model's assumption. </li>
    <li>USA's $\alpha$ value is greater than 1, and it does not violate our model's assumption. </li>
    <li>USA's $\alpha$ value is less than 1, and it violates our model's assumption. </li>
    <li>USA's $\alpha$ value is less than 1, and it does not violate our model's assumption. </li>
</ol>

Assign a letter corresponding to your answer to `q3_6` below. For example, `q1_2 = 'a'`.


In [None]:
q3_6 = ...

In [None]:
grader.check("q3_6")

<!-- BEGIN QUESTION -->

**Question 3.7:**
With reference to the $\alpha$ and $A$ values for each of the countries you have examined, do they indicate about that country's ability to produce output as measured through GDP? Compare and contrast how each country allocates capital and labor when producing output. How about the role of technology or research and development? 3-4 sentences should suffice.


_Type your answer here, replacing this text._

<!-- END QUESTION -->

**That's all! Hope you had fun analyzing the vast amount of data.**

---

**Acknowledgements:**
We would like to thank Professor Raymond Hawkins for his Economics 100B Problem Set that served as the basis for this assignment.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)