# Data processing and empirical modelling

## Processing data in Julia

Different approaches are possible. We'll be using elements of [Tidier.jl](https://tidierorg.github.io/Tidier.jl/dev/), which mirrors a subset of the popular [Tidyverse](https://www.tidyverse.org/) collection of R packages.

In [None]:
using Tidier

### Reading in datasets

In [None]:
sales = read_csv("datasets/sales.csv")

In [None]:
customers = read_csv("datasets/customers.csv")

### Selecting variables

In [None]:
sls = @chain sales begin
    @select(Product_ID,Product_Name,Unit_Price,Revenue,Units_Sold)
end

In [None]:
# Also possible
@select(sales, Product_ID,Product_Name,Unit_Price,Revenue,Units_Sold)

In [None]:
@chain sales begin
    @select(-Customer_ID) # exclude a column
end

In [None]:
@chain sales begin
    @select(Category:Revenue) # select a subset of columns using a slice
end

### Filtering and slicing data. Renaming variables

In [None]:
@chain sales begin
    @filter(Category == "Electronics", Revenue >= 100000)
    @rename(Price = Unit_Price, Units = Units_Sold)
end

In [None]:
@chain sales begin
    @slice(1:2,5)
end

### Mutating variables

In [None]:
sls = @chain sls begin
        @mutate(P = Revenue/Units_Sold)
        @mutate(check = P - Unit_Price)
end

In [None]:
@chain sls begin
        @transmute(problem = check!=0)
end

### Joins

In [None]:
# left_join: first dataframe is the "leading" one

@left_join(sales,customers)

In [None]:
# right_join: second dataframe is the "leading" one

@right_join(sales, customers) 

In [None]:
# inner_join: on common keys, here happens to be the same as the left_join

@inner_join(sales, customers) 

In [None]:
# we need an interpolation tweak to get that to work inside a chain: note the @eval macro and the dollar sign

@eval @chain sales begin
    @left_join($customers) # the common key is automatically discovered
end

### Groupby and summarize operations

In [None]:
@chain sales begin
    @group_by(Category)
    @mutate(avgP = mean(Unit_Price))
    @ungroup()
end

In [None]:
@chain sales begin
    @summarize(maxrev = maximum(Revenue), minP = minimum(Unit_Price))
end

In [None]:
@chain sales begin
    @group_by(Customer_ID)
    @summarize(maxrev = maximum(Revenue), minP = minimum(Unit_Price))
end

### Pivot_wider and pivot_longer

In [None]:
@chain sales begin
    @select(Product_Name, Units_Sold)
    @pivot_wider(names_from = Product_Name, values_from = Units_Sold)
end

In [None]:
@chain sales begin
    @select(Product_ID, Category, Units_Sold)
    @pivot_wider(names_from = Category, values_from = Units_Sold)
end

In [None]:
@chain sales begin
    @select(Customer_ID, Unit_Price, Units_Sold)
    @pivot_longer(Unit_Price:Units_Sold, names_to = "type", values_to = "value")
end

# Growth accounting

Consider a two-factor Cobb-Douglas production function of the form
$$ Y_t = A_t K_t^\alpha L_t^{1-\alpha} $$

Here $Y_t$ is output (real GDP) in period $t$, $L_t$ is employment, $K_t$ is the capital stock and $A_t$ is total factor productivity (technology). The parameter $\alpha$ is assumed to be known or is obtained through a separate calculation. Typical estimates are derived from national accounts data and can range from 0.3 to 0.45.

Data for $Y_t$ and $L_t$ is directly available from statistical sources (e.g. Eurostat). 

While data on $K_t$ is not published directly, there are techniques to reconstruct the capital stock under appropriate assumptions.

Then, TFP can be computed as a residual, e.g.
$$ \ln Y_t = \ln A_t + \alpha \ln K_t + (1-\alpha) \ln L_t ~\Rightarrow~ a_t = y_t - \alpha k_t -(1-\alpha)l_t,$$
where lower-case letters denote logs. This is known as the **Solow residual**.

Because a production function approach provides only an approximation to the real world, what actually happens is that $A_t$, measured as a residual picks up a lot of other influences, so it can't directly be interpreted as a measure of technology. This makes it interesting and a subject to independent analyses.

In [None]:
using Plots # Import here to speed up module loading

## Data retrieval

In [None]:
using HTTP

# dataset_id = "nama_10_gdp"
dataset_id = "nama_10_pe"
url = "https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/$(dataset_id)/?format=SDMX-CSV&compressed=true"

local_file = "datasets/$(dataset_id).csv.gz"

response = HTTP.get(url)  
open(local_file, "w") do file
    write(file, response.body)
end

## Process data on GDP and components

In [None]:
natl_acc = read_csv("datasets/nama_10_gdp.csv.gz")

In [None]:
nat_acc = @chain natl_acc begin
    @select(unit, na_item, geo, TIME_PERIOD,OBS_VALUE)
    @rename(year=TIME_PERIOD, value=OBS_VALUE)
end

In [None]:
country = "SE"

country_nat_acc = @eval @chain nat_acc begin
    @filter(geo == $country, unit == "CLV10_MNAC", na_item in ("B1GQ", "P51G"))
    @select(na_item,year,value)
    @drop_missing()
    @pivot_wider(names_from = na_item, values_from = value)
    @rename(gdp = B1GQ, inv = P51G)
end

## Process data on employment

In [None]:
labour = read_csv("datasets/nama_10_pe.csv.gz")

In [None]:
lab = @chain labour begin
    @select(unit, na_item, geo, TIME_PERIOD,OBS_VALUE)
    @rename(year=TIME_PERIOD, value=OBS_VALUE)
end

In [None]:
country_lab = @eval @chain lab begin
    @filter(geo == $country, unit ==  "THS_PER", na_item == "EMP_DC")
    @select(year,value)
    @drop_missing()
    @rename(emp = value)
end

## Combine output, investment and employment

In [None]:
gracc = @eval @chain country_nat_acc begin
    @inner_join($country_lab)
end

### Perpetual inventory method

Typically, there is no data on the capital stock in an economy, so it has to be constructed or estimated under certain assumptions. The **perpetual inventory method** provides one possible approach. This method relies on the accumulation of past investments while accounting for depreciation, ensuring a continuous update of the capital stock over time.

The capital evolution equation is the standard one

$$ K_{t+1} = (1-\delta)K_t + I_t ,$$

where $K_t$ is the initial capital stock for period $t$, $I_t$ is the investment (gross fixed capital formation) in real terms for the period and $\delta$ is the depreciation rate.

To use the above equation, we need to estimate the capital stock for a chosen period. The simplest approach is to choose an initial period and apply the formula

$$ K_0 = \frac{I_0}{\bar{\gamma}_I + \delta}, $$

where $\bar{\gamma}_I$ is the long-term average growth of investment.

The above formula is a corollary of a setup where the economy is assumed to be on a steady-state path. Along such a path GDP and capital grow at one and the same rate. Then the capital evolution equation implies that

$$ \frac{K_{t+1}-K_t}{K_t} = \frac{I_t}{K_t} - \delta \qquad \Rightarrow \qquad K_t = \frac{I_t}{\gamma_{Y} + \delta}, $$

where $\gamma_{Y}$ is the growth rate of GDP (and therefore of capital).

Further research has argued that this approach can be extended to apply even out of equilibrium if $\gamma_{Y}$ is replaced by the long-term average growth of investment $\bar{\gamma}_I$, leading to the above formula.

In [None]:
country_coefs = @eval @chain nat_acc begin
    @filter(geo == $country, unit == "CP_MEUR", na_item in ("B1GQ", "B2A3G"))
    @select(na_item,year,value)
    @drop_missing()
    @pivot_wider(names_from = na_item, values_from = value)
    @rename(gdp = B1GQ, gos = B2A3G)
    @mutate(alpha = gos/gdp)
end

α = country_coefs.alpha |> mean

In [None]:
δ = 0.05
inv = gracc[:,:inv]
γ_I = mean([inv[t]/inv[t-1]-1 for t in 2:length(inv)])
K0 = inv[1]/(γ_I+δ)

K = zeros(size(inv))
K[1] = K0
for t in 2:length(inv)
    K[t] = (1-δ)*K[t-1] + inv[t-1]
end

ga = @eval @chain gracc begin
    @mutate(
        y = Base.log(gdp),
        k = Base.log($K),
        l = Base.log(emp)
    )
    @select(year,y,k,l)
    @mutate(a = y - $α*k - (1-$α)*l)
end

In [None]:
using Plots
plot(ga.year, ga.a, legend = false, ylabel = "a")

In [None]:
α_cal = 0.4
δ = 0.05
inv = gracc[:,:inv]
γ_I = mean([inv[t]/inv[t-1]-1 for t in 2:length(inv)])
K0 = inv[1]/(γ_I+δ)

K = zeros(size(inv))
K[1] = K0
for t in 2:length(inv)
    K[t] = (1-δ)*K[t-1] + inv[t-1]
end

ga_cal = @eval @chain gracc begin
    @mutate(
        y = Base.log(gdp),
        k = Base.log($K),
        l = Base.log(emp)
    )
    @select(year,y,k,l)
    @mutate(a = y - $α_cal*k - (1-$α_cal)*l)
end



In [None]:
using Plots
plot(ga.year,ga.a, legend = false, ylabel = "a")