#              Lecture 12                  
                                           
## Simple (Linear) Regressions             
   - multiple graphs and descriptive     
   - Scatterplots                        
       - to decide functional form       
       - to decide outcome variable      
   - Simple, nonlinear models:           
       - models with log                 
       - polynomials                     
       - piecewise linear spline         
       - extra: weighted OLS             
   - Residual analysis                   
       - with multiple annotations       
                                           
#### Case Study:                               
-  Life-expectancy and income               

___

Import packages

In [None]:
import warnings

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from plotnine import *
from skimpy import skim
from stargazer.stargazer import Stargazer

%matplotlib inline
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv("data/WDI_lifeexp_clean.csv")

df.head()

In [None]:
skim(df)

Good-to-know: Quick check on all HISTOGRAMS

In [None]:
df.hist()

In [None]:
df.describe()

Create new variable: Total GDP = GDP per Capita * Population


    note we could have download an other GDP total variable for this,
    but for comparison purposes, let use the exact same data and 
    concentrate on difference which are only due to transforming the variables.

In [None]:
df["gdptot"] = df["gdppc"] * df["population"]

### Check basic scatter-plots!

Two competing models:
- A) lifeexp = alpha + beta * gdptot
- B) lifeexp = alpha + beta * gdppc

Where to use log-transformation? - level-level vs level-log vs log-level vs log-log

Create the following graphs with loess:

#### Model A) lifeexp = alpha + beta * gdptot
1) lifeexp - gdptot: level-level model without scaling

In [None]:
(
    ggplot(df, aes(x="gdptot", y="lifeexp"))
    + geom_point()
    + geom_smooth(method="loess", color = "blue")
    + labs(x="Total GDP (2017 int. const. $, PPP )", y="Life expectancy  (years)")
    + theme_bw()
)

2) Change the scale for Total GDP for checking log-transformation


In [None]:
(
    ggplot(df, aes(x="gdptot", y="lifeexp"))
    + geom_point()
    + geom_smooth(method="loess", color = "blue")
    + labs(x="Total GDP (2017 int. const. $, PPP )", y="Life expectancy  (years)")
    + scale_x_log10()
    + theme_bw()
)

3) Change the scale for Total GDP and life-expectancy for checking log-transformation

In [None]:
(
    ggplot(df, aes(x="gdptot", y="lifeexp"))
    + geom_point()
    + geom_smooth(method="loess", color="blue")
    + labs(x="Total GDP (2017 int. const. $, PPP )", y="Life expectancy  (years)")
    + scale_x_log10()
    + scale_y_log10()
    + theme_bw()
)

#### Model B) lifeexp = alpha + beta * gdppc:

4) lifeexp - gdppc: level-level model without scaling

In [None]:
(
    ggplot(df, aes(x="gdppc", y="lifeexp"))
    + geom_point()
    + geom_smooth(method="loess", color = "blue")
    + labs(x="GDP/capita (2017 int. const. $, PPP )", y="Life expectancy  (years)")
    + theme_bw()
)

5) Change the scale for GDP/capita for checking log-transformation

In [None]:
(
    ggplot(df, aes(x="gdppc", y="lifeexp"))
    + geom_point()
    + geom_smooth(method="loess", color = "blue")
    + labs(x="GDP/capita (2017 int. const. $, PPP )", y="Life expectancy  (years)")
    + scale_x_log10()
    + theme_bw()
)

 6) Change the scale for GDP/capita and life-expectancy for checking log-transformation

In [None]:
(
    ggplot(df, aes(x="gdppc", y="lifeexp"))
    + geom_point()
    + geom_smooth(method="loess", color="blue")
    + labs(x="GDP/capita (2017 int. const. $, PPP )", y="Life expectancy  (years)")
    + scale_x_log10()
    + scale_y_log10()
    + theme_bw()
)

You should reach the following conclusions:
  1) taking log of _gdptot_ is needed, but still non-linear pattern in data/need to use 'approximation' interpretation

      - feasible to check and we do it due to learn how to do it, 
          but in practice I would skip this -> over-complicates analysis 
  2) using only _gdppc_ is possible, but need to model the non-linearity in data 
  
      - Substantive: Level changes is harder to interpret and our aim is not to get $ based comparison
      - Statistical: log transformation is way better approximation make simplification!
  3) taking log of _gdppc_ is making the association close to linear!
  4) taking log for _life-expectancy_ does not matter -> use levels!
  
      - Substantive: it does not give better interpretation
      - Statistical: you can compare models with the same y, no better fit
      - Remember: the simpler the better!
      
___

Create new variables 
   
   _ln_gdppc  = Log of gdp/capita \
   ln_gdptot = log GDP total_  

Take Log of gdp/capita and log GDP total

In [None]:
df["ln_gdppc"] = np.log(df["gdppc"])
df["ln_gdptot"] = np.log(df["gdptot"])

Run the following competing models:

    with ln_gdptot:
    reg1: lifeexp = alpha + beta * ln_gdptot
    reg2: lifeexp = alpha + beta_1 * ln_gdptot + beta_2 * ln_gdptot^2
    reg3: lifeexp = alpha + beta_1 * ln_gdptot + beta_2 * ln_gdptot^2 + beta_3 * ln_gdptot^3
 
    with ln_gdppc:
    reg4: lifeexp = alpha + beta * ln_gdppc
    reg5: lifeexp = alpha + beta_1 * ln_gdppc + beta_2 * ln_gdppc^2
    reg6: lifeexp = alpha + beta_1 * ln_gdppc * 1(gdppc < 50) + beta_2 * ln_gdppc * 1(gdppc >= 50)
    
    Extra: weighted-ols:
    reg7: lifeexp = alpha + beta * ln_gdppc, weights: population

Two ways to handle polynomials: 

 1) Add powers of the variable(s) to the dataframe:

In [None]:
df["ln_gdptot_sq"] = df["ln_gdptot"] ** 2
df["ln_gdptot_cb"] = df["ln_gdptot"] ** 3
df["ln_gdppc_sq"] = df["ln_gdppc"] ** 2

2) You van use `**n` inside formulas also

### Do the regressions

Using statsmodels formula api \
Reminder: formula: _y ~ x1 + x2 + ..._, note: intercept is automatically added

In [None]:
reg_b = smf.ols("lifeexp ~ ln_gdptot",data = df).fit()
reg_b.summary()

First model

In [None]:
reg1 = smf.ols("lifeexp ~ ln_gdptot", data=df).fit(cov_type ="HC3")
reg1.summary()

Visual inspection:

In [None]:
(
    ggplot(df, aes(x="ln_gdptot", y="lifeexp"))
    + geom_point(color="blue")
    + geom_smooth(method="lm", color="red")
    + theme_bw()
)

In [None]:
reg2 = smf.ols("lifeexp ~ ln_gdptot + ln_gdptot_sq", data=df).fit(cov_type ="HC3")
reg2.summary()

In [None]:
(
    ggplot(df, aes(x="ln_gdptot", y="lifeexp"))
    + geom_point(color="blue")
    + geom_smooth(method="lm",formula = "y ~ x + np.square(x)", color="red")
    + theme_bw()
)

In [None]:
reg3 = smf.ols("lifeexp ~ ln_gdptot + ln_gdptot_sq + ln_gdptot_cb", data=df).fit(cov_type ="HC3")
reg3.summary()

In [None]:
reg3 = smf.ols("lifeexp ~ ln_gdptot + ln_gdptot_sq + ln_gdptot_cb", data=df).fit(cov_type ="HC3")
reg3.summary()

In [None]:
(
    ggplot(df, aes(x="ln_gdptot", y="lifeexp"))
    + geom_point(color="blue")
    + geom_smooth(method="lm", formula="y ~ x + np.square(x) + np.power(x, 3)", color="red")
    + theme_bw()
)

Compare these models

In [None]:
table = Stargazer([reg1, reg2,reg3])
table

From these you should consider reg1 and reg3 only!

### Models with gdp per capita:
 reg4: lifeexp = alpha + beta * ln_gdppc

In [None]:
reg4 = smf.ols("lifeexp ~ ln_gdppc", data=df).fit(cov_type ="HC3")
reg4.summary()

In [None]:
(
    ggplot(df, aes(x="ln_gdppc", y="lifeexp"))
    + geom_point(color="blue")
    + geom_smooth(method="lm", color="red")
    + theme_bw()
)

In [None]:
reg5 = smf.ols("lifeexp ~ ln_gdppc + ln_gdppc_sq", data=df).fit(cov_type ="HC3")
reg5.summary()

In [None]:
(
    ggplot(df, aes(x="ln_gdppc", y="lifeexp"))
    + geom_point(color="blue")
    + geom_smooth(method="lm",formula = "y ~ x + np.square(x)",color="red")
    + theme_bw()
)

 Compare results with gdp per capita:

In [None]:
table = Stargazer([reg4, reg5])
table

Conclusion: reg5 is not adding new information

Compare reg1, reg3 and reg4 to get an idea log transformation is a good idea:

In [None]:
table = Stargazer([reg1,reg3,reg4])
table

R2 measure is much better for reg4...

#### Regression with piecewise linear spline:

    1st: define the cutoff for gdp per capita

In [None]:
cutoff = 50

    2nd: take care of log transformation -> cutoff needs to be transformed as well

reg6: lifeexp = alpha + beta_1 * ln_gdppc * 1(gdppc < 50) + beta_2 * ln_gdppc * 1(gdppc >= 50)

In [None]:
cutoff_ln = np.log(cutoff)

Note, Python does not have an `lnspline` function as R, so we wrote one

In [None]:
import copy
def lspline(series, knots):
    def knot_ceil(vector, knot):
        vector_copy = copy.deepcopy(vector)
        vector_copy[vector_copy > knot] = knot
        return vector_copy

    if type(knots) != list:
        knots = [knots]
    design_matrix = None
    vector = series.values

    for i in range(len(knots)):
        # print(i)
        # print(vector)
        if i == 0:
            column = knot_ceil(vector, knots[i])
        else:
            column = knot_ceil(vector, knots[i] - knots[i - 1])
        # print(column)
        if i == 0:
            design_matrix = column
        else:
            design_matrix = np.column_stack((design_matrix, column))
        # print(design_matrix)
        vector = vector - column
    design_matrix = np.column_stack((design_matrix, vector))
    # print(design_matrix)
    return design_matrix

In [None]:
reg6 = smf.ols("lifeexp ~ lspline(ln_gdppc, cutoff_ln)", data=df).fit(cov_type ="HC3")
reg6.summary()

In [None]:
(
    ggplot(df, aes(x="ln_gdppc", y="lifeexp"))
    + geom_point(color="blue")
    + geom_smooth(method="lm",formula = "y ~ lspline(x,cutoff_ln)",color="red")
    + theme_bw()
)

### Extra
 Weighted-OLS: use reg4 setup and weight with population\
 Can be done with the `weights = df["population"]` input!

In [None]:
reg7 = smf.wls("lifeexp ~ ln_gdppc", weights=df["population"], data=df).fit(cov_type ="HC3")
reg7.summary()

Created a pretty graph for visualize this method:

In [None]:
(
    ggplot(df, aes(x="gdppc", y="lifeexp"))
    + geom_point(df, aes(size="population"), color="blue", alpha=0.6, show_legend=False)
    + geom_smooth(
        aes(weight="population"), method="lm", color="red", se=False, size=0.7
    )
    + scale_size(range=(1, 15))
    + coord_cartesian(ylim=(50, 85))
    + scale_x_log10()
    + scale_y_continuous(expand=(0.01, 0.01), breaks=np.arange(50, 85, 5))
    + labs(
        x="GDP per capita, thousand US dollars (ln scale) ",
        y="Life expectancy  (years)",
    )
    + theme_bw()
    + annotate("text", x=70, y=80, label="USA", size=10)
    + annotate("text", x=10, y=82, label="China", size=10)
    + annotate("text", x=7, y=63, label="India", size=10)
)

Compare reg4, reg6 and reg7 models

In [None]:
table = Stargazer([reg4, reg6, reg7])
table.custom_columns(["Simple", "L.Spline", "Weighted"], [1, 1, 1])
table

Based on model comparison your chosen model should be reg4 - lifeexp ~ ln_gdppc \

    Substantive: - level-log interpretation works properly for countries
                 - magnitude of coefficients are meaningful
    Statistical: - simple model, easy to interpret
                 - Comparatively high R2 and captures variation well

### Residual analysis

Get the predicted y values from the model

In [None]:
df["reg4_y_pred"] = reg4.fittedvalues

Calculate the errors of the model

In [None]:
df["reg4_res"] = df["lifeexp"] - df["reg4_y_pred"]

Find countries with largest negative errors

In [None]:
worst5 = df.sort_values(by=["reg4_res"]).head(5)
worst5

Find countries with largest positive errors

In [None]:
best5 = df.sort_values(by=["reg4_res"]).tail(5)
best5

Show again the scatter plot with bests and worst

In [None]:
(
    ggplot(df, aes(x="ln_gdppc", y="lifeexp"))
    + geom_point(color="blue")
    + geom_smooth(method="lm", color="red")
    + annotate(
        "text",
        x=worst5["ln_gdppc"],
        y=worst5["lifeexp"] - 1,
        label=worst5["country"].tolist(),
        color="purple",
    )
    + annotate(
        "text",
        x=best5["ln_gdppc"],
        y=best5["lifeexp"] + 1,
        label=best5["country"].tolist(),
        color="green",
    )
    + theme_bw()
)