<a href="https://colab.research.google.com/github/federicapennino/Data_analysis/blob/LABS/Analyzing_Real_GDP_Using_Time_Series_and_Panel_Data_Regression_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Analyzing Real GDP Using Time-Series and Panel Data Regression Models**

This analysis focuses on predicting real GDP (at constant 2017 national prices) as a function of the price level of household consumption and population. Using both Ordinary Least Squares (OLS) regression and first-differences models, the study explores how these predictors influence GDP over time and across countries. Logged and non-logged variables are used to investigate potential differences in results.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

The data for this analysis is sourced from the Penn World Table version 10.01. It includes economic indicators such as real GDP, price levels, and population for various countries over several years.

Variables Used:

Real GDP (rgdpna): The dependent variable, representing real GDP at constant 2017 prices (in million 2017 US$).
Price Level of Household Consumption (pl_c): A measure of consumer prices, expected to have mixed effects on GDP.
Population (pop): The total population of a country, hypothesized to positively impact GDP due to larger economies of scale.


In [None]:
# upload file
df= pd.read_excel('/content/LAB4.xlsx')
df.head()

Unnamed: 0,country,currency_unit,year,rgdpe,rgdpo,pop,emp,avh,hc,ccon,...,csh_x,csh_m,csh_r,pl_c,pl_i,pl_g,pl_x,pl_m,pl_n,pl_k
0,Aruba,Aruban Guilder,1950,,,,,,,,...,,,,,,,,,,
1,Aruba,Aruban Guilder,1951,,,,,,,,...,,,,,,,,,,
2,Aruba,Aruban Guilder,1952,,,,,,,,...,,,,,,,,,,
3,Aruba,Aruban Guilder,1953,,,,,,,,...,,,,,,,,,,
4,Aruba,Aruban Guilder,1954,,,,,,,,...,,,,,,,,,,


In [None]:
# original variable
df[['rgdpna']].describe()

Unnamed: 0,rgdpna
count,10399.0
mean,327198.7
std,1234987.0
min,13.91291
25%,7462.823
50%,33576.15
75%,184403.9
max,20572610.0


In [None]:
# logged variable
df['log_rgdpna']=np.log(df['rgdpna'])
# visualizing the variable
df[['log_rgdpna']].describe()

Unnamed: 0,log_rgdpna
count,10399.0
mean,10.441482
std,2.30363
min,2.632817
25%,8.917689
50%,10.421571
75%,12.124884
max,16.839471


In [None]:
df[['pl_c']].describe()

Unnamed: 0,pl_c
count,10399.0
mean,0.37085
std,0.424091
min,0.015589
25%,0.171226
50%,0.306258
75%,0.484549
max,23.122841


In [None]:
df[['pop']].describe()

Unnamed: 0,pop
count,10399.0
mean,30.962982
std,116.189454
min,0.004425
25%,1.579663
50%,6.150688
75%,19.934229
max,1433.783686


**Ordinary Least Squares (OLS) Regression**

An OLS regression model is fitted to predict logged real GDP (log_rgdpna) as a function of pl_c and pop. Logged variables are used to capture proportional relationships and interpret coefficients in percentage terms.


I am now going to predict the log real GDP as a function of the price level of household consumption and population in a country. I expect countries that have a higher population and lower price level of household consumption to have a higher real GDP. However the effect of prices is not straightforward, higher prices could reflect economic growth, and hence there could be a positive impact on GDP.

In [None]:
gdp1 = smf.ols(formula = 'log_rgdpna ~  pl_c + pop', data = df).fit()
print (gdp1.summary())

                            OLS Regression Results                            
Dep. Variable:             log_rgdpna   R-squared:                       0.153
Model:                            OLS   Adj. R-squared:                  0.153
Method:                 Least Squares   F-statistic:                     941.4
Date:                Thu, 14 Nov 2024   Prob (F-statistic):               0.00
Time:                        01:07:56   Log-Likelihood:                -22567.
No. Observations:               10399   AIC:                         4.514e+04
Df Residuals:                   10396   BIC:                         4.516e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     10.0015      0.028    353.779      0.0

From the table: one unit increase in price level of household consumption is associated with a 55% increase in real GDP whilst one unit increase in population only accounts for 0.76%. it must be noted that association is not causation, hence the fact that GDP and prices both increase could just reflect the condition of the economy, rather than one causing the other. Both predictors are statistically significant, pl_c at 5% level and pop at 1%. Overall, the model can explain 15.3% of the changes in real GDP (R squared), meaning that there are probably many other variables impacting it.

**First-Differences Regression**

A first-differences regression is used to account for unobserved heterogeneity across countries. This approach models the changes in GDP as a function of changes in pl_c and pop.

In [None]:
!pip install linearmodels

from linearmodels.panel import FirstDifferenceOLS

Collecting linearmodels
  Downloading linearmodels-6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.9 kB)
Collecting mypy-extensions>=0.4 (from linearmodels)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl.metadata (1.1 kB)
Collecting pyhdfe>=0.1 (from linearmodels)
  Downloading pyhdfe-0.2.0-py3-none-any.whl.metadata (4.0 kB)
Collecting formulaic>=1.0.0 (from linearmodels)
  Downloading formulaic-1.0.2-py3-none-any.whl.metadata (6.8 kB)
Collecting setuptools-scm<9.0.0,>=8.0.0 (from setuptools-scm[toml]<9.0.0,>=8.0.0->linearmodels)
  Downloading setuptools_scm-8.1.0-py3-none-any.whl.metadata (6.6 kB)
Collecting interface-meta>=1.2.0 (from formulaic>=1.0.0->linearmodels)
  Downloading interface_meta-1.3.0-py3-none-any.whl.metadata (6.7 kB)
Downloading linearmodels-6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m68.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
columns = ['country', 'year', 'log_rgdpna', 'pl_c', 'pop']
df1 = df[columns].dropna()

In [None]:
# Set the MultiIndex for panel data
df1 = df1.set_index(['country', 'year'])

# Define the dependent and independent variables
y = df1['log_rgdpna']
X = df1[['pl_c', 'pop']]

# Fit the first-differenced panel data model
fdmodel = FirstDifferenceOLS(y, X)
results = fdmodel.fit(cov_type='clustered', cluster_entity=True)

print(results)

                     FirstDifferenceOLS Estimation Summary                      
Dep. Variable:             log_rgdpna   R-squared:                        0.1691
Estimator:         FirstDifferenceOLS   R-squared (Between):              0.0490
No. Observations:               10157   R-squared (Within):               0.1401
Date:                Thu, Nov 14 2024   R-squared (Overall):              0.0537
Time:                        01:08:18   Log-likelihood                    2681.0
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      1033.1
Entities:                         183   P-value                           0.0000
Avg Obs:                       56.825   Distribution:                 F(2,10155)
Min Obs:                       15.000                                           
Max Obs:                       70.000   F-statistic (robust):             3.7307
                            

From the table: a change in the price level of household consumption is associated with a 12.35% change in real GDP whilst a population change only accounts for 0.69%. It must be noted again that association is not causation, hence the fact that GDP and prices change in the same direction, as they did in the previous regression, could just reflect the condition of the economy, rather than implying that one cause the other. Before both predictors were statistically significant, however, this time pl_c is not statistically significant, while pop still is. Overall, the overall fit of the model is 14.01% and the overall F-statsitics and P-value highlight that the model is valid.  

In [None]:
# Group by 'ccode' and calculate the first differences for the relevant columns

df1['log_rgdpna_diff'] = df1.groupby('country')['log_rgdpna'].diff()
df1['pl_c_diff'] = df1.groupby('country')['pl_c'].diff()
df1['pop_diff'] = df1.groupby('country')['pop'].diff()

In [None]:
df1[['log_rgdpna_diff', 'pl_c_diff', 'pop_diff']].describe()

Unnamed: 0,log_rgdpna_diff,pl_c_diff,pop_diff
count,10216.0,10216.0,10216.0
mean,0.03707,0.009789,0.472323
std,0.06341,0.197859,1.730812
min,-1.082343,-4.915405,-6.043162
25%,0.01503,-0.004154,0.010645
50%,0.039203,0.006678,0.083711
75%,0.064539,0.023844,0.345722
max,0.724063,18.049239,22.685037


**Non-Logged Variable Analysis**

The same analysis is repeated using the non-logged real GDP (rgdpna) to examine whether the use of logged variables influences results.

In [None]:
gdp2 = smf.ols(formula = 'rgdpna ~  pl_c + pop', data = df).fit()
print (gdp2.summary())

                            OLS Regression Results                            
Dep. Variable:                 rgdpna   R-squared:                       0.353
Model:                            OLS   Adj. R-squared:                  0.353
Method:                 Least Squares   F-statistic:                     2838.
Date:                Thu, 14 Nov 2024   Prob (F-statistic):               0.00
Time:                        01:08:26   Log-Likelihood:            -1.5835e+05
No. Observations:               10399   AIC:                         3.167e+05
Df Residuals:                   10396   BIC:                         3.167e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   2552.7677   1.32e+04      0.193      0.8

Compared to the results before: this model can explain 35.3% of the changes in real GDP (R squared), and is way higher compared to the 15.3% of the model above. Both coefficients are as significant as they were before.  



In [None]:
columns = ['country', 'year', 'rgdpna', 'pl_c', 'pop']
df2 = df[columns].dropna()

In [None]:
# Set the MultiIndex for panel data
df2 = df2.set_index(['country', 'year'])

# Define the dependent and independent variables
y = df2['rgdpna']
X = df2[['pl_c', 'pop']]

# Fit the first-differenced panel data model
fdmodel = FirstDifferenceOLS(y, X)
results = fdmodel.fit(cov_type='clustered', cluster_entity=True)

print(results)

                     FirstDifferenceOLS Estimation Summary                      
Dep. Variable:                 rgdpna   R-squared:                        0.3328
Estimator:         FirstDifferenceOLS   R-squared (Between):             -0.0101
No. Observations:               10157   R-squared (Within):               0.3876
Date:                Thu, Nov 14 2024   R-squared (Overall):              0.1051
Time:                        01:08:30   Log-likelihood                -1.376e+05
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      2533.2
Entities:                         183   P-value                           0.0000
Avg Obs:                       56.825   Distribution:                 F(2,10155)
Min Obs:                       15.000                                           
Max Obs:                       70.000   F-statistic (robust):             5.9162
                            

Comparison with the above results: As before only pop is significant. Overall, the overall fit of the model is 38.76%, way higher than before, and the overall F-statistic and P-value highlight that the model is valid.

In [None]:
# Group by 'ccode' and calculate the first differences for the relevant columns

df2['rgdpna_diff'] = df2.groupby('country')['rgdpna'].diff()
df2['pl_c_diff'] = df2.groupby('country')['pl_c'].diff()
df2['pop_diff'] = df2.groupby('country')['pop'].diff()

In [None]:
df2[['rgdpna_diff', 'pl_c_diff', 'pop_diff']].describe()


Unnamed: 0,rgdpna_diff,pl_c_diff,pop_diff
count,10216.0,10216.0,10216.0
mean,10755.7,0.009789,0.472323
std,52352.0,0.197859,1.730812
min,-449837.5,-4.915405,-6.043162
25%,68.12866,-0.004154,0.010645
50%,986.9141,0.006678,0.083711
75%,6059.434,0.023844,0.345722
max,1369392.0,18.049239,22.685037


**Comparison of Models**

OLS vs. First-Differences
The first-differences model consistently shows a better fit (higher R-squared) and controls for unobserved heterogeneity, making it more robust for longitudinal data.
The loss of significance for pl_c in the first-differences model suggests that its relationship with GDP may be spurious or driven by omitted variables.

Logged vs. Non-Logged Variables
Logged variables provide proportional interpretations, which are more intuitive for economic analysis.
Non-logged variables, while explaining more variance, may oversimplify complex relationships and are harder to compare across countries of varying sizes.

**Conclusion**

This analysis demonstrates the importance of model selection and variable transformation in time-series and panel data analysis. Key findings include:

Population is a consistently significant predictor of GDP, regardless of model or transformation, highlighting the role of demographic size in economic output.
Price level’s significance varies, suggesting its effect may depend on country-specific or time-variant factors.
First-differences models are more robust, controlling for unobserved heterogeneity and providing a clearer picture of causal relationships.
