In [22]:
# pip install statsmodels
 # pip install requests
# !pip install openpyxl

In [None]:
import pandas as pd
import zipfile
import io
import requests
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

In [24]:
# URL of the World Bank WDI dataset
url = "https://databank.worldbank.org/data/download/WDI_EXCEL.zip"

# Download the ZIP file
response = requests.get(url)
if response.status_code == 200:
    # Extract the contents of the ZIP file
    with zipfile.ZipFile(io.BytesIO(response.content), 'r') as z:
        # Extract the correct Excel file
        excel_file = "WDIEXCEL.xlsx"
        with z.open(excel_file) as file:
            # Read the dataset into a Pandas DataFrame
            df = pd.read_excel(file, sheet_name="Data")

# Select relevant indicators
gdp_indicator = "NY.GDP.PCAP.CD"  # GDP per capita (current US$)
life_exp_indicator = "SP.DYN.LE00.IN"  # Life expectancy at birth, total (years)

# Filter dataset for selected indicators
df_filtered = df[df["Indicator Code"].isin([gdp_indicator, life_exp_indicator])]

# Reshape the dataset (convert years from columns to rows)
df_melted = df_filtered.melt(id_vars=["Country Name", "Country Code", "Indicator Name", "Indicator Code"], 
                             var_name="Year", value_name="Value")

# Convert Year column to numeric
df_melted["Year"] = pd.to_numeric(df_melted["Year"], errors="coerce")

# Filter for the year 2020
df_2020 = df_melted[df_melted["Year"] == 2020]

# Pivot so we have one row per country with GDP and Life Expectancy as separate columns
df_pivot = df_2020.pivot(index=["Country Name", "Country Code"], columns="Indicator Code", values="Value").dropna()

# Rename columns for readability
df_pivot.columns = ["GDP_per_capita", "Life_expectancy"]

# Reset index
df_pivot = df_pivot.reset_index()

# Perform linear regression
model = smf.ols("Life_expectancy ~ GDP_per_capita", data=df_pivot).fit()

# Print regression summary
print(model.summary())



                            OLS Regression Results                            
Dep. Variable:        Life_expectancy   R-squared:                       0.413
Model:                            OLS   Adj. R-squared:                  0.411
Method:                 Least Squares   F-statistic:                     174.5
Date:                Thu, 13 Mar 2025   Prob (F-statistic):           1.61e-30
Time:                        13:49:40   Log-Likelihood:                -782.59
No. Observations:                 250   AIC:                             1569.
Df Residuals:                     248   BIC:                             1576.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         68.7870      0.427    161.

### **OLS Regression Results**

The **Ordinary Least Squares (OLS) Regression** output provides insights into the relationship between **GDP per capita** and **Life Expectancy**. Let's break it down step by step.

---

## **1. Model Summary**
### **Key Metrics**
| Metric                | Value        | Interpretation |
|-----------------------|-------------|----------------|
| **R-squared**        | 0.413       | 41.3% of the variance in life expectancy is explained by GDP per capita. |
| **Adj. R-squared**   | 0.411       | Adjusted for the number of predictors, slightly lower than R-squared. |
| **F-statistic**      | 174.5       | Measures the overall significance of the regression model. |
| **Prob (F-statistic)** | 1.61e-30   | **Very low value**, meaning the model is statistically significant. |
| **Observations**     | 250         | The model is based on 250 countries/entries. |
| **AIC / BIC**        | 1569 / 1576 | Information criteria used for model selection. Lower values indicate a better model. |

### **Interpretation**
- **R-squared = 0.413** means that **41.3% of the variation in life expectancy** is explained by GDP per capita. While this suggests a moderate correlation, other factors also influence life expectancy.
- The **F-statistic (174.5)** and its **p-value (1.61e-30)** indicate that the overall regression model is highly significant.

---

## **2. Regression Coefficients**
| Variable         | Coefficient | Std. Error | t-value | p-value | Confidence Interval (95%) |
|-----------------|-------------|------------|---------|---------|---------------------------|
| **Intercept**   | 68.7870     | 0.427      | 161.009 | 0.000   | (67.946, 69.628) |
| **GDP per capita** | 0.0002   | 1.6e-05    | 13.209  | 0.000   | (0.000, 0.000) |

### **Interpretation**
1. **Intercept (68.79):**  
   - This means that when **GDP per capita = 0**, the predicted life expectancy would be **68.79 years**.
   - Although this scenario is unrealistic (since no country has a GDP per capita of exactly 0), the intercept helps in understanding the baseline estimate.

2. **GDP per capita coefficient (0.0002):**  
   - This suggests that **for every $1,000 increase in GDP per capita, life expectancy increases by 0.2 years**.
   - The **p-value (0.000)** indicates that this effect is statistically significant.

---

## **3. Diagnostic Statistics**
### **Omnibus & Jarque-Bera Test (Normality)**
| Statistic       | Value  | Interpretation |
|---------------|--------|----------------|
| **Omnibus**   | 27.831 | The residuals deviate from a normal distribution. |
| **Prob(Omnibus)** | 0.000 | A p-value close to 0 confirms non-normality. |
| **Jarque-Bera (JB)** | 33.568 | Indicates non-normality in residuals. |
| **Prob(JB)** | 5.14e-08 | Very low probability, confirming residuals are skewed. |

### **Durbin-Watson (Autocorrelation)**
| Metric       | Value  | Interpretation |
|-------------|--------|----------------|
| **Durbin-Watson** | 1.889 | No strong autocorrelation in residuals. (Ideal range: 1.5 - 2.5) |

- Since the **Durbin-Watson statistic is close to 2**, we **do not** see strong evidence of autocorrelation in the residuals.

---

## **4. Condition Number (Multicollinearity)**
| Metric       | Value  | Interpretation |
|-------------|--------|----------------|
| **Condition Number** | 3.24e+04 | **High value**, indicating potential multicollinearity or numerical issues. |

- **Condition number > 30,000** suggests that the independent variable (GDP per capita) could have some scaling issues or multicollinearity concerns.

---

## **Conclusion**
- **The model is statistically significant**, with GDP per capita having a strong relationship with life expectancy.
- **GDP per capita explains about 41% of the variance** in life expectancy, but other factors (e.g., healthcare access, education, environment) also play a major role.
- **The coefficient is small (0.0002)**, but since GDP per capita ranges in **thousands to tens of thousands**, its effect becomes meaningful.
- **Potential Issues:**
  - The residuals **do not follow a perfect normal distribution**.
  - The **high condition number** suggests some **multicollinearity concerns**.
