<a href="https://colab.research.google.com/github/YoulanCheng/Github-file-notebook/blob/main/Untitled14.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Title: [Your Project Title Here]

**Author:** [Your Name]

In [1]:
import pandas as pd
import numpy as np

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.stattools import durbin_watson
from statsmodels.graphics.regressionplots import plot_partregress

In [4]:
from scipy import stats

In [5]:
sns.set(style="whitegrid", context="notebook")
plt.rcParams['figure.figsize'] = (7, 5)

## 1. Data Acquisition, Vetting, and Preparation


### 1.1. Research Questions and Hypothesis
This project focuses on cross-national differences in life expectancy and asks how environmental and socioeconomic factors jointly shape population health.

**Research Question 1 (Q1)**  
Does higher exposure to PM2.5 air pollution reduce life expectancy across countries?

- **Hypothesis 1 (H1)**  
  Holding other factors constant, higher PM2.5 levels are associated with lower life expectancy.  
  *Theoretical justification:* Epidemiological and public health research shows that fine particulate matter increases the risk of respiratory and cardiovascular diseases, which directly affect mortality.

---

**Research Question 2 (Q2)**  
Does economic development improve life expectancy, net of air pollution and other controls?

- **Hypothesis 2 (H2)**  
  Controlling for air pollution and other predictors, higher GDP per capita is associated with higher life expectancy.  
  *Theoretical justification:* Classic political economy and development literature links income to better nutrition, housing, sanitation, education, and access to healthcare, all of which contribute to longer lives.

---

**Research Question 3 (Q3)**  
Does health expenditure mitigate the negative effect of PM2.5 on life expectancy?

- **Hypothesis 3 (H3)**  
  The negative effect of PM2.5 on life expectancy is weaker in countries that spend a higher share of GDP on health.  
  *Theoretical justification:* Health systems with more resources can invest in prevention, screening, and treatment, and thus partially shield populations from the health consequences of environmental risks.

---

**Research Question 4 (Q4)**  
How do urbanization and health expenditure, together with pollution and income, shape cross-national differences in life expectancy?

- **Hypothesis 4 (H4)**  
  Urbanization and health expenditure are positively associated with life expectancy, once pollution and income are controlled for.  
  *Theoretical justification:* Urban areas often host better medical infrastructure and services, and health spending reflects state capacity and policy commitment to public health. However, urbanization may also increase exposure to pollution and crowding, so its net effect is theoretically ambiguous.

### 1.2. Dataset Justification
<!-- Describe the dataset(s) you are using. Explain the source, the key variables of interest, and justify why this data is appropriate for answering your research question. -->


### 1.2. Dataset Justification

The analysis uses a cross-sectional dataset constructed from the **World Development Indicators (WDI)** provided by the World Bank for the year **2019**. The dataset `air_health_2019.csv` contains one row per country and includes:

- `country`: Country name (identifier).
- `LifeExpectancy`: Life expectancy at birth (years).
- `PM25`: Annual mean exposure to PM2.5 (µg/m³).
- `GDPpc`: GDP per capita in current US dollars.
- `UrbanRate`: Urban population as a percentage of total population.
- `HealthExp`: Current health expenditure as a percentage of GDP.

**Why this dataset is appropriate:**

1. **Global coverage and variation**  
   The data cover roughly 180 countries, providing substantial variation in environmental conditions, income levels, urbanization, and health spending. This variation is essential for identifying relationships in a cross-national regression framework.

2. **Temporal focus (2019)**  
   The year 2019 is the last pre-COVID year with broad global coverage. Using 2019 avoids the severe distortions created by the COVID-19 pandemic (e.g., excess mortality, sudden GDP contractions) while still relying on recent data.

3. **Conceptual alignment**  
   Each variable corresponds directly to a key concept in the research questions:  
   - `PM25` → environmental risk;  
   - `GDPpc` → economic development;  
   - `UrbanRate` → structural transformation in settlement patterns;  
   - `HealthExp` → health system investment;  
   - `LifeExpectancy` → overall population health outcome.

4. **Data quality and comparability**  
   The WDI are widely used in political science and public policy research. Indicators are based on standardized definitions, consistent country codes, and documented methodologies. This supports both **reliability** (measurement consistency) and **comparability** across countries.

5. **Reproducibility**  
   The CSV file used in this notebook can be stored in the same GitHub repository as the notebook. Anyone can clone the repository and reproduce the entire analysis by running the notebook.


### 1.3. Data Loading, Merging and Initial Exploration
Provide the code to load your dataset into a pandas DataFrame. Merge your data frames if necessary.  Display the .head(), .info(), and .describe() outputs. Provide a brief markdown interpretation of the initial state of the data (e.g., number of observations, data types, potential missing values).

In [6]:
df = pd.read_csv("air_health_2019.csv")


FileNotFoundError: [Errno 2] No such file or directory: 'air_health_2019.csv'

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

**Initial Interpretation**

1. The dataset contains one observation per country. The main analytical variables (`LifeExpectancy`, `PM25`, `GDPpc`, `UrbanRate`, `HealthExp`) are numerical (`float64`), which is appropriate for multiple linear regression.
2. `country` is a string identifier and will not be used directly as a predictor.
3. The descriptive statistics indicate plausible ranges: life expectancy between roughly 50 and 85 years, PM2.5 mostly in the single or double digits, and a wide range in GDP per capita, as expected in a global sample.
4. Any missing values will be systematically examined and handled in the next section.

To better understand the structure of the data, I next inspect missingness patterns and basic distributions.

<!-- Your interpretation here. -->

## 2. Systematic Data Cleaning and Transformation
Document and justify every step of your data cleaning and preprocessing. Use sub-sections for clarity (e.g., Handling Missing Values, Creating Dummy Variables, Addressing Outliers). Show your code and explain the rationale behind each significant decision.


### 2.1. Handling Missing Values

In [None]:
# Check for missing values
# print(df.isnull().sum())

# Drop rows with missing values
# df_clean = df.dropna()

# Fill missing values with mean/median
# df['column_name'].fillna(df['column_name'].mean(), inplace=True)

# Fill missing values with a specific value
# df['column_name'].fillna('Unknown', inplace=True)

In [None]:
df.isna().sum()

In [None]:
df_clean = df.copy()

numeric_cols = ["LifeExpectancy", "PM25", "GDPpc", "UrbanRate", "HealthExp"]

for col in numeric_cols:
    if df_clean[col].isna().any():
        median_value = df_clean[col].median()
        df_clean[col] = df_clean[col].fillna(median_value)
        print(f"Filled missing values in {col} with median = {median_value:.2f}")

In [None]:
df_clean = df_clean.dropna(subset=["LifeExpectancy"])

**Justification**

1. Missingness in macro-level indicators is usually due to incomplete reporting for specific countries, not to individual-level nonresponse.
2. For skewed variables such as `PM25` and `GDPpc`, **median imputation** is more robust than mean imputation because it is less influenced by extreme values.
3. Imputing independent variables with their median avoids dropping countries entirely, which helps maintain statistical power and preserve global variation.
4. The dependent variable `LifeExpectancy` is not imputed; instead, any remaining missing values are dropped. Imputing the outcome would artificially reduce uncertainty in the regression and complicate interpretation.

This approach strikes a balance between preserving observations and maintaining reasonable distributional properties.


### 2.2 Creating Dummy Variables

In [None]:
# Convert categorical variables to dummy variables
# df = pd.get_dummies(df, columns=['categorical_column'], prefix='cat')

# Or create dummy variables manually
# df['dummy_var'] = df['categorical_column'].map({'Category1': 1, 'Category2': 0})


In [8]:
europe_list = [
    "Poland", "Germany", "France", "Spain", "Italy", "United Kingdom", "Netherlands",
    "Belgium", "Sweden", "Norway", "Denmark", "Finland", "Czech Republic", "Austria",
    "Switzerland", "Portugal", "Greece", "Ireland", "Hungary"
]

In [9]:
df_clean["RegionSimple"] = df_clean["country"].apply(
    lambda x: "Europe" if x in europe_list else "Other"
)

NameError: name 'df_clean' is not defined

In [None]:
df_clean = pd.get_dummies(df_clean, columns=["RegionSimple"], drop_first=True)

In [None]:
df_clean[["country", "RegionSimple_Europe"]].head()

**Justification**

The main models in this project do not rely on categorical predictors, but this example shows how a simple region dummy (`RegionSimple_Europe`) could be constructed.

Such dummies could be used in robustness checks or extended models to control for broad regional differences (e.g. European vs non-European countries). For the core analysis, I keep the specification focused on the theoretically central predictors: pollution, income, urbanization, and health expenditure.


### 2.3 Addressing Outliers

In [None]:

# Identify outliers using IQR method
# Q1 = df['numeric_column'].quantile(0.25)
# Q3 = df['numeric_column'].quantile(0.75)
# IQR = Q3 - Q1
# lower_bound = Q1 - 1.5 * IQR
# upper_bound = Q3 + 1.5 * IQR

# Remove outliers??
# df_no_outliers = df[(df['numeric_column'] >= lower_bound) & (df['numeric_column'] <= upper_bound)]

# Or cap outliers at percentiles
# df['numeric_column'] = df['numeric_column'].clip(lower=df['numeric_column'].quantile(0.05),
#                                                  upper=df['numeric_column'].quantile(0.95))

# Use Z-score to identify outliers
# from scipy import stats
# z_scores = np.abs(stats.zscore(df['numeric_column']))
# df_no_outliers = df[z_scores < 3]

<!-- Your justification here. -->

### 2.4 Variable Transformation/Creation

In [None]:
# Create new variables or transform existing ones
# Example: Create interaction terms
# df['interaction_var'] = df['var1'] * df['var2']

# Example: Create categorical variables from continuous ones
# df['category_var'] = pd.cut(df['continuous_var'], bins=3, labels=['Low', 'Medium', 'High'])

<!-- Your justification here. -->

## 3. Rigorous Assumption Checking and Interpretation
This section presents the diagnostic checks for your final chosen model. For each assumption, provide the relevant plot or test, the code that generated it, and a clear interpretation of the result. Discuss any violations and their implications for your conclusions.


### 3.1. Linearity and Homoscedasticity (Residuals vs. Fitted Plot)

In [None]:
# plt.scatter(results.fittedvalues, results.resid)
# plt.xlabel('Fitted values')
# plt.ylabel('Residuals')
# plt.title('Residuals vs Fitted')
# plt.show()

<!-- Your interpretation here. -->

### 3.2. Normality of Residuals (Q-Q Plot & Histogram)

In [None]:
# stats.probplot(results.resid, dist="norm", plot=plt)
# plt.show()
# plt.hist(results.resid, bins=30)
# plt.title('Histogram of Residuals')
# plt.show()

<!-- Your interpretation here. -->

### 3.3. Multicollinearity (VIF)

In [None]:
# X = df[['var1', 'var2', 'var3']] # replace with your predictors
# vif_data = pd.DataFrame()
# vif_data['feature'] = X.columns
# vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
# print(vif_data)

<!-- Your interpretation here. -->

### 3.4. Influential Observations (Cook's Distance)

In [None]:
# influence = results.get_influence()
# cooks = influence.cooks_distance[0]
# print(pd.Series(cooks).sort_values(ascending=False).head())

<!-- Your interpretation here. -->

## 4. Purposeful Multiple Linear Model Formulation
Define the at least three distinct models you will estimate. For each model, provide the theoretical and/or empirical justification for its specification (i.e., why you included/excluded certain variables).

**Model 1: Baseline Model**

*Justification:* This model includes... because...

**Model 2: Extended Model**

*Justification:* This model adds... to test...

**Model 3: Interaction/Alternative Model**

*Justification:* This model incorporates an interaction term between... and... to examine...

## 5. Model Estimation, Comprehensive Evaluation, and Comparative Analysis


### 5.1. Model Estimation
Present the full statsmodels summary output for each of your three models.


**Model 1 Results:**

In [None]:
# model1 = smf.ols('y ~ x1 + x2 + x3', data=df).fit()
# print(model1.summary())

<!-- Model 1 summary output -->

**Model 2 Results:**

In [None]:
# model2 = smf.ols('y ~ x1 + x2 + x3 + x4', data=df).fit()
# print(model2.summary())

<!-- Model 2 summary output -->

**Model 3 Results:**

In [None]:
# model3 = smf.ols('y ~ x1 * x2 + x3', data=df).fit()
# print(model3.summary())

<!-- Model 3 summary output -->

### 5.2. Comparative Analysis
Compare your models using appropriate metrics (e.g., Adjusted R-squared) and theoretical fit. Create a table summarizing key coefficients and fit statistics across models for easy comparison. Justify which model you consider to be the 'best' for answering your research question, considering parsimony, theoretical coherence, and statistical performance.

## 6. Substantive Conclusions and Limitations


### 6.1. Conclusions
Summarize your key findings based on your chosen model. Directly address your research question and state whether your hypotheses were supported or not. Discuss the substantive and theoretical implications of your results in a political science context.


### 6.2. Limitations

Acknowledge the limitations of your study. Consider data limitations, potential violations of model assumptions, and issues of generalizability.
